Hi Ryan, It is not easy to calculate. For the column indexes feature we introduced two new structures saved before the footer: column indexes and offset indexes. If the min/max values are not too long, then the truncation might not decrease the file size because of the offset indexes. Moreover, we also introduced parquet.page.row.count.limit which might increase the number of pages which leads to increasing the file size. The footer itself is also changed and we are saving more values in it: the offset values to the column/offset indexes, the new logical type structures, the CRC checksums (we might have some others). So, the size of the files with small amount of data will be increased (because of the larger footer). The size of the files where the values can be encoded very well (RLE) will probably be increased (because we will have more pages). The size of some files where the values are long (>64bytes by default) might be decreased because of truncating the min/max values.
Regards, Gabor On Mon, Nov 18, 2019 at 6:46 PM Ryan Blue <[email protected]> wrote: > Gabor, do we have an idea of the additional overhead for a non-test data > file? It should be easy to validate that this doesn't introduce an > unreasonable amount of overhead. In some cases, it should actually be > smaller since the column indexes are truncated and page stats are not. > > On Mon, Nov 18, 2019 at 1:00 AM Gabor Szadovszky > <[email protected]> wrote: > > > Hi Fokko, > > > > For the first point. The referenced constructor is private and Iceberg > uses > > it via reflection. It is not a breaking change. I think, parquet-mr shall > > not keep private methods only because of clients might use them via > > reflection. > > > > About the checksum. I've agreed on having the CRC checksum write enabled > by > > default because the benchmarks did not show significant performance > > penalties. See https://github.com/apache/parquet-mr/pull/647 for > details. > > > > About the file size change. 1.11.0 is introducing column indexes, CRC > > checksum, removing the statistics from the page headers and maybe other > > changes that impact file size. If only file size is in question I cannot > > see a breaking change here. > > > > Regards, > > Gabor > > > > > > > > On Sun, Nov 17, 2019 at 9:27 PM Driesprong, Fokko <[email protected]> > > wrote: > > > > > Unfortunately, a -1 from my side (non-binding) > > > > > > I've updated Iceberg to Parquet 1.11.0, and found three things: > > > > > > - We've broken backward compatibility of the constructor of > > > ColumnChunkPageWriteStore > > > < > > > > > > https://github.com/apache/parquet-mr/commit/e7db9e20f52c925a207ea62d6dda6dc4e870294e#diff-d007a18083a2431c30a5416f248e0a4bR80 > > > >. > > > This required a change > > > < > > > > > > https://github.com/apache/incubator-iceberg/pull/297/files#diff-b877faa96f292b851c75fe8bcc1912f8R176 > > > > > > > to the code. This isn't a hard blocker, but if there will be a new > RC, > > > I've > > > submitted a patch: https://github.com/apache/parquet-mr/pull/699 > > > - Related, that we need to put in the changelog, is that checksums > are > > > enabled by default: > > > > > > > > > https://github.com/apache/parquet-mr/blob/master/parquet-column/src/main/java/org/apache/parquet/column/ParquetProperties.java#L54 > > > This > > > will impact performance. I would suggest disabling it by default: > > > https://github.com/apache/parquet-mr/pull/700 > > > < > > > > > > https://github.com/apache/parquet-mr/commit/e7db9e20f52c925a207ea62d6dda6dc4e870294e#diff-d007a18083a2431c30a5416f248e0a4bR277 > > > > > > > - Binary compatibility. While updating Iceberg, I've noticed that > the > > > split-test was failing: > > > > > > > > > https://github.com/apache/incubator-iceberg/pull/297/files#diff-4b64b7014f259be41b26cfb73d3e6e93L199 > > > The > > > two records are now divided over four Spark partitions. Something in > > the > > > output has changed since the files are bigger now. Has anyone any > idea > > > to > > > check what's changed, or a way to check this? The only thing I can > > > think of > > > is the checksum mentioned above. > > > > > > $ ls -lah ~/Desktop/parquet-1-1* > > > -rw-r--r-- 1 fokkodriesprong staff 562B 17 nov 21:09 > > > /Users/fokkodriesprong/Desktop/parquet-1-10-1.parquet > > > -rw-r--r-- 1 fokkodriesprong staff 611B 17 nov 21:05 > > > /Users/fokkodriesprong/Desktop/parquet-1-11-0.parquet > > > > > > $ parquet-tools cat > /Users/fokkodriesprong/Desktop/parquet-1-10-1.parquet > > > id = 1 > > > data = a > > > > > > $ parquet-tools cat > /Users/fokkodriesprong/Desktop/parquet-1-11-0.parquet > > > id = 1 > > > data = a > > > > > > A binary diff here: > > > https://gist.github.com/Fokko/1c209f158299dc2fb5878c5bae4bf6d8 > > > > > > Cheers, Fokko > > > > > > Op za 16 nov. 2019 om 04:18 schreef Junjie Chen < > > [email protected] > > > >: > > > > > > > +1 > > > > Verified signature, checksum and ran mvn install successfully. > > > > > > > > Wang, Yuming <[email protected]> 于2019年11月14日周四 下午2:05写道: > > > > > > > > > > +1 > > > > > Tested Parquet 1.11.0 with Spark SQL module: build/sbt > > "sql/test-only" > > > > -Phadoop-3.2 > > > > > > > > > > On 2019/11/13, 21:33, "Gabor Szadovszky" <[email protected]> > wrote: > > > > > > > > > > Hi everyone, > > > > > > > > > > I propose the following RC to be released as official Apache > > > Parquet > > > > 1.11.0 > > > > > release. > > > > > > > > > > The commit id is 18519eb8e059865652eee3ff0e8593f126701da4 > > > > > * This corresponds to the tag: apache-parquet-1.11.0-rc7 > > > > > * > > > > > > > > > > > > > > > https://nam01.safelinks.protection.outlook.com/?url=https%3A%2F%2Fgithub.com%2Fapache%2Fparquet-mr%2Ftree%2F18519eb8e059865652eee3ff0e8593f126701da4&data=02%7C01%7Cyumwang%40ebay.com%7C8d588ca5855842a94bed08d7683e1221%7C46326bff992841a0baca17c16c94ea99%7C0%7C0%7C637092488114756267&sdata=ToLFrTB9lU%2FGzH6UpXwy7PAY7kaupbyKAgdghESCfgg%3D&reserved=0 > > > > > > > > > > The release tarball, signature, and checksums are here: > > > > > * > > > > > > > > > > https://nam01.safelinks.protection.outlook.com/?url=https%3A%2F%2Fdist.apache.org%2Frepos%2Fdist%2Fdev%2Fparquet%2Fapache-parquet-1.11.0-rc7&data=02%7C01%7Cyumwang%40ebay.com%7C8d588ca5855842a94bed08d7683e1221%7C46326bff992841a0baca17c16c94ea99%7C0%7C0%7C637092488114756267&sdata=MPaHiYJT7ZcqreAYUkvDvZugthUhRPrySdXpN2ytT5k%3D&reserved=0 > > > > > > > > > > You can find the KEYS file here: > > > > > * > > > > > > > > > > https://nam01.safelinks.protection.outlook.com/?url=https%3A%2F%2Fapache.org%2Fdist%2Fparquet%2FKEYS&data=02%7C01%7Cyumwang%40ebay.com%7C8d588ca5855842a94bed08d7683e1221%7C46326bff992841a0baca17c16c94ea99%7C0%7C0%7C637092488114756267&sdata=IwG4MUGsP2lVzlD4bwZUEPuEAPUg%2FHXRYtxc5CQupBM%3D&reserved=0 > > > > > > > > > > Binary artifacts are staged in Nexus here: > > > > > * > > > > > > > > > > https://nam01.safelinks.protection.outlook.com/?url=https%3A%2F%2Frepository.apache.org%2Fcontent%2Fgroups%2Fstaging%2Forg%2Fapache%2Fparquet%2F&data=02%7C01%7Cyumwang%40ebay.com%7C8d588ca5855842a94bed08d7683e1221%7C46326bff992841a0baca17c16c94ea99%7C0%7C0%7C637092488114756267&sdata=lHtqLRQqQFwsyoaLSVaJuau5gxPKsCQFFVJaY8H0tZQ%3D&reserved=0 > > > > > > > > > > This release includes the changes listed at: > > > > > > > > > > > > > > > https://nam01.safelinks.protection.outlook.com/?url=https%3A%2F%2Fgithub.com%2Fapache%2Fparquet-mr%2Fblob%2Fapache-parquet-1.11.0-rc7%2FCHANGES.md&data=02%7C01%7Cyumwang%40ebay.com%7C8d588ca5855842a94bed08d7683e1221%7C46326bff992841a0baca17c16c94ea99%7C0%7C0%7C637092488114756267&sdata=82BplI3bLAL6qArLHvVoYReZOk%2BboSP655rI8VX5Q5I%3D&reserved=0 > > > > > > > > > > Please download, verify, and test. > > > > > > > > > > Please vote in the next 72 hours. > > > > > > > > > > [ ] +1 Release this as Apache Parquet 1.11.0 > > > > > [ ] +0 > > > > > [ ] -1 Do not release this because... > > > > > > > > > > > > > > > > > > > > > > -- > Ryan Blue > Software Engineer > Netflix >
