+1 Downloaded release code, checked hashes/signatures, run full tests and installed locally with zero errors. Tested integration on a downstream project (Apache Beam) and no issues (Note that we don't use any of the new features yet).
Gabor, can you please create a corresponding parquet-1.11.x branch. I expected to compare the release with the branch and tag but I found the branch is not present. Thanks, Ismaël On Tue, Nov 19, 2019 at 8:35 AM Gabor Szadovszky <ga...@apache.org> wrote: > Hi Ryan, > > It is not easy to calculate. For the column indexes feature we introduced > two new structures saved before the footer: column indexes and offset > indexes. If the min/max values are not too long, then the truncation might > not decrease the file size because of the offset indexes. Moreover, we also > introduced parquet.page.row.count.limit which might increase the number of > pages which leads to increasing the file size. > The footer itself is also changed and we are saving more values in it: the > offset values to the column/offset indexes, the new logical type > structures, the CRC checksums (we might have some others). > So, the size of the files with small amount of data will be increased > (because of the larger footer). The size of the files where the values can > be encoded very well (RLE) will probably be increased (because we will have > more pages). The size of some files where the values are long (>64bytes by > default) might be decreased because of truncating the min/max values. > > Regards, > Gabor > > On Mon, Nov 18, 2019 at 6:46 PM Ryan Blue <rb...@netflix.com.invalid> > wrote: > > > Gabor, do we have an idea of the additional overhead for a non-test data > > file? It should be easy to validate that this doesn't introduce an > > unreasonable amount of overhead. In some cases, it should actually be > > smaller since the column indexes are truncated and page stats are not. > > > > On Mon, Nov 18, 2019 at 1:00 AM Gabor Szadovszky > > <gabor.szadovs...@cloudera.com.invalid> wrote: > > > > > Hi Fokko, > > > > > > For the first point. The referenced constructor is private and Iceberg > > uses > > > it via reflection. It is not a breaking change. I think, parquet-mr > shall > > > not keep private methods only because of clients might use them via > > > reflection. > > > > > > About the checksum. I've agreed on having the CRC checksum write > enabled > > by > > > default because the benchmarks did not show significant performance > > > penalties. See https://github.com/apache/parquet-mr/pull/647 for > > details. > > > > > > About the file size change. 1.11.0 is introducing column indexes, CRC > > > checksum, removing the statistics from the page headers and maybe other > > > changes that impact file size. If only file size is in question I > cannot > > > see a breaking change here. > > > > > > Regards, > > > Gabor > > > > > > > > > > > > On Sun, Nov 17, 2019 at 9:27 PM Driesprong, Fokko <fo...@driesprong.frl > > > > > wrote: > > > > > > > Unfortunately, a -1 from my side (non-binding) > > > > > > > > I've updated Iceberg to Parquet 1.11.0, and found three things: > > > > > > > > - We've broken backward compatibility of the constructor of > > > > ColumnChunkPageWriteStore > > > > < > > > > > > > > > > https://github.com/apache/parquet-mr/commit/e7db9e20f52c925a207ea62d6dda6dc4e870294e#diff-d007a18083a2431c30a5416f248e0a4bR80 > > > > >. > > > > This required a change > > > > < > > > > > > > > > > https://github.com/apache/incubator-iceberg/pull/297/files#diff-b877faa96f292b851c75fe8bcc1912f8R176 > > > > > > > > > to the code. This isn't a hard blocker, but if there will be a new > > RC, > > > > I've > > > > submitted a patch: https://github.com/apache/parquet-mr/pull/699 > > > > - Related, that we need to put in the changelog, is that checksums > > are > > > > enabled by default: > > > > > > > > > > > > > > https://github.com/apache/parquet-mr/blob/master/parquet-column/src/main/java/org/apache/parquet/column/ParquetProperties.java#L54 > > > > This > > > > will impact performance. I would suggest disabling it by default: > > > > https://github.com/apache/parquet-mr/pull/700 > > > > < > > > > > > > > > > https://github.com/apache/parquet-mr/commit/e7db9e20f52c925a207ea62d6dda6dc4e870294e#diff-d007a18083a2431c30a5416f248e0a4bR277 > > > > > > > > > - Binary compatibility. While updating Iceberg, I've noticed that > > the > > > > split-test was failing: > > > > > > > > > > > > > > https://github.com/apache/incubator-iceberg/pull/297/files#diff-4b64b7014f259be41b26cfb73d3e6e93L199 > > > > The > > > > two records are now divided over four Spark partitions. Something > in > > > the > > > > output has changed since the files are bigger now. Has anyone any > > idea > > > > to > > > > check what's changed, or a way to check this? The only thing I can > > > > think of > > > > is the checksum mentioned above. > > > > > > > > $ ls -lah ~/Desktop/parquet-1-1* > > > > -rw-r--r-- 1 fokkodriesprong staff 562B 17 nov 21:09 > > > > /Users/fokkodriesprong/Desktop/parquet-1-10-1.parquet > > > > -rw-r--r-- 1 fokkodriesprong staff 611B 17 nov 21:05 > > > > /Users/fokkodriesprong/Desktop/parquet-1-11-0.parquet > > > > > > > > $ parquet-tools cat > > /Users/fokkodriesprong/Desktop/parquet-1-10-1.parquet > > > > id = 1 > > > > data = a > > > > > > > > $ parquet-tools cat > > /Users/fokkodriesprong/Desktop/parquet-1-11-0.parquet > > > > id = 1 > > > > data = a > > > > > > > > A binary diff here: > > > > https://gist.github.com/Fokko/1c209f158299dc2fb5878c5bae4bf6d8 > > > > > > > > Cheers, Fokko > > > > > > > > Op za 16 nov. 2019 om 04:18 schreef Junjie Chen < > > > chenjunjied...@gmail.com > > > > >: > > > > > > > > > +1 > > > > > Verified signature, checksum and ran mvn install successfully. > > > > > > > > > > Wang, Yuming <yumw...@ebay.com.invalid> 于2019年11月14日周四 下午2:05写道: > > > > > > > > > > > > +1 > > > > > > Tested Parquet 1.11.0 with Spark SQL module: build/sbt > > > "sql/test-only" > > > > > -Phadoop-3.2 > > > > > > > > > > > > On 2019/11/13, 21:33, "Gabor Szadovszky" <ga...@apache.org> > > wrote: > > > > > > > > > > > > Hi everyone, > > > > > > > > > > > > I propose the following RC to be released as official Apache > > > > Parquet > > > > > 1.11.0 > > > > > > release. > > > > > > > > > > > > The commit id is 18519eb8e059865652eee3ff0e8593f126701da4 > > > > > > * This corresponds to the tag: apache-parquet-1.11.0-rc7 > > > > > > * > > > > > > > > > > > > > > > > > > > > > https://nam01.safelinks.protection.outlook.com/?url=https%3A%2F%2Fgithub.com%2Fapache%2Fparquet-mr%2Ftree%2F18519eb8e059865652eee3ff0e8593f126701da4&data=02%7C01%7Cyumwang%40ebay.com%7C8d588ca5855842a94bed08d7683e1221%7C46326bff992841a0baca17c16c94ea99%7C0%7C0%7C637092488114756267&sdata=ToLFrTB9lU%2FGzH6UpXwy7PAY7kaupbyKAgdghESCfgg%3D&reserved=0 > > > > > > > > > > > > The release tarball, signature, and checksums are here: > > > > > > * > > > > > > > > > > > > > > > https://nam01.safelinks.protection.outlook.com/?url=https%3A%2F%2Fdist.apache.org%2Frepos%2Fdist%2Fdev%2Fparquet%2Fapache-parquet-1.11.0-rc7&data=02%7C01%7Cyumwang%40ebay.com%7C8d588ca5855842a94bed08d7683e1221%7C46326bff992841a0baca17c16c94ea99%7C0%7C0%7C637092488114756267&sdata=MPaHiYJT7ZcqreAYUkvDvZugthUhRPrySdXpN2ytT5k%3D&reserved=0 > > > > > > > > > > > > You can find the KEYS file here: > > > > > > * > > > > > > > > > > > > > > > https://nam01.safelinks.protection.outlook.com/?url=https%3A%2F%2Fapache.org%2Fdist%2Fparquet%2FKEYS&data=02%7C01%7Cyumwang%40ebay.com%7C8d588ca5855842a94bed08d7683e1221%7C46326bff992841a0baca17c16c94ea99%7C0%7C0%7C637092488114756267&sdata=IwG4MUGsP2lVzlD4bwZUEPuEAPUg%2FHXRYtxc5CQupBM%3D&reserved=0 > > > > > > > > > > > > Binary artifacts are staged in Nexus here: > > > > > > * > > > > > > > > > > > > > > > https://nam01.safelinks.protection.outlook.com/?url=https%3A%2F%2Frepository.apache.org%2Fcontent%2Fgroups%2Fstaging%2Forg%2Fapache%2Fparquet%2F&data=02%7C01%7Cyumwang%40ebay.com%7C8d588ca5855842a94bed08d7683e1221%7C46326bff992841a0baca17c16c94ea99%7C0%7C0%7C637092488114756267&sdata=lHtqLRQqQFwsyoaLSVaJuau5gxPKsCQFFVJaY8H0tZQ%3D&reserved=0 > > > > > > > > > > > > This release includes the changes listed at: > > > > > > > > > > > > > > > > > > > > > https://nam01.safelinks.protection.outlook.com/?url=https%3A%2F%2Fgithub.com%2Fapache%2Fparquet-mr%2Fblob%2Fapache-parquet-1.11.0-rc7%2FCHANGES.md&data=02%7C01%7Cyumwang%40ebay.com%7C8d588ca5855842a94bed08d7683e1221%7C46326bff992841a0baca17c16c94ea99%7C0%7C0%7C637092488114756267&sdata=82BplI3bLAL6qArLHvVoYReZOk%2BboSP655rI8VX5Q5I%3D&reserved=0 > > > > > > > > > > > > Please download, verify, and test. > > > > > > > > > > > > Please vote in the next 72 hours. > > > > > > > > > > > > [ ] +1 Release this as Apache Parquet 1.11.0 > > > > > > [ ] +0 > > > > > > [ ] -1 Do not release this because... > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > -- > > Ryan Blue > > Software Engineer > > Netflix > > >