Forgot to mention also moving master to 1.12.0-SNAPSHOT to validate that all things transition ok too and that someone does not accidentally merge a PR that does not end up in the correct branch.
On Tue, Nov 19, 2019 at 10:35 AM Ismaël Mejía <ieme...@gmail.com> wrote: > +1 > > Downloaded release code, checked hashes/signatures, run full tests and > installed locally with zero errors. Tested integration on a downstream > project (Apache Beam) and no issues (Note that we don't use any of the new > features yet). > > Gabor, can you please create a corresponding parquet-1.11.x branch. I > expected to compare the release with the branch and tag but I found the > branch is not present. > > Thanks, > Ismaël > > > > On Tue, Nov 19, 2019 at 8:35 AM Gabor Szadovszky <ga...@apache.org> wrote: > >> Hi Ryan, >> >> It is not easy to calculate. For the column indexes feature we introduced >> two new structures saved before the footer: column indexes and offset >> indexes. If the min/max values are not too long, then the truncation might >> not decrease the file size because of the offset indexes. Moreover, we >> also >> introduced parquet.page.row.count.limit which might increase the number of >> pages which leads to increasing the file size. >> The footer itself is also changed and we are saving more values in it: the >> offset values to the column/offset indexes, the new logical type >> structures, the CRC checksums (we might have some others). >> So, the size of the files with small amount of data will be increased >> (because of the larger footer). The size of the files where the values can >> be encoded very well (RLE) will probably be increased (because we will >> have >> more pages). The size of some files where the values are long (>64bytes by >> default) might be decreased because of truncating the min/max values. >> >> Regards, >> Gabor >> >> On Mon, Nov 18, 2019 at 6:46 PM Ryan Blue <rb...@netflix.com.invalid> >> wrote: >> >> > Gabor, do we have an idea of the additional overhead for a non-test data >> > file? It should be easy to validate that this doesn't introduce an >> > unreasonable amount of overhead. In some cases, it should actually be >> > smaller since the column indexes are truncated and page stats are not. >> > >> > On Mon, Nov 18, 2019 at 1:00 AM Gabor Szadovszky >> > <gabor.szadovs...@cloudera.com.invalid> wrote: >> > >> > > Hi Fokko, >> > > >> > > For the first point. The referenced constructor is private and Iceberg >> > uses >> > > it via reflection. It is not a breaking change. I think, parquet-mr >> shall >> > > not keep private methods only because of clients might use them via >> > > reflection. >> > > >> > > About the checksum. I've agreed on having the CRC checksum write >> enabled >> > by >> > > default because the benchmarks did not show significant performance >> > > penalties. See https://github.com/apache/parquet-mr/pull/647 for >> > details. >> > > >> > > About the file size change. 1.11.0 is introducing column indexes, CRC >> > > checksum, removing the statistics from the page headers and maybe >> other >> > > changes that impact file size. If only file size is in question I >> cannot >> > > see a breaking change here. >> > > >> > > Regards, >> > > Gabor >> > > >> > > >> > > >> > > On Sun, Nov 17, 2019 at 9:27 PM Driesprong, Fokko >> <fo...@driesprong.frl> >> > > wrote: >> > > >> > > > Unfortunately, a -1 from my side (non-binding) >> > > > >> > > > I've updated Iceberg to Parquet 1.11.0, and found three things: >> > > > >> > > > - We've broken backward compatibility of the constructor of >> > > > ColumnChunkPageWriteStore >> > > > < >> > > > >> > > >> > >> https://github.com/apache/parquet-mr/commit/e7db9e20f52c925a207ea62d6dda6dc4e870294e#diff-d007a18083a2431c30a5416f248e0a4bR80 >> > > > >. >> > > > This required a change >> > > > < >> > > > >> > > >> > >> https://github.com/apache/incubator-iceberg/pull/297/files#diff-b877faa96f292b851c75fe8bcc1912f8R176 >> > > > > >> > > > to the code. This isn't a hard blocker, but if there will be a >> new >> > RC, >> > > > I've >> > > > submitted a patch: https://github.com/apache/parquet-mr/pull/699 >> > > > - Related, that we need to put in the changelog, is that >> checksums >> > are >> > > > enabled by default: >> > > > >> > > > >> > > >> > >> https://github.com/apache/parquet-mr/blob/master/parquet-column/src/main/java/org/apache/parquet/column/ParquetProperties.java#L54 >> > > > This >> > > > will impact performance. I would suggest disabling it by default: >> > > > https://github.com/apache/parquet-mr/pull/700 >> > > > < >> > > > >> > > >> > >> https://github.com/apache/parquet-mr/commit/e7db9e20f52c925a207ea62d6dda6dc4e870294e#diff-d007a18083a2431c30a5416f248e0a4bR277 >> > > > > >> > > > - Binary compatibility. While updating Iceberg, I've noticed that >> > the >> > > > split-test was failing: >> > > > >> > > > >> > > >> > >> https://github.com/apache/incubator-iceberg/pull/297/files#diff-4b64b7014f259be41b26cfb73d3e6e93L199 >> > > > The >> > > > two records are now divided over four Spark partitions. >> Something in >> > > the >> > > > output has changed since the files are bigger now. Has anyone any >> > idea >> > > > to >> > > > check what's changed, or a way to check this? The only thing I >> can >> > > > think of >> > > > is the checksum mentioned above. >> > > > >> > > > $ ls -lah ~/Desktop/parquet-1-1* >> > > > -rw-r--r-- 1 fokkodriesprong staff 562B 17 nov 21:09 >> > > > /Users/fokkodriesprong/Desktop/parquet-1-10-1.parquet >> > > > -rw-r--r-- 1 fokkodriesprong staff 611B 17 nov 21:05 >> > > > /Users/fokkodriesprong/Desktop/parquet-1-11-0.parquet >> > > > >> > > > $ parquet-tools cat >> > /Users/fokkodriesprong/Desktop/parquet-1-10-1.parquet >> > > > id = 1 >> > > > data = a >> > > > >> > > > $ parquet-tools cat >> > /Users/fokkodriesprong/Desktop/parquet-1-11-0.parquet >> > > > id = 1 >> > > > data = a >> > > > >> > > > A binary diff here: >> > > > https://gist.github.com/Fokko/1c209f158299dc2fb5878c5bae4bf6d8 >> > > > >> > > > Cheers, Fokko >> > > > >> > > > Op za 16 nov. 2019 om 04:18 schreef Junjie Chen < >> > > chenjunjied...@gmail.com >> > > > >: >> > > > >> > > > > +1 >> > > > > Verified signature, checksum and ran mvn install successfully. >> > > > > >> > > > > Wang, Yuming <yumw...@ebay.com.invalid> 于2019年11月14日周四 下午2:05写道: >> > > > > > >> > > > > > +1 >> > > > > > Tested Parquet 1.11.0 with Spark SQL module: build/sbt >> > > "sql/test-only" >> > > > > -Phadoop-3.2 >> > > > > > >> > > > > > On 2019/11/13, 21:33, "Gabor Szadovszky" <ga...@apache.org> >> > wrote: >> > > > > > >> > > > > > Hi everyone, >> > > > > > >> > > > > > I propose the following RC to be released as official Apache >> > > > Parquet >> > > > > 1.11.0 >> > > > > > release. >> > > > > > >> > > > > > The commit id is 18519eb8e059865652eee3ff0e8593f126701da4 >> > > > > > * This corresponds to the tag: apache-parquet-1.11.0-rc7 >> > > > > > * >> > > > > > >> > > > > >> > > > >> > > >> > >> https://nam01.safelinks.protection.outlook.com/?url=https%3A%2F%2Fgithub.com%2Fapache%2Fparquet-mr%2Ftree%2F18519eb8e059865652eee3ff0e8593f126701da4&data=02%7C01%7Cyumwang%40ebay.com%7C8d588ca5855842a94bed08d7683e1221%7C46326bff992841a0baca17c16c94ea99%7C0%7C0%7C637092488114756267&sdata=ToLFrTB9lU%2FGzH6UpXwy7PAY7kaupbyKAgdghESCfgg%3D&reserved=0 >> > > > > > >> > > > > > The release tarball, signature, and checksums are here: >> > > > > > * >> > > > > >> > > > >> > > >> > >> https://nam01.safelinks.protection.outlook.com/?url=https%3A%2F%2Fdist.apache.org%2Frepos%2Fdist%2Fdev%2Fparquet%2Fapache-parquet-1.11.0-rc7&data=02%7C01%7Cyumwang%40ebay.com%7C8d588ca5855842a94bed08d7683e1221%7C46326bff992841a0baca17c16c94ea99%7C0%7C0%7C637092488114756267&sdata=MPaHiYJT7ZcqreAYUkvDvZugthUhRPrySdXpN2ytT5k%3D&reserved=0 >> > > > > > >> > > > > > You can find the KEYS file here: >> > > > > > * >> > > > > >> > > > >> > > >> > >> https://nam01.safelinks.protection.outlook.com/?url=https%3A%2F%2Fapache.org%2Fdist%2Fparquet%2FKEYS&data=02%7C01%7Cyumwang%40ebay.com%7C8d588ca5855842a94bed08d7683e1221%7C46326bff992841a0baca17c16c94ea99%7C0%7C0%7C637092488114756267&sdata=IwG4MUGsP2lVzlD4bwZUEPuEAPUg%2FHXRYtxc5CQupBM%3D&reserved=0 >> > > > > > >> > > > > > Binary artifacts are staged in Nexus here: >> > > > > > * >> > > > > >> > > > >> > > >> > >> https://nam01.safelinks.protection.outlook.com/?url=https%3A%2F%2Frepository.apache.org%2Fcontent%2Fgroups%2Fstaging%2Forg%2Fapache%2Fparquet%2F&data=02%7C01%7Cyumwang%40ebay.com%7C8d588ca5855842a94bed08d7683e1221%7C46326bff992841a0baca17c16c94ea99%7C0%7C0%7C637092488114756267&sdata=lHtqLRQqQFwsyoaLSVaJuau5gxPKsCQFFVJaY8H0tZQ%3D&reserved=0 >> > > > > > >> > > > > > This release includes the changes listed at: >> > > > > > >> > > > > >> > > > >> > > >> > >> https://nam01.safelinks.protection.outlook.com/?url=https%3A%2F%2Fgithub.com%2Fapache%2Fparquet-mr%2Fblob%2Fapache-parquet-1.11.0-rc7%2FCHANGES.md&data=02%7C01%7Cyumwang%40ebay.com%7C8d588ca5855842a94bed08d7683e1221%7C46326bff992841a0baca17c16c94ea99%7C0%7C0%7C637092488114756267&sdata=82BplI3bLAL6qArLHvVoYReZOk%2BboSP655rI8VX5Q5I%3D&reserved=0 >> > > > > > >> > > > > > Please download, verify, and test. >> > > > > > >> > > > > > Please vote in the next 72 hours. >> > > > > > >> > > > > > [ ] +1 Release this as Apache Parquet 1.11.0 >> > > > > > [ ] +0 >> > > > > > [ ] -1 Do not release this because... >> > > > > > >> > > > > > >> > > > > >> > > > >> > > >> > >> > >> > -- >> > Ryan Blue >> > Software Engineer >> > Netflix >> > >> >