Hi Ryan/Gabor, I will do some tests on real data with checksum enabled.
Xinli On Wed, Nov 20, 2019 at 1:29 AM Gabor Szadovszky <[email protected]> wrote: > Thanks, Fokko. > > Ryan, we did not do such measurements yet. I'm afraid, I won't have enough > time to do that in the next couple of weeks. > > Cheers, > Gabor > > On Tue, Nov 19, 2019 at 6:14 PM Driesprong, Fokko <[email protected]> > wrote: > > > Thanks Gabor for the explanation. I'd like to change my vote to +1 > > (non-binding). > > > > Cheers, Fokko > > > > Op di 19 nov. 2019 om 18:03 schreef Ryan Blue <[email protected] > > > > > > > Gabor, what I meant was: have we tried this with real data to see the > > > effect? I think those results would be helpful. > > > > > > On Mon, Nov 18, 2019 at 11:35 PM Gabor Szadovszky <[email protected]> > > > wrote: > > > > > > > Hi Ryan, > > > > > > > > It is not easy to calculate. For the column indexes feature we > > introduced > > > > two new structures saved before the footer: column indexes and offset > > > > indexes. If the min/max values are not too long, then the truncation > > > might > > > > not decrease the file size because of the offset indexes. Moreover, > we > > > also > > > > introduced parquet.page.row.count.limit which might increase the > number > > > of > > > > pages which leads to increasing the file size. > > > > The footer itself is also changed and we are saving more values in > it: > > > the > > > > offset values to the column/offset indexes, the new logical type > > > > structures, the CRC checksums (we might have some others). > > > > So, the size of the files with small amount of data will be increased > > > > (because of the larger footer). The size of the files where the > values > > > can > > > > be encoded very well (RLE) will probably be increased (because we > will > > > have > > > > more pages). The size of some files where the values are long > (>64bytes > > > by > > > > default) might be decreased because of truncating the min/max values. > > > > > > > > Regards, > > > > Gabor > > > > > > > > On Mon, Nov 18, 2019 at 6:46 PM Ryan Blue <[email protected] > > > > > > wrote: > > > > > > > > > Gabor, do we have an idea of the additional overhead for a non-test > > > data > > > > > file? It should be easy to validate that this doesn't introduce an > > > > > unreasonable amount of overhead. In some cases, it should actually > be > > > > > smaller since the column indexes are truncated and page stats are > > not. > > > > > > > > > > On Mon, Nov 18, 2019 at 1:00 AM Gabor Szadovszky > > > > > <[email protected]> wrote: > > > > > > > > > > > Hi Fokko, > > > > > > > > > > > > For the first point. The referenced constructor is private and > > > Iceberg > > > > > uses > > > > > > it via reflection. It is not a breaking change. I think, > parquet-mr > > > > shall > > > > > > not keep private methods only because of clients might use them > via > > > > > > reflection. > > > > > > > > > > > > About the checksum. I've agreed on having the CRC checksum write > > > > enabled > > > > > by > > > > > > default because the benchmarks did not show significant > performance > > > > > > penalties. See > https://urldefense.proofpoint.com/v2/url?u=https-3A__github.com_apache_parquet-2Dmr_pull_647&d=DwIFaQ&c=r2dcLCtU9q6n0vrtnDw9vg&r=FQ88AmOZ4TMjDdqNBGu-ag&m=CoznEc8bzT5Gkp9UNE3EwFMcEadunf3b2ewl8BcbNjI&s=1QqG8osJ05dcq3kAsALygqXBr-LVzQdSs_hRCp3ljWg&e= > for > > > > > details. > > > > > > > > > > > > About the file size change. 1.11.0 is introducing column indexes, > > CRC > > > > > > checksum, removing the statistics from the page headers and maybe > > > other > > > > > > changes that impact file size. If only file size is in question I > > > > cannot > > > > > > see a breaking change here. > > > > > > > > > > > > Regards, > > > > > > Gabor > > > > > > > > > > > > > > > > > > > > > > > > On Sun, Nov 17, 2019 at 9:27 PM Driesprong, Fokko > > > <[email protected] > > > > > > > > > > > wrote: > > > > > > > > > > > > > Unfortunately, a -1 from my side (non-binding) > > > > > > > > > > > > > > I've updated Iceberg to Parquet 1.11.0, and found three things: > > > > > > > > > > > > > > - We've broken backward compatibility of the constructor of > > > > > > > ColumnChunkPageWriteStore > > > > > > > < > > > > > > > > > > > > > > > > > > > > > > > > > > > > https://urldefense.proofpoint.com/v2/url?u=https-3A__github.com_apache_parquet-2Dmr_commit_e7db9e20f52c925a207ea62d6dda6dc4e870294e-23diff-2Dd007a18083a2431c30a5416f248e0a4bR80&d=DwIFaQ&c=r2dcLCtU9q6n0vrtnDw9vg&r=FQ88AmOZ4TMjDdqNBGu-ag&m=CoznEc8bzT5Gkp9UNE3EwFMcEadunf3b2ewl8BcbNjI&s=kjdKf5G4aBgAjWGMzvaBHm3qQwrn1lyDYjFfWjdqKbc&e= > > > > > > > >. > > > > > > > This required a change > > > > > > > < > > > > > > > > > > > > > > > > > > > > > > > > > > > > https://urldefense.proofpoint.com/v2/url?u=https-3A__github.com_apache_incubator-2Diceberg_pull_297_files-23diff-2Db877faa96f292b851c75fe8bcc1912f8R176&d=DwIFaQ&c=r2dcLCtU9q6n0vrtnDw9vg&r=FQ88AmOZ4TMjDdqNBGu-ag&m=CoznEc8bzT5Gkp9UNE3EwFMcEadunf3b2ewl8BcbNjI&s=wGGSIQ9tS_WY5Xx6OLcgVlPblirY01kM_W9o0YmzG28&e= > > > > > > > > > > > > > > > to the code. This isn't a hard blocker, but if there will > be a > > > new > > > > > RC, > > > > > > > I've > > > > > > > submitted a patch: > > > > https://urldefense.proofpoint.com/v2/url?u=https-3A__github.com_apache_parquet-2Dmr_pull_699&d=DwIFaQ&c=r2dcLCtU9q6n0vrtnDw9vg&r=FQ88AmOZ4TMjDdqNBGu-ag&m=CoznEc8bzT5Gkp9UNE3EwFMcEadunf3b2ewl8BcbNjI&s=404yTBuM-XBj9OfBM_x5artWTHDOFnZLj3iuCT3n0iU&e= > > > > > > > - Related, that we need to put in the changelog, is that > > > checksums > > > > > are > > > > > > > enabled by default: > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > https://urldefense.proofpoint.com/v2/url?u=https-3A__github.com_apache_parquet-2Dmr_blob_master_parquet-2Dcolumn_src_main_java_org_apache_parquet_column_ParquetProperties.java-23L54&d=DwIFaQ&c=r2dcLCtU9q6n0vrtnDw9vg&r=FQ88AmOZ4TMjDdqNBGu-ag&m=CoznEc8bzT5Gkp9UNE3EwFMcEadunf3b2ewl8BcbNjI&s=N1r-at2NYbuKi71Z6xwzy2c6DjtbpbOzc2gOHcsrlkk&e= > > > > > > > This > > > > > > > will impact performance. I would suggest disabling it by > > > default: > > > > > > > > https://urldefense.proofpoint.com/v2/url?u=https-3A__github.com_apache_parquet-2Dmr_pull_700&d=DwIFaQ&c=r2dcLCtU9q6n0vrtnDw9vg&r=FQ88AmOZ4TMjDdqNBGu-ag&m=CoznEc8bzT5Gkp9UNE3EwFMcEadunf3b2ewl8BcbNjI&s=s3EiQII3WgIqr0yiiUQKHa33W9vxw1oOCh5Rh5VHraQ&e= > > > > > > > < > > > > > > > > > > > > > > > > > > > > > > > > > > > > https://urldefense.proofpoint.com/v2/url?u=https-3A__github.com_apache_parquet-2Dmr_commit_e7db9e20f52c925a207ea62d6dda6dc4e870294e-23diff-2Dd007a18083a2431c30a5416f248e0a4bR277&d=DwIFaQ&c=r2dcLCtU9q6n0vrtnDw9vg&r=FQ88AmOZ4TMjDdqNBGu-ag&m=CoznEc8bzT5Gkp9UNE3EwFMcEadunf3b2ewl8BcbNjI&s=I-IP-iYjMxPh-25Sog01-VziM_wp0v1riNYPfJRiVpM&e= > > > > > > > > > > > > > > > - Binary compatibility. While updating Iceberg, I've noticed > > > that > > > > > the > > > > > > > split-test was failing: > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > https://urldefense.proofpoint.com/v2/url?u=https-3A__github.com_apache_incubator-2Diceberg_pull_297_files-23diff-2D4b64b7014f259be41b26cfb73d3e6e93L199&d=DwIFaQ&c=r2dcLCtU9q6n0vrtnDw9vg&r=FQ88AmOZ4TMjDdqNBGu-ag&m=CoznEc8bzT5Gkp9UNE3EwFMcEadunf3b2ewl8BcbNjI&s=9LPLuFRv8lnWCGDVNU-FoGSLuY_GofaQ-tYA_jgRZPQ&e= > > > > > > > The > > > > > > > two records are now divided over four Spark partitions. > > > Something > > > > in > > > > > > the > > > > > > > output has changed since the files are bigger now. Has > anyone > > > any > > > > > idea > > > > > > > to > > > > > > > check what's changed, or a way to check this? The only > thing I > > > can > > > > > > > think of > > > > > > > is the checksum mentioned above. > > > > > > > > > > > > > > $ ls -lah ~/Desktop/parquet-1-1* > > > > > > > -rw-r--r-- 1 fokkodriesprong staff 562B 17 nov 21:09 > > > > > > > /Users/fokkodriesprong/Desktop/parquet-1-10-1.parquet > > > > > > > -rw-r--r-- 1 fokkodriesprong staff 611B 17 nov 21:05 > > > > > > > /Users/fokkodriesprong/Desktop/parquet-1-11-0.parquet > > > > > > > > > > > > > > $ parquet-tools cat > > > > > /Users/fokkodriesprong/Desktop/parquet-1-10-1.parquet > > > > > > > id = 1 > > > > > > > data = a > > > > > > > > > > > > > > $ parquet-tools cat > > > > > /Users/fokkodriesprong/Desktop/parquet-1-11-0.parquet > > > > > > > id = 1 > > > > > > > data = a > > > > > > > > > > > > > > A binary diff here: > > > > > > > > https://urldefense.proofpoint.com/v2/url?u=https-3A__gist.github.com_Fokko_1c209f158299dc2fb5878c5bae4bf6d8&d=DwIFaQ&c=r2dcLCtU9q6n0vrtnDw9vg&r=FQ88AmOZ4TMjDdqNBGu-ag&m=CoznEc8bzT5Gkp9UNE3EwFMcEadunf3b2ewl8BcbNjI&s=sMuaS6b28yXjYslQvpfSzR_ocwBjXx1kM6bXa7Nue_c&e= > > > > > > > > > > > > > > Cheers, Fokko > > > > > > > > > > > > > > Op za 16 nov. 2019 om 04:18 schreef Junjie Chen < > > > > > > [email protected] > > > > > > > >: > > > > > > > > > > > > > > > +1 > > > > > > > > Verified signature, checksum and ran mvn install > successfully. > > > > > > > > > > > > > > > > Wang, Yuming <[email protected]> 于2019年11月14日周四 > > 下午2:05写道: > > > > > > > > > > > > > > > > > > +1 > > > > > > > > > Tested Parquet 1.11.0 with Spark SQL module: build/sbt > > > > > > "sql/test-only" > > > > > > > > -Phadoop-3.2 > > > > > > > > > > > > > > > > > > On 2019/11/13, 21:33, "Gabor Szadovszky" < > [email protected]> > > > > > wrote: > > > > > > > > > > > > > > > > > > Hi everyone, > > > > > > > > > > > > > > > > > > I propose the following RC to be released as official > > > Apache > > > > > > > Parquet > > > > > > > > 1.11.0 > > > > > > > > > release. > > > > > > > > > > > > > > > > > > The commit id is > 18519eb8e059865652eee3ff0e8593f126701da4 > > > > > > > > > * This corresponds to the tag: > apache-parquet-1.11.0-rc7 > > > > > > > > > * > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > https://urldefense.proofpoint.com/v2/url?u=https-3A__nam01.safelinks.protection.outlook.com_-3Furl-3Dhttps-253A-252F-252Fgithub.com-252Fapache-252Fparquet-2Dmr-252Ftree-252F18519eb8e059865652eee3ff0e8593f126701da4-26amp-3Bdata-3D02-257C01-257Cyumwang-2540ebay.com-257C8d588ca5855842a94bed08d7683e1221-257C46326bff992841a0baca17c16c94ea99-257C0-257C0-257C637092488114756267-26amp-3Bsdata-3DToLFrTB9lU-252FGzH6UpXwy7PAY7kaupbyKAgdghESCfgg-253D-26amp-3Breserved-3D0&d=DwIFaQ&c=r2dcLCtU9q6n0vrtnDw9vg&r=FQ88AmOZ4TMjDdqNBGu-ag&m=CoznEc8bzT5Gkp9UNE3EwFMcEadunf3b2ewl8BcbNjI&s=UVepLy1MDaX4CT1EPcDESsF_lCp6B_Wf73oJw4j_xnE&e= > > > > > > > > > > > > > > > > > > The release tarball, signature, and checksums are here: > > > > > > > > > * > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > https://urldefense.proofpoint.com/v2/url?u=https-3A__nam01.safelinks.protection.outlook.com_-3Furl-3Dhttps-253A-252F-252Fdist.apache.org-252Frepos-252Fdist-252Fdev-252Fparquet-252Fapache-2Dparquet-2D1.11.0-2Drc7-26amp-3Bdata-3D02-257C01-257Cyumwang-2540ebay.com-257C8d588ca5855842a94bed08d7683e1221-257C46326bff992841a0baca17c16c94ea99-257C0-257C0-257C637092488114756267-26amp-3Bsdata-3DMPaHiYJT7ZcqreAYUkvDvZugthUhRPrySdXpN2ytT5k-253D-26amp-3Breserved-3D0&d=DwIFaQ&c=r2dcLCtU9q6n0vrtnDw9vg&r=FQ88AmOZ4TMjDdqNBGu-ag&m=CoznEc8bzT5Gkp9UNE3EwFMcEadunf3b2ewl8BcbNjI&s=ZwnVnpNGRVFQ3_Hw_sJoUBl6U3CCbT0-uTRMzUQiKJc&e= > > > > > > > > > > > > > > > > > > You can find the KEYS file here: > > > > > > > > > * > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > https://urldefense.proofpoint.com/v2/url?u=https-3A__nam01.safelinks.protection.outlook.com_-3Furl-3Dhttps-253A-252F-252Fapache.org-252Fdist-252Fparquet-252FKEYS-26amp-3Bdata-3D02-257C01-257Cyumwang-2540ebay.com-257C8d588ca5855842a94bed08d7683e1221-257C46326bff992841a0baca17c16c94ea99-257C0-257C0-257C637092488114756267-26amp-3Bsdata-3DIwG4MUGsP2lVzlD4bwZUEPuEAPUg-252FHXRYtxc5CQupBM-253D-26amp-3Breserved-3D0&d=DwIFaQ&c=r2dcLCtU9q6n0vrtnDw9vg&r=FQ88AmOZ4TMjDdqNBGu-ag&m=CoznEc8bzT5Gkp9UNE3EwFMcEadunf3b2ewl8BcbNjI&s=RA0T1Q_BTgA6gwN8EK2CBeZ0nf7340zDgEMadjjqXmQ&e= > > > > > > > > > > > > > > > > > > Binary artifacts are staged in Nexus here: > > > > > > > > > * > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > https://urldefense.proofpoint.com/v2/url?u=https-3A__nam01.safelinks.protection.outlook.com_-3Furl-3Dhttps-253A-252F-252Frepository.apache.org-252Fcontent-252Fgroups-252Fstaging-252Forg-252Fapache-252Fparquet-252F-26amp-3Bdata-3D02-257C01-257Cyumwang-2540ebay.com-257C8d588ca5855842a94bed08d7683e1221-257C46326bff992841a0baca17c16c94ea99-257C0-257C0-257C637092488114756267-26amp-3Bsdata-3DlHtqLRQqQFwsyoaLSVaJuau5gxPKsCQFFVJaY8H0tZQ-253D-26amp-3Breserved-3D0&d=DwIFaQ&c=r2dcLCtU9q6n0vrtnDw9vg&r=FQ88AmOZ4TMjDdqNBGu-ag&m=CoznEc8bzT5Gkp9UNE3EwFMcEadunf3b2ewl8BcbNjI&s=kdM7O8WCtNwj3f7wg3YHQZu2kAaBfh4QjWfG3i5b690&e= > > > > > > > > > > > > > > > > > > This release includes the changes listed at: > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > https://urldefense.proofpoint.com/v2/url?u=https-3A__nam01.safelinks.protection.outlook.com_-3Furl-3Dhttps-253A-252F-252Fgithub.com-252Fapache-252Fparquet-2Dmr-252Fblob-252Fapache-2Dparquet-2D1.11.0-2Drc7-252FCHANGES.md-26amp-3Bdata-3D02-257C01-257Cyumwang-2540ebay.com-257C8d588ca5855842a94bed08d7683e1221-257C46326bff992841a0baca17c16c94ea99-257C0-257C0-257C637092488114756267-26amp-3Bsdata-3D82BplI3bLAL6qArLHvVoYReZOk-252BboSP655rI8VX5Q5I-253D-26amp-3Breserved-3D0&d=DwIFaQ&c=r2dcLCtU9q6n0vrtnDw9vg&r=FQ88AmOZ4TMjDdqNBGu-ag&m=CoznEc8bzT5Gkp9UNE3EwFMcEadunf3b2ewl8BcbNjI&s=Pg6nebaAqfj7qh-_b_3PStcrWu-dpBVbjtY9OLp4_G4&e= > > > > > > > > > > > > > > > > > > Please download, verify, and test. > > > > > > > > > > > > > > > > > > Please vote in the next 72 hours. > > > > > > > > > > > > > > > > > > [ ] +1 Release this as Apache Parquet 1.11.0 > > > > > > > > > [ ] +0 > > > > > > > > > [ ] -1 Do not release this because... > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > -- > > > > > Ryan Blue > > > > > Software Engineer > > > > > Netflix > > > > > > > > > > > > > > > > > > -- > > > Ryan Blue > > > Software Engineer > > > Netflix > > > > > > -- Xinli Shang
