Hi Ryan/Gabor,

I will do some tests on real data with checksum enabled.

Xinli

On Wed, Nov 20, 2019 at 1:29 AM Gabor Szadovszky <[email protected]> wrote:

> Thanks, Fokko.
>
> Ryan, we did not do such measurements yet. I'm afraid, I won't have enough
> time to do that in the next couple of weeks.
>
> Cheers,
> Gabor
>
> On Tue, Nov 19, 2019 at 6:14 PM Driesprong, Fokko <[email protected]>
> wrote:
>
> > Thanks Gabor for the explanation. I'd like to change my vote to +1
> > (non-binding).
> >
> > Cheers, Fokko
> >
> > Op di 19 nov. 2019 om 18:03 schreef Ryan Blue <[email protected]
> >
> >
> > > Gabor, what I meant was: have we tried this with real data to see the
> > > effect? I think those results would be helpful.
> > >
> > > On Mon, Nov 18, 2019 at 11:35 PM Gabor Szadovszky <[email protected]>
> > > wrote:
> > >
> > > > Hi Ryan,
> > > >
> > > > It is not easy to calculate. For the column indexes feature we
> > introduced
> > > > two new structures saved before the footer: column indexes and offset
> > > > indexes. If the min/max values are not too long, then the truncation
> > > might
> > > > not decrease the file size because of the offset indexes. Moreover,
> we
> > > also
> > > > introduced parquet.page.row.count.limit which might increase the
> number
> > > of
> > > > pages which leads to increasing the file size.
> > > > The footer itself is also changed and we are saving more values in
> it:
> > > the
> > > > offset values to the column/offset indexes, the new logical type
> > > > structures, the CRC checksums (we might have some others).
> > > > So, the size of the files with small amount of data will be increased
> > > > (because of the larger footer). The size of the files where the
> values
> > > can
> > > > be encoded very well (RLE) will probably be increased (because we
> will
> > > have
> > > > more pages). The size of some files where the values are long
> (>64bytes
> > > by
> > > > default) might be decreased because of truncating the min/max values.
> > > >
> > > > Regards,
> > > > Gabor
> > > >
> > > > On Mon, Nov 18, 2019 at 6:46 PM Ryan Blue <[email protected]
> >
> > > > wrote:
> > > >
> > > > > Gabor, do we have an idea of the additional overhead for a non-test
> > > data
> > > > > file? It should be easy to validate that this doesn't introduce an
> > > > > unreasonable amount of overhead. In some cases, it should actually
> be
> > > > > smaller since the column indexes are truncated and page stats are
> > not.
> > > > >
> > > > > On Mon, Nov 18, 2019 at 1:00 AM Gabor Szadovszky
> > > > > <[email protected]> wrote:
> > > > >
> > > > > > Hi Fokko,
> > > > > >
> > > > > > For the first point. The referenced constructor is private and
> > > Iceberg
> > > > > uses
> > > > > > it via reflection. It is not a breaking change. I think,
> parquet-mr
> > > > shall
> > > > > > not keep private methods only because of clients might use them
> via
> > > > > > reflection.
> > > > > >
> > > > > > About the checksum. I've agreed on having the CRC checksum write
> > > > enabled
> > > > > by
> > > > > > default because the benchmarks did not show significant
> performance
> > > > > > penalties. See
> https://urldefense.proofpoint.com/v2/url?u=https-3A__github.com_apache_parquet-2Dmr_pull_647&d=DwIFaQ&c=r2dcLCtU9q6n0vrtnDw9vg&r=FQ88AmOZ4TMjDdqNBGu-ag&m=CoznEc8bzT5Gkp9UNE3EwFMcEadunf3b2ewl8BcbNjI&s=1QqG8osJ05dcq3kAsALygqXBr-LVzQdSs_hRCp3ljWg&e=
> for
> > > > > details.
> > > > > >
> > > > > > About the file size change. 1.11.0 is introducing column indexes,
> > CRC
> > > > > > checksum, removing the statistics from the page headers and maybe
> > > other
> > > > > > changes that impact file size. If only file size is in question I
> > > > cannot
> > > > > > see a breaking change here.
> > > > > >
> > > > > > Regards,
> > > > > > Gabor
> > > > > >
> > > > > >
> > > > > >
> > > > > > On Sun, Nov 17, 2019 at 9:27 PM Driesprong, Fokko
> > > <[email protected]
> > > > >
> > > > > > wrote:
> > > > > >
> > > > > > > Unfortunately, a -1 from my side (non-binding)
> > > > > > >
> > > > > > > I've updated Iceberg to Parquet 1.11.0, and found three things:
> > > > > > >
> > > > > > >    - We've broken backward compatibility of the constructor of
> > > > > > >    ColumnChunkPageWriteStore
> > > > > > >    <
> > > > > > >
> > > > > >
> > > > >
> > > >
> > >
> >
> https://urldefense.proofpoint.com/v2/url?u=https-3A__github.com_apache_parquet-2Dmr_commit_e7db9e20f52c925a207ea62d6dda6dc4e870294e-23diff-2Dd007a18083a2431c30a5416f248e0a4bR80&d=DwIFaQ&c=r2dcLCtU9q6n0vrtnDw9vg&r=FQ88AmOZ4TMjDdqNBGu-ag&m=CoznEc8bzT5Gkp9UNE3EwFMcEadunf3b2ewl8BcbNjI&s=kjdKf5G4aBgAjWGMzvaBHm3qQwrn1lyDYjFfWjdqKbc&e=
> > > > > > > >.
> > > > > > >    This required a change
> > > > > > >    <
> > > > > > >
> > > > > >
> > > > >
> > > >
> > >
> >
> https://urldefense.proofpoint.com/v2/url?u=https-3A__github.com_apache_incubator-2Diceberg_pull_297_files-23diff-2Db877faa96f292b851c75fe8bcc1912f8R176&d=DwIFaQ&c=r2dcLCtU9q6n0vrtnDw9vg&r=FQ88AmOZ4TMjDdqNBGu-ag&m=CoznEc8bzT5Gkp9UNE3EwFMcEadunf3b2ewl8BcbNjI&s=wGGSIQ9tS_WY5Xx6OLcgVlPblirY01kM_W9o0YmzG28&e=
> > > > > > > >
> > > > > > >    to the code. This isn't a hard blocker, but if there will
> be a
> > > new
> > > > > RC,
> > > > > > > I've
> > > > > > >    submitted a patch:
> > >
> https://urldefense.proofpoint.com/v2/url?u=https-3A__github.com_apache_parquet-2Dmr_pull_699&d=DwIFaQ&c=r2dcLCtU9q6n0vrtnDw9vg&r=FQ88AmOZ4TMjDdqNBGu-ag&m=CoznEc8bzT5Gkp9UNE3EwFMcEadunf3b2ewl8BcbNjI&s=404yTBuM-XBj9OfBM_x5artWTHDOFnZLj3iuCT3n0iU&e=
> > > > > > >    - Related, that we need to put in the changelog, is that
> > > checksums
> > > > > are
> > > > > > >    enabled by default:
> > > > > > >
> > > > > > >
> > > > > >
> > > > >
> > > >
> > >
> >
> https://urldefense.proofpoint.com/v2/url?u=https-3A__github.com_apache_parquet-2Dmr_blob_master_parquet-2Dcolumn_src_main_java_org_apache_parquet_column_ParquetProperties.java-23L54&d=DwIFaQ&c=r2dcLCtU9q6n0vrtnDw9vg&r=FQ88AmOZ4TMjDdqNBGu-ag&m=CoznEc8bzT5Gkp9UNE3EwFMcEadunf3b2ewl8BcbNjI&s=N1r-at2NYbuKi71Z6xwzy2c6DjtbpbOzc2gOHcsrlkk&e=
> > > > > > > This
> > > > > > >    will impact performance. I would suggest disabling it by
> > > default:
> > > > > > >
> https://urldefense.proofpoint.com/v2/url?u=https-3A__github.com_apache_parquet-2Dmr_pull_700&d=DwIFaQ&c=r2dcLCtU9q6n0vrtnDw9vg&r=FQ88AmOZ4TMjDdqNBGu-ag&m=CoznEc8bzT5Gkp9UNE3EwFMcEadunf3b2ewl8BcbNjI&s=s3EiQII3WgIqr0yiiUQKHa33W9vxw1oOCh5Rh5VHraQ&e=
> > > > > > >    <
> > > > > > >
> > > > > >
> > > > >
> > > >
> > >
> >
> https://urldefense.proofpoint.com/v2/url?u=https-3A__github.com_apache_parquet-2Dmr_commit_e7db9e20f52c925a207ea62d6dda6dc4e870294e-23diff-2Dd007a18083a2431c30a5416f248e0a4bR277&d=DwIFaQ&c=r2dcLCtU9q6n0vrtnDw9vg&r=FQ88AmOZ4TMjDdqNBGu-ag&m=CoznEc8bzT5Gkp9UNE3EwFMcEadunf3b2ewl8BcbNjI&s=I-IP-iYjMxPh-25Sog01-VziM_wp0v1riNYPfJRiVpM&e=
> > > > > > > >
> > > > > > >    - Binary compatibility. While updating Iceberg, I've noticed
> > > that
> > > > > the
> > > > > > >    split-test was failing:
> > > > > > >
> > > > > > >
> > > > > >
> > > > >
> > > >
> > >
> >
> https://urldefense.proofpoint.com/v2/url?u=https-3A__github.com_apache_incubator-2Diceberg_pull_297_files-23diff-2D4b64b7014f259be41b26cfb73d3e6e93L199&d=DwIFaQ&c=r2dcLCtU9q6n0vrtnDw9vg&r=FQ88AmOZ4TMjDdqNBGu-ag&m=CoznEc8bzT5Gkp9UNE3EwFMcEadunf3b2ewl8BcbNjI&s=9LPLuFRv8lnWCGDVNU-FoGSLuY_GofaQ-tYA_jgRZPQ&e=
> > > > > > > The
> > > > > > >    two records are now divided over four Spark partitions.
> > > Something
> > > > in
> > > > > > the
> > > > > > >    output has changed since the files are bigger now. Has
> anyone
> > > any
> > > > > idea
> > > > > > > to
> > > > > > >    check what's changed, or a way to check this? The only
> thing I
> > > can
> > > > > > > think of
> > > > > > >    is the checksum mentioned above.
> > > > > > >
> > > > > > > $ ls -lah ~/Desktop/parquet-1-1*
> > > > > > > -rw-r--r--  1 fokkodriesprong  staff   562B 17 nov 21:09
> > > > > > > /Users/fokkodriesprong/Desktop/parquet-1-10-1.parquet
> > > > > > > -rw-r--r--  1 fokkodriesprong  staff   611B 17 nov 21:05
> > > > > > > /Users/fokkodriesprong/Desktop/parquet-1-11-0.parquet
> > > > > > >
> > > > > > > $ parquet-tools cat
> > > > > /Users/fokkodriesprong/Desktop/parquet-1-10-1.parquet
> > > > > > > id = 1
> > > > > > > data = a
> > > > > > >
> > > > > > > $ parquet-tools cat
> > > > > /Users/fokkodriesprong/Desktop/parquet-1-11-0.parquet
> > > > > > > id = 1
> > > > > > > data = a
> > > > > > >
> > > > > > > A binary diff here:
> > > > > > >
> https://urldefense.proofpoint.com/v2/url?u=https-3A__gist.github.com_Fokko_1c209f158299dc2fb5878c5bae4bf6d8&d=DwIFaQ&c=r2dcLCtU9q6n0vrtnDw9vg&r=FQ88AmOZ4TMjDdqNBGu-ag&m=CoznEc8bzT5Gkp9UNE3EwFMcEadunf3b2ewl8BcbNjI&s=sMuaS6b28yXjYslQvpfSzR_ocwBjXx1kM6bXa7Nue_c&e=
> > > > > > >
> > > > > > > Cheers, Fokko
> > > > > > >
> > > > > > > Op za 16 nov. 2019 om 04:18 schreef Junjie Chen <
> > > > > > [email protected]
> > > > > > > >:
> > > > > > >
> > > > > > > > +1
> > > > > > > > Verified signature, checksum and ran mvn install
> successfully.
> > > > > > > >
> > > > > > > > Wang, Yuming <[email protected]> 于2019年11月14日周四
> > 下午2:05写道:
> > > > > > > > >
> > > > > > > > > +1
> > > > > > > > > Tested Parquet 1.11.0 with Spark SQL module: build/sbt
> > > > > > "sql/test-only"
> > > > > > > > -Phadoop-3.2
> > > > > > > > >
> > > > > > > > > On 2019/11/13, 21:33, "Gabor Szadovszky" <
> [email protected]>
> > > > > wrote:
> > > > > > > > >
> > > > > > > > >     Hi everyone,
> > > > > > > > >
> > > > > > > > >     I propose the following RC to be released as official
> > > Apache
> > > > > > > Parquet
> > > > > > > > 1.11.0
> > > > > > > > >     release.
> > > > > > > > >
> > > > > > > > >     The commit id is
> 18519eb8e059865652eee3ff0e8593f126701da4
> > > > > > > > >     * This corresponds to the tag:
> apache-parquet-1.11.0-rc7
> > > > > > > > >     *
> > > > > > > > >
> > > > > > > >
> > > > > > >
> > > > > >
> > > > >
> > > >
> > >
> >
> https://urldefense.proofpoint.com/v2/url?u=https-3A__nam01.safelinks.protection.outlook.com_-3Furl-3Dhttps-253A-252F-252Fgithub.com-252Fapache-252Fparquet-2Dmr-252Ftree-252F18519eb8e059865652eee3ff0e8593f126701da4-26amp-3Bdata-3D02-257C01-257Cyumwang-2540ebay.com-257C8d588ca5855842a94bed08d7683e1221-257C46326bff992841a0baca17c16c94ea99-257C0-257C0-257C637092488114756267-26amp-3Bsdata-3DToLFrTB9lU-252FGzH6UpXwy7PAY7kaupbyKAgdghESCfgg-253D-26amp-3Breserved-3D0&d=DwIFaQ&c=r2dcLCtU9q6n0vrtnDw9vg&r=FQ88AmOZ4TMjDdqNBGu-ag&m=CoznEc8bzT5Gkp9UNE3EwFMcEadunf3b2ewl8BcbNjI&s=UVepLy1MDaX4CT1EPcDESsF_lCp6B_Wf73oJw4j_xnE&e=
> > > > > > > > >
> > > > > > > > >     The release tarball, signature, and checksums are here:
> > > > > > > > >     *
> > > > > > > >
> > > > > > >
> > > > > >
> > > > >
> > > >
> > >
> >
> https://urldefense.proofpoint.com/v2/url?u=https-3A__nam01.safelinks.protection.outlook.com_-3Furl-3Dhttps-253A-252F-252Fdist.apache.org-252Frepos-252Fdist-252Fdev-252Fparquet-252Fapache-2Dparquet-2D1.11.0-2Drc7-26amp-3Bdata-3D02-257C01-257Cyumwang-2540ebay.com-257C8d588ca5855842a94bed08d7683e1221-257C46326bff992841a0baca17c16c94ea99-257C0-257C0-257C637092488114756267-26amp-3Bsdata-3DMPaHiYJT7ZcqreAYUkvDvZugthUhRPrySdXpN2ytT5k-253D-26amp-3Breserved-3D0&d=DwIFaQ&c=r2dcLCtU9q6n0vrtnDw9vg&r=FQ88AmOZ4TMjDdqNBGu-ag&m=CoznEc8bzT5Gkp9UNE3EwFMcEadunf3b2ewl8BcbNjI&s=ZwnVnpNGRVFQ3_Hw_sJoUBl6U3CCbT0-uTRMzUQiKJc&e=
> > > > > > > > >
> > > > > > > > >     You can find the KEYS file here:
> > > > > > > > >     *
> > > > > > > >
> > > > > > >
> > > > > >
> > > > >
> > > >
> > >
> >
> https://urldefense.proofpoint.com/v2/url?u=https-3A__nam01.safelinks.protection.outlook.com_-3Furl-3Dhttps-253A-252F-252Fapache.org-252Fdist-252Fparquet-252FKEYS-26amp-3Bdata-3D02-257C01-257Cyumwang-2540ebay.com-257C8d588ca5855842a94bed08d7683e1221-257C46326bff992841a0baca17c16c94ea99-257C0-257C0-257C637092488114756267-26amp-3Bsdata-3DIwG4MUGsP2lVzlD4bwZUEPuEAPUg-252FHXRYtxc5CQupBM-253D-26amp-3Breserved-3D0&d=DwIFaQ&c=r2dcLCtU9q6n0vrtnDw9vg&r=FQ88AmOZ4TMjDdqNBGu-ag&m=CoznEc8bzT5Gkp9UNE3EwFMcEadunf3b2ewl8BcbNjI&s=RA0T1Q_BTgA6gwN8EK2CBeZ0nf7340zDgEMadjjqXmQ&e=
> > > > > > > > >
> > > > > > > > >     Binary artifacts are staged in Nexus here:
> > > > > > > > >     *
> > > > > > > >
> > > > > > >
> > > > > >
> > > > >
> > > >
> > >
> >
> https://urldefense.proofpoint.com/v2/url?u=https-3A__nam01.safelinks.protection.outlook.com_-3Furl-3Dhttps-253A-252F-252Frepository.apache.org-252Fcontent-252Fgroups-252Fstaging-252Forg-252Fapache-252Fparquet-252F-26amp-3Bdata-3D02-257C01-257Cyumwang-2540ebay.com-257C8d588ca5855842a94bed08d7683e1221-257C46326bff992841a0baca17c16c94ea99-257C0-257C0-257C637092488114756267-26amp-3Bsdata-3DlHtqLRQqQFwsyoaLSVaJuau5gxPKsCQFFVJaY8H0tZQ-253D-26amp-3Breserved-3D0&d=DwIFaQ&c=r2dcLCtU9q6n0vrtnDw9vg&r=FQ88AmOZ4TMjDdqNBGu-ag&m=CoznEc8bzT5Gkp9UNE3EwFMcEadunf3b2ewl8BcbNjI&s=kdM7O8WCtNwj3f7wg3YHQZu2kAaBfh4QjWfG3i5b690&e=
> > > > > > > > >
> > > > > > > > >     This release includes the changes listed at:
> > > > > > > > >
> > > > > > > >
> > > > > > >
> > > > > >
> > > > >
> > > >
> > >
> >
> https://urldefense.proofpoint.com/v2/url?u=https-3A__nam01.safelinks.protection.outlook.com_-3Furl-3Dhttps-253A-252F-252Fgithub.com-252Fapache-252Fparquet-2Dmr-252Fblob-252Fapache-2Dparquet-2D1.11.0-2Drc7-252FCHANGES.md-26amp-3Bdata-3D02-257C01-257Cyumwang-2540ebay.com-257C8d588ca5855842a94bed08d7683e1221-257C46326bff992841a0baca17c16c94ea99-257C0-257C0-257C637092488114756267-26amp-3Bsdata-3D82BplI3bLAL6qArLHvVoYReZOk-252BboSP655rI8VX5Q5I-253D-26amp-3Breserved-3D0&d=DwIFaQ&c=r2dcLCtU9q6n0vrtnDw9vg&r=FQ88AmOZ4TMjDdqNBGu-ag&m=CoznEc8bzT5Gkp9UNE3EwFMcEadunf3b2ewl8BcbNjI&s=Pg6nebaAqfj7qh-_b_3PStcrWu-dpBVbjtY9OLp4_G4&e=
> > > > > > > > >
> > > > > > > > >     Please download, verify, and test.
> > > > > > > > >
> > > > > > > > >     Please vote in the next 72 hours.
> > > > > > > > >
> > > > > > > > >     [ ] +1 Release this as Apache Parquet 1.11.0
> > > > > > > > >     [ ] +0
> > > > > > > > >     [ ] -1 Do not release this because...
> > > > > > > > >
> > > > > > > > >
> > > > > > > >
> > > > > > >
> > > > > >
> > > > >
> > > > >
> > > > > --
> > > > > Ryan Blue
> > > > > Software Engineer
> > > > > Netflix
> > > > >
> > > >
> > >
> > >
> > > --
> > > Ryan Blue
> > > Software Engineer
> > > Netflix
> > >
> >
>


-- 
Xinli Shang

Reply via email to