+1

Downloaded release code, checked hashes/signatures, run full tests and
installed locally with zero errors. Tested integration on a downstream
project (Apache Beam) and no issues (Note that we don't use any of the new
features yet).

Gabor, can you please create a corresponding parquet-1.11.x branch. I
expected to compare the release with the branch and tag but I found the
branch is not present.

Thanks,
Ismaël



On Tue, Nov 19, 2019 at 8:35 AM Gabor Szadovszky <ga...@apache.org> wrote:

> Hi Ryan,
>
> It is not easy to calculate. For the column indexes feature we introduced
> two new structures saved before the footer: column indexes and offset
> indexes. If the min/max values are not too long, then the truncation might
> not decrease the file size because of the offset indexes. Moreover, we also
> introduced parquet.page.row.count.limit which might increase the number of
> pages which leads to increasing the file size.
> The footer itself is also changed and we are saving more values in it: the
> offset values to the column/offset indexes, the new logical type
> structures, the CRC checksums (we might have some others).
> So, the size of the files with small amount of data will be increased
> (because of the larger footer). The size of the files where the values can
> be encoded very well (RLE) will probably be increased (because we will have
> more pages). The size of some files where the values are long (>64bytes by
> default) might be decreased because of truncating the min/max values.
>
> Regards,
> Gabor
>
> On Mon, Nov 18, 2019 at 6:46 PM Ryan Blue <rb...@netflix.com.invalid>
> wrote:
>
> > Gabor, do we have an idea of the additional overhead for a non-test data
> > file? It should be easy to validate that this doesn't introduce an
> > unreasonable amount of overhead. In some cases, it should actually be
> > smaller since the column indexes are truncated and page stats are not.
> >
> > On Mon, Nov 18, 2019 at 1:00 AM Gabor Szadovszky
> > <gabor.szadovs...@cloudera.com.invalid> wrote:
> >
> > > Hi Fokko,
> > >
> > > For the first point. The referenced constructor is private and Iceberg
> > uses
> > > it via reflection. It is not a breaking change. I think, parquet-mr
> shall
> > > not keep private methods only because of clients might use them via
> > > reflection.
> > >
> > > About the checksum. I've agreed on having the CRC checksum write
> enabled
> > by
> > > default because the benchmarks did not show significant performance
> > > penalties. See https://github.com/apache/parquet-mr/pull/647 for
> > details.
> > >
> > > About the file size change. 1.11.0 is introducing column indexes, CRC
> > > checksum, removing the statistics from the page headers and maybe other
> > > changes that impact file size. If only file size is in question I
> cannot
> > > see a breaking change here.
> > >
> > > Regards,
> > > Gabor
> > >
> > >
> > >
> > > On Sun, Nov 17, 2019 at 9:27 PM Driesprong, Fokko <fo...@driesprong.frl
> >
> > > wrote:
> > >
> > > > Unfortunately, a -1 from my side (non-binding)
> > > >
> > > > I've updated Iceberg to Parquet 1.11.0, and found three things:
> > > >
> > > >    - We've broken backward compatibility of the constructor of
> > > >    ColumnChunkPageWriteStore
> > > >    <
> > > >
> > >
> >
> https://github.com/apache/parquet-mr/commit/e7db9e20f52c925a207ea62d6dda6dc4e870294e#diff-d007a18083a2431c30a5416f248e0a4bR80
> > > > >.
> > > >    This required a change
> > > >    <
> > > >
> > >
> >
> https://github.com/apache/incubator-iceberg/pull/297/files#diff-b877faa96f292b851c75fe8bcc1912f8R176
> > > > >
> > > >    to the code. This isn't a hard blocker, but if there will be a new
> > RC,
> > > > I've
> > > >    submitted a patch: https://github.com/apache/parquet-mr/pull/699
> > > >    - Related, that we need to put in the changelog, is that checksums
> > are
> > > >    enabled by default:
> > > >
> > > >
> > >
> >
> https://github.com/apache/parquet-mr/blob/master/parquet-column/src/main/java/org/apache/parquet/column/ParquetProperties.java#L54
> > > > This
> > > >    will impact performance. I would suggest disabling it by default:
> > > >    https://github.com/apache/parquet-mr/pull/700
> > > >    <
> > > >
> > >
> >
> https://github.com/apache/parquet-mr/commit/e7db9e20f52c925a207ea62d6dda6dc4e870294e#diff-d007a18083a2431c30a5416f248e0a4bR277
> > > > >
> > > >    - Binary compatibility. While updating Iceberg, I've noticed that
> > the
> > > >    split-test was failing:
> > > >
> > > >
> > >
> >
> https://github.com/apache/incubator-iceberg/pull/297/files#diff-4b64b7014f259be41b26cfb73d3e6e93L199
> > > > The
> > > >    two records are now divided over four Spark partitions. Something
> in
> > > the
> > > >    output has changed since the files are bigger now. Has anyone any
> > idea
> > > > to
> > > >    check what's changed, or a way to check this? The only thing I can
> > > > think of
> > > >    is the checksum mentioned above.
> > > >
> > > > $ ls -lah ~/Desktop/parquet-1-1*
> > > > -rw-r--r--  1 fokkodriesprong  staff   562B 17 nov 21:09
> > > > /Users/fokkodriesprong/Desktop/parquet-1-10-1.parquet
> > > > -rw-r--r--  1 fokkodriesprong  staff   611B 17 nov 21:05
> > > > /Users/fokkodriesprong/Desktop/parquet-1-11-0.parquet
> > > >
> > > > $ parquet-tools cat
> > /Users/fokkodriesprong/Desktop/parquet-1-10-1.parquet
> > > > id = 1
> > > > data = a
> > > >
> > > > $ parquet-tools cat
> > /Users/fokkodriesprong/Desktop/parquet-1-11-0.parquet
> > > > id = 1
> > > > data = a
> > > >
> > > > A binary diff here:
> > > > https://gist.github.com/Fokko/1c209f158299dc2fb5878c5bae4bf6d8
> > > >
> > > > Cheers, Fokko
> > > >
> > > > Op za 16 nov. 2019 om 04:18 schreef Junjie Chen <
> > > chenjunjied...@gmail.com
> > > > >:
> > > >
> > > > > +1
> > > > > Verified signature, checksum and ran mvn install successfully.
> > > > >
> > > > > Wang, Yuming <yumw...@ebay.com.invalid> 于2019年11月14日周四 下午2:05写道:
> > > > > >
> > > > > > +1
> > > > > > Tested Parquet 1.11.0 with Spark SQL module: build/sbt
> > > "sql/test-only"
> > > > > -Phadoop-3.2
> > > > > >
> > > > > > On 2019/11/13, 21:33, "Gabor Szadovszky" <ga...@apache.org>
> > wrote:
> > > > > >
> > > > > >     Hi everyone,
> > > > > >
> > > > > >     I propose the following RC to be released as official Apache
> > > > Parquet
> > > > > 1.11.0
> > > > > >     release.
> > > > > >
> > > > > >     The commit id is 18519eb8e059865652eee3ff0e8593f126701da4
> > > > > >     * This corresponds to the tag: apache-parquet-1.11.0-rc7
> > > > > >     *
> > > > > >
> > > > >
> > > >
> > >
> >
> https://nam01.safelinks.protection.outlook.com/?url=https%3A%2F%2Fgithub.com%2Fapache%2Fparquet-mr%2Ftree%2F18519eb8e059865652eee3ff0e8593f126701da4&amp;data=02%7C01%7Cyumwang%40ebay.com%7C8d588ca5855842a94bed08d7683e1221%7C46326bff992841a0baca17c16c94ea99%7C0%7C0%7C637092488114756267&amp;sdata=ToLFrTB9lU%2FGzH6UpXwy7PAY7kaupbyKAgdghESCfgg%3D&amp;reserved=0
> > > > > >
> > > > > >     The release tarball, signature, and checksums are here:
> > > > > >     *
> > > > >
> > > >
> > >
> >
> https://nam01.safelinks.protection.outlook.com/?url=https%3A%2F%2Fdist.apache.org%2Frepos%2Fdist%2Fdev%2Fparquet%2Fapache-parquet-1.11.0-rc7&amp;data=02%7C01%7Cyumwang%40ebay.com%7C8d588ca5855842a94bed08d7683e1221%7C46326bff992841a0baca17c16c94ea99%7C0%7C0%7C637092488114756267&amp;sdata=MPaHiYJT7ZcqreAYUkvDvZugthUhRPrySdXpN2ytT5k%3D&amp;reserved=0
> > > > > >
> > > > > >     You can find the KEYS file here:
> > > > > >     *
> > > > >
> > > >
> > >
> >
> https://nam01.safelinks.protection.outlook.com/?url=https%3A%2F%2Fapache.org%2Fdist%2Fparquet%2FKEYS&amp;data=02%7C01%7Cyumwang%40ebay.com%7C8d588ca5855842a94bed08d7683e1221%7C46326bff992841a0baca17c16c94ea99%7C0%7C0%7C637092488114756267&amp;sdata=IwG4MUGsP2lVzlD4bwZUEPuEAPUg%2FHXRYtxc5CQupBM%3D&amp;reserved=0
> > > > > >
> > > > > >     Binary artifacts are staged in Nexus here:
> > > > > >     *
> > > > >
> > > >
> > >
> >
> https://nam01.safelinks.protection.outlook.com/?url=https%3A%2F%2Frepository.apache.org%2Fcontent%2Fgroups%2Fstaging%2Forg%2Fapache%2Fparquet%2F&amp;data=02%7C01%7Cyumwang%40ebay.com%7C8d588ca5855842a94bed08d7683e1221%7C46326bff992841a0baca17c16c94ea99%7C0%7C0%7C637092488114756267&amp;sdata=lHtqLRQqQFwsyoaLSVaJuau5gxPKsCQFFVJaY8H0tZQ%3D&amp;reserved=0
> > > > > >
> > > > > >     This release includes the changes listed at:
> > > > > >
> > > > >
> > > >
> > >
> >
> https://nam01.safelinks.protection.outlook.com/?url=https%3A%2F%2Fgithub.com%2Fapache%2Fparquet-mr%2Fblob%2Fapache-parquet-1.11.0-rc7%2FCHANGES.md&amp;data=02%7C01%7Cyumwang%40ebay.com%7C8d588ca5855842a94bed08d7683e1221%7C46326bff992841a0baca17c16c94ea99%7C0%7C0%7C637092488114756267&amp;sdata=82BplI3bLAL6qArLHvVoYReZOk%2BboSP655rI8VX5Q5I%3D&amp;reserved=0
> > > > > >
> > > > > >     Please download, verify, and test.
> > > > > >
> > > > > >     Please vote in the next 72 hours.
> > > > > >
> > > > > >     [ ] +1 Release this as Apache Parquet 1.11.0
> > > > > >     [ ] +0
> > > > > >     [ ] -1 Do not release this because...
> > > > > >
> > > > > >
> > > > >
> > > >
> > >
> >
> >
> > --
> > Ryan Blue
> > Software Engineer
> > Netflix
> >
>

Reply via email to