Re: [VOTE] Release Apache Parquet 1.11.0 RC7

Gabor Szadovszky Thu, 21 Nov 2019 08:57:12 -0800

Hi Michael,

Unfortunately, I don't have too much experience on Spark. But if spark uses
the parquet-mr library in an embedded way (that's how Hive uses it) it is
required to re-build Spark with 1.11 RC parquet-mr.


Regards,
Gabor

On Wed, Nov 20, 2019 at 5:44 PM Michael Heuer <[email protected]> wrote:

> It appears a provided scope dependency on spark-sql leaks old parquet
> versions was causing the runtime error below.  After including new
> parquet-column and parquet-hadoop compile scope dependencies (in addition
> to parquet-avro, which we already have at compile scope), our build
> succeeds.
>
> https://github.com/bigdatagenomics/adam/pull/2232 <
> https://github.com/bigdatagenomics/adam/pull/2232>
>
> However, when running via spark-submit I run into a similar runtime error
>
> Caused by: java.lang.NoSuchMethodError:
> org.apache.parquet.schema.Types$PrimitiveBuilder.as(Lorg/apache/parquet/schema/LogicalTypeAnnotation;)Lorg/apache/parquet/schema/Types$Builder;
>         at
> org.apache.parquet.avro.AvroSchemaConverter.convertField(AvroSchemaConverter.java:161)
>         at
> org.apache.parquet.avro.AvroSchemaConverter.convertUnion(AvroSchemaConverter.java:226)
>         at
> org.apache.parquet.avro.AvroSchemaConverter.convertField(AvroSchemaConverter.java:182)
>         at
> org.apache.parquet.avro.AvroSchemaConverter.convertField(AvroSchemaConverter.java:141)
>         at
> org.apache.parquet.avro.AvroSchemaConverter.convertField(AvroSchemaConverter.java:244)
>         at
> org.apache.parquet.avro.AvroSchemaConverter.convertFields(AvroSchemaConverter.java:135)
>         at
> org.apache.parquet.avro.AvroSchemaConverter.convert(AvroSchemaConverter.java:126)
>         at
> org.apache.parquet.avro.AvroWriteSupport.init(AvroWriteSupport.java:121)
>         at
> org.apache.parquet.hadoop.ParquetOutputFormat.getRecordWriter(ParquetOutputFormat.java:388)
>         at
> org.apache.parquet.hadoop.ParquetOutputFormat.getRecordWriter(ParquetOutputFormat.java:349)
>         at
> org.apache.spark.rdd.InstrumentedOutputFormat.getRecordWriter(InstrumentedOutputFormat.scala:35)
>         at org.apache.spark.internal.io
> .HadoopMapReduceWriteConfigUtil.initWriter(SparkHadoopWriter.scala:350)
>         at org.apache.spark.internal.io
> .SparkHadoopWriter$.org$apache$spark$internal$io$SparkHadoopWriter$$executeTask(SparkHadoopWriter.scala:120)
>         at org.apache.spark.internal.io
> .SparkHadoopWriter$$anonfun$3.apply(SparkHadoopWriter.scala:83)
>         at org.apache.spark.internal.io
> .SparkHadoopWriter$$anonfun$3.apply(SparkHadoopWriter.scala:78)
>         at
> org.apache.spark.scheduler.ResultTask.runTask(ResultTask.scala:90)
>         at org.apache.spark.scheduler.Task.run(Task.scala:123)
>         at
> org.apache.spark.executor.Executor$TaskRunner$$anonfun$10.apply(Executor.scala:408)
>         at
> org.apache.spark.util.Utils$.tryWithSafeFinally(Utils.scala:1360)
>         at
> org.apache.spark.executor.Executor$TaskRunner.run(Executor.scala:414)
>         at
> java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1149)
>         at
> java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:624)
>         at java.lang.Thread.run(Thread.java:748)
>
>
> Will bumping our library dependency version to 1.11 require a new version
> of Spark, built against Parquet 1.11?
>
> Please accept my apologies if this is heading out-of-scope for the Parquet
> mailing list.
>
>    michael
>
>
> > On Nov 20, 2019, at 10:00 AM, Michael Heuer <[email protected]> wrote:
> >
> > I am willing to do some benchmarking on genomic data at scale but am not
> quite sure what the Spark target version for 1.11.0 might be.  Will Parquet
> 1.11.0 be compatible in Spark 2.4.x?
> >
> > Updating from 1.10.1 to 1.11.0 breaks at runtime in our build
> >
> > …
> > D 0, localhost, executor driver): java.lang.NoClassDefFoundError:
> org/apache/parquet/schema/LogicalTypeAnnotation
> >       at
> org.apache.parquet.avro.AvroWriteSupport.init(AvroWriteSupport.java:121)
> >       at
> org.apache.parquet.hadoop.ParquetOutputFormat.getRecordWriter(ParquetOutputFormat.java:388)
> >       at
> org.apache.parquet.hadoop.ParquetOutputFormat.getRecordWriter(ParquetOutputFormat.java:349)
> >       at
> org.apache.spark.rdd.InstrumentedOutputFormat.getRecordWriter(InstrumentedOutputFormat.scala:35)
> >       at org.apache.spark.internal.io
> .HadoopMapReduceWriteConfigUtil.initWriter(SparkHadoopWriter.scala:350)
> >       at org.apache.spark.internal.io
> .SparkHadoopWriter$.org$apache$spark$internal$io$SparkHadoopWriter$$executeTask(SparkHadoopWriter.scala:120)
> >       at org.apache.spark.internal.io
> .SparkHadoopWriter$$anonfun$3.apply(SparkHadoopWriter.scala:83)
> >       at org.apache.spark.internal.io
> .SparkHadoopWriter$$anonfun$3.apply(SparkHadoopWriter.scala:78)
> >       at
> org.apache.spark.scheduler.ResultTask.runTask(ResultTask.scala:90)
> >       at org.apache.spark.scheduler.Task.run(Task.scala:123)
> >       at
> org.apache.spark.executor.Executor$TaskRunner$$anonfun$10.apply(Executor.scala:408)
> >       at
> org.apache.spark.util.Utils$.tryWithSafeFinally(Utils.scala:1360)
> >       at
> org.apache.spark.executor.Executor$TaskRunner.run(Executor.scala:414)
> >       at
> java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1149)
> >       at
> java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:624)
> >       at java.lang.Thread.run(Thread.java:748)
> > Caused by: java.lang.ClassNotFoundException:
> org.apache.parquet.schema.LogicalTypeAnnotation
> >       at java.net.URLClassLoader.findClass(URLClassLoader.java:382)
> >       at java.lang.ClassLoader.loadClass(ClassLoader.java:424)
> >       at sun.misc.Launcher$AppClassLoader.loadClass(Launcher.java:349)
> >       at java.lang.ClassLoader.loadClass(ClassLoader.java:357)
> >
> >   michael
> >
> >
> >> On Nov 20, 2019, at 3:29 AM, Gabor Szadovszky <[email protected]> wrote:
> >>
> >> Thanks, Fokko.
> >>
> >> Ryan, we did not do such measurements yet. I'm afraid, I won't have
> enough
> >> time to do that in the next couple of weeks.
> >>
> >> Cheers,
> >> Gabor
> >>
> >> On Tue, Nov 19, 2019 at 6:14 PM Driesprong, Fokko <[email protected]
> >
> >> wrote:
> >>
> >>> Thanks Gabor for the explanation. I'd like to change my vote to +1
> >>> (non-binding).
> >>>
> >>> Cheers, Fokko
> >>>
> >>> Op di 19 nov. 2019 om 18:03 schreef Ryan Blue
> <[email protected]>
> >>>
> >>>> Gabor, what I meant was: have we tried this with real data to see the
> >>>> effect? I think those results would be helpful.
> >>>>
> >>>> On Mon, Nov 18, 2019 at 11:35 PM Gabor Szadovszky <[email protected]>
> >>>> wrote:
> >>>>
> >>>>> Hi Ryan,
> >>>>>
> >>>>> It is not easy to calculate. For the column indexes feature we
> >>> introduced
> >>>>> two new structures saved before the footer: column indexes and offset
> >>>>> indexes. If the min/max values are not too long, then the truncation
> >>>> might
> >>>>> not decrease the file size because of the offset indexes. Moreover,
> we
> >>>> also
> >>>>> introduced parquet.page.row.count.limit which might increase the
> number
> >>>> of
> >>>>> pages which leads to increasing the file size.
> >>>>> The footer itself is also changed and we are saving more values in
> it:
> >>>> the
> >>>>> offset values to the column/offset indexes, the new logical type
> >>>>> structures, the CRC checksums (we might have some others).
> >>>>> So, the size of the files with small amount of data will be increased
> >>>>> (because of the larger footer). The size of the files where the
> values
> >>>> can
> >>>>> be encoded very well (RLE) will probably be increased (because we
> will
> >>>> have
> >>>>> more pages). The size of some files where the values are long
> (>64bytes
> >>>> by
> >>>>> default) might be decreased because of truncating the min/max values.
> >>>>>
> >>>>> Regards,
> >>>>> Gabor
> >>>>>
> >>>>> On Mon, Nov 18, 2019 at 6:46 PM Ryan Blue <[email protected]
> >
> >>>>> wrote:
> >>>>>
> >>>>>> Gabor, do we have an idea of the additional overhead for a non-test
> >>>> data
> >>>>>> file? It should be easy to validate that this doesn't introduce an
> >>>>>> unreasonable amount of overhead. In some cases, it should actually
> be
> >>>>>> smaller since the column indexes are truncated and page stats are
> >>> not.
> >>>>>>
> >>>>>> On Mon, Nov 18, 2019 at 1:00 AM Gabor Szadovszky
> >>>>>> <[email protected]> wrote:
> >>>>>>
> >>>>>>> Hi Fokko,
> >>>>>>>
> >>>>>>> For the first point. The referenced constructor is private and
> >>>> Iceberg
> >>>>>> uses
> >>>>>>> it via reflection. It is not a breaking change. I think, parquet-mr
> >>>>> shall
> >>>>>>> not keep private methods only because of clients might use them via
> >>>>>>> reflection.
> >>>>>>>
> >>>>>>> About the checksum. I've agreed on having the CRC checksum write
> >>>>> enabled
> >>>>>> by
> >>>>>>> default because the benchmarks did not show significant performance
> >>>>>>> penalties. See https://github.com/apache/parquet-mr/pull/647 for
> >>>>>> details.
> >>>>>>>
> >>>>>>> About the file size change. 1.11.0 is introducing column indexes,
> >>> CRC
> >>>>>>> checksum, removing the statistics from the page headers and maybe
> >>>> other
> >>>>>>> changes that impact file size. If only file size is in question I
> >>>>> cannot
> >>>>>>> see a breaking change here.
> >>>>>>>
> >>>>>>> Regards,
> >>>>>>> Gabor
> >>>>>>>
> >>>>>>>
> >>>>>>>
> >>>>>>> On Sun, Nov 17, 2019 at 9:27 PM Driesprong, Fokko
> >>>> <[email protected]
> >>>>>>
> >>>>>>> wrote:
> >>>>>>>
> >>>>>>>> Unfortunately, a -1 from my side (non-binding)
> >>>>>>>>
> >>>>>>>> I've updated Iceberg to Parquet 1.11.0, and found three things:
> >>>>>>>>
> >>>>>>>>  - We've broken backward compatibility of the constructor of
> >>>>>>>>  ColumnChunkPageWriteStore
> >>>>>>>>  <
> >>>>>>>>
> >>>>>>>
> >>>>>>
> >>>>>
> >>>>
> >>>
> https://github.com/apache/parquet-mr/commit/e7db9e20f52c925a207ea62d6dda6dc4e870294e#diff-d007a18083a2431c30a5416f248e0a4bR80
> >>>>>>>>> .
> >>>>>>>>  This required a change
> >>>>>>>>  <
> >>>>>>>>
> >>>>>>>
> >>>>>>
> >>>>>
> >>>>
> >>>
> https://github.com/apache/incubator-iceberg/pull/297/files#diff-b877faa96f292b851c75fe8bcc1912f8R176
> >>>>>>>>>
> >>>>>>>>  to the code. This isn't a hard blocker, but if there will be a
> >>>> new
> >>>>>> RC,
> >>>>>>>> I've
> >>>>>>>>  submitted a patch:
> >>>> https://github.com/apache/parquet-mr/pull/699
> >>>>>>>>  - Related, that we need to put in the changelog, is that
> >>>> checksums
> >>>>>> are
> >>>>>>>>  enabled by default:
> >>>>>>>>
> >>>>>>>>
> >>>>>>>
> >>>>>>
> >>>>>
> >>>>
> >>>
> https://github.com/apache/parquet-mr/blob/master/parquet-column/src/main/java/org/apache/parquet/column/ParquetProperties.java#L54
> >>>>>>>> This
> >>>>>>>>  will impact performance. I would suggest disabling it by
> >>>> default:
> >>>>>>>>  https://github.com/apache/parquet-mr/pull/700
> >>>>>>>>  <
> >>>>>>>>
> >>>>>>>
> >>>>>>
> >>>>>
> >>>>
> >>>
> https://github.com/apache/parquet-mr/commit/e7db9e20f52c925a207ea62d6dda6dc4e870294e#diff-d007a18083a2431c30a5416f248e0a4bR277
> >>>>>>>>>
> >>>>>>>>  - Binary compatibility. While updating Iceberg, I've noticed
> >>>> that
> >>>>>> the
> >>>>>>>>  split-test was failing:
> >>>>>>>>
> >>>>>>>>
> >>>>>>>
> >>>>>>
> >>>>>
> >>>>
> >>>
> https://github.com/apache/incubator-iceberg/pull/297/files#diff-4b64b7014f259be41b26cfb73d3e6e93L199
> >>>>>>>> The
> >>>>>>>>  two records are now divided over four Spark partitions.
> >>>> Something
> >>>>> in
> >>>>>>> the
> >>>>>>>>  output has changed since the files are bigger now. Has anyone
> >>>> any
> >>>>>> idea
> >>>>>>>> to
> >>>>>>>>  check what's changed, or a way to check this? The only thing I
> >>>> can
> >>>>>>>> think of
> >>>>>>>>  is the checksum mentioned above.
> >>>>>>>>
> >>>>>>>> $ ls -lah ~/Desktop/parquet-1-1*
> >>>>>>>> -rw-r--r--  1 fokkodriesprong  staff   562B 17 nov 21:09
> >>>>>>>> /Users/fokkodriesprong/Desktop/parquet-1-10-1.parquet
> >>>>>>>> -rw-r--r--  1 fokkodriesprong  staff   611B 17 nov 21:05
> >>>>>>>> /Users/fokkodriesprong/Desktop/parquet-1-11-0.parquet
> >>>>>>>>
> >>>>>>>> $ parquet-tools cat
> >>>>>> /Users/fokkodriesprong/Desktop/parquet-1-10-1.parquet
> >>>>>>>> id = 1
> >>>>>>>> data = a
> >>>>>>>>
> >>>>>>>> $ parquet-tools cat
> >>>>>> /Users/fokkodriesprong/Desktop/parquet-1-11-0.parquet
> >>>>>>>> id = 1
> >>>>>>>> data = a
> >>>>>>>>
> >>>>>>>> A binary diff here:
> >>>>>>>> https://gist.github.com/Fokko/1c209f158299dc2fb5878c5bae4bf6d8
> >>>>>>>>
> >>>>>>>> Cheers, Fokko
> >>>>>>>>
> >>>>>>>> Op za 16 nov. 2019 om 04:18 schreef Junjie Chen <
> >>>>>>> [email protected]
> >>>>>>>>> :
> >>>>>>>>
> >>>>>>>>> +1
> >>>>>>>>> Verified signature, checksum and ran mvn install successfully.
> >>>>>>>>>
> >>>>>>>>> Wang, Yuming <[email protected]> 于2019年11月14日周四
> >>> 下午2:05写道：
> >>>>>>>>>>
> >>>>>>>>>> +1
> >>>>>>>>>> Tested Parquet 1.11.0 with Spark SQL module: build/sbt
> >>>>>>> "sql/test-only"
> >>>>>>>>> -Phadoop-3.2
> >>>>>>>>>>
> >>>>>>>>>> On 2019/11/13, 21:33, "Gabor Szadovszky" <[email protected]>
> >>>>>> wrote:
> >>>>>>>>>>
> >>>>>>>>>>   Hi everyone,
> >>>>>>>>>>
> >>>>>>>>>>   I propose the following RC to be released as official
> >>>> Apache
> >>>>>>>> Parquet
> >>>>>>>>> 1.11.0
> >>>>>>>>>>   release.
> >>>>>>>>>>
> >>>>>>>>>>   The commit id is 18519eb8e059865652eee3ff0e8593f126701da4
> >>>>>>>>>>   * This corresponds to the tag: apache-parquet-1.11.0-rc7
> >>>>>>>>>>   *
> >>>>>>>>>>
> >>>>>>>>>
> >>>>>>>>
> >>>>>>>
> >>>>>>
> >>>>>
> >>>>
> >>>
> https://nam01.safelinks.protection.outlook.com/?url=https%3A%2F%2Fgithub.com%2Fapache%2Fparquet-mr%2Ftree%2F18519eb8e059865652eee3ff0e8593f126701da4&amp;data=02%7C01%7Cyumwang%40ebay.com%7C8d588ca5855842a94bed08d7683e1221%7C46326bff992841a0baca17c16c94ea99%7C0%7C0%7C637092488114756267&amp;sdata=ToLFrTB9lU%2FGzH6UpXwy7PAY7kaupbyKAgdghESCfgg%3D&amp;reserved=0
> >>>>>>>>>>
> >>>>>>>>>>   The release tarball, signature, and checksums are here:
> >>>>>>>>>>   *
> >>>>>>>>>
> >>>>>>>>
> >>>>>>>
> >>>>>>
> >>>>>
> >>>>
> >>>
> https://nam01.safelinks.protection.outlook.com/?url=https%3A%2F%2Fdist.apache.org%2Frepos%2Fdist%2Fdev%2Fparquet%2Fapache-parquet-1.11.0-rc7&amp;data=02%7C01%7Cyumwang%40ebay.com%7C8d588ca5855842a94bed08d7683e1221%7C46326bff992841a0baca17c16c94ea99%7C0%7C0%7C637092488114756267&amp;sdata=MPaHiYJT7ZcqreAYUkvDvZugthUhRPrySdXpN2ytT5k%3D&amp;reserved=0
> >>>>>>>>>>
> >>>>>>>>>>   You can find the KEYS file here:
> >>>>>>>>>>   *
> >>>>>>>>>
> >>>>>>>>
> >>>>>>>
> >>>>>>
> >>>>>
> >>>>
> >>>
> https://nam01.safelinks.protection.outlook.com/?url=https%3A%2F%2Fapache.org%2Fdist%2Fparquet%2FKEYS&amp;data=02%7C01%7Cyumwang%40ebay.com%7C8d588ca5855842a94bed08d7683e1221%7C46326bff992841a0baca17c16c94ea99%7C0%7C0%7C637092488114756267&amp;sdata=IwG4MUGsP2lVzlD4bwZUEPuEAPUg%2FHXRYtxc5CQupBM%3D&amp;reserved=0
> >>>>>>>>>>
> >>>>>>>>>>   Binary artifacts are staged in Nexus here:
> >>>>>>>>>>   *
> >>>>>>>>>
> >>>>>>>>
> >>>>>>>
> >>>>>>
> >>>>>
> >>>>
> >>>
> https://nam01.safelinks.protection.outlook.com/?url=https%3A%2F%2Frepository.apache.org%2Fcontent%2Fgroups%2Fstaging%2Forg%2Fapache%2Fparquet%2F&amp;data=02%7C01%7Cyumwang%40ebay.com%7C8d588ca5855842a94bed08d7683e1221%7C46326bff992841a0baca17c16c94ea99%7C0%7C0%7C637092488114756267&amp;sdata=lHtqLRQqQFwsyoaLSVaJuau5gxPKsCQFFVJaY8H0tZQ%3D&amp;reserved=0
> >>>>>>>>>>
> >>>>>>>>>>   This release includes the changes listed at:
> >>>>>>>>>>
> >>>>>>>>>
> >>>>>>>>
> >>>>>>>
> >>>>>>
> >>>>>
> >>>>
> >>>
> https://nam01.safelinks.protection.outlook.com/?url=https%3A%2F%2Fgithub.com%2Fapache%2Fparquet-mr%2Fblob%2Fapache-parquet-1.11.0-rc7%2FCHANGES.md&amp;data=02%7C01%7Cyumwang%40ebay.com%7C8d588ca5855842a94bed08d7683e1221%7C46326bff992841a0baca17c16c94ea99%7C0%7C0%7C637092488114756267&amp;sdata=82BplI3bLAL6qArLHvVoYReZOk%2BboSP655rI8VX5Q5I%3D&amp;reserved=0
> >>>>>>>>>>
> >>>>>>>>>>   Please download, verify, and test.
> >>>>>>>>>>
> >>>>>>>>>>   Please vote in the next 72 hours.
> >>>>>>>>>>
> >>>>>>>>>>   [ ] +1 Release this as Apache Parquet 1.11.0
> >>>>>>>>>>   [ ] +0
> >>>>>>>>>>   [ ] -1 Do not release this because...
> >>>>>>>>>>
> >>>>>>>>>>
> >>>>>>>>>
> >>>>>>>>
> >>>>>>>
> >>>>>>
> >>>>>>
> >>>>>> --
> >>>>>> Ryan Blue
> >>>>>> Software Engineer
> >>>>>> Netflix
> >>>>>>
> >>>>>
> >>>>
> >>>>
> >>>> --
> >>>> Ryan Blue
> >>>> Software Engineer
> >>>> Netflix
> >>>>
> >>>
> >
>
>

Re: [VOTE] Release Apache Parquet 1.11.0 RC7

Reply via email to