Gabor, shouldn't Parquet be binary compatible for public APIs? From the
stack trace, it looks like this 1.11.0 RC breaks binary compatibility in
the type builders.

Looks like this should have been caught by the binary compatibility checks.

On Thu, Nov 21, 2019 at 8:56 AM Gabor Szadovszky <[email protected]> wrote:

> Hi Michael,
>
> Unfortunately, I don't have too much experience on Spark. But if spark uses
> the parquet-mr library in an embedded way (that's how Hive uses it) it is
> required to re-build Spark with 1.11 RC parquet-mr.
>
> Regards,
> Gabor
>
> On Wed, Nov 20, 2019 at 5:44 PM Michael Heuer <[email protected]> wrote:
>
> > It appears a provided scope dependency on spark-sql leaks old parquet
> > versions was causing the runtime error below.  After including new
> > parquet-column and parquet-hadoop compile scope dependencies (in addition
> > to parquet-avro, which we already have at compile scope), our build
> > succeeds.
> >
> > https://github.com/bigdatagenomics/adam/pull/2232 <
> > https://github.com/bigdatagenomics/adam/pull/2232>
> >
> > However, when running via spark-submit I run into a similar runtime error
> >
> > Caused by: java.lang.NoSuchMethodError:
> >
> org.apache.parquet.schema.Types$PrimitiveBuilder.as(Lorg/apache/parquet/schema/LogicalTypeAnnotation;)Lorg/apache/parquet/schema/Types$Builder;
> >         at
> >
> org.apache.parquet.avro.AvroSchemaConverter.convertField(AvroSchemaConverter.java:161)
> >         at
> >
> org.apache.parquet.avro.AvroSchemaConverter.convertUnion(AvroSchemaConverter.java:226)
> >         at
> >
> org.apache.parquet.avro.AvroSchemaConverter.convertField(AvroSchemaConverter.java:182)
> >         at
> >
> org.apache.parquet.avro.AvroSchemaConverter.convertField(AvroSchemaConverter.java:141)
> >         at
> >
> org.apache.parquet.avro.AvroSchemaConverter.convertField(AvroSchemaConverter.java:244)
> >         at
> >
> org.apache.parquet.avro.AvroSchemaConverter.convertFields(AvroSchemaConverter.java:135)
> >         at
> >
> org.apache.parquet.avro.AvroSchemaConverter.convert(AvroSchemaConverter.java:126)
> >         at
> > org.apache.parquet.avro.AvroWriteSupport.init(AvroWriteSupport.java:121)
> >         at
> >
> org.apache.parquet.hadoop.ParquetOutputFormat.getRecordWriter(ParquetOutputFormat.java:388)
> >         at
> >
> org.apache.parquet.hadoop.ParquetOutputFormat.getRecordWriter(ParquetOutputFormat.java:349)
> >         at
> >
> org.apache.spark.rdd.InstrumentedOutputFormat.getRecordWriter(InstrumentedOutputFormat.scala:35)
> >         at org.apache.spark.internal.io
> > .HadoopMapReduceWriteConfigUtil.initWriter(SparkHadoopWriter.scala:350)
> >         at org.apache.spark.internal.io
> >
> .SparkHadoopWriter$.org$apache$spark$internal$io$SparkHadoopWriter$$executeTask(SparkHadoopWriter.scala:120)
> >         at org.apache.spark.internal.io
> > .SparkHadoopWriter$$anonfun$3.apply(SparkHadoopWriter.scala:83)
> >         at org.apache.spark.internal.io
> > .SparkHadoopWriter$$anonfun$3.apply(SparkHadoopWriter.scala:78)
> >         at
> > org.apache.spark.scheduler.ResultTask.runTask(ResultTask.scala:90)
> >         at org.apache.spark.scheduler.Task.run(Task.scala:123)
> >         at
> >
> org.apache.spark.executor.Executor$TaskRunner$$anonfun$10.apply(Executor.scala:408)
> >         at
> > org.apache.spark.util.Utils$.tryWithSafeFinally(Utils.scala:1360)
> >         at
> > org.apache.spark.executor.Executor$TaskRunner.run(Executor.scala:414)
> >         at
> >
> java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1149)
> >         at
> >
> java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:624)
> >         at java.lang.Thread.run(Thread.java:748)
> >
> >
> > Will bumping our library dependency version to 1.11 require a new version
> > of Spark, built against Parquet 1.11?
> >
> > Please accept my apologies if this is heading out-of-scope for the
> Parquet
> > mailing list.
> >
> >    michael
> >
> >
> > > On Nov 20, 2019, at 10:00 AM, Michael Heuer <[email protected]> wrote:
> > >
> > > I am willing to do some benchmarking on genomic data at scale but am
> not
> > quite sure what the Spark target version for 1.11.0 might be.  Will
> Parquet
> > 1.11.0 be compatible in Spark 2.4.x?
> > >
> > > Updating from 1.10.1 to 1.11.0 breaks at runtime in our build
> > >
> > > …
> > > D 0, localhost, executor driver): java.lang.NoClassDefFoundError:
> > org/apache/parquet/schema/LogicalTypeAnnotation
> > >       at
> > org.apache.parquet.avro.AvroWriteSupport.init(AvroWriteSupport.java:121)
> > >       at
> >
> org.apache.parquet.hadoop.ParquetOutputFormat.getRecordWriter(ParquetOutputFormat.java:388)
> > >       at
> >
> org.apache.parquet.hadoop.ParquetOutputFormat.getRecordWriter(ParquetOutputFormat.java:349)
> > >       at
> >
> org.apache.spark.rdd.InstrumentedOutputFormat.getRecordWriter(InstrumentedOutputFormat.scala:35)
> > >       at org.apache.spark.internal.io
> > .HadoopMapReduceWriteConfigUtil.initWriter(SparkHadoopWriter.scala:350)
> > >       at org.apache.spark.internal.io
> >
> .SparkHadoopWriter$.org$apache$spark$internal$io$SparkHadoopWriter$$executeTask(SparkHadoopWriter.scala:120)
> > >       at org.apache.spark.internal.io
> > .SparkHadoopWriter$$anonfun$3.apply(SparkHadoopWriter.scala:83)
> > >       at org.apache.spark.internal.io
> > .SparkHadoopWriter$$anonfun$3.apply(SparkHadoopWriter.scala:78)
> > >       at
> > org.apache.spark.scheduler.ResultTask.runTask(ResultTask.scala:90)
> > >       at org.apache.spark.scheduler.Task.run(Task.scala:123)
> > >       at
> >
> org.apache.spark.executor.Executor$TaskRunner$$anonfun$10.apply(Executor.scala:408)
> > >       at
> > org.apache.spark.util.Utils$.tryWithSafeFinally(Utils.scala:1360)
> > >       at
> > org.apache.spark.executor.Executor$TaskRunner.run(Executor.scala:414)
> > >       at
> >
> java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1149)
> > >       at
> >
> java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:624)
> > >       at java.lang.Thread.run(Thread.java:748)
> > > Caused by: java.lang.ClassNotFoundException:
> > org.apache.parquet.schema.LogicalTypeAnnotation
> > >       at java.net.URLClassLoader.findClass(URLClassLoader.java:382)
> > >       at java.lang.ClassLoader.loadClass(ClassLoader.java:424)
> > >       at sun.misc.Launcher$AppClassLoader.loadClass(Launcher.java:349)
> > >       at java.lang.ClassLoader.loadClass(ClassLoader.java:357)
> > >
> > >   michael
> > >
> > >
> > >> On Nov 20, 2019, at 3:29 AM, Gabor Szadovszky <[email protected]>
> wrote:
> > >>
> > >> Thanks, Fokko.
> > >>
> > >> Ryan, we did not do such measurements yet. I'm afraid, I won't have
> > enough
> > >> time to do that in the next couple of weeks.
> > >>
> > >> Cheers,
> > >> Gabor
> > >>
> > >> On Tue, Nov 19, 2019 at 6:14 PM Driesprong, Fokko
> <[email protected]
> > >
> > >> wrote:
> > >>
> > >>> Thanks Gabor for the explanation. I'd like to change my vote to +1
> > >>> (non-binding).
> > >>>
> > >>> Cheers, Fokko
> > >>>
> > >>> Op di 19 nov. 2019 om 18:03 schreef Ryan Blue
> > <[email protected]>
> > >>>
> > >>>> Gabor, what I meant was: have we tried this with real data to see
> the
> > >>>> effect? I think those results would be helpful.
> > >>>>
> > >>>> On Mon, Nov 18, 2019 at 11:35 PM Gabor Szadovszky <[email protected]
> >
> > >>>> wrote:
> > >>>>
> > >>>>> Hi Ryan,
> > >>>>>
> > >>>>> It is not easy to calculate. For the column indexes feature we
> > >>> introduced
> > >>>>> two new structures saved before the footer: column indexes and
> offset
> > >>>>> indexes. If the min/max values are not too long, then the
> truncation
> > >>>> might
> > >>>>> not decrease the file size because of the offset indexes. Moreover,
> > we
> > >>>> also
> > >>>>> introduced parquet.page.row.count.limit which might increase the
> > number
> > >>>> of
> > >>>>> pages which leads to increasing the file size.
> > >>>>> The footer itself is also changed and we are saving more values in
> > it:
> > >>>> the
> > >>>>> offset values to the column/offset indexes, the new logical type
> > >>>>> structures, the CRC checksums (we might have some others).
> > >>>>> So, the size of the files with small amount of data will be
> increased
> > >>>>> (because of the larger footer). The size of the files where the
> > values
> > >>>> can
> > >>>>> be encoded very well (RLE) will probably be increased (because we
> > will
> > >>>> have
> > >>>>> more pages). The size of some files where the values are long
> > (>64bytes
> > >>>> by
> > >>>>> default) might be decreased because of truncating the min/max
> values.
> > >>>>>
> > >>>>> Regards,
> > >>>>> Gabor
> > >>>>>
> > >>>>> On Mon, Nov 18, 2019 at 6:46 PM Ryan Blue
> <[email protected]
> > >
> > >>>>> wrote:
> > >>>>>
> > >>>>>> Gabor, do we have an idea of the additional overhead for a
> non-test
> > >>>> data
> > >>>>>> file? It should be easy to validate that this doesn't introduce an
> > >>>>>> unreasonable amount of overhead. In some cases, it should actually
> > be
> > >>>>>> smaller since the column indexes are truncated and page stats are
> > >>> not.
> > >>>>>>
> > >>>>>> On Mon, Nov 18, 2019 at 1:00 AM Gabor Szadovszky
> > >>>>>> <[email protected]> wrote:
> > >>>>>>
> > >>>>>>> Hi Fokko,
> > >>>>>>>
> > >>>>>>> For the first point. The referenced constructor is private and
> > >>>> Iceberg
> > >>>>>> uses
> > >>>>>>> it via reflection. It is not a breaking change. I think,
> parquet-mr
> > >>>>> shall
> > >>>>>>> not keep private methods only because of clients might use them
> via
> > >>>>>>> reflection.
> > >>>>>>>
> > >>>>>>> About the checksum. I've agreed on having the CRC checksum write
> > >>>>> enabled
> > >>>>>> by
> > >>>>>>> default because the benchmarks did not show significant
> performance
> > >>>>>>> penalties. See https://github.com/apache/parquet-mr/pull/647 for
> > >>>>>> details.
> > >>>>>>>
> > >>>>>>> About the file size change. 1.11.0 is introducing column indexes,
> > >>> CRC
> > >>>>>>> checksum, removing the statistics from the page headers and maybe
> > >>>> other
> > >>>>>>> changes that impact file size. If only file size is in question I
> > >>>>> cannot
> > >>>>>>> see a breaking change here.
> > >>>>>>>
> > >>>>>>> Regards,
> > >>>>>>> Gabor
> > >>>>>>>
> > >>>>>>>
> > >>>>>>>
> > >>>>>>> On Sun, Nov 17, 2019 at 9:27 PM Driesprong, Fokko
> > >>>> <[email protected]
> > >>>>>>
> > >>>>>>> wrote:
> > >>>>>>>
> > >>>>>>>> Unfortunately, a -1 from my side (non-binding)
> > >>>>>>>>
> > >>>>>>>> I've updated Iceberg to Parquet 1.11.0, and found three things:
> > >>>>>>>>
> > >>>>>>>>  - We've broken backward compatibility of the constructor of
> > >>>>>>>>  ColumnChunkPageWriteStore
> > >>>>>>>>  <
> > >>>>>>>>
> > >>>>>>>
> > >>>>>>
> > >>>>>
> > >>>>
> > >>>
> >
> https://github.com/apache/parquet-mr/commit/e7db9e20f52c925a207ea62d6dda6dc4e870294e#diff-d007a18083a2431c30a5416f248e0a4bR80
> > >>>>>>>>> .
> > >>>>>>>>  This required a change
> > >>>>>>>>  <
> > >>>>>>>>
> > >>>>>>>
> > >>>>>>
> > >>>>>
> > >>>>
> > >>>
> >
> https://github.com/apache/incubator-iceberg/pull/297/files#diff-b877faa96f292b851c75fe8bcc1912f8R176
> > >>>>>>>>>
> > >>>>>>>>  to the code. This isn't a hard blocker, but if there will be a
> > >>>> new
> > >>>>>> RC,
> > >>>>>>>> I've
> > >>>>>>>>  submitted a patch:
> > >>>> https://github.com/apache/parquet-mr/pull/699
> > >>>>>>>>  - Related, that we need to put in the changelog, is that
> > >>>> checksums
> > >>>>>> are
> > >>>>>>>>  enabled by default:
> > >>>>>>>>
> > >>>>>>>>
> > >>>>>>>
> > >>>>>>
> > >>>>>
> > >>>>
> > >>>
> >
> https://github.com/apache/parquet-mr/blob/master/parquet-column/src/main/java/org/apache/parquet/column/ParquetProperties.java#L54
> > >>>>>>>> This
> > >>>>>>>>  will impact performance. I would suggest disabling it by
> > >>>> default:
> > >>>>>>>>  https://github.com/apache/parquet-mr/pull/700
> > >>>>>>>>  <
> > >>>>>>>>
> > >>>>>>>
> > >>>>>>
> > >>>>>
> > >>>>
> > >>>
> >
> https://github.com/apache/parquet-mr/commit/e7db9e20f52c925a207ea62d6dda6dc4e870294e#diff-d007a18083a2431c30a5416f248e0a4bR277
> > >>>>>>>>>
> > >>>>>>>>  - Binary compatibility. While updating Iceberg, I've noticed
> > >>>> that
> > >>>>>> the
> > >>>>>>>>  split-test was failing:
> > >>>>>>>>
> > >>>>>>>>
> > >>>>>>>
> > >>>>>>
> > >>>>>
> > >>>>
> > >>>
> >
> https://github.com/apache/incubator-iceberg/pull/297/files#diff-4b64b7014f259be41b26cfb73d3e6e93L199
> > >>>>>>>> The
> > >>>>>>>>  two records are now divided over four Spark partitions.
> > >>>> Something
> > >>>>> in
> > >>>>>>> the
> > >>>>>>>>  output has changed since the files are bigger now. Has anyone
> > >>>> any
> > >>>>>> idea
> > >>>>>>>> to
> > >>>>>>>>  check what's changed, or a way to check this? The only thing I
> > >>>> can
> > >>>>>>>> think of
> > >>>>>>>>  is the checksum mentioned above.
> > >>>>>>>>
> > >>>>>>>> $ ls -lah ~/Desktop/parquet-1-1*
> > >>>>>>>> -rw-r--r--  1 fokkodriesprong  staff   562B 17 nov 21:09
> > >>>>>>>> /Users/fokkodriesprong/Desktop/parquet-1-10-1.parquet
> > >>>>>>>> -rw-r--r--  1 fokkodriesprong  staff   611B 17 nov 21:05
> > >>>>>>>> /Users/fokkodriesprong/Desktop/parquet-1-11-0.parquet
> > >>>>>>>>
> > >>>>>>>> $ parquet-tools cat
> > >>>>>> /Users/fokkodriesprong/Desktop/parquet-1-10-1.parquet
> > >>>>>>>> id = 1
> > >>>>>>>> data = a
> > >>>>>>>>
> > >>>>>>>> $ parquet-tools cat
> > >>>>>> /Users/fokkodriesprong/Desktop/parquet-1-11-0.parquet
> > >>>>>>>> id = 1
> > >>>>>>>> data = a
> > >>>>>>>>
> > >>>>>>>> A binary diff here:
> > >>>>>>>> https://gist.github.com/Fokko/1c209f158299dc2fb5878c5bae4bf6d8
> > >>>>>>>>
> > >>>>>>>> Cheers, Fokko
> > >>>>>>>>
> > >>>>>>>> Op za 16 nov. 2019 om 04:18 schreef Junjie Chen <
> > >>>>>>> [email protected]
> > >>>>>>>>> :
> > >>>>>>>>
> > >>>>>>>>> +1
> > >>>>>>>>> Verified signature, checksum and ran mvn install successfully.
> > >>>>>>>>>
> > >>>>>>>>> Wang, Yuming <[email protected]> 于2019年11月14日周四
> > >>> 下午2:05写道:
> > >>>>>>>>>>
> > >>>>>>>>>> +1
> > >>>>>>>>>> Tested Parquet 1.11.0 with Spark SQL module: build/sbt
> > >>>>>>> "sql/test-only"
> > >>>>>>>>> -Phadoop-3.2
> > >>>>>>>>>>
> > >>>>>>>>>> On 2019/11/13, 21:33, "Gabor Szadovszky" <[email protected]>
> > >>>>>> wrote:
> > >>>>>>>>>>
> > >>>>>>>>>>   Hi everyone,
> > >>>>>>>>>>
> > >>>>>>>>>>   I propose the following RC to be released as official
> > >>>> Apache
> > >>>>>>>> Parquet
> > >>>>>>>>> 1.11.0
> > >>>>>>>>>>   release.
> > >>>>>>>>>>
> > >>>>>>>>>>   The commit id is 18519eb8e059865652eee3ff0e8593f126701da4
> > >>>>>>>>>>   * This corresponds to the tag: apache-parquet-1.11.0-rc7
> > >>>>>>>>>>   *
> > >>>>>>>>>>
> > >>>>>>>>>
> > >>>>>>>>
> > >>>>>>>
> > >>>>>>
> > >>>>>
> > >>>>
> > >>>
> >
> https://nam01.safelinks.protection.outlook.com/?url=https%3A%2F%2Fgithub.com%2Fapache%2Fparquet-mr%2Ftree%2F18519eb8e059865652eee3ff0e8593f126701da4&amp;data=02%7C01%7Cyumwang%40ebay.com%7C8d588ca5855842a94bed08d7683e1221%7C46326bff992841a0baca17c16c94ea99%7C0%7C0%7C637092488114756267&amp;sdata=ToLFrTB9lU%2FGzH6UpXwy7PAY7kaupbyKAgdghESCfgg%3D&amp;reserved=0
> > >>>>>>>>>>
> > >>>>>>>>>>   The release tarball, signature, and checksums are here:
> > >>>>>>>>>>   *
> > >>>>>>>>>
> > >>>>>>>>
> > >>>>>>>
> > >>>>>>
> > >>>>>
> > >>>>
> > >>>
> >
> https://nam01.safelinks.protection.outlook.com/?url=https%3A%2F%2Fdist.apache.org%2Frepos%2Fdist%2Fdev%2Fparquet%2Fapache-parquet-1.11.0-rc7&amp;data=02%7C01%7Cyumwang%40ebay.com%7C8d588ca5855842a94bed08d7683e1221%7C46326bff992841a0baca17c16c94ea99%7C0%7C0%7C637092488114756267&amp;sdata=MPaHiYJT7ZcqreAYUkvDvZugthUhRPrySdXpN2ytT5k%3D&amp;reserved=0
> > >>>>>>>>>>
> > >>>>>>>>>>   You can find the KEYS file here:
> > >>>>>>>>>>   *
> > >>>>>>>>>
> > >>>>>>>>
> > >>>>>>>
> > >>>>>>
> > >>>>>
> > >>>>
> > >>>
> >
> https://nam01.safelinks.protection.outlook.com/?url=https%3A%2F%2Fapache.org%2Fdist%2Fparquet%2FKEYS&amp;data=02%7C01%7Cyumwang%40ebay.com%7C8d588ca5855842a94bed08d7683e1221%7C46326bff992841a0baca17c16c94ea99%7C0%7C0%7C637092488114756267&amp;sdata=IwG4MUGsP2lVzlD4bwZUEPuEAPUg%2FHXRYtxc5CQupBM%3D&amp;reserved=0
> > >>>>>>>>>>
> > >>>>>>>>>>   Binary artifacts are staged in Nexus here:
> > >>>>>>>>>>   *
> > >>>>>>>>>
> > >>>>>>>>
> > >>>>>>>
> > >>>>>>
> > >>>>>
> > >>>>
> > >>>
> >
> https://nam01.safelinks.protection.outlook.com/?url=https%3A%2F%2Frepository.apache.org%2Fcontent%2Fgroups%2Fstaging%2Forg%2Fapache%2Fparquet%2F&amp;data=02%7C01%7Cyumwang%40ebay.com%7C8d588ca5855842a94bed08d7683e1221%7C46326bff992841a0baca17c16c94ea99%7C0%7C0%7C637092488114756267&amp;sdata=lHtqLRQqQFwsyoaLSVaJuau5gxPKsCQFFVJaY8H0tZQ%3D&amp;reserved=0
> > >>>>>>>>>>
> > >>>>>>>>>>   This release includes the changes listed at:
> > >>>>>>>>>>
> > >>>>>>>>>
> > >>>>>>>>
> > >>>>>>>
> > >>>>>>
> > >>>>>
> > >>>>
> > >>>
> >
> https://nam01.safelinks.protection.outlook.com/?url=https%3A%2F%2Fgithub.com%2Fapache%2Fparquet-mr%2Fblob%2Fapache-parquet-1.11.0-rc7%2FCHANGES.md&amp;data=02%7C01%7Cyumwang%40ebay.com%7C8d588ca5855842a94bed08d7683e1221%7C46326bff992841a0baca17c16c94ea99%7C0%7C0%7C637092488114756267&amp;sdata=82BplI3bLAL6qArLHvVoYReZOk%2BboSP655rI8VX5Q5I%3D&amp;reserved=0
> > >>>>>>>>>>
> > >>>>>>>>>>   Please download, verify, and test.
> > >>>>>>>>>>
> > >>>>>>>>>>   Please vote in the next 72 hours.
> > >>>>>>>>>>
> > >>>>>>>>>>   [ ] +1 Release this as Apache Parquet 1.11.0
> > >>>>>>>>>>   [ ] +0
> > >>>>>>>>>>   [ ] -1 Do not release this because...
> > >>>>>>>>>>
> > >>>>>>>>>>
> > >>>>>>>>>
> > >>>>>>>>
> > >>>>>>>
> > >>>>>>
> > >>>>>>
> > >>>>>> --
> > >>>>>> Ryan Blue
> > >>>>>> Software Engineer
> > >>>>>> Netflix
> > >>>>>>
> > >>>>>
> > >>>>
> > >>>>
> > >>>> --
> > >>>> Ryan Blue
> > >>>> Software Engineer
> > >>>> Netflix
> > >>>>
> > >>>
> > >
> >
> >
>


-- 
Ryan Blue
Software Engineer
Netflix

Reply via email to