Re: [VOTE] Release Apache Parquet 1.11.0 RC7

Michael Heuer Wed, 20 Nov 2019 08:45:07 -0800

It appears a provided scope dependency on spark-sql leaks old parquet versions 
was causing the runtime error below.  After including new parquet-column and 
parquet-hadoop compile scope dependencies (in addition to parquet-avro, which 
we already have at compile scope), our build succeeds.


https://github.com/bigdatagenomics/adam/pull/2232 
<https://github.com/bigdatagenomics/adam/pull/2232>

However, when running via spark-submit I run into a similar runtime error

Caused by: java.lang.NoSuchMethodError: 
org.apache.parquet.schema.Types$PrimitiveBuilder.as(Lorg/apache/parquet/schema/LogicalTypeAnnotation;)Lorg/apache/parquet/schema/Types$Builder;
        at 
org.apache.parquet.avro.AvroSchemaConverter.convertField(AvroSchemaConverter.java:161)
        at 
org.apache.parquet.avro.AvroSchemaConverter.convertUnion(AvroSchemaConverter.java:226)
        at 
org.apache.parquet.avro.AvroSchemaConverter.convertField(AvroSchemaConverter.java:182)
        at 
org.apache.parquet.avro.AvroSchemaConverter.convertField(AvroSchemaConverter.java:141)
        at 
org.apache.parquet.avro.AvroSchemaConverter.convertField(AvroSchemaConverter.java:244)
        at 
org.apache.parquet.avro.AvroSchemaConverter.convertFields(AvroSchemaConverter.java:135)
        at 
org.apache.parquet.avro.AvroSchemaConverter.convert(AvroSchemaConverter.java:126)
        at 
org.apache.parquet.avro.AvroWriteSupport.init(AvroWriteSupport.java:121)
        at 
org.apache.parquet.hadoop.ParquetOutputFormat.getRecordWriter(ParquetOutputFormat.java:388)
        at 
org.apache.parquet.hadoop.ParquetOutputFormat.getRecordWriter(ParquetOutputFormat.java:349)
        at 
org.apache.spark.rdd.InstrumentedOutputFormat.getRecordWriter(InstrumentedOutputFormat.scala:35)
        at 
org.apache.spark.internal.io.HadoopMapReduceWriteConfigUtil.initWriter(SparkHadoopWriter.scala:350)
        at 
org.apache.spark.internal.io.SparkHadoopWriter$.org$apache$spark$internal$io$SparkHadoopWriter$$executeTask(SparkHadoopWriter.scala:120)
        at 
org.apache.spark.internal.io.SparkHadoopWriter$$anonfun$3.apply(SparkHadoopWriter.scala:83)
        at 
org.apache.spark.internal.io.SparkHadoopWriter$$anonfun$3.apply(SparkHadoopWriter.scala:78)
        at org.apache.spark.scheduler.ResultTask.runTask(ResultTask.scala:90)
        at org.apache.spark.scheduler.Task.run(Task.scala:123)
        at 
org.apache.spark.executor.Executor$TaskRunner$$anonfun$10.apply(Executor.scala:408)
        at org.apache.spark.util.Utils$.tryWithSafeFinally(Utils.scala:1360)
        at org.apache.spark.executor.Executor$TaskRunner.run(Executor.scala:414)
        at 
java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1149)
        at 
java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:624)
        at java.lang.Thread.run(Thread.java:748)


Will bumping our library dependency version to 1.11 require a new version of 
Spark, built against Parquet 1.11?

Please accept my apologies if this is heading out-of-scope for the Parquet 
mailing list.

   michael


> On Nov 20, 2019, at 10:00 AM, Michael Heuer <[email protected]> wrote:
> 
> I am willing to do some benchmarking on genomic data at scale but am not 
> quite sure what the Spark target version for 1.11.0 might be.  Will Parquet 
> 1.11.0 be compatible in Spark 2.4.x?
> 
> Updating from 1.10.1 to 1.11.0 breaks at runtime in our build
> 
> …
> D 0, localhost, executor driver): java.lang.NoClassDefFoundError: 
> org/apache/parquet/schema/LogicalTypeAnnotation
>       at 
> org.apache.parquet.avro.AvroWriteSupport.init(AvroWriteSupport.java:121)
>       at 
> org.apache.parquet.hadoop.ParquetOutputFormat.getRecordWriter(ParquetOutputFormat.java:388)
>       at 
> org.apache.parquet.hadoop.ParquetOutputFormat.getRecordWriter(ParquetOutputFormat.java:349)
>       at 
> org.apache.spark.rdd.InstrumentedOutputFormat.getRecordWriter(InstrumentedOutputFormat.scala:35)
>       at 
> org.apache.spark.internal.io.HadoopMapReduceWriteConfigUtil.initWriter(SparkHadoopWriter.scala:350)
>       at 
> org.apache.spark.internal.io.SparkHadoopWriter$.org$apache$spark$internal$io$SparkHadoopWriter$$executeTask(SparkHadoopWriter.scala:120)
>       at 
> org.apache.spark.internal.io.SparkHadoopWriter$$anonfun$3.apply(SparkHadoopWriter.scala:83)
>       at 
> org.apache.spark.internal.io.SparkHadoopWriter$$anonfun$3.apply(SparkHadoopWriter.scala:78)
>       at org.apache.spark.scheduler.ResultTask.runTask(ResultTask.scala:90)
>       at org.apache.spark.scheduler.Task.run(Task.scala:123)
>       at 
> org.apache.spark.executor.Executor$TaskRunner$$anonfun$10.apply(Executor.scala:408)
>       at org.apache.spark.util.Utils$.tryWithSafeFinally(Utils.scala:1360)
>       at org.apache.spark.executor.Executor$TaskRunner.run(Executor.scala:414)
>       at 
> java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1149)
>       at 
> java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:624)
>       at java.lang.Thread.run(Thread.java:748)
> Caused by: java.lang.ClassNotFoundException: 
> org.apache.parquet.schema.LogicalTypeAnnotation
>       at java.net.URLClassLoader.findClass(URLClassLoader.java:382)
>       at java.lang.ClassLoader.loadClass(ClassLoader.java:424)
>       at sun.misc.Launcher$AppClassLoader.loadClass(Launcher.java:349)
>       at java.lang.ClassLoader.loadClass(ClassLoader.java:357)
> 
>   michael
> 
> 
>> On Nov 20, 2019, at 3:29 AM, Gabor Szadovszky <[email protected]> wrote:
>> 
>> Thanks, Fokko.
>> 
>> Ryan, we did not do such measurements yet. I'm afraid, I won't have enough
>> time to do that in the next couple of weeks.
>> 
>> Cheers,
>> Gabor
>> 
>> On Tue, Nov 19, 2019 at 6:14 PM Driesprong, Fokko <[email protected]>
>> wrote:
>> 
>>> Thanks Gabor for the explanation. I'd like to change my vote to +1
>>> (non-binding).
>>> 
>>> Cheers, Fokko
>>> 
>>> Op di 19 nov. 2019 om 18:03 schreef Ryan Blue <[email protected]>
>>> 
>>>> Gabor, what I meant was: have we tried this with real data to see the
>>>> effect? I think those results would be helpful.
>>>> 
>>>> On Mon, Nov 18, 2019 at 11:35 PM Gabor Szadovszky <[email protected]>
>>>> wrote:
>>>> 
>>>>> Hi Ryan,
>>>>> 
>>>>> It is not easy to calculate. For the column indexes feature we
>>> introduced
>>>>> two new structures saved before the footer: column indexes and offset
>>>>> indexes. If the min/max values are not too long, then the truncation
>>>> might
>>>>> not decrease the file size because of the offset indexes. Moreover, we
>>>> also
>>>>> introduced parquet.page.row.count.limit which might increase the number
>>>> of
>>>>> pages which leads to increasing the file size.
>>>>> The footer itself is also changed and we are saving more values in it:
>>>> the
>>>>> offset values to the column/offset indexes, the new logical type
>>>>> structures, the CRC checksums (we might have some others).
>>>>> So, the size of the files with small amount of data will be increased
>>>>> (because of the larger footer). The size of the files where the values
>>>> can
>>>>> be encoded very well (RLE) will probably be increased (because we will
>>>> have
>>>>> more pages). The size of some files where the values are long (>64bytes
>>>> by
>>>>> default) might be decreased because of truncating the min/max values.
>>>>> 
>>>>> Regards,
>>>>> Gabor
>>>>> 
>>>>> On Mon, Nov 18, 2019 at 6:46 PM Ryan Blue <[email protected]>
>>>>> wrote:
>>>>> 
>>>>>> Gabor, do we have an idea of the additional overhead for a non-test
>>>> data
>>>>>> file? It should be easy to validate that this doesn't introduce an
>>>>>> unreasonable amount of overhead. In some cases, it should actually be
>>>>>> smaller since the column indexes are truncated and page stats are
>>> not.
>>>>>> 
>>>>>> On Mon, Nov 18, 2019 at 1:00 AM Gabor Szadovszky
>>>>>> <[email protected]> wrote:
>>>>>> 
>>>>>>> Hi Fokko,
>>>>>>> 
>>>>>>> For the first point. The referenced constructor is private and
>>>> Iceberg
>>>>>> uses
>>>>>>> it via reflection. It is not a breaking change. I think, parquet-mr
>>>>> shall
>>>>>>> not keep private methods only because of clients might use them via
>>>>>>> reflection.
>>>>>>> 
>>>>>>> About the checksum. I've agreed on having the CRC checksum write
>>>>> enabled
>>>>>> by
>>>>>>> default because the benchmarks did not show significant performance
>>>>>>> penalties. See https://github.com/apache/parquet-mr/pull/647 for
>>>>>> details.
>>>>>>> 
>>>>>>> About the file size change. 1.11.0 is introducing column indexes,
>>> CRC
>>>>>>> checksum, removing the statistics from the page headers and maybe
>>>> other
>>>>>>> changes that impact file size. If only file size is in question I
>>>>> cannot
>>>>>>> see a breaking change here.
>>>>>>> 
>>>>>>> Regards,
>>>>>>> Gabor
>>>>>>> 
>>>>>>> 
>>>>>>> 
>>>>>>> On Sun, Nov 17, 2019 at 9:27 PM Driesprong, Fokko
>>>> <[email protected]
>>>>>> 
>>>>>>> wrote:
>>>>>>> 
>>>>>>>> Unfortunately, a -1 from my side (non-binding)
>>>>>>>> 
>>>>>>>> I've updated Iceberg to Parquet 1.11.0, and found three things:
>>>>>>>> 
>>>>>>>>  - We've broken backward compatibility of the constructor of
>>>>>>>>  ColumnChunkPageWriteStore
>>>>>>>>  <
>>>>>>>> 
>>>>>>> 
>>>>>> 
>>>>> 
>>>> 
>>> https://github.com/apache/parquet-mr/commit/e7db9e20f52c925a207ea62d6dda6dc4e870294e#diff-d007a18083a2431c30a5416f248e0a4bR80
>>>>>>>>> .
>>>>>>>>  This required a change
>>>>>>>>  <
>>>>>>>> 
>>>>>>> 
>>>>>> 
>>>>> 
>>>> 
>>> https://github.com/apache/incubator-iceberg/pull/297/files#diff-b877faa96f292b851c75fe8bcc1912f8R176
>>>>>>>>> 
>>>>>>>>  to the code. This isn't a hard blocker, but if there will be a
>>>> new
>>>>>> RC,
>>>>>>>> I've
>>>>>>>>  submitted a patch:
>>>> https://github.com/apache/parquet-mr/pull/699
>>>>>>>>  - Related, that we need to put in the changelog, is that
>>>> checksums
>>>>>> are
>>>>>>>>  enabled by default:
>>>>>>>> 
>>>>>>>> 
>>>>>>> 
>>>>>> 
>>>>> 
>>>> 
>>> https://github.com/apache/parquet-mr/blob/master/parquet-column/src/main/java/org/apache/parquet/column/ParquetProperties.java#L54
>>>>>>>> This
>>>>>>>>  will impact performance. I would suggest disabling it by
>>>> default:
>>>>>>>>  https://github.com/apache/parquet-mr/pull/700
>>>>>>>>  <
>>>>>>>> 
>>>>>>> 
>>>>>> 
>>>>> 
>>>> 
>>> https://github.com/apache/parquet-mr/commit/e7db9e20f52c925a207ea62d6dda6dc4e870294e#diff-d007a18083a2431c30a5416f248e0a4bR277
>>>>>>>>> 
>>>>>>>>  - Binary compatibility. While updating Iceberg, I've noticed
>>>> that
>>>>>> the
>>>>>>>>  split-test was failing:
>>>>>>>> 
>>>>>>>> 
>>>>>>> 
>>>>>> 
>>>>> 
>>>> 
>>> https://github.com/apache/incubator-iceberg/pull/297/files#diff-4b64b7014f259be41b26cfb73d3e6e93L199
>>>>>>>> The
>>>>>>>>  two records are now divided over four Spark partitions.
>>>> Something
>>>>> in
>>>>>>> the
>>>>>>>>  output has changed since the files are bigger now. Has anyone
>>>> any
>>>>>> idea
>>>>>>>> to
>>>>>>>>  check what's changed, or a way to check this? The only thing I
>>>> can
>>>>>>>> think of
>>>>>>>>  is the checksum mentioned above.
>>>>>>>> 
>>>>>>>> $ ls -lah ~/Desktop/parquet-1-1*
>>>>>>>> -rw-r--r--  1 fokkodriesprong  staff   562B 17 nov 21:09
>>>>>>>> /Users/fokkodriesprong/Desktop/parquet-1-10-1.parquet
>>>>>>>> -rw-r--r--  1 fokkodriesprong  staff   611B 17 nov 21:05
>>>>>>>> /Users/fokkodriesprong/Desktop/parquet-1-11-0.parquet
>>>>>>>> 
>>>>>>>> $ parquet-tools cat
>>>>>> /Users/fokkodriesprong/Desktop/parquet-1-10-1.parquet
>>>>>>>> id = 1
>>>>>>>> data = a
>>>>>>>> 
>>>>>>>> $ parquet-tools cat
>>>>>> /Users/fokkodriesprong/Desktop/parquet-1-11-0.parquet
>>>>>>>> id = 1
>>>>>>>> data = a
>>>>>>>> 
>>>>>>>> A binary diff here:
>>>>>>>> https://gist.github.com/Fokko/1c209f158299dc2fb5878c5bae4bf6d8
>>>>>>>> 
>>>>>>>> Cheers, Fokko
>>>>>>>> 
>>>>>>>> Op za 16 nov. 2019 om 04:18 schreef Junjie Chen <
>>>>>>> [email protected]
>>>>>>>>> :
>>>>>>>> 
>>>>>>>>> +1
>>>>>>>>> Verified signature, checksum and ran mvn install successfully.
>>>>>>>>> 
>>>>>>>>> Wang, Yuming <[email protected]> 于2019年11月14日周四
>>> 下午2:05写道：
>>>>>>>>>> 
>>>>>>>>>> +1
>>>>>>>>>> Tested Parquet 1.11.0 with Spark SQL module: build/sbt
>>>>>>> "sql/test-only"
>>>>>>>>> -Phadoop-3.2
>>>>>>>>>> 
>>>>>>>>>> On 2019/11/13, 21:33, "Gabor Szadovszky" <[email protected]>
>>>>>> wrote:
>>>>>>>>>> 
>>>>>>>>>>   Hi everyone,
>>>>>>>>>> 
>>>>>>>>>>   I propose the following RC to be released as official
>>>> Apache
>>>>>>>> Parquet
>>>>>>>>> 1.11.0
>>>>>>>>>>   release.
>>>>>>>>>> 
>>>>>>>>>>   The commit id is 18519eb8e059865652eee3ff0e8593f126701da4
>>>>>>>>>>   * This corresponds to the tag: apache-parquet-1.11.0-rc7
>>>>>>>>>>   *
>>>>>>>>>> 
>>>>>>>>> 
>>>>>>>> 
>>>>>>> 
>>>>>> 
>>>>> 
>>>> 
>>> https://nam01.safelinks.protection.outlook.com/?url=https%3A%2F%2Fgithub.com%2Fapache%2Fparquet-mr%2Ftree%2F18519eb8e059865652eee3ff0e8593f126701da4&amp;data=02%7C01%7Cyumwang%40ebay.com%7C8d588ca5855842a94bed08d7683e1221%7C46326bff992841a0baca17c16c94ea99%7C0%7C0%7C637092488114756267&amp;sdata=ToLFrTB9lU%2FGzH6UpXwy7PAY7kaupbyKAgdghESCfgg%3D&amp;reserved=0
>>>>>>>>>> 
>>>>>>>>>>   The release tarball, signature, and checksums are here:
>>>>>>>>>>   *
>>>>>>>>> 
>>>>>>>> 
>>>>>>> 
>>>>>> 
>>>>> 
>>>> 
>>> https://nam01.safelinks.protection.outlook.com/?url=https%3A%2F%2Fdist.apache.org%2Frepos%2Fdist%2Fdev%2Fparquet%2Fapache-parquet-1.11.0-rc7&amp;data=02%7C01%7Cyumwang%40ebay.com%7C8d588ca5855842a94bed08d7683e1221%7C46326bff992841a0baca17c16c94ea99%7C0%7C0%7C637092488114756267&amp;sdata=MPaHiYJT7ZcqreAYUkvDvZugthUhRPrySdXpN2ytT5k%3D&amp;reserved=0
>>>>>>>>>> 
>>>>>>>>>>   You can find the KEYS file here:
>>>>>>>>>>   *
>>>>>>>>> 
>>>>>>>> 
>>>>>>> 
>>>>>> 
>>>>> 
>>>> 
>>> https://nam01.safelinks.protection.outlook.com/?url=https%3A%2F%2Fapache.org%2Fdist%2Fparquet%2FKEYS&amp;data=02%7C01%7Cyumwang%40ebay.com%7C8d588ca5855842a94bed08d7683e1221%7C46326bff992841a0baca17c16c94ea99%7C0%7C0%7C637092488114756267&amp;sdata=IwG4MUGsP2lVzlD4bwZUEPuEAPUg%2FHXRYtxc5CQupBM%3D&amp;reserved=0
>>>>>>>>>> 
>>>>>>>>>>   Binary artifacts are staged in Nexus here:
>>>>>>>>>>   *
>>>>>>>>> 
>>>>>>>> 
>>>>>>> 
>>>>>> 
>>>>> 
>>>> 
>>> https://nam01.safelinks.protection.outlook.com/?url=https%3A%2F%2Frepository.apache.org%2Fcontent%2Fgroups%2Fstaging%2Forg%2Fapache%2Fparquet%2F&amp;data=02%7C01%7Cyumwang%40ebay.com%7C8d588ca5855842a94bed08d7683e1221%7C46326bff992841a0baca17c16c94ea99%7C0%7C0%7C637092488114756267&amp;sdata=lHtqLRQqQFwsyoaLSVaJuau5gxPKsCQFFVJaY8H0tZQ%3D&amp;reserved=0
>>>>>>>>>> 
>>>>>>>>>>   This release includes the changes listed at:
>>>>>>>>>> 
>>>>>>>>> 
>>>>>>>> 
>>>>>>> 
>>>>>> 
>>>>> 
>>>> 
>>> https://nam01.safelinks.protection.outlook.com/?url=https%3A%2F%2Fgithub.com%2Fapache%2Fparquet-mr%2Fblob%2Fapache-parquet-1.11.0-rc7%2FCHANGES.md&amp;data=02%7C01%7Cyumwang%40ebay.com%7C8d588ca5855842a94bed08d7683e1221%7C46326bff992841a0baca17c16c94ea99%7C0%7C0%7C637092488114756267&amp;sdata=82BplI3bLAL6qArLHvVoYReZOk%2BboSP655rI8VX5Q5I%3D&amp;reserved=0
>>>>>>>>>> 
>>>>>>>>>>   Please download, verify, and test.
>>>>>>>>>> 
>>>>>>>>>>   Please vote in the next 72 hours.
>>>>>>>>>> 
>>>>>>>>>>   [ ] +1 Release this as Apache Parquet 1.11.0
>>>>>>>>>>   [ ] +0
>>>>>>>>>>   [ ] -1 Do not release this because...
>>>>>>>>>> 
>>>>>>>>>> 
>>>>>>>>> 
>>>>>>>> 
>>>>>>> 
>>>>>> 
>>>>>> 
>>>>>> --
>>>>>> Ryan Blue
>>>>>> Software Engineer
>>>>>> Netflix
>>>>>> 
>>>>> 
>>>> 
>>>> 
>>>> --
>>>> Ryan Blue
>>>> Software Engineer
>>>> Netflix
>>>> 
>>> 
>

Re: [VOTE] Release Apache Parquet 1.11.0 RC7

Reply via email to