Re: [VOTE] Release Apache Parquet 1.11.0 RC7

Michael Heuer Wed, 20 Nov 2019 08:00:56 -0800

I am willing to do some benchmarking on genomic data at scale but am not quite 
sure what the Spark target version for 1.11.0 might be.  Will Parquet 1.11.0 be 
compatible in Spark 2.4.x?


Updating from 1.10.1 to 1.11.0 breaks at runtime in our build

…
D 0, localhost, executor driver): java.lang.NoClassDefFoundError: 
org/apache/parquet/schema/LogicalTypeAnnotation
        at 
org.apache.parquet.avro.AvroWriteSupport.init(AvroWriteSupport.java:121)
        at 
org.apache.parquet.hadoop.ParquetOutputFormat.getRecordWriter(ParquetOutputFormat.java:388)
        at 
org.apache.parquet.hadoop.ParquetOutputFormat.getRecordWriter(ParquetOutputFormat.java:349)
        at 
org.apache.spark.rdd.InstrumentedOutputFormat.getRecordWriter(InstrumentedOutputFormat.scala:35)
        at 
org.apache.spark.internal.io.HadoopMapReduceWriteConfigUtil.initWriter(SparkHadoopWriter.scala:350)
        at 
org.apache.spark.internal.io.SparkHadoopWriter$.org$apache$spark$internal$io$SparkHadoopWriter$$executeTask(SparkHadoopWriter.scala:120)
        at 
org.apache.spark.internal.io.SparkHadoopWriter$$anonfun$3.apply(SparkHadoopWriter.scala:83)
        at 
org.apache.spark.internal.io.SparkHadoopWriter$$anonfun$3.apply(SparkHadoopWriter.scala:78)
        at org.apache.spark.scheduler.ResultTask.runTask(ResultTask.scala:90)
        at org.apache.spark.scheduler.Task.run(Task.scala:123)
        at 
org.apache.spark.executor.Executor$TaskRunner$$anonfun$10.apply(Executor.scala:408)
        at org.apache.spark.util.Utils$.tryWithSafeFinally(Utils.scala:1360)
        at org.apache.spark.executor.Executor$TaskRunner.run(Executor.scala:414)
        at 
java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1149)
        at 
java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:624)
        at java.lang.Thread.run(Thread.java:748)
Caused by: java.lang.ClassNotFoundException: 
org.apache.parquet.schema.LogicalTypeAnnotation
        at java.net.URLClassLoader.findClass(URLClassLoader.java:382)
        at java.lang.ClassLoader.loadClass(ClassLoader.java:424)
        at sun.misc.Launcher$AppClassLoader.loadClass(Launcher.java:349)
        at java.lang.ClassLoader.loadClass(ClassLoader.java:357)

   michael


> On Nov 20, 2019, at 3:29 AM, Gabor Szadovszky <[email protected]> wrote:
> 
> Thanks, Fokko.
> 
> Ryan, we did not do such measurements yet. I'm afraid, I won't have enough
> time to do that in the next couple of weeks.
> 
> Cheers,
> Gabor
> 
> On Tue, Nov 19, 2019 at 6:14 PM Driesprong, Fokko <[email protected]>
> wrote:
> 
>> Thanks Gabor for the explanation. I'd like to change my vote to +1
>> (non-binding).
>> 
>> Cheers, Fokko
>> 
>> Op di 19 nov. 2019 om 18:03 schreef Ryan Blue <[email protected]>
>> 
>>> Gabor, what I meant was: have we tried this with real data to see the
>>> effect? I think those results would be helpful.
>>> 
>>> On Mon, Nov 18, 2019 at 11:35 PM Gabor Szadovszky <[email protected]>
>>> wrote:
>>> 
>>>> Hi Ryan,
>>>> 
>>>> It is not easy to calculate. For the column indexes feature we
>> introduced
>>>> two new structures saved before the footer: column indexes and offset
>>>> indexes. If the min/max values are not too long, then the truncation
>>> might
>>>> not decrease the file size because of the offset indexes. Moreover, we
>>> also
>>>> introduced parquet.page.row.count.limit which might increase the number
>>> of
>>>> pages which leads to increasing the file size.
>>>> The footer itself is also changed and we are saving more values in it:
>>> the
>>>> offset values to the column/offset indexes, the new logical type
>>>> structures, the CRC checksums (we might have some others).
>>>> So, the size of the files with small amount of data will be increased
>>>> (because of the larger footer). The size of the files where the values
>>> can
>>>> be encoded very well (RLE) will probably be increased (because we will
>>> have
>>>> more pages). The size of some files where the values are long (>64bytes
>>> by
>>>> default) might be decreased because of truncating the min/max values.
>>>> 
>>>> Regards,
>>>> Gabor
>>>> 
>>>> On Mon, Nov 18, 2019 at 6:46 PM Ryan Blue <[email protected]>
>>>> wrote:
>>>> 
>>>>> Gabor, do we have an idea of the additional overhead for a non-test
>>> data
>>>>> file? It should be easy to validate that this doesn't introduce an
>>>>> unreasonable amount of overhead. In some cases, it should actually be
>>>>> smaller since the column indexes are truncated and page stats are
>> not.
>>>>> 
>>>>> On Mon, Nov 18, 2019 at 1:00 AM Gabor Szadovszky
>>>>> <[email protected]> wrote:
>>>>> 
>>>>>> Hi Fokko,
>>>>>> 
>>>>>> For the first point. The referenced constructor is private and
>>> Iceberg
>>>>> uses
>>>>>> it via reflection. It is not a breaking change. I think, parquet-mr
>>>> shall
>>>>>> not keep private methods only because of clients might use them via
>>>>>> reflection.
>>>>>> 
>>>>>> About the checksum. I've agreed on having the CRC checksum write
>>>> enabled
>>>>> by
>>>>>> default because the benchmarks did not show significant performance
>>>>>> penalties. See https://github.com/apache/parquet-mr/pull/647 for
>>>>> details.
>>>>>> 
>>>>>> About the file size change. 1.11.0 is introducing column indexes,
>> CRC
>>>>>> checksum, removing the statistics from the page headers and maybe
>>> other
>>>>>> changes that impact file size. If only file size is in question I
>>>> cannot
>>>>>> see a breaking change here.
>>>>>> 
>>>>>> Regards,
>>>>>> Gabor
>>>>>> 
>>>>>> 
>>>>>> 
>>>>>> On Sun, Nov 17, 2019 at 9:27 PM Driesprong, Fokko
>>> <[email protected]
>>>>> 
>>>>>> wrote:
>>>>>> 
>>>>>>> Unfortunately, a -1 from my side (non-binding)
>>>>>>> 
>>>>>>> I've updated Iceberg to Parquet 1.11.0, and found three things:
>>>>>>> 
>>>>>>>   - We've broken backward compatibility of the constructor of
>>>>>>>   ColumnChunkPageWriteStore
>>>>>>>   <
>>>>>>> 
>>>>>> 
>>>>> 
>>>> 
>>> 
>> https://github.com/apache/parquet-mr/commit/e7db9e20f52c925a207ea62d6dda6dc4e870294e#diff-d007a18083a2431c30a5416f248e0a4bR80
>>>>>>>> .
>>>>>>>   This required a change
>>>>>>>   <
>>>>>>> 
>>>>>> 
>>>>> 
>>>> 
>>> 
>> https://github.com/apache/incubator-iceberg/pull/297/files#diff-b877faa96f292b851c75fe8bcc1912f8R176
>>>>>>>> 
>>>>>>>   to the code. This isn't a hard blocker, but if there will be a
>>> new
>>>>> RC,
>>>>>>> I've
>>>>>>>   submitted a patch:
>>> https://github.com/apache/parquet-mr/pull/699
>>>>>>>   - Related, that we need to put in the changelog, is that
>>> checksums
>>>>> are
>>>>>>>   enabled by default:
>>>>>>> 
>>>>>>> 
>>>>>> 
>>>>> 
>>>> 
>>> 
>> https://github.com/apache/parquet-mr/blob/master/parquet-column/src/main/java/org/apache/parquet/column/ParquetProperties.java#L54
>>>>>>> This
>>>>>>>   will impact performance. I would suggest disabling it by
>>> default:
>>>>>>>   https://github.com/apache/parquet-mr/pull/700
>>>>>>>   <
>>>>>>> 
>>>>>> 
>>>>> 
>>>> 
>>> 
>> https://github.com/apache/parquet-mr/commit/e7db9e20f52c925a207ea62d6dda6dc4e870294e#diff-d007a18083a2431c30a5416f248e0a4bR277
>>>>>>>> 
>>>>>>>   - Binary compatibility. While updating Iceberg, I've noticed
>>> that
>>>>> the
>>>>>>>   split-test was failing:
>>>>>>> 
>>>>>>> 
>>>>>> 
>>>>> 
>>>> 
>>> 
>> https://github.com/apache/incubator-iceberg/pull/297/files#diff-4b64b7014f259be41b26cfb73d3e6e93L199
>>>>>>> The
>>>>>>>   two records are now divided over four Spark partitions.
>>> Something
>>>> in
>>>>>> the
>>>>>>>   output has changed since the files are bigger now. Has anyone
>>> any
>>>>> idea
>>>>>>> to
>>>>>>>   check what's changed, or a way to check this? The only thing I
>>> can
>>>>>>> think of
>>>>>>>   is the checksum mentioned above.
>>>>>>> 
>>>>>>> $ ls -lah ~/Desktop/parquet-1-1*
>>>>>>> -rw-r--r--  1 fokkodriesprong  staff   562B 17 nov 21:09
>>>>>>> /Users/fokkodriesprong/Desktop/parquet-1-10-1.parquet
>>>>>>> -rw-r--r--  1 fokkodriesprong  staff   611B 17 nov 21:05
>>>>>>> /Users/fokkodriesprong/Desktop/parquet-1-11-0.parquet
>>>>>>> 
>>>>>>> $ parquet-tools cat
>>>>> /Users/fokkodriesprong/Desktop/parquet-1-10-1.parquet
>>>>>>> id = 1
>>>>>>> data = a
>>>>>>> 
>>>>>>> $ parquet-tools cat
>>>>> /Users/fokkodriesprong/Desktop/parquet-1-11-0.parquet
>>>>>>> id = 1
>>>>>>> data = a
>>>>>>> 
>>>>>>> A binary diff here:
>>>>>>> https://gist.github.com/Fokko/1c209f158299dc2fb5878c5bae4bf6d8
>>>>>>> 
>>>>>>> Cheers, Fokko
>>>>>>> 
>>>>>>> Op za 16 nov. 2019 om 04:18 schreef Junjie Chen <
>>>>>> [email protected]
>>>>>>>> :
>>>>>>> 
>>>>>>>> +1
>>>>>>>> Verified signature, checksum and ran mvn install successfully.
>>>>>>>> 
>>>>>>>> Wang, Yuming <[email protected]> 于2019年11月14日周四
>> 下午2:05写道：
>>>>>>>>> 
>>>>>>>>> +1
>>>>>>>>> Tested Parquet 1.11.0 with Spark SQL module: build/sbt
>>>>>> "sql/test-only"
>>>>>>>> -Phadoop-3.2
>>>>>>>>> 
>>>>>>>>> On 2019/11/13, 21:33, "Gabor Szadovszky" <[email protected]>
>>>>> wrote:
>>>>>>>>> 
>>>>>>>>>    Hi everyone,
>>>>>>>>> 
>>>>>>>>>    I propose the following RC to be released as official
>>> Apache
>>>>>>> Parquet
>>>>>>>> 1.11.0
>>>>>>>>>    release.
>>>>>>>>> 
>>>>>>>>>    The commit id is 18519eb8e059865652eee3ff0e8593f126701da4
>>>>>>>>>    * This corresponds to the tag: apache-parquet-1.11.0-rc7
>>>>>>>>>    *
>>>>>>>>> 
>>>>>>>> 
>>>>>>> 
>>>>>> 
>>>>> 
>>>> 
>>> 
>> https://nam01.safelinks.protection.outlook.com/?url=https%3A%2F%2Fgithub.com%2Fapache%2Fparquet-mr%2Ftree%2F18519eb8e059865652eee3ff0e8593f126701da4&amp;data=02%7C01%7Cyumwang%40ebay.com%7C8d588ca5855842a94bed08d7683e1221%7C46326bff992841a0baca17c16c94ea99%7C0%7C0%7C637092488114756267&amp;sdata=ToLFrTB9lU%2FGzH6UpXwy7PAY7kaupbyKAgdghESCfgg%3D&amp;reserved=0
>>>>>>>>> 
>>>>>>>>>    The release tarball, signature, and checksums are here:
>>>>>>>>>    *
>>>>>>>> 
>>>>>>> 
>>>>>> 
>>>>> 
>>>> 
>>> 
>> https://nam01.safelinks.protection.outlook.com/?url=https%3A%2F%2Fdist.apache.org%2Frepos%2Fdist%2Fdev%2Fparquet%2Fapache-parquet-1.11.0-rc7&amp;data=02%7C01%7Cyumwang%40ebay.com%7C8d588ca5855842a94bed08d7683e1221%7C46326bff992841a0baca17c16c94ea99%7C0%7C0%7C637092488114756267&amp;sdata=MPaHiYJT7ZcqreAYUkvDvZugthUhRPrySdXpN2ytT5k%3D&amp;reserved=0
>>>>>>>>> 
>>>>>>>>>    You can find the KEYS file here:
>>>>>>>>>    *
>>>>>>>> 
>>>>>>> 
>>>>>> 
>>>>> 
>>>> 
>>> 
>> https://nam01.safelinks.protection.outlook.com/?url=https%3A%2F%2Fapache.org%2Fdist%2Fparquet%2FKEYS&amp;data=02%7C01%7Cyumwang%40ebay.com%7C8d588ca5855842a94bed08d7683e1221%7C46326bff992841a0baca17c16c94ea99%7C0%7C0%7C637092488114756267&amp;sdata=IwG4MUGsP2lVzlD4bwZUEPuEAPUg%2FHXRYtxc5CQupBM%3D&amp;reserved=0
>>>>>>>>> 
>>>>>>>>>    Binary artifacts are staged in Nexus here:
>>>>>>>>>    *
>>>>>>>> 
>>>>>>> 
>>>>>> 
>>>>> 
>>>> 
>>> 
>> https://nam01.safelinks.protection.outlook.com/?url=https%3A%2F%2Frepository.apache.org%2Fcontent%2Fgroups%2Fstaging%2Forg%2Fapache%2Fparquet%2F&amp;data=02%7C01%7Cyumwang%40ebay.com%7C8d588ca5855842a94bed08d7683e1221%7C46326bff992841a0baca17c16c94ea99%7C0%7C0%7C637092488114756267&amp;sdata=lHtqLRQqQFwsyoaLSVaJuau5gxPKsCQFFVJaY8H0tZQ%3D&amp;reserved=0
>>>>>>>>> 
>>>>>>>>>    This release includes the changes listed at:
>>>>>>>>> 
>>>>>>>> 
>>>>>>> 
>>>>>> 
>>>>> 
>>>> 
>>> 
>> https://nam01.safelinks.protection.outlook.com/?url=https%3A%2F%2Fgithub.com%2Fapache%2Fparquet-mr%2Fblob%2Fapache-parquet-1.11.0-rc7%2FCHANGES.md&amp;data=02%7C01%7Cyumwang%40ebay.com%7C8d588ca5855842a94bed08d7683e1221%7C46326bff992841a0baca17c16c94ea99%7C0%7C0%7C637092488114756267&amp;sdata=82BplI3bLAL6qArLHvVoYReZOk%2BboSP655rI8VX5Q5I%3D&amp;reserved=0
>>>>>>>>> 
>>>>>>>>>    Please download, verify, and test.
>>>>>>>>> 
>>>>>>>>>    Please vote in the next 72 hours.
>>>>>>>>> 
>>>>>>>>>    [ ] +1 Release this as Apache Parquet 1.11.0
>>>>>>>>>    [ ] +0
>>>>>>>>>    [ ] -1 Do not release this because...
>>>>>>>>> 
>>>>>>>>> 
>>>>>>>> 
>>>>>>> 
>>>>>> 
>>>>> 
>>>>> 
>>>>> --
>>>>> Ryan Blue
>>>>> Software Engineer
>>>>> Netflix
>>>>> 
>>>> 
>>> 
>>> 
>>> --
>>> Ryan Blue
>>> Software Engineer
>>> Netflix
>>> 
>>

Re: [VOTE] Release Apache Parquet 1.11.0 RC7

Reply via email to