Hi Michael, Unfortunately, I don't have too much experience on Spark. But if spark uses the parquet-mr library in an embedded way (that's how Hive uses it) it is required to re-build Spark with 1.11 RC parquet-mr.
Regards, Gabor On Wed, Nov 20, 2019 at 5:44 PM Michael Heuer <[email protected]> wrote: > It appears a provided scope dependency on spark-sql leaks old parquet > versions was causing the runtime error below. After including new > parquet-column and parquet-hadoop compile scope dependencies (in addition > to parquet-avro, which we already have at compile scope), our build > succeeds. > > https://github.com/bigdatagenomics/adam/pull/2232 < > https://github.com/bigdatagenomics/adam/pull/2232> > > However, when running via spark-submit I run into a similar runtime error > > Caused by: java.lang.NoSuchMethodError: > org.apache.parquet.schema.Types$PrimitiveBuilder.as(Lorg/apache/parquet/schema/LogicalTypeAnnotation;)Lorg/apache/parquet/schema/Types$Builder; > at > org.apache.parquet.avro.AvroSchemaConverter.convertField(AvroSchemaConverter.java:161) > at > org.apache.parquet.avro.AvroSchemaConverter.convertUnion(AvroSchemaConverter.java:226) > at > org.apache.parquet.avro.AvroSchemaConverter.convertField(AvroSchemaConverter.java:182) > at > org.apache.parquet.avro.AvroSchemaConverter.convertField(AvroSchemaConverter.java:141) > at > org.apache.parquet.avro.AvroSchemaConverter.convertField(AvroSchemaConverter.java:244) > at > org.apache.parquet.avro.AvroSchemaConverter.convertFields(AvroSchemaConverter.java:135) > at > org.apache.parquet.avro.AvroSchemaConverter.convert(AvroSchemaConverter.java:126) > at > org.apache.parquet.avro.AvroWriteSupport.init(AvroWriteSupport.java:121) > at > org.apache.parquet.hadoop.ParquetOutputFormat.getRecordWriter(ParquetOutputFormat.java:388) > at > org.apache.parquet.hadoop.ParquetOutputFormat.getRecordWriter(ParquetOutputFormat.java:349) > at > org.apache.spark.rdd.InstrumentedOutputFormat.getRecordWriter(InstrumentedOutputFormat.scala:35) > at org.apache.spark.internal.io > .HadoopMapReduceWriteConfigUtil.initWriter(SparkHadoopWriter.scala:350) > at org.apache.spark.internal.io > .SparkHadoopWriter$.org$apache$spark$internal$io$SparkHadoopWriter$$executeTask(SparkHadoopWriter.scala:120) > at org.apache.spark.internal.io > .SparkHadoopWriter$$anonfun$3.apply(SparkHadoopWriter.scala:83) > at org.apache.spark.internal.io > .SparkHadoopWriter$$anonfun$3.apply(SparkHadoopWriter.scala:78) > at > org.apache.spark.scheduler.ResultTask.runTask(ResultTask.scala:90) > at org.apache.spark.scheduler.Task.run(Task.scala:123) > at > org.apache.spark.executor.Executor$TaskRunner$$anonfun$10.apply(Executor.scala:408) > at > org.apache.spark.util.Utils$.tryWithSafeFinally(Utils.scala:1360) > at > org.apache.spark.executor.Executor$TaskRunner.run(Executor.scala:414) > at > java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1149) > at > java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:624) > at java.lang.Thread.run(Thread.java:748) > > > Will bumping our library dependency version to 1.11 require a new version > of Spark, built against Parquet 1.11? > > Please accept my apologies if this is heading out-of-scope for the Parquet > mailing list. > > michael > > > > On Nov 20, 2019, at 10:00 AM, Michael Heuer <[email protected]> wrote: > > > > I am willing to do some benchmarking on genomic data at scale but am not > quite sure what the Spark target version for 1.11.0 might be. Will Parquet > 1.11.0 be compatible in Spark 2.4.x? > > > > Updating from 1.10.1 to 1.11.0 breaks at runtime in our build > > > > … > > D 0, localhost, executor driver): java.lang.NoClassDefFoundError: > org/apache/parquet/schema/LogicalTypeAnnotation > > at > org.apache.parquet.avro.AvroWriteSupport.init(AvroWriteSupport.java:121) > > at > org.apache.parquet.hadoop.ParquetOutputFormat.getRecordWriter(ParquetOutputFormat.java:388) > > at > org.apache.parquet.hadoop.ParquetOutputFormat.getRecordWriter(ParquetOutputFormat.java:349) > > at > org.apache.spark.rdd.InstrumentedOutputFormat.getRecordWriter(InstrumentedOutputFormat.scala:35) > > at org.apache.spark.internal.io > .HadoopMapReduceWriteConfigUtil.initWriter(SparkHadoopWriter.scala:350) > > at org.apache.spark.internal.io > .SparkHadoopWriter$.org$apache$spark$internal$io$SparkHadoopWriter$$executeTask(SparkHadoopWriter.scala:120) > > at org.apache.spark.internal.io > .SparkHadoopWriter$$anonfun$3.apply(SparkHadoopWriter.scala:83) > > at org.apache.spark.internal.io > .SparkHadoopWriter$$anonfun$3.apply(SparkHadoopWriter.scala:78) > > at > org.apache.spark.scheduler.ResultTask.runTask(ResultTask.scala:90) > > at org.apache.spark.scheduler.Task.run(Task.scala:123) > > at > org.apache.spark.executor.Executor$TaskRunner$$anonfun$10.apply(Executor.scala:408) > > at > org.apache.spark.util.Utils$.tryWithSafeFinally(Utils.scala:1360) > > at > org.apache.spark.executor.Executor$TaskRunner.run(Executor.scala:414) > > at > java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1149) > > at > java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:624) > > at java.lang.Thread.run(Thread.java:748) > > Caused by: java.lang.ClassNotFoundException: > org.apache.parquet.schema.LogicalTypeAnnotation > > at java.net.URLClassLoader.findClass(URLClassLoader.java:382) > > at java.lang.ClassLoader.loadClass(ClassLoader.java:424) > > at sun.misc.Launcher$AppClassLoader.loadClass(Launcher.java:349) > > at java.lang.ClassLoader.loadClass(ClassLoader.java:357) > > > > michael > > > > > >> On Nov 20, 2019, at 3:29 AM, Gabor Szadovszky <[email protected]> wrote: > >> > >> Thanks, Fokko. > >> > >> Ryan, we did not do such measurements yet. I'm afraid, I won't have > enough > >> time to do that in the next couple of weeks. > >> > >> Cheers, > >> Gabor > >> > >> On Tue, Nov 19, 2019 at 6:14 PM Driesprong, Fokko <[email protected] > > > >> wrote: > >> > >>> Thanks Gabor for the explanation. I'd like to change my vote to +1 > >>> (non-binding). > >>> > >>> Cheers, Fokko > >>> > >>> Op di 19 nov. 2019 om 18:03 schreef Ryan Blue > <[email protected]> > >>> > >>>> Gabor, what I meant was: have we tried this with real data to see the > >>>> effect? I think those results would be helpful. > >>>> > >>>> On Mon, Nov 18, 2019 at 11:35 PM Gabor Szadovszky <[email protected]> > >>>> wrote: > >>>> > >>>>> Hi Ryan, > >>>>> > >>>>> It is not easy to calculate. For the column indexes feature we > >>> introduced > >>>>> two new structures saved before the footer: column indexes and offset > >>>>> indexes. If the min/max values are not too long, then the truncation > >>>> might > >>>>> not decrease the file size because of the offset indexes. Moreover, > we > >>>> also > >>>>> introduced parquet.page.row.count.limit which might increase the > number > >>>> of > >>>>> pages which leads to increasing the file size. > >>>>> The footer itself is also changed and we are saving more values in > it: > >>>> the > >>>>> offset values to the column/offset indexes, the new logical type > >>>>> structures, the CRC checksums (we might have some others). > >>>>> So, the size of the files with small amount of data will be increased > >>>>> (because of the larger footer). The size of the files where the > values > >>>> can > >>>>> be encoded very well (RLE) will probably be increased (because we > will > >>>> have > >>>>> more pages). The size of some files where the values are long > (>64bytes > >>>> by > >>>>> default) might be decreased because of truncating the min/max values. > >>>>> > >>>>> Regards, > >>>>> Gabor > >>>>> > >>>>> On Mon, Nov 18, 2019 at 6:46 PM Ryan Blue <[email protected] > > > >>>>> wrote: > >>>>> > >>>>>> Gabor, do we have an idea of the additional overhead for a non-test > >>>> data > >>>>>> file? It should be easy to validate that this doesn't introduce an > >>>>>> unreasonable amount of overhead. In some cases, it should actually > be > >>>>>> smaller since the column indexes are truncated and page stats are > >>> not. > >>>>>> > >>>>>> On Mon, Nov 18, 2019 at 1:00 AM Gabor Szadovszky > >>>>>> <[email protected]> wrote: > >>>>>> > >>>>>>> Hi Fokko, > >>>>>>> > >>>>>>> For the first point. The referenced constructor is private and > >>>> Iceberg > >>>>>> uses > >>>>>>> it via reflection. It is not a breaking change. I think, parquet-mr > >>>>> shall > >>>>>>> not keep private methods only because of clients might use them via > >>>>>>> reflection. > >>>>>>> > >>>>>>> About the checksum. I've agreed on having the CRC checksum write > >>>>> enabled > >>>>>> by > >>>>>>> default because the benchmarks did not show significant performance > >>>>>>> penalties. See https://github.com/apache/parquet-mr/pull/647 for > >>>>>> details. > >>>>>>> > >>>>>>> About the file size change. 1.11.0 is introducing column indexes, > >>> CRC > >>>>>>> checksum, removing the statistics from the page headers and maybe > >>>> other > >>>>>>> changes that impact file size. If only file size is in question I > >>>>> cannot > >>>>>>> see a breaking change here. > >>>>>>> > >>>>>>> Regards, > >>>>>>> Gabor > >>>>>>> > >>>>>>> > >>>>>>> > >>>>>>> On Sun, Nov 17, 2019 at 9:27 PM Driesprong, Fokko > >>>> <[email protected] > >>>>>> > >>>>>>> wrote: > >>>>>>> > >>>>>>>> Unfortunately, a -1 from my side (non-binding) > >>>>>>>> > >>>>>>>> I've updated Iceberg to Parquet 1.11.0, and found three things: > >>>>>>>> > >>>>>>>> - We've broken backward compatibility of the constructor of > >>>>>>>> ColumnChunkPageWriteStore > >>>>>>>> < > >>>>>>>> > >>>>>>> > >>>>>> > >>>>> > >>>> > >>> > https://github.com/apache/parquet-mr/commit/e7db9e20f52c925a207ea62d6dda6dc4e870294e#diff-d007a18083a2431c30a5416f248e0a4bR80 > >>>>>>>>> . > >>>>>>>> This required a change > >>>>>>>> < > >>>>>>>> > >>>>>>> > >>>>>> > >>>>> > >>>> > >>> > https://github.com/apache/incubator-iceberg/pull/297/files#diff-b877faa96f292b851c75fe8bcc1912f8R176 > >>>>>>>>> > >>>>>>>> to the code. This isn't a hard blocker, but if there will be a > >>>> new > >>>>>> RC, > >>>>>>>> I've > >>>>>>>> submitted a patch: > >>>> https://github.com/apache/parquet-mr/pull/699 > >>>>>>>> - Related, that we need to put in the changelog, is that > >>>> checksums > >>>>>> are > >>>>>>>> enabled by default: > >>>>>>>> > >>>>>>>> > >>>>>>> > >>>>>> > >>>>> > >>>> > >>> > https://github.com/apache/parquet-mr/blob/master/parquet-column/src/main/java/org/apache/parquet/column/ParquetProperties.java#L54 > >>>>>>>> This > >>>>>>>> will impact performance. I would suggest disabling it by > >>>> default: > >>>>>>>> https://github.com/apache/parquet-mr/pull/700 > >>>>>>>> < > >>>>>>>> > >>>>>>> > >>>>>> > >>>>> > >>>> > >>> > https://github.com/apache/parquet-mr/commit/e7db9e20f52c925a207ea62d6dda6dc4e870294e#diff-d007a18083a2431c30a5416f248e0a4bR277 > >>>>>>>>> > >>>>>>>> - Binary compatibility. While updating Iceberg, I've noticed > >>>> that > >>>>>> the > >>>>>>>> split-test was failing: > >>>>>>>> > >>>>>>>> > >>>>>>> > >>>>>> > >>>>> > >>>> > >>> > https://github.com/apache/incubator-iceberg/pull/297/files#diff-4b64b7014f259be41b26cfb73d3e6e93L199 > >>>>>>>> The > >>>>>>>> two records are now divided over four Spark partitions. > >>>> Something > >>>>> in > >>>>>>> the > >>>>>>>> output has changed since the files are bigger now. Has anyone > >>>> any > >>>>>> idea > >>>>>>>> to > >>>>>>>> check what's changed, or a way to check this? The only thing I > >>>> can > >>>>>>>> think of > >>>>>>>> is the checksum mentioned above. > >>>>>>>> > >>>>>>>> $ ls -lah ~/Desktop/parquet-1-1* > >>>>>>>> -rw-r--r-- 1 fokkodriesprong staff 562B 17 nov 21:09 > >>>>>>>> /Users/fokkodriesprong/Desktop/parquet-1-10-1.parquet > >>>>>>>> -rw-r--r-- 1 fokkodriesprong staff 611B 17 nov 21:05 > >>>>>>>> /Users/fokkodriesprong/Desktop/parquet-1-11-0.parquet > >>>>>>>> > >>>>>>>> $ parquet-tools cat > >>>>>> /Users/fokkodriesprong/Desktop/parquet-1-10-1.parquet > >>>>>>>> id = 1 > >>>>>>>> data = a > >>>>>>>> > >>>>>>>> $ parquet-tools cat > >>>>>> /Users/fokkodriesprong/Desktop/parquet-1-11-0.parquet > >>>>>>>> id = 1 > >>>>>>>> data = a > >>>>>>>> > >>>>>>>> A binary diff here: > >>>>>>>> https://gist.github.com/Fokko/1c209f158299dc2fb5878c5bae4bf6d8 > >>>>>>>> > >>>>>>>> Cheers, Fokko > >>>>>>>> > >>>>>>>> Op za 16 nov. 2019 om 04:18 schreef Junjie Chen < > >>>>>>> [email protected] > >>>>>>>>> : > >>>>>>>> > >>>>>>>>> +1 > >>>>>>>>> Verified signature, checksum and ran mvn install successfully. > >>>>>>>>> > >>>>>>>>> Wang, Yuming <[email protected]> 于2019年11月14日周四 > >>> 下午2:05写道: > >>>>>>>>>> > >>>>>>>>>> +1 > >>>>>>>>>> Tested Parquet 1.11.0 with Spark SQL module: build/sbt > >>>>>>> "sql/test-only" > >>>>>>>>> -Phadoop-3.2 > >>>>>>>>>> > >>>>>>>>>> On 2019/11/13, 21:33, "Gabor Szadovszky" <[email protected]> > >>>>>> wrote: > >>>>>>>>>> > >>>>>>>>>> Hi everyone, > >>>>>>>>>> > >>>>>>>>>> I propose the following RC to be released as official > >>>> Apache > >>>>>>>> Parquet > >>>>>>>>> 1.11.0 > >>>>>>>>>> release. > >>>>>>>>>> > >>>>>>>>>> The commit id is 18519eb8e059865652eee3ff0e8593f126701da4 > >>>>>>>>>> * This corresponds to the tag: apache-parquet-1.11.0-rc7 > >>>>>>>>>> * > >>>>>>>>>> > >>>>>>>>> > >>>>>>>> > >>>>>>> > >>>>>> > >>>>> > >>>> > >>> > https://nam01.safelinks.protection.outlook.com/?url=https%3A%2F%2Fgithub.com%2Fapache%2Fparquet-mr%2Ftree%2F18519eb8e059865652eee3ff0e8593f126701da4&data=02%7C01%7Cyumwang%40ebay.com%7C8d588ca5855842a94bed08d7683e1221%7C46326bff992841a0baca17c16c94ea99%7C0%7C0%7C637092488114756267&sdata=ToLFrTB9lU%2FGzH6UpXwy7PAY7kaupbyKAgdghESCfgg%3D&reserved=0 > >>>>>>>>>> > >>>>>>>>>> The release tarball, signature, and checksums are here: > >>>>>>>>>> * > >>>>>>>>> > >>>>>>>> > >>>>>>> > >>>>>> > >>>>> > >>>> > >>> > https://nam01.safelinks.protection.outlook.com/?url=https%3A%2F%2Fdist.apache.org%2Frepos%2Fdist%2Fdev%2Fparquet%2Fapache-parquet-1.11.0-rc7&data=02%7C01%7Cyumwang%40ebay.com%7C8d588ca5855842a94bed08d7683e1221%7C46326bff992841a0baca17c16c94ea99%7C0%7C0%7C637092488114756267&sdata=MPaHiYJT7ZcqreAYUkvDvZugthUhRPrySdXpN2ytT5k%3D&reserved=0 > >>>>>>>>>> > >>>>>>>>>> You can find the KEYS file here: > >>>>>>>>>> * > >>>>>>>>> > >>>>>>>> > >>>>>>> > >>>>>> > >>>>> > >>>> > >>> > https://nam01.safelinks.protection.outlook.com/?url=https%3A%2F%2Fapache.org%2Fdist%2Fparquet%2FKEYS&data=02%7C01%7Cyumwang%40ebay.com%7C8d588ca5855842a94bed08d7683e1221%7C46326bff992841a0baca17c16c94ea99%7C0%7C0%7C637092488114756267&sdata=IwG4MUGsP2lVzlD4bwZUEPuEAPUg%2FHXRYtxc5CQupBM%3D&reserved=0 > >>>>>>>>>> > >>>>>>>>>> Binary artifacts are staged in Nexus here: > >>>>>>>>>> * > >>>>>>>>> > >>>>>>>> > >>>>>>> > >>>>>> > >>>>> > >>>> > >>> > https://nam01.safelinks.protection.outlook.com/?url=https%3A%2F%2Frepository.apache.org%2Fcontent%2Fgroups%2Fstaging%2Forg%2Fapache%2Fparquet%2F&data=02%7C01%7Cyumwang%40ebay.com%7C8d588ca5855842a94bed08d7683e1221%7C46326bff992841a0baca17c16c94ea99%7C0%7C0%7C637092488114756267&sdata=lHtqLRQqQFwsyoaLSVaJuau5gxPKsCQFFVJaY8H0tZQ%3D&reserved=0 > >>>>>>>>>> > >>>>>>>>>> This release includes the changes listed at: > >>>>>>>>>> > >>>>>>>>> > >>>>>>>> > >>>>>>> > >>>>>> > >>>>> > >>>> > >>> > https://nam01.safelinks.protection.outlook.com/?url=https%3A%2F%2Fgithub.com%2Fapache%2Fparquet-mr%2Fblob%2Fapache-parquet-1.11.0-rc7%2FCHANGES.md&data=02%7C01%7Cyumwang%40ebay.com%7C8d588ca5855842a94bed08d7683e1221%7C46326bff992841a0baca17c16c94ea99%7C0%7C0%7C637092488114756267&sdata=82BplI3bLAL6qArLHvVoYReZOk%2BboSP655rI8VX5Q5I%3D&reserved=0 > >>>>>>>>>> > >>>>>>>>>> Please download, verify, and test. > >>>>>>>>>> > >>>>>>>>>> Please vote in the next 72 hours. > >>>>>>>>>> > >>>>>>>>>> [ ] +1 Release this as Apache Parquet 1.11.0 > >>>>>>>>>> [ ] +0 > >>>>>>>>>> [ ] -1 Do not release this because... > >>>>>>>>>> > >>>>>>>>>> > >>>>>>>>> > >>>>>>>> > >>>>>>> > >>>>>> > >>>>>> > >>>>>> -- > >>>>>> Ryan Blue > >>>>>> Software Engineer > >>>>>> Netflix > >>>>>> > >>>>> > >>>> > >>>> > >>>> -- > >>>> Ryan Blue > >>>> Software Engineer > >>>> Netflix > >>>> > >>> > > > >
