Perhaps not strictly necessary to say, but if this particular compatibility break between 1.10 and 1.11 was intentional, and no other compatibility breaks are found, I would vote -1 (non-binding) on this RC such that we might go back and revisit the changes to preserve compatibility.
I am not sure there is presently enough motivation in the Spark project for a release after 2.4.4 and before 3.0 in which to bump the Parquet dependency version to 1.11.x. michael > On Nov 21, 2019, at 11:01 AM, Ryan Blue <[email protected]> wrote: > > Gabor, shouldn't Parquet be binary compatible for public APIs? From the > stack trace, it looks like this 1.11.0 RC breaks binary compatibility in > the type builders. > > Looks like this should have been caught by the binary compatibility checks. > > On Thu, Nov 21, 2019 at 8:56 AM Gabor Szadovszky <[email protected]> wrote: > >> Hi Michael, >> >> Unfortunately, I don't have too much experience on Spark. But if spark uses >> the parquet-mr library in an embedded way (that's how Hive uses it) it is >> required to re-build Spark with 1.11 RC parquet-mr. >> >> Regards, >> Gabor >> >> On Wed, Nov 20, 2019 at 5:44 PM Michael Heuer <[email protected]> wrote: >> >>> It appears a provided scope dependency on spark-sql leaks old parquet >>> versions was causing the runtime error below. After including new >>> parquet-column and parquet-hadoop compile scope dependencies (in addition >>> to parquet-avro, which we already have at compile scope), our build >>> succeeds. >>> >>> https://github.com/bigdatagenomics/adam/pull/2232 < >>> https://github.com/bigdatagenomics/adam/pull/2232> >>> >>> However, when running via spark-submit I run into a similar runtime error >>> >>> Caused by: java.lang.NoSuchMethodError: >>> >> org.apache.parquet.schema.Types$PrimitiveBuilder.as(Lorg/apache/parquet/schema/LogicalTypeAnnotation;)Lorg/apache/parquet/schema/Types$Builder; >>> at >>> >> org.apache.parquet.avro.AvroSchemaConverter.convertField(AvroSchemaConverter.java:161) >>> at >>> >> org.apache.parquet.avro.AvroSchemaConverter.convertUnion(AvroSchemaConverter.java:226) >>> at >>> >> org.apache.parquet.avro.AvroSchemaConverter.convertField(AvroSchemaConverter.java:182) >>> at >>> >> org.apache.parquet.avro.AvroSchemaConverter.convertField(AvroSchemaConverter.java:141) >>> at >>> >> org.apache.parquet.avro.AvroSchemaConverter.convertField(AvroSchemaConverter.java:244) >>> at >>> >> org.apache.parquet.avro.AvroSchemaConverter.convertFields(AvroSchemaConverter.java:135) >>> at >>> >> org.apache.parquet.avro.AvroSchemaConverter.convert(AvroSchemaConverter.java:126) >>> at >>> org.apache.parquet.avro.AvroWriteSupport.init(AvroWriteSupport.java:121) >>> at >>> >> org.apache.parquet.hadoop.ParquetOutputFormat.getRecordWriter(ParquetOutputFormat.java:388) >>> at >>> >> org.apache.parquet.hadoop.ParquetOutputFormat.getRecordWriter(ParquetOutputFormat.java:349) >>> at >>> >> org.apache.spark.rdd.InstrumentedOutputFormat.getRecordWriter(InstrumentedOutputFormat.scala:35) >>> at org.apache.spark.internal.io >>> .HadoopMapReduceWriteConfigUtil.initWriter(SparkHadoopWriter.scala:350) >>> at org.apache.spark.internal.io >>> >> .SparkHadoopWriter$.org$apache$spark$internal$io$SparkHadoopWriter$$executeTask(SparkHadoopWriter.scala:120) >>> at org.apache.spark.internal.io >>> .SparkHadoopWriter$$anonfun$3.apply(SparkHadoopWriter.scala:83) >>> at org.apache.spark.internal.io >>> .SparkHadoopWriter$$anonfun$3.apply(SparkHadoopWriter.scala:78) >>> at >>> org.apache.spark.scheduler.ResultTask.runTask(ResultTask.scala:90) >>> at org.apache.spark.scheduler.Task.run(Task.scala:123) >>> at >>> >> org.apache.spark.executor.Executor$TaskRunner$$anonfun$10.apply(Executor.scala:408) >>> at >>> org.apache.spark.util.Utils$.tryWithSafeFinally(Utils.scala:1360) >>> at >>> org.apache.spark.executor.Executor$TaskRunner.run(Executor.scala:414) >>> at >>> >> java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1149) >>> at >>> >> java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:624) >>> at java.lang.Thread.run(Thread.java:748) >>> >>> >>> Will bumping our library dependency version to 1.11 require a new version >>> of Spark, built against Parquet 1.11? >>> >>> Please accept my apologies if this is heading out-of-scope for the >> Parquet >>> mailing list. >>> >>> michael >>> >>> >>>> On Nov 20, 2019, at 10:00 AM, Michael Heuer <[email protected]> wrote: >>>> >>>> I am willing to do some benchmarking on genomic data at scale but am >> not >>> quite sure what the Spark target version for 1.11.0 might be. Will >> Parquet >>> 1.11.0 be compatible in Spark 2.4.x? >>>> >>>> Updating from 1.10.1 to 1.11.0 breaks at runtime in our build >>>> >>>> … >>>> D 0, localhost, executor driver): java.lang.NoClassDefFoundError: >>> org/apache/parquet/schema/LogicalTypeAnnotation >>>> at >>> org.apache.parquet.avro.AvroWriteSupport.init(AvroWriteSupport.java:121) >>>> at >>> >> org.apache.parquet.hadoop.ParquetOutputFormat.getRecordWriter(ParquetOutputFormat.java:388) >>>> at >>> >> org.apache.parquet.hadoop.ParquetOutputFormat.getRecordWriter(ParquetOutputFormat.java:349) >>>> at >>> >> org.apache.spark.rdd.InstrumentedOutputFormat.getRecordWriter(InstrumentedOutputFormat.scala:35) >>>> at org.apache.spark.internal.io >>> .HadoopMapReduceWriteConfigUtil.initWriter(SparkHadoopWriter.scala:350) >>>> at org.apache.spark.internal.io >>> >> .SparkHadoopWriter$.org$apache$spark$internal$io$SparkHadoopWriter$$executeTask(SparkHadoopWriter.scala:120) >>>> at org.apache.spark.internal.io >>> .SparkHadoopWriter$$anonfun$3.apply(SparkHadoopWriter.scala:83) >>>> at org.apache.spark.internal.io >>> .SparkHadoopWriter$$anonfun$3.apply(SparkHadoopWriter.scala:78) >>>> at >>> org.apache.spark.scheduler.ResultTask.runTask(ResultTask.scala:90) >>>> at org.apache.spark.scheduler.Task.run(Task.scala:123) >>>> at >>> >> org.apache.spark.executor.Executor$TaskRunner$$anonfun$10.apply(Executor.scala:408) >>>> at >>> org.apache.spark.util.Utils$.tryWithSafeFinally(Utils.scala:1360) >>>> at >>> org.apache.spark.executor.Executor$TaskRunner.run(Executor.scala:414) >>>> at >>> >> java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1149) >>>> at >>> >> java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:624) >>>> at java.lang.Thread.run(Thread.java:748) >>>> Caused by: java.lang.ClassNotFoundException: >>> org.apache.parquet.schema.LogicalTypeAnnotation >>>> at java.net.URLClassLoader.findClass(URLClassLoader.java:382) >>>> at java.lang.ClassLoader.loadClass(ClassLoader.java:424) >>>> at sun.misc.Launcher$AppClassLoader.loadClass(Launcher.java:349) >>>> at java.lang.ClassLoader.loadClass(ClassLoader.java:357) >>>> >>>> michael >>>> >>>> >>>>> On Nov 20, 2019, at 3:29 AM, Gabor Szadovszky <[email protected]> >> wrote: >>>>> >>>>> Thanks, Fokko. >>>>> >>>>> Ryan, we did not do such measurements yet. I'm afraid, I won't have >>> enough >>>>> time to do that in the next couple of weeks. >>>>> >>>>> Cheers, >>>>> Gabor >>>>> >>>>> On Tue, Nov 19, 2019 at 6:14 PM Driesprong, Fokko >> <[email protected] >>>> >>>>> wrote: >>>>> >>>>>> Thanks Gabor for the explanation. I'd like to change my vote to +1 >>>>>> (non-binding). >>>>>> >>>>>> Cheers, Fokko >>>>>> >>>>>> Op di 19 nov. 2019 om 18:03 schreef Ryan Blue >>> <[email protected]> >>>>>> >>>>>>> Gabor, what I meant was: have we tried this with real data to see >> the >>>>>>> effect? I think those results would be helpful. >>>>>>> >>>>>>> On Mon, Nov 18, 2019 at 11:35 PM Gabor Szadovszky <[email protected] >>> >>>>>>> wrote: >>>>>>> >>>>>>>> Hi Ryan, >>>>>>>> >>>>>>>> It is not easy to calculate. For the column indexes feature we >>>>>> introduced >>>>>>>> two new structures saved before the footer: column indexes and >> offset >>>>>>>> indexes. If the min/max values are not too long, then the >> truncation >>>>>>> might >>>>>>>> not decrease the file size because of the offset indexes. Moreover, >>> we >>>>>>> also >>>>>>>> introduced parquet.page.row.count.limit which might increase the >>> number >>>>>>> of >>>>>>>> pages which leads to increasing the file size. >>>>>>>> The footer itself is also changed and we are saving more values in >>> it: >>>>>>> the >>>>>>>> offset values to the column/offset indexes, the new logical type >>>>>>>> structures, the CRC checksums (we might have some others). >>>>>>>> So, the size of the files with small amount of data will be >> increased >>>>>>>> (because of the larger footer). The size of the files where the >>> values >>>>>>> can >>>>>>>> be encoded very well (RLE) will probably be increased (because we >>> will >>>>>>> have >>>>>>>> more pages). The size of some files where the values are long >>> (>64bytes >>>>>>> by >>>>>>>> default) might be decreased because of truncating the min/max >> values. >>>>>>>> >>>>>>>> Regards, >>>>>>>> Gabor >>>>>>>> >>>>>>>> On Mon, Nov 18, 2019 at 6:46 PM Ryan Blue >> <[email protected] >>>> >>>>>>>> wrote: >>>>>>>> >>>>>>>>> Gabor, do we have an idea of the additional overhead for a >> non-test >>>>>>> data >>>>>>>>> file? It should be easy to validate that this doesn't introduce an >>>>>>>>> unreasonable amount of overhead. In some cases, it should actually >>> be >>>>>>>>> smaller since the column indexes are truncated and page stats are >>>>>> not. >>>>>>>>> >>>>>>>>> On Mon, Nov 18, 2019 at 1:00 AM Gabor Szadovszky >>>>>>>>> <[email protected]> wrote: >>>>>>>>> >>>>>>>>>> Hi Fokko, >>>>>>>>>> >>>>>>>>>> For the first point. The referenced constructor is private and >>>>>>> Iceberg >>>>>>>>> uses >>>>>>>>>> it via reflection. It is not a breaking change. I think, >> parquet-mr >>>>>>>> shall >>>>>>>>>> not keep private methods only because of clients might use them >> via >>>>>>>>>> reflection. >>>>>>>>>> >>>>>>>>>> About the checksum. I've agreed on having the CRC checksum write >>>>>>>> enabled >>>>>>>>> by >>>>>>>>>> default because the benchmarks did not show significant >> performance >>>>>>>>>> penalties. See https://github.com/apache/parquet-mr/pull/647 for >>>>>>>>> details. >>>>>>>>>> >>>>>>>>>> About the file size change. 1.11.0 is introducing column indexes, >>>>>> CRC >>>>>>>>>> checksum, removing the statistics from the page headers and maybe >>>>>>> other >>>>>>>>>> changes that impact file size. If only file size is in question I >>>>>>>> cannot >>>>>>>>>> see a breaking change here. >>>>>>>>>> >>>>>>>>>> Regards, >>>>>>>>>> Gabor >>>>>>>>>> >>>>>>>>>> >>>>>>>>>> >>>>>>>>>> On Sun, Nov 17, 2019 at 9:27 PM Driesprong, Fokko >>>>>>> <[email protected] >>>>>>>>> >>>>>>>>>> wrote: >>>>>>>>>> >>>>>>>>>>> Unfortunately, a -1 from my side (non-binding) >>>>>>>>>>> >>>>>>>>>>> I've updated Iceberg to Parquet 1.11.0, and found three things: >>>>>>>>>>> >>>>>>>>>>> - We've broken backward compatibility of the constructor of >>>>>>>>>>> ColumnChunkPageWriteStore >>>>>>>>>>> < >>>>>>>>>>> >>>>>>>>>> >>>>>>>>> >>>>>>>> >>>>>>> >>>>>> >>> >> https://github.com/apache/parquet-mr/commit/e7db9e20f52c925a207ea62d6dda6dc4e870294e#diff-d007a18083a2431c30a5416f248e0a4bR80 >>>>>>>>>>>> . >>>>>>>>>>> This required a change >>>>>>>>>>> < >>>>>>>>>>> >>>>>>>>>> >>>>>>>>> >>>>>>>> >>>>>>> >>>>>> >>> >> https://github.com/apache/incubator-iceberg/pull/297/files#diff-b877faa96f292b851c75fe8bcc1912f8R176 >>>>>>>>>>>> >>>>>>>>>>> to the code. This isn't a hard blocker, but if there will be a >>>>>>> new >>>>>>>>> RC, >>>>>>>>>>> I've >>>>>>>>>>> submitted a patch: >>>>>>> https://github.com/apache/parquet-mr/pull/699 >>>>>>>>>>> - Related, that we need to put in the changelog, is that >>>>>>> checksums >>>>>>>>> are >>>>>>>>>>> enabled by default: >>>>>>>>>>> >>>>>>>>>>> >>>>>>>>>> >>>>>>>>> >>>>>>>> >>>>>>> >>>>>> >>> >> https://github.com/apache/parquet-mr/blob/master/parquet-column/src/main/java/org/apache/parquet/column/ParquetProperties.java#L54 >>>>>>>>>>> This >>>>>>>>>>> will impact performance. I would suggest disabling it by >>>>>>> default: >>>>>>>>>>> https://github.com/apache/parquet-mr/pull/700 >>>>>>>>>>> < >>>>>>>>>>> >>>>>>>>>> >>>>>>>>> >>>>>>>> >>>>>>> >>>>>> >>> >> https://github.com/apache/parquet-mr/commit/e7db9e20f52c925a207ea62d6dda6dc4e870294e#diff-d007a18083a2431c30a5416f248e0a4bR277 >>>>>>>>>>>> >>>>>>>>>>> - Binary compatibility. While updating Iceberg, I've noticed >>>>>>> that >>>>>>>>> the >>>>>>>>>>> split-test was failing: >>>>>>>>>>> >>>>>>>>>>> >>>>>>>>>> >>>>>>>>> >>>>>>>> >>>>>>> >>>>>> >>> >> https://github.com/apache/incubator-iceberg/pull/297/files#diff-4b64b7014f259be41b26cfb73d3e6e93L199 >>>>>>>>>>> The >>>>>>>>>>> two records are now divided over four Spark partitions. >>>>>>> Something >>>>>>>> in >>>>>>>>>> the >>>>>>>>>>> output has changed since the files are bigger now. Has anyone >>>>>>> any >>>>>>>>> idea >>>>>>>>>>> to >>>>>>>>>>> check what's changed, or a way to check this? The only thing I >>>>>>> can >>>>>>>>>>> think of >>>>>>>>>>> is the checksum mentioned above. >>>>>>>>>>> >>>>>>>>>>> $ ls -lah ~/Desktop/parquet-1-1* >>>>>>>>>>> -rw-r--r-- 1 fokkodriesprong staff 562B 17 nov 21:09 >>>>>>>>>>> /Users/fokkodriesprong/Desktop/parquet-1-10-1.parquet >>>>>>>>>>> -rw-r--r-- 1 fokkodriesprong staff 611B 17 nov 21:05 >>>>>>>>>>> /Users/fokkodriesprong/Desktop/parquet-1-11-0.parquet >>>>>>>>>>> >>>>>>>>>>> $ parquet-tools cat >>>>>>>>> /Users/fokkodriesprong/Desktop/parquet-1-10-1.parquet >>>>>>>>>>> id = 1 >>>>>>>>>>> data = a >>>>>>>>>>> >>>>>>>>>>> $ parquet-tools cat >>>>>>>>> /Users/fokkodriesprong/Desktop/parquet-1-11-0.parquet >>>>>>>>>>> id = 1 >>>>>>>>>>> data = a >>>>>>>>>>> >>>>>>>>>>> A binary diff here: >>>>>>>>>>> https://gist.github.com/Fokko/1c209f158299dc2fb5878c5bae4bf6d8 >>>>>>>>>>> >>>>>>>>>>> Cheers, Fokko >>>>>>>>>>> >>>>>>>>>>> Op za 16 nov. 2019 om 04:18 schreef Junjie Chen < >>>>>>>>>> [email protected] >>>>>>>>>>>> : >>>>>>>>>>> >>>>>>>>>>>> +1 >>>>>>>>>>>> Verified signature, checksum and ran mvn install successfully. >>>>>>>>>>>> >>>>>>>>>>>> Wang, Yuming <[email protected]> 于2019年11月14日周四 >>>>>> 下午2:05写道: >>>>>>>>>>>>> >>>>>>>>>>>>> +1 >>>>>>>>>>>>> Tested Parquet 1.11.0 with Spark SQL module: build/sbt >>>>>>>>>> "sql/test-only" >>>>>>>>>>>> -Phadoop-3.2 >>>>>>>>>>>>> >>>>>>>>>>>>> On 2019/11/13, 21:33, "Gabor Szadovszky" <[email protected]> >>>>>>>>> wrote: >>>>>>>>>>>>> >>>>>>>>>>>>> Hi everyone, >>>>>>>>>>>>> >>>>>>>>>>>>> I propose the following RC to be released as official >>>>>>> Apache >>>>>>>>>>> Parquet >>>>>>>>>>>> 1.11.0 >>>>>>>>>>>>> release. >>>>>>>>>>>>> >>>>>>>>>>>>> The commit id is 18519eb8e059865652eee3ff0e8593f126701da4 >>>>>>>>>>>>> * This corresponds to the tag: apache-parquet-1.11.0-rc7 >>>>>>>>>>>>> * >>>>>>>>>>>>> >>>>>>>>>>>> >>>>>>>>>>> >>>>>>>>>> >>>>>>>>> >>>>>>>> >>>>>>> >>>>>> >>> >> https://nam01.safelinks.protection.outlook.com/?url=https%3A%2F%2Fgithub.com%2Fapache%2Fparquet-mr%2Ftree%2F18519eb8e059865652eee3ff0e8593f126701da4&data=02%7C01%7Cyumwang%40ebay.com%7C8d588ca5855842a94bed08d7683e1221%7C46326bff992841a0baca17c16c94ea99%7C0%7C0%7C637092488114756267&sdata=ToLFrTB9lU%2FGzH6UpXwy7PAY7kaupbyKAgdghESCfgg%3D&reserved=0 >>>>>>>>>>>>> >>>>>>>>>>>>> The release tarball, signature, and checksums are here: >>>>>>>>>>>>> * >>>>>>>>>>>> >>>>>>>>>>> >>>>>>>>>> >>>>>>>>> >>>>>>>> >>>>>>> >>>>>> >>> >> https://nam01.safelinks.protection.outlook.com/?url=https%3A%2F%2Fdist.apache.org%2Frepos%2Fdist%2Fdev%2Fparquet%2Fapache-parquet-1.11.0-rc7&data=02%7C01%7Cyumwang%40ebay.com%7C8d588ca5855842a94bed08d7683e1221%7C46326bff992841a0baca17c16c94ea99%7C0%7C0%7C637092488114756267&sdata=MPaHiYJT7ZcqreAYUkvDvZugthUhRPrySdXpN2ytT5k%3D&reserved=0 >>>>>>>>>>>>> >>>>>>>>>>>>> You can find the KEYS file here: >>>>>>>>>>>>> * >>>>>>>>>>>> >>>>>>>>>>> >>>>>>>>>> >>>>>>>>> >>>>>>>> >>>>>>> >>>>>> >>> >> https://nam01.safelinks.protection.outlook.com/?url=https%3A%2F%2Fapache.org%2Fdist%2Fparquet%2FKEYS&data=02%7C01%7Cyumwang%40ebay.com%7C8d588ca5855842a94bed08d7683e1221%7C46326bff992841a0baca17c16c94ea99%7C0%7C0%7C637092488114756267&sdata=IwG4MUGsP2lVzlD4bwZUEPuEAPUg%2FHXRYtxc5CQupBM%3D&reserved=0 >>>>>>>>>>>>> >>>>>>>>>>>>> Binary artifacts are staged in Nexus here: >>>>>>>>>>>>> * >>>>>>>>>>>> >>>>>>>>>>> >>>>>>>>>> >>>>>>>>> >>>>>>>> >>>>>>> >>>>>> >>> >> https://nam01.safelinks.protection.outlook.com/?url=https%3A%2F%2Frepository.apache.org%2Fcontent%2Fgroups%2Fstaging%2Forg%2Fapache%2Fparquet%2F&data=02%7C01%7Cyumwang%40ebay.com%7C8d588ca5855842a94bed08d7683e1221%7C46326bff992841a0baca17c16c94ea99%7C0%7C0%7C637092488114756267&sdata=lHtqLRQqQFwsyoaLSVaJuau5gxPKsCQFFVJaY8H0tZQ%3D&reserved=0 >>>>>>>>>>>>> >>>>>>>>>>>>> This release includes the changes listed at: >>>>>>>>>>>>> >>>>>>>>>>>> >>>>>>>>>>> >>>>>>>>>> >>>>>>>>> >>>>>>>> >>>>>>> >>>>>> >>> >> https://nam01.safelinks.protection.outlook.com/?url=https%3A%2F%2Fgithub.com%2Fapache%2Fparquet-mr%2Fblob%2Fapache-parquet-1.11.0-rc7%2FCHANGES.md&data=02%7C01%7Cyumwang%40ebay.com%7C8d588ca5855842a94bed08d7683e1221%7C46326bff992841a0baca17c16c94ea99%7C0%7C0%7C637092488114756267&sdata=82BplI3bLAL6qArLHvVoYReZOk%2BboSP655rI8VX5Q5I%3D&reserved=0 >>>>>>>>>>>>> >>>>>>>>>>>>> Please download, verify, and test. >>>>>>>>>>>>> >>>>>>>>>>>>> Please vote in the next 72 hours. >>>>>>>>>>>>> >>>>>>>>>>>>> [ ] +1 Release this as Apache Parquet 1.11.0 >>>>>>>>>>>>> [ ] +0 >>>>>>>>>>>>> [ ] -1 Do not release this because... >>>>>>>>>>>>> >>>>>>>>>>>>> >>>>>>>>>>>> >>>>>>>>>>> >>>>>>>>>> >>>>>>>>> >>>>>>>>> >>>>>>>>> -- >>>>>>>>> Ryan Blue >>>>>>>>> Software Engineer >>>>>>>>> Netflix >>>>>>>>> >>>>>>>> >>>>>>> >>>>>>> >>>>>>> -- >>>>>>> Ryan Blue >>>>>>> Software Engineer >>>>>>> Netflix >>>>>>> >>>>>> >>>> >>> >>> >> > > > -- > Ryan Blue > Software Engineer > Netflix
