Gabor, good point about not being able to check new APIs. Updating the previous version would also allow us to get rid of temporary exclusions, like the one you pointed out for schema. It would be great to improve what we catch there.
On Mon, Nov 25, 2019 at 1:56 AM Gabor Szadovszky <[email protected]> wrote: > Hi Ryan, > > It is a different topic but would like to reflect shortly. > I understand that 1.7.0 was the first apache release. The problem with > doing the compatibility checks comparing to 1.7.0 is that we can easily add > incompatibilities in API which are added after 1.7.0. For example: Adding a > new class for public use in 1.8.0 then removing it in 1.9.0. The > compatibility check would not discover this breaking change. So, I think, a > better approach would be to always compare to the previous minor release > (e.g. comparing 1.9.0 to 1.8.0 etc.). > As I've written before, even org/apache/parquet/schema/** is excluded from > the compatibility check. As far as I know this is public API. So, I am not > sure that only packages that are not part of the public API are excluded. > > Let's discuss this on the next parquet sync. > > Regards, > Gabor > > On Fri, Nov 22, 2019 at 6:20 PM Ryan Blue <[email protected]> > wrote: > > > Gabor, > > > > 1.7.0 was the first version using the org.apache.parquet packages, so > > that's the correct base version for compatibility checks. The exclusions > in > > the POM are classes that the Parquet community does not consider public. > We > > rely on these checks to highlight binary incompatibilities, and then we > > discuss them on this list or in the dev sync. If the class is internal, > we > > add an exclusion for it. > > > > I know you're familiar with this process since we've talked about it > > before. I also know that you'd rather have more strict binary > > compatibility, but until we have someone with the time to do some > > maintenance and build a public API module, I'm afraid that's what we have > > to work with. > > > > Michael, > > > > I hope the context above is helpful and explains why running a binary > > compatibility check tool will find incompatible changes. We allow binary > > incompatible changes to internal classes and modules, like > parquet-common. > > > > On Fri, Nov 22, 2019 at 12:23 AM Gabor Szadovszky <[email protected]> > > wrote: > > > > > Ryan, > > > I would not trust our compatibility checks (semver) too much. > Currently, > > it > > > is configured to compare our current version to 1.7.0. It means > anything > > > that is added since 1.7.0 and then broke in a later release won't be > > > caught. In addition, many packages are excluded from the check because > of > > > different reasons. For example org/apache/parquet/schema/** is excluded > > so > > > if it would really be an API compatibility issue we certainly wouldn't > > > catch it. > > > > > > Michael, > > > It fails because of a NoSuchMethodError pointing to a method that is > > newly > > > introduced in 1.11. Both the caller and the callee shipped by > parquet-mr. > > > So, I'm quite sure it is a classpath issue. It seems that the 1.11 > > version > > > of the parquet-column jar is not on the classpath. > > > > > > > > > On Fri, Nov 22, 2019 at 1:44 AM Michael Heuer <[email protected]> > wrote: > > > > > > > The dependency versions are consistent in our artifact > > > > > > > > $ mvn dependency:tree | grep parquet > > > > [INFO] | \- org.apache.parquet:parquet-avro:jar:1.11.0:compile > > > > [INFO] | \- > > > > org.apache.parquet:parquet-format-structures:jar:1.11.0:compile > > > > [INFO] | +- org.apache.parquet:parquet-column:jar:1.11.0:compile > > > > [INFO] | | +- org.apache.parquet:parquet-common:jar:1.11.0:compile > > > > [INFO] | | \- > org.apache.parquet:parquet-encoding:jar:1.11.0:compile > > > > [INFO] | +- org.apache.parquet:parquet-hadoop:jar:1.11.0:compile > > > > [INFO] | | +- org.apache.parquet:parquet-jackson:jar:1.11.0:compile > > > > > > > > The latter error > > > > > > > > Caused by: org.apache.spark.SparkException: Job aborted due to stage > > > > failure: Task 0 in stage 0.0 failed 1 times, most recent failure: > Lost > > > task > > > > 0.0 in stage 0.0 (TID 0, localhost, executor driver): > > > > java.lang.NoSuchMethodError: > > > > > > > > > > org.apache.parquet.schema.Types$PrimitiveBuilder.as(Lorg/apache/parquet/schema/LogicalTypeAnnotation;)Lorg/apache/parquet/schema/Types$Builder; > > > > at > > > > > > > > > > org.apache.parquet.avro.AvroSchemaConverter.convertField(AvroSchemaConverter.java:161) > > > > > > > > occurs when I attempt to run via spark-submit on Spark 2.4.4 > > > > > > > > $ spark-submit --version > > > > Welcome to > > > > ____ __ > > > > / __/__ ___ _____/ /__ > > > > _\ \/ _ \/ _ `/ __/ '_/ > > > > /___/ .__/\_,_/_/ /_/\_\ version 2.4.4 > > > > /_/ > > > > > > > > Using Scala version 2.11.12, Java HotSpot(TM) 64-Bit Server VM, > > 1.8.0_191 > > > > Branch > > > > Compiled by user on 2019-08-27T21:21:38Z > > > > Revision > > > > Url > > > > Type --help for more information. > > > > > > > > > > > > > > > > > On Nov 21, 2019, at 6:06 PM, Ryan Blue <[email protected]> > > > > wrote: > > > > > > > > > > Thanks for looking into it, Nandor. That doesn't sound like a > problem > > > > with > > > > > Parquet, but a problem with the test environment since parquet-avro > > > > depends > > > > > on a newer API method. > > > > > > > > > > On Thu, Nov 21, 2019 at 3:58 PM Nandor Kollar > > > > <[email protected]> > > > > > wrote: > > > > > > > > > >> I'm not sure that this is a binary compatibility issue. The > missing > > > > builder > > > > >> method was recently added in 1.11.0 with the introduction of the > new > > > > >> logical type API, while the original version (one with a single > > > > >> OriginalType input parameter called before by AvroSchemaConverter) > > of > > > > this > > > > >> method is kept untouched. It seems to me that the Parquet version > on > > > > Spark > > > > >> executor mismatch: parquet-avro is on 1.11.0, but parquet-column > is > > > > still > > > > >> on an older version. > > > > >> > > > > >> On Thu, Nov 21, 2019 at 11:41 PM Michael Heuer <[email protected] > > > > > > wrote: > > > > >> > > > > >>> Perhaps not strictly necessary to say, but if this particular > > > > >>> compatibility break between 1.10 and 1.11 was intentional, and no > > > other > > > > >>> compatibility breaks are found, I would vote -1 (non-binding) on > > this > > > > RC > > > > >>> such that we might go back and revisit the changes to preserve > > > > >>> compatibility. > > > > >>> > > > > >>> I am not sure there is presently enough motivation in the Spark > > > project > > > > >>> for a release after 2.4.4 and before 3.0 in which to bump the > > Parquet > > > > >>> dependency version to 1.11.x. > > > > >>> > > > > >>> michael > > > > >>> > > > > >>> > > > > >>>> On Nov 21, 2019, at 11:01 AM, Ryan Blue > <[email protected] > > > > > > > >>> wrote: > > > > >>>> > > > > >>>> Gabor, shouldn't Parquet be binary compatible for public APIs? > > From > > > > the > > > > >>>> stack trace, it looks like this 1.11.0 RC breaks binary > > > compatibility > > > > >> in > > > > >>>> the type builders. > > > > >>>> > > > > >>>> Looks like this should have been caught by the binary > > compatibility > > > > >>> checks. > > > > >>>> > > > > >>>> On Thu, Nov 21, 2019 at 8:56 AM Gabor Szadovszky < > > [email protected]> > > > > >>> wrote: > > > > >>>> > > > > >>>>> Hi Michael, > > > > >>>>> > > > > >>>>> Unfortunately, I don't have too much experience on Spark. But > if > > > > spark > > > > >>> uses > > > > >>>>> the parquet-mr library in an embedded way (that's how Hive uses > > it) > > > > it > > > > >>> is > > > > >>>>> required to re-build Spark with 1.11 RC parquet-mr. > > > > >>>>> > > > > >>>>> Regards, > > > > >>>>> Gabor > > > > >>>>> > > > > >>>>> On Wed, Nov 20, 2019 at 5:44 PM Michael Heuer < > [email protected] > > > > > > > >>> wrote: > > > > >>>>> > > > > >>>>>> It appears a provided scope dependency on spark-sql leaks old > > > > parquet > > > > >>>>>> versions was causing the runtime error below. After including > > new > > > > >>>>>> parquet-column and parquet-hadoop compile scope dependencies > (in > > > > >>> addition > > > > >>>>>> to parquet-avro, which we already have at compile scope), our > > > build > > > > >>>>>> succeeds. > > > > >>>>>> > > > > >>>>>> https://github.com/bigdatagenomics/adam/pull/2232 < > > > > >>>>>> https://github.com/bigdatagenomics/adam/pull/2232> > > > > >>>>>> > > > > >>>>>> However, when running via spark-submit I run into a similar > > > runtime > > > > >>> error > > > > >>>>>> > > > > >>>>>> Caused by: java.lang.NoSuchMethodError: > > > > >>>>>> > > > > >>>>> > > > > >>> > > > > >> > > > > > > > > > > org.apache.parquet.schema.Types$PrimitiveBuilder.as(Lorg/apache/parquet/schema/LogicalTypeAnnotation;)Lorg/apache/parquet/schema/Types$Builder; > > > > >>>>>> at > > > > >>>>>> > > > > >>>>> > > > > >>> > > > > >> > > > > > > > > > > org.apache.parquet.avro.AvroSchemaConverter.convertField(AvroSchemaConverter.java:161) > > > > >>>>>> at > > > > >>>>>> > > > > >>>>> > > > > >>> > > > > >> > > > > > > > > > > org.apache.parquet.avro.AvroSchemaConverter.convertUnion(AvroSchemaConverter.java:226) > > > > >>>>>> at > > > > >>>>>> > > > > >>>>> > > > > >>> > > > > >> > > > > > > > > > > org.apache.parquet.avro.AvroSchemaConverter.convertField(AvroSchemaConverter.java:182) > > > > >>>>>> at > > > > >>>>>> > > > > >>>>> > > > > >>> > > > > >> > > > > > > > > > > org.apache.parquet.avro.AvroSchemaConverter.convertField(AvroSchemaConverter.java:141) > > > > >>>>>> at > > > > >>>>>> > > > > >>>>> > > > > >>> > > > > >> > > > > > > > > > > org.apache.parquet.avro.AvroSchemaConverter.convertField(AvroSchemaConverter.java:244) > > > > >>>>>> at > > > > >>>>>> > > > > >>>>> > > > > >>> > > > > >> > > > > > > > > > > org.apache.parquet.avro.AvroSchemaConverter.convertFields(AvroSchemaConverter.java:135) > > > > >>>>>> at > > > > >>>>>> > > > > >>>>> > > > > >>> > > > > >> > > > > > > > > > > org.apache.parquet.avro.AvroSchemaConverter.convert(AvroSchemaConverter.java:126) > > > > >>>>>> at > > > > >>>>>> > > > > >>> > > > > > > org.apache.parquet.avro.AvroWriteSupport.init(AvroWriteSupport.java:121) > > > > >>>>>> at > > > > >>>>>> > > > > >>>>> > > > > >>> > > > > >> > > > > > > > > > > org.apache.parquet.hadoop.ParquetOutputFormat.getRecordWriter(ParquetOutputFormat.java:388) > > > > >>>>>> at > > > > >>>>>> > > > > >>>>> > > > > >>> > > > > >> > > > > > > > > > > org.apache.parquet.hadoop.ParquetOutputFormat.getRecordWriter(ParquetOutputFormat.java:349) > > > > >>>>>> at > > > > >>>>>> > > > > >>>>> > > > > >>> > > > > >> > > > > > > > > > > org.apache.spark.rdd.InstrumentedOutputFormat.getRecordWriter(InstrumentedOutputFormat.scala:35) > > > > >>>>>> at org.apache.spark.internal.io > > > > >>>>>> > > > > >> > > > .HadoopMapReduceWriteConfigUtil.initWriter(SparkHadoopWriter.scala:350) > > > > >>>>>> at org.apache.spark.internal.io > > > > >>>>>> > > > > >>>>> > > > > >>> > > > > >> > > > > > > > > > > .SparkHadoopWriter$.org$apache$spark$internal$io$SparkHadoopWriter$$executeTask(SparkHadoopWriter.scala:120) > > > > >>>>>> at org.apache.spark.internal.io > > > > >>>>>> > .SparkHadoopWriter$$anonfun$3.apply(SparkHadoopWriter.scala:83) > > > > >>>>>> at org.apache.spark.internal.io > > > > >>>>>> > .SparkHadoopWriter$$anonfun$3.apply(SparkHadoopWriter.scala:78) > > > > >>>>>> at > > > > >>>>>> > > org.apache.spark.scheduler.ResultTask.runTask(ResultTask.scala:90) > > > > >>>>>> at org.apache.spark.scheduler.Task.run(Task.scala:123) > > > > >>>>>> at > > > > >>>>>> > > > > >>>>> > > > > >>> > > > > >> > > > > > > > > > > org.apache.spark.executor.Executor$TaskRunner$$anonfun$10.apply(Executor.scala:408) > > > > >>>>>> at > > > > >>>>>> > > org.apache.spark.util.Utils$.tryWithSafeFinally(Utils.scala:1360) > > > > >>>>>> at > > > > >>>>>> > > > > org.apache.spark.executor.Executor$TaskRunner.run(Executor.scala:414) > > > > >>>>>> at > > > > >>>>>> > > > > >>>>> > > > > >>> > > > > >> > > > > > > > > > > java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1149) > > > > >>>>>> at > > > > >>>>>> > > > > >>>>> > > > > >>> > > > > >> > > > > > > > > > > java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:624) > > > > >>>>>> at java.lang.Thread.run(Thread.java:748) > > > > >>>>>> > > > > >>>>>> > > > > >>>>>> Will bumping our library dependency version to 1.11 require a > > new > > > > >>> version > > > > >>>>>> of Spark, built against Parquet 1.11? > > > > >>>>>> > > > > >>>>>> Please accept my apologies if this is heading out-of-scope for > > the > > > > >>>>> Parquet > > > > >>>>>> mailing list. > > > > >>>>>> > > > > >>>>>> michael > > > > >>>>>> > > > > >>>>>> > > > > >>>>>>> On Nov 20, 2019, at 10:00 AM, Michael Heuer < > [email protected] > > > > > > > >>> wrote: > > > > >>>>>>> > > > > >>>>>>> I am willing to do some benchmarking on genomic data at scale > > but > > > > am > > > > >>>>> not > > > > >>>>>> quite sure what the Spark target version for 1.11.0 might be. > > > Will > > > > >>>>> Parquet > > > > >>>>>> 1.11.0 be compatible in Spark 2.4.x? > > > > >>>>>>> > > > > >>>>>>> Updating from 1.10.1 to 1.11.0 breaks at runtime in our build > > > > >>>>>>> > > > > >>>>>>> … > > > > >>>>>>> D 0, localhost, executor driver): > > java.lang.NoClassDefFoundError: > > > > >>>>>> org/apache/parquet/schema/LogicalTypeAnnotation > > > > >>>>>>> at > > > > >>>>>> > > > > >>> > > > > > > org.apache.parquet.avro.AvroWriteSupport.init(AvroWriteSupport.java:121) > > > > >>>>>>> at > > > > >>>>>> > > > > >>>>> > > > > >>> > > > > >> > > > > > > > > > > org.apache.parquet.hadoop.ParquetOutputFormat.getRecordWriter(ParquetOutputFormat.java:388) > > > > >>>>>>> at > > > > >>>>>> > > > > >>>>> > > > > >>> > > > > >> > > > > > > > > > > org.apache.parquet.hadoop.ParquetOutputFormat.getRecordWriter(ParquetOutputFormat.java:349) > > > > >>>>>>> at > > > > >>>>>> > > > > >>>>> > > > > >>> > > > > >> > > > > > > > > > > org.apache.spark.rdd.InstrumentedOutputFormat.getRecordWriter(InstrumentedOutputFormat.scala:35) > > > > >>>>>>> at org.apache.spark.internal.io > > > > >>>>>> > > > > >> > > > .HadoopMapReduceWriteConfigUtil.initWriter(SparkHadoopWriter.scala:350) > > > > >>>>>>> at org.apache.spark.internal.io > > > > >>>>>> > > > > >>>>> > > > > >>> > > > > >> > > > > > > > > > > .SparkHadoopWriter$.org$apache$spark$internal$io$SparkHadoopWriter$$executeTask(SparkHadoopWriter.scala:120) > > > > >>>>>>> at org.apache.spark.internal.io > > > > >>>>>> > .SparkHadoopWriter$$anonfun$3.apply(SparkHadoopWriter.scala:83) > > > > >>>>>>> at org.apache.spark.internal.io > > > > >>>>>> > .SparkHadoopWriter$$anonfun$3.apply(SparkHadoopWriter.scala:78) > > > > >>>>>>> at > > > > >>>>>> > > org.apache.spark.scheduler.ResultTask.runTask(ResultTask.scala:90) > > > > >>>>>>> at org.apache.spark.scheduler.Task.run(Task.scala:123) > > > > >>>>>>> at > > > > >>>>>> > > > > >>>>> > > > > >>> > > > > >> > > > > > > > > > > org.apache.spark.executor.Executor$TaskRunner$$anonfun$10.apply(Executor.scala:408) > > > > >>>>>>> at > > > > >>>>>> > > org.apache.spark.util.Utils$.tryWithSafeFinally(Utils.scala:1360) > > > > >>>>>>> at > > > > >>>>>> > > > > org.apache.spark.executor.Executor$TaskRunner.run(Executor.scala:414) > > > > >>>>>>> at > > > > >>>>>> > > > > >>>>> > > > > >>> > > > > >> > > > > > > > > > > java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1149) > > > > >>>>>>> at > > > > >>>>>> > > > > >>>>> > > > > >>> > > > > >> > > > > > > > > > > java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:624) > > > > >>>>>>> at java.lang.Thread.run(Thread.java:748) > > > > >>>>>>> Caused by: java.lang.ClassNotFoundException: > > > > >>>>>> org.apache.parquet.schema.LogicalTypeAnnotation > > > > >>>>>>> at > > java.net.URLClassLoader.findClass(URLClassLoader.java:382) > > > > >>>>>>> at java.lang.ClassLoader.loadClass(ClassLoader.java:424) > > > > >>>>>>> at > > > > >> sun.misc.Launcher$AppClassLoader.loadClass(Launcher.java:349) > > > > >>>>>>> at java.lang.ClassLoader.loadClass(ClassLoader.java:357) > > > > >>>>>>> > > > > >>>>>>> michael > > > > >>>>>>> > > > > >>>>>>> > > > > >>>>>>>> On Nov 20, 2019, at 3:29 AM, Gabor Szadovszky < > > [email protected] > > > > > > > > >>>>> wrote: > > > > >>>>>>>> > > > > >>>>>>>> Thanks, Fokko. > > > > >>>>>>>> > > > > >>>>>>>> Ryan, we did not do such measurements yet. I'm afraid, I > won't > > > > have > > > > >>>>>> enough > > > > >>>>>>>> time to do that in the next couple of weeks. > > > > >>>>>>>> > > > > >>>>>>>> Cheers, > > > > >>>>>>>> Gabor > > > > >>>>>>>> > > > > >>>>>>>> On Tue, Nov 19, 2019 at 6:14 PM Driesprong, Fokko > > > > >>>>> <[email protected] > > > > >>>>>>> > > > > >>>>>>>> wrote: > > > > >>>>>>>> > > > > >>>>>>>>> Thanks Gabor for the explanation. I'd like to change my > vote > > to > > > > +1 > > > > >>>>>>>>> (non-binding). > > > > >>>>>>>>> > > > > >>>>>>>>> Cheers, Fokko > > > > >>>>>>>>> > > > > >>>>>>>>> Op di 19 nov. 2019 om 18:03 schreef Ryan Blue > > > > >>>>>> <[email protected]> > > > > >>>>>>>>> > > > > >>>>>>>>>> Gabor, what I meant was: have we tried this with real data > > to > > > > see > > > > >>>>> the > > > > >>>>>>>>>> effect? I think those results would be helpful. > > > > >>>>>>>>>> > > > > >>>>>>>>>> On Mon, Nov 18, 2019 at 11:35 PM Gabor Szadovszky < > > > > >>> [email protected] > > > > >>>>>> > > > > >>>>>>>>>> wrote: > > > > >>>>>>>>>> > > > > >>>>>>>>>>> Hi Ryan, > > > > >>>>>>>>>>> > > > > >>>>>>>>>>> It is not easy to calculate. For the column indexes > feature > > > we > > > > >>>>>>>>> introduced > > > > >>>>>>>>>>> two new structures saved before the footer: column > indexes > > > and > > > > >>>>> offset > > > > >>>>>>>>>>> indexes. If the min/max values are not too long, then the > > > > >>>>> truncation > > > > >>>>>>>>>> might > > > > >>>>>>>>>>> not decrease the file size because of the offset indexes. > > > > >>> Moreover, > > > > >>>>>> we > > > > >>>>>>>>>> also > > > > >>>>>>>>>>> introduced parquet.page.row.count.limit which might > > increase > > > > the > > > > >>>>>> number > > > > >>>>>>>>>> of > > > > >>>>>>>>>>> pages which leads to increasing the file size. > > > > >>>>>>>>>>> The footer itself is also changed and we are saving more > > > values > > > > >> in > > > > >>>>>> it: > > > > >>>>>>>>>> the > > > > >>>>>>>>>>> offset values to the column/offset indexes, the new > logical > > > > type > > > > >>>>>>>>>>> structures, the CRC checksums (we might have some > others). > > > > >>>>>>>>>>> So, the size of the files with small amount of data will > be > > > > >>>>> increased > > > > >>>>>>>>>>> (because of the larger footer). The size of the files > where > > > the > > > > >>>>>> values > > > > >>>>>>>>>> can > > > > >>>>>>>>>>> be encoded very well (RLE) will probably be increased > > > (because > > > > >> we > > > > >>>>>> will > > > > >>>>>>>>>> have > > > > >>>>>>>>>>> more pages). The size of some files where the values are > > long > > > > >>>>>> (>64bytes > > > > >>>>>>>>>> by > > > > >>>>>>>>>>> default) might be decreased because of truncating the > > min/max > > > > >>>>> values. > > > > >>>>>>>>>>> > > > > >>>>>>>>>>> Regards, > > > > >>>>>>>>>>> Gabor > > > > >>>>>>>>>>> > > > > >>>>>>>>>>> On Mon, Nov 18, 2019 at 6:46 PM Ryan Blue > > > > >>>>> <[email protected] > > > > >>>>>>> > > > > >>>>>>>>>>> wrote: > > > > >>>>>>>>>>> > > > > >>>>>>>>>>>> Gabor, do we have an idea of the additional overhead > for a > > > > >>>>> non-test > > > > >>>>>>>>>> data > > > > >>>>>>>>>>>> file? It should be easy to validate that this doesn't > > > > introduce > > > > >>> an > > > > >>>>>>>>>>>> unreasonable amount of overhead. In some cases, it > should > > > > >>> actually > > > > >>>>>> be > > > > >>>>>>>>>>>> smaller since the column indexes are truncated and page > > > stats > > > > >> are > > > > >>>>>>>>> not. > > > > >>>>>>>>>>>> > > > > >>>>>>>>>>>> On Mon, Nov 18, 2019 at 1:00 AM Gabor Szadovszky > > > > >>>>>>>>>>>> <[email protected]> wrote: > > > > >>>>>>>>>>>> > > > > >>>>>>>>>>>>> Hi Fokko, > > > > >>>>>>>>>>>>> > > > > >>>>>>>>>>>>> For the first point. The referenced constructor is > > private > > > > and > > > > >>>>>>>>>> Iceberg > > > > >>>>>>>>>>>> uses > > > > >>>>>>>>>>>>> it via reflection. It is not a breaking change. I > think, > > > > >>>>> parquet-mr > > > > >>>>>>>>>>> shall > > > > >>>>>>>>>>>>> not keep private methods only because of clients might > > use > > > > >> them > > > > >>>>> via > > > > >>>>>>>>>>>>> reflection. > > > > >>>>>>>>>>>>> > > > > >>>>>>>>>>>>> About the checksum. I've agreed on having the CRC > > checksum > > > > >> write > > > > >>>>>>>>>>> enabled > > > > >>>>>>>>>>>> by > > > > >>>>>>>>>>>>> default because the benchmarks did not show significant > > > > >>>>> performance > > > > >>>>>>>>>>>>> penalties. See > > > https://github.com/apache/parquet-mr/pull/647 > > > > >>> for > > > > >>>>>>>>>>>> details. > > > > >>>>>>>>>>>>> > > > > >>>>>>>>>>>>> About the file size change. 1.11.0 is introducing > column > > > > >>> indexes, > > > > >>>>>>>>> CRC > > > > >>>>>>>>>>>>> checksum, removing the statistics from the page headers > > and > > > > >>> maybe > > > > >>>>>>>>>> other > > > > >>>>>>>>>>>>> changes that impact file size. If only file size is in > > > > >> question > > > > >>> I > > > > >>>>>>>>>>> cannot > > > > >>>>>>>>>>>>> see a breaking change here. > > > > >>>>>>>>>>>>> > > > > >>>>>>>>>>>>> Regards, > > > > >>>>>>>>>>>>> Gabor > > > > >>>>>>>>>>>>> > > > > >>>>>>>>>>>>> > > > > >>>>>>>>>>>>> > > > > >>>>>>>>>>>>> On Sun, Nov 17, 2019 at 9:27 PM Driesprong, Fokko > > > > >>>>>>>>>> <[email protected] > > > > >>>>>>>>>>>> > > > > >>>>>>>>>>>>> wrote: > > > > >>>>>>>>>>>>> > > > > >>>>>>>>>>>>>> Unfortunately, a -1 from my side (non-binding) > > > > >>>>>>>>>>>>>> > > > > >>>>>>>>>>>>>> I've updated Iceberg to Parquet 1.11.0, and found > three > > > > >> things: > > > > >>>>>>>>>>>>>> > > > > >>>>>>>>>>>>>> - We've broken backward compatibility of the > constructor > > > of > > > > >>>>>>>>>>>>>> ColumnChunkPageWriteStore > > > > >>>>>>>>>>>>>> < > > > > >>>>>>>>>>>>>> > > > > >>>>>>>>>>>>> > > > > >>>>>>>>>>>> > > > > >>>>>>>>>>> > > > > >>>>>>>>>> > > > > >>>>>>>>> > > > > >>>>>> > > > > >>>>> > > > > >>> > > > > >> > > > > > > > > > > https://github.com/apache/parquet-mr/commit/e7db9e20f52c925a207ea62d6dda6dc4e870294e#diff-d007a18083a2431c30a5416f248e0a4bR80 > > > > >>>>>>>>>>>>>>> . > > > > >>>>>>>>>>>>>> This required a change > > > > >>>>>>>>>>>>>> < > > > > >>>>>>>>>>>>>> > > > > >>>>>>>>>>>>> > > > > >>>>>>>>>>>> > > > > >>>>>>>>>>> > > > > >>>>>>>>>> > > > > >>>>>>>>> > > > > >>>>>> > > > > >>>>> > > > > >>> > > > > >> > > > > > > > > > > https://github.com/apache/incubator-iceberg/pull/297/files#diff-b877faa96f292b851c75fe8bcc1912f8R176 > > > > >>>>>>>>>>>>>>> > > > > >>>>>>>>>>>>>> to the code. This isn't a hard blocker, but if there > > will > > > be > > > > >> a > > > > >>>>>>>>>> new > > > > >>>>>>>>>>>> RC, > > > > >>>>>>>>>>>>>> I've > > > > >>>>>>>>>>>>>> submitted a patch: > > > > >>>>>>>>>> https://github.com/apache/parquet-mr/pull/699 > > > > >>>>>>>>>>>>>> - Related, that we need to put in the changelog, is > that > > > > >>>>>>>>>> checksums > > > > >>>>>>>>>>>> are > > > > >>>>>>>>>>>>>> enabled by default: > > > > >>>>>>>>>>>>>> > > > > >>>>>>>>>>>>>> > > > > >>>>>>>>>>>>> > > > > >>>>>>>>>>>> > > > > >>>>>>>>>>> > > > > >>>>>>>>>> > > > > >>>>>>>>> > > > > >>>>>> > > > > >>>>> > > > > >>> > > > > >> > > > > > > > > > > https://github.com/apache/parquet-mr/blob/master/parquet-column/src/main/java/org/apache/parquet/column/ParquetProperties.java#L54 > > > > >>>>>>>>>>>>>> This > > > > >>>>>>>>>>>>>> will impact performance. I would suggest disabling it > by > > > > >>>>>>>>>> default: > > > > >>>>>>>>>>>>>> https://github.com/apache/parquet-mr/pull/700 > > > > >>>>>>>>>>>>>> < > > > > >>>>>>>>>>>>>> > > > > >>>>>>>>>>>>> > > > > >>>>>>>>>>>> > > > > >>>>>>>>>>> > > > > >>>>>>>>>> > > > > >>>>>>>>> > > > > >>>>>> > > > > >>>>> > > > > >>> > > > > >> > > > > > > > > > > https://github.com/apache/parquet-mr/commit/e7db9e20f52c925a207ea62d6dda6dc4e870294e#diff-d007a18083a2431c30a5416f248e0a4bR277 > > > > >>>>>>>>>>>>>>> > > > > >>>>>>>>>>>>>> - Binary compatibility. While updating Iceberg, I've > > > noticed > > > > >>>>>>>>>> that > > > > >>>>>>>>>>>> the > > > > >>>>>>>>>>>>>> split-test was failing: > > > > >>>>>>>>>>>>>> > > > > >>>>>>>>>>>>>> > > > > >>>>>>>>>>>>> > > > > >>>>>>>>>>>> > > > > >>>>>>>>>>> > > > > >>>>>>>>>> > > > > >>>>>>>>> > > > > >>>>>> > > > > >>>>> > > > > >>> > > > > >> > > > > > > > > > > https://github.com/apache/incubator-iceberg/pull/297/files#diff-4b64b7014f259be41b26cfb73d3e6e93L199 > > > > >>>>>>>>>>>>>> The > > > > >>>>>>>>>>>>>> two records are now divided over four Spark > partitions. > > > > >>>>>>>>>> Something > > > > >>>>>>>>>>> in > > > > >>>>>>>>>>>>> the > > > > >>>>>>>>>>>>>> output has changed since the files are bigger now. Has > > > > anyone > > > > >>>>>>>>>> any > > > > >>>>>>>>>>>> idea > > > > >>>>>>>>>>>>>> to > > > > >>>>>>>>>>>>>> check what's changed, or a way to check this? The only > > > thing > > > > >> I > > > > >>>>>>>>>> can > > > > >>>>>>>>>>>>>> think of > > > > >>>>>>>>>>>>>> is the checksum mentioned above. > > > > >>>>>>>>>>>>>> > > > > >>>>>>>>>>>>>> $ ls -lah ~/Desktop/parquet-1-1* > > > > >>>>>>>>>>>>>> -rw-r--r-- 1 fokkodriesprong staff 562B 17 nov > 21:09 > > > > >>>>>>>>>>>>>> /Users/fokkodriesprong/Desktop/parquet-1-10-1.parquet > > > > >>>>>>>>>>>>>> -rw-r--r-- 1 fokkodriesprong staff 611B 17 nov > 21:05 > > > > >>>>>>>>>>>>>> /Users/fokkodriesprong/Desktop/parquet-1-11-0.parquet > > > > >>>>>>>>>>>>>> > > > > >>>>>>>>>>>>>> $ parquet-tools cat > > > > >>>>>>>>>>>> /Users/fokkodriesprong/Desktop/parquet-1-10-1.parquet > > > > >>>>>>>>>>>>>> id = 1 > > > > >>>>>>>>>>>>>> data = a > > > > >>>>>>>>>>>>>> > > > > >>>>>>>>>>>>>> $ parquet-tools cat > > > > >>>>>>>>>>>> /Users/fokkodriesprong/Desktop/parquet-1-11-0.parquet > > > > >>>>>>>>>>>>>> id = 1 > > > > >>>>>>>>>>>>>> data = a > > > > >>>>>>>>>>>>>> > > > > >>>>>>>>>>>>>> A binary diff here: > > > > >>>>>>>>>>>>>> > > > > >> https://gist.github.com/Fokko/1c209f158299dc2fb5878c5bae4bf6d8 > > > > >>>>>>>>>>>>>> > > > > >>>>>>>>>>>>>> Cheers, Fokko > > > > >>>>>>>>>>>>>> > > > > >>>>>>>>>>>>>> Op za 16 nov. 2019 om 04:18 schreef Junjie Chen < > > > > >>>>>>>>>>>>> [email protected] > > > > >>>>>>>>>>>>>>> : > > > > >>>>>>>>>>>>>> > > > > >>>>>>>>>>>>>>> +1 > > > > >>>>>>>>>>>>>>> Verified signature, checksum and ran mvn install > > > > >> successfully. > > > > >>>>>>>>>>>>>>> > > > > >>>>>>>>>>>>>>> Wang, Yuming <[email protected]> > 于2019年11月14日周四 > > > > >>>>>>>>> 下午2:05写道: > > > > >>>>>>>>>>>>>>>> > > > > >>>>>>>>>>>>>>>> +1 > > > > >>>>>>>>>>>>>>>> Tested Parquet 1.11.0 with Spark SQL module: > build/sbt > > > > >>>>>>>>>>>>> "sql/test-only" > > > > >>>>>>>>>>>>>>> -Phadoop-3.2 > > > > >>>>>>>>>>>>>>>> > > > > >>>>>>>>>>>>>>>> On 2019/11/13, 21:33, "Gabor Szadovszky" < > > > > >> [email protected]> > > > > >>>>>>>>>>>> wrote: > > > > >>>>>>>>>>>>>>>> > > > > >>>>>>>>>>>>>>>> Hi everyone, > > > > >>>>>>>>>>>>>>>> > > > > >>>>>>>>>>>>>>>> I propose the following RC to be released as > official > > > > >>>>>>>>>> Apache > > > > >>>>>>>>>>>>>> Parquet > > > > >>>>>>>>>>>>>>> 1.11.0 > > > > >>>>>>>>>>>>>>>> release. > > > > >>>>>>>>>>>>>>>> > > > > >>>>>>>>>>>>>>>> The commit id is > > > 18519eb8e059865652eee3ff0e8593f126701da4 > > > > >>>>>>>>>>>>>>>> * This corresponds to the tag: > > apache-parquet-1.11.0-rc7 > > > > >>>>>>>>>>>>>>>> * > > > > >>>>>>>>>>>>>>>> > > > > >>>>>>>>>>>>>>> > > > > >>>>>>>>>>>>>> > > > > >>>>>>>>>>>>> > > > > >>>>>>>>>>>> > > > > >>>>>>>>>>> > > > > >>>>>>>>>> > > > > >>>>>>>>> > > > > >>>>>> > > > > >>>>> > > > > >>> > > > > >> > > > > > > > > > > https://nam01.safelinks.protection.outlook.com/?url=https%3A%2F%2Fgithub.com%2Fapache%2Fparquet-mr%2Ftree%2F18519eb8e059865652eee3ff0e8593f126701da4&data=02%7C01%7Cyumwang%40ebay.com%7C8d588ca5855842a94bed08d7683e1221%7C46326bff992841a0baca17c16c94ea99%7C0%7C0%7C637092488114756267&sdata=ToLFrTB9lU%2FGzH6UpXwy7PAY7kaupbyKAgdghESCfgg%3D&reserved=0 > > > > >>>>>>>>>>>>>>>> > > > > >>>>>>>>>>>>>>>> The release tarball, signature, and checksums are > > here: > > > > >>>>>>>>>>>>>>>> * > > > > >>>>>>>>>>>>>>> > > > > >>>>>>>>>>>>>> > > > > >>>>>>>>>>>>> > > > > >>>>>>>>>>>> > > > > >>>>>>>>>>> > > > > >>>>>>>>>> > > > > >>>>>>>>> > > > > >>>>>> > > > > >>>>> > > > > >>> > > > > >> > > > > > > > > > > https://nam01.safelinks.protection.outlook.com/?url=https%3A%2F%2Fdist.apache.org%2Frepos%2Fdist%2Fdev%2Fparquet%2Fapache-parquet-1.11.0-rc7&data=02%7C01%7Cyumwang%40ebay.com%7C8d588ca5855842a94bed08d7683e1221%7C46326bff992841a0baca17c16c94ea99%7C0%7C0%7C637092488114756267&sdata=MPaHiYJT7ZcqreAYUkvDvZugthUhRPrySdXpN2ytT5k%3D&reserved=0 > > > > >>>>>>>>>>>>>>>> > > > > >>>>>>>>>>>>>>>> You can find the KEYS file here: > > > > >>>>>>>>>>>>>>>> * > > > > >>>>>>>>>>>>>>> > > > > >>>>>>>>>>>>>> > > > > >>>>>>>>>>>>> > > > > >>>>>>>>>>>> > > > > >>>>>>>>>>> > > > > >>>>>>>>>> > > > > >>>>>>>>> > > > > >>>>>> > > > > >>>>> > > > > >>> > > > > >> > > > > > > > > > > https://nam01.safelinks.protection.outlook.com/?url=https%3A%2F%2Fapache.org%2Fdist%2Fparquet%2FKEYS&data=02%7C01%7Cyumwang%40ebay.com%7C8d588ca5855842a94bed08d7683e1221%7C46326bff992841a0baca17c16c94ea99%7C0%7C0%7C637092488114756267&sdata=IwG4MUGsP2lVzlD4bwZUEPuEAPUg%2FHXRYtxc5CQupBM%3D&reserved=0 > > > > >>>>>>>>>>>>>>>> > > > > >>>>>>>>>>>>>>>> Binary artifacts are staged in Nexus here: > > > > >>>>>>>>>>>>>>>> * > > > > >>>>>>>>>>>>>>> > > > > >>>>>>>>>>>>>> > > > > >>>>>>>>>>>>> > > > > >>>>>>>>>>>> > > > > >>>>>>>>>>> > > > > >>>>>>>>>> > > > > >>>>>>>>> > > > > >>>>>> > > > > >>>>> > > > > >>> > > > > >> > > > > > > > > > > https://nam01.safelinks.protection.outlook.com/?url=https%3A%2F%2Frepository.apache.org%2Fcontent%2Fgroups%2Fstaging%2Forg%2Fapache%2Fparquet%2F&data=02%7C01%7Cyumwang%40ebay.com%7C8d588ca5855842a94bed08d7683e1221%7C46326bff992841a0baca17c16c94ea99%7C0%7C0%7C637092488114756267&sdata=lHtqLRQqQFwsyoaLSVaJuau5gxPKsCQFFVJaY8H0tZQ%3D&reserved=0 > > > > >>>>>>>>>>>>>>>> > > > > >>>>>>>>>>>>>>>> This release includes the changes listed at: > > > > >>>>>>>>>>>>>>>> > > > > >>>>>>>>>>>>>>> > > > > >>>>>>>>>>>>>> > > > > >>>>>>>>>>>>> > > > > >>>>>>>>>>>> > > > > >>>>>>>>>>> > > > > >>>>>>>>>> > > > > >>>>>>>>> > > > > >>>>>> > > > > >>>>> > > > > >>> > > > > >> > > > > > > > > > > https://nam01.safelinks.protection.outlook.com/?url=https%3A%2F%2Fgithub.com%2Fapache%2Fparquet-mr%2Fblob%2Fapache-parquet-1.11.0-rc7%2FCHANGES.md&data=02%7C01%7Cyumwang%40ebay.com%7C8d588ca5855842a94bed08d7683e1221%7C46326bff992841a0baca17c16c94ea99%7C0%7C0%7C637092488114756267&sdata=82BplI3bLAL6qArLHvVoYReZOk%2BboSP655rI8VX5Q5I%3D&reserved=0 > > > > >>>>>>>>>>>>>>>> > > > > >>>>>>>>>>>>>>>> Please download, verify, and test. > > > > >>>>>>>>>>>>>>>> > > > > >>>>>>>>>>>>>>>> Please vote in the next 72 hours. > > > > >>>>>>>>>>>>>>>> > > > > >>>>>>>>>>>>>>>> [ ] +1 Release this as Apache Parquet 1.11.0 > > > > >>>>>>>>>>>>>>>> [ ] +0 > > > > >>>>>>>>>>>>>>>> [ ] -1 Do not release this because... > > > > >>>>>>>>>>>>>>>> > > > > >>>>>>>>>>>>>>>> > > > > >>>>>>>>>>>>>>> > > > > >>>>>>>>>>>>>> > > > > >>>>>>>>>>>>> > > > > >>>>>>>>>>>> > > > > >>>>>>>>>>>> > > > > >>>>>>>>>>>> -- > > > > >>>>>>>>>>>> Ryan Blue > > > > >>>>>>>>>>>> Software Engineer > > > > >>>>>>>>>>>> Netflix > > > > >>>>>>>>>>>> > > > > >>>>>>>>>>> > > > > >>>>>>>>>> > > > > >>>>>>>>>> > > > > >>>>>>>>>> -- > > > > >>>>>>>>>> Ryan Blue > > > > >>>>>>>>>> Software Engineer > > > > >>>>>>>>>> Netflix > > > > >>>>>>>>>> > > > > >>>>>>>>> > > > > >>>>>>> > > > > >>>>>> > > > > >>>>>> > > > > >>>>> > > > > >>>> > > > > >>>> > > > > >>>> -- > > > > >>>> Ryan Blue > > > > >>>> Software Engineer > > > > >>>> Netflix > > > > >>> > > > > >>> > > > > >> > > > > > > > > > > > > > > > -- > > > > > Ryan Blue > > > > > Software Engineer > > > > > Netflix > > > > > > > > > > > > > > > > > -- > > Ryan Blue > > Software Engineer > > Netflix > > > -- Ryan Blue Software Engineer Netflix
