Created https://issues.apache.org/jira/browse/PARQUET-1703 to track this.

Back to the RC. Anyone from the PMC willing to vote?

Cheers,
Gabor

On Mon, Nov 25, 2019 at 6:45 PM Ryan Blue <[email protected]> wrote:

> Gabor, good point about not being able to check new APIs. Updating the
> previous version would also allow us to get rid of temporary exclusions,
> like the one you pointed out for schema. It would be great to improve what
> we catch there.
>
> On Mon, Nov 25, 2019 at 1:56 AM Gabor Szadovszky <[email protected]> wrote:
>
> > Hi Ryan,
> >
> > It is a different topic but would like to reflect shortly.
> > I understand that 1.7.0 was the first apache release. The problem with
> > doing the compatibility checks comparing to 1.7.0 is that we can easily
> add
> > incompatibilities in API which are added after 1.7.0. For example:
> Adding a
> > new class for public use in 1.8.0 then removing it in 1.9.0. The
> > compatibility check would not discover this breaking change. So, I
> think, a
> > better approach would be to always compare to the previous minor release
> > (e.g. comparing 1.9.0 to 1.8.0 etc.).
> > As I've written before, even org/apache/parquet/schema/** is excluded
> from
> > the compatibility check. As far as I know this is public API. So, I am
> not
> > sure that only packages that are not part of the public API are excluded.
> >
> > Let's discuss this on the next parquet sync.
> >
> > Regards,
> > Gabor
> >
> > On Fri, Nov 22, 2019 at 6:20 PM Ryan Blue <[email protected]>
> > wrote:
> >
> > > Gabor,
> > >
> > > 1.7.0 was the first version using the org.apache.parquet packages, so
> > > that's the correct base version for compatibility checks. The
> exclusions
> > in
> > > the POM are classes that the Parquet community does not consider
> public.
> > We
> > > rely on these checks to highlight binary incompatibilities, and then we
> > > discuss them on this list or in the dev sync. If the class is internal,
> > we
> > > add an exclusion for it.
> > >
> > > I know you're familiar with this process since we've talked about it
> > > before. I also know that you'd rather have more strict binary
> > > compatibility, but until we have someone with the time to do some
> > > maintenance and build a public API module, I'm afraid that's what we
> have
> > > to work with.
> > >
> > > Michael,
> > >
> > > I hope the context above is helpful and explains why running a binary
> > > compatibility check tool will find incompatible changes. We allow
> binary
> > > incompatible changes to internal classes and modules, like
> > parquet-common.
> > >
> > > On Fri, Nov 22, 2019 at 12:23 AM Gabor Szadovszky <[email protected]>
> > > wrote:
> > >
> > > > Ryan,
> > > > I would not trust our compatibility checks (semver) too much.
> > Currently,
> > > it
> > > > is configured to compare our current version to 1.7.0. It means
> > anything
> > > > that is added since 1.7.0 and then broke in a later release won't be
> > > > caught. In addition, many packages are excluded from the check
> because
> > of
> > > > different reasons. For example org/apache/parquet/schema/** is
> excluded
> > > so
> > > > if it would really be an API compatibility issue we certainly
> wouldn't
> > > > catch it.
> > > >
> > > > Michael,
> > > > It fails because of a NoSuchMethodError pointing to a method that is
> > > newly
> > > > introduced in 1.11. Both the caller and the callee shipped by
> > parquet-mr.
> > > > So, I'm quite sure it is a classpath issue. It seems that the 1.11
> > > version
> > > > of the parquet-column jar is not on the classpath.
> > > >
> > > >
> > > > On Fri, Nov 22, 2019 at 1:44 AM Michael Heuer <[email protected]>
> > wrote:
> > > >
> > > > > The dependency versions are consistent in our artifact
> > > > >
> > > > > $ mvn dependency:tree | grep parquet
> > > > > [INFO] |  \- org.apache.parquet:parquet-avro:jar:1.11.0:compile
> > > > > [INFO] |     \-
> > > > > org.apache.parquet:parquet-format-structures:jar:1.11.0:compile
> > > > > [INFO] |  +- org.apache.parquet:parquet-column:jar:1.11.0:compile
> > > > > [INFO] |  |  +-
> org.apache.parquet:parquet-common:jar:1.11.0:compile
> > > > > [INFO] |  |  \-
> > org.apache.parquet:parquet-encoding:jar:1.11.0:compile
> > > > > [INFO] |  +- org.apache.parquet:parquet-hadoop:jar:1.11.0:compile
> > > > > [INFO] |  |  +-
> org.apache.parquet:parquet-jackson:jar:1.11.0:compile
> > > > >
> > > > > The latter error
> > > > >
> > > > > Caused by: org.apache.spark.SparkException: Job aborted due to
> stage
> > > > > failure: Task 0 in stage 0.0 failed 1 times, most recent failure:
> > Lost
> > > > task
> > > > > 0.0 in stage 0.0 (TID 0, localhost, executor driver):
> > > > > java.lang.NoSuchMethodError:
> > > > >
> > > >
> > >
> >
> org.apache.parquet.schema.Types$PrimitiveBuilder.as(Lorg/apache/parquet/schema/LogicalTypeAnnotation;)Lorg/apache/parquet/schema/Types$Builder;
> > > > >         at
> > > > >
> > > >
> > >
> >
> org.apache.parquet.avro.AvroSchemaConverter.convertField(AvroSchemaConverter.java:161)
> > > > >
> > > > > occurs when I attempt to run via spark-submit on Spark 2.4.4
> > > > >
> > > > > $ spark-submit --version
> > > > > Welcome to
> > > > >       ____              __
> > > > >      / __/__  ___ _____/ /__
> > > > >     _\ \/ _ \/ _ `/ __/  '_/
> > > > >    /___/ .__/\_,_/_/ /_/\_\   version 2.4.4
> > > > >       /_/
> > > > >
> > > > > Using Scala version 2.11.12, Java HotSpot(TM) 64-Bit Server VM,
> > > 1.8.0_191
> > > > > Branch
> > > > > Compiled by user  on 2019-08-27T21:21:38Z
> > > > > Revision
> > > > > Url
> > > > > Type --help for more information.
> > > > >
> > > > >
> > > > >
> > > > > > On Nov 21, 2019, at 6:06 PM, Ryan Blue <[email protected]
> >
> > > > > wrote:
> > > > > >
> > > > > > Thanks for looking into it, Nandor. That doesn't sound like a
> > problem
> > > > > with
> > > > > > Parquet, but a problem with the test environment since
> parquet-avro
> > > > > depends
> > > > > > on a newer API method.
> > > > > >
> > > > > > On Thu, Nov 21, 2019 at 3:58 PM Nandor Kollar
> > > > > <[email protected]>
> > > > > > wrote:
> > > > > >
> > > > > >> I'm not sure that this is a binary compatibility issue. The
> > missing
> > > > > builder
> > > > > >> method was recently added in 1.11.0 with the introduction of the
> > new
> > > > > >> logical type API, while the original version (one with a single
> > > > > >> OriginalType input parameter called before by
> AvroSchemaConverter)
> > > of
> > > > > this
> > > > > >> method is kept untouched. It seems to me that the Parquet
> version
> > on
> > > > > Spark
> > > > > >> executor mismatch: parquet-avro is on 1.11.0, but parquet-column
> > is
> > > > > still
> > > > > >> on an older version.
> > > > > >>
> > > > > >> On Thu, Nov 21, 2019 at 11:41 PM Michael Heuer <
> [email protected]
> > >
> > > > > wrote:
> > > > > >>
> > > > > >>> Perhaps not strictly necessary to say, but if this particular
> > > > > >>> compatibility break between 1.10 and 1.11 was intentional, and
> no
> > > > other
> > > > > >>> compatibility breaks are found, I would vote -1 (non-binding)
> on
> > > this
> > > > > RC
> > > > > >>> such that we might go back and revisit the changes to preserve
> > > > > >>> compatibility.
> > > > > >>>
> > > > > >>> I am not sure there is presently enough motivation in the Spark
> > > > project
> > > > > >>> for a release after 2.4.4 and before 3.0 in which to bump the
> > > Parquet
> > > > > >>> dependency version to 1.11.x.
> > > > > >>>
> > > > > >>>   michael
> > > > > >>>
> > > > > >>>
> > > > > >>>> On Nov 21, 2019, at 11:01 AM, Ryan Blue
> > <[email protected]
> > > >
> > > > > >>> wrote:
> > > > > >>>>
> > > > > >>>> Gabor, shouldn't Parquet be binary compatible for public APIs?
> > > From
> > > > > the
> > > > > >>>> stack trace, it looks like this 1.11.0 RC breaks binary
> > > > compatibility
> > > > > >> in
> > > > > >>>> the type builders.
> > > > > >>>>
> > > > > >>>> Looks like this should have been caught by the binary
> > > compatibility
> > > > > >>> checks.
> > > > > >>>>
> > > > > >>>> On Thu, Nov 21, 2019 at 8:56 AM Gabor Szadovszky <
> > > [email protected]>
> > > > > >>> wrote:
> > > > > >>>>
> > > > > >>>>> Hi Michael,
> > > > > >>>>>
> > > > > >>>>> Unfortunately, I don't have too much experience on Spark. But
> > if
> > > > > spark
> > > > > >>> uses
> > > > > >>>>> the parquet-mr library in an embedded way (that's how Hive
> uses
> > > it)
> > > > > it
> > > > > >>> is
> > > > > >>>>> required to re-build Spark with 1.11 RC parquet-mr.
> > > > > >>>>>
> > > > > >>>>> Regards,
> > > > > >>>>> Gabor
> > > > > >>>>>
> > > > > >>>>> On Wed, Nov 20, 2019 at 5:44 PM Michael Heuer <
> > [email protected]
> > > >
> > > > > >>> wrote:
> > > > > >>>>>
> > > > > >>>>>> It appears a provided scope dependency on spark-sql leaks
> old
> > > > > parquet
> > > > > >>>>>> versions was causing the runtime error below.  After
> including
> > > new
> > > > > >>>>>> parquet-column and parquet-hadoop compile scope dependencies
> > (in
> > > > > >>> addition
> > > > > >>>>>> to parquet-avro, which we already have at compile scope),
> our
> > > > build
> > > > > >>>>>> succeeds.
> > > > > >>>>>>
> > > > > >>>>>> https://github.com/bigdatagenomics/adam/pull/2232 <
> > > > > >>>>>> https://github.com/bigdatagenomics/adam/pull/2232>
> > > > > >>>>>>
> > > > > >>>>>> However, when running via spark-submit I run into a similar
> > > > runtime
> > > > > >>> error
> > > > > >>>>>>
> > > > > >>>>>> Caused by: java.lang.NoSuchMethodError:
> > > > > >>>>>>
> > > > > >>>>>
> > > > > >>>
> > > > > >>
> > > > >
> > > >
> > >
> >
> org.apache.parquet.schema.Types$PrimitiveBuilder.as(Lorg/apache/parquet/schema/LogicalTypeAnnotation;)Lorg/apache/parquet/schema/Types$Builder;
> > > > > >>>>>>       at
> > > > > >>>>>>
> > > > > >>>>>
> > > > > >>>
> > > > > >>
> > > > >
> > > >
> > >
> >
> org.apache.parquet.avro.AvroSchemaConverter.convertField(AvroSchemaConverter.java:161)
> > > > > >>>>>>       at
> > > > > >>>>>>
> > > > > >>>>>
> > > > > >>>
> > > > > >>
> > > > >
> > > >
> > >
> >
> org.apache.parquet.avro.AvroSchemaConverter.convertUnion(AvroSchemaConverter.java:226)
> > > > > >>>>>>       at
> > > > > >>>>>>
> > > > > >>>>>
> > > > > >>>
> > > > > >>
> > > > >
> > > >
> > >
> >
> org.apache.parquet.avro.AvroSchemaConverter.convertField(AvroSchemaConverter.java:182)
> > > > > >>>>>>       at
> > > > > >>>>>>
> > > > > >>>>>
> > > > > >>>
> > > > > >>
> > > > >
> > > >
> > >
> >
> org.apache.parquet.avro.AvroSchemaConverter.convertField(AvroSchemaConverter.java:141)
> > > > > >>>>>>       at
> > > > > >>>>>>
> > > > > >>>>>
> > > > > >>>
> > > > > >>
> > > > >
> > > >
> > >
> >
> org.apache.parquet.avro.AvroSchemaConverter.convertField(AvroSchemaConverter.java:244)
> > > > > >>>>>>       at
> > > > > >>>>>>
> > > > > >>>>>
> > > > > >>>
> > > > > >>
> > > > >
> > > >
> > >
> >
> org.apache.parquet.avro.AvroSchemaConverter.convertFields(AvroSchemaConverter.java:135)
> > > > > >>>>>>       at
> > > > > >>>>>>
> > > > > >>>>>
> > > > > >>>
> > > > > >>
> > > > >
> > > >
> > >
> >
> org.apache.parquet.avro.AvroSchemaConverter.convert(AvroSchemaConverter.java:126)
> > > > > >>>>>>       at
> > > > > >>>>>>
> > > > > >>>
> > > > >
> > >
> org.apache.parquet.avro.AvroWriteSupport.init(AvroWriteSupport.java:121)
> > > > > >>>>>>       at
> > > > > >>>>>>
> > > > > >>>>>
> > > > > >>>
> > > > > >>
> > > > >
> > > >
> > >
> >
> org.apache.parquet.hadoop.ParquetOutputFormat.getRecordWriter(ParquetOutputFormat.java:388)
> > > > > >>>>>>       at
> > > > > >>>>>>
> > > > > >>>>>
> > > > > >>>
> > > > > >>
> > > > >
> > > >
> > >
> >
> org.apache.parquet.hadoop.ParquetOutputFormat.getRecordWriter(ParquetOutputFormat.java:349)
> > > > > >>>>>>       at
> > > > > >>>>>>
> > > > > >>>>>
> > > > > >>>
> > > > > >>
> > > > >
> > > >
> > >
> >
> org.apache.spark.rdd.InstrumentedOutputFormat.getRecordWriter(InstrumentedOutputFormat.scala:35)
> > > > > >>>>>>       at org.apache.spark.internal.io
> > > > > >>>>>>
> > > > > >>
> > > >
> .HadoopMapReduceWriteConfigUtil.initWriter(SparkHadoopWriter.scala:350)
> > > > > >>>>>>       at org.apache.spark.internal.io
> > > > > >>>>>>
> > > > > >>>>>
> > > > > >>>
> > > > > >>
> > > > >
> > > >
> > >
> >
> .SparkHadoopWriter$.org$apache$spark$internal$io$SparkHadoopWriter$$executeTask(SparkHadoopWriter.scala:120)
> > > > > >>>>>>       at org.apache.spark.internal.io
> > > > > >>>>>>
> > .SparkHadoopWriter$$anonfun$3.apply(SparkHadoopWriter.scala:83)
> > > > > >>>>>>       at org.apache.spark.internal.io
> > > > > >>>>>>
> > .SparkHadoopWriter$$anonfun$3.apply(SparkHadoopWriter.scala:78)
> > > > > >>>>>>       at
> > > > > >>>>>>
> > > org.apache.spark.scheduler.ResultTask.runTask(ResultTask.scala:90)
> > > > > >>>>>>       at org.apache.spark.scheduler.Task.run(Task.scala:123)
> > > > > >>>>>>       at
> > > > > >>>>>>
> > > > > >>>>>
> > > > > >>>
> > > > > >>
> > > > >
> > > >
> > >
> >
> org.apache.spark.executor.Executor$TaskRunner$$anonfun$10.apply(Executor.scala:408)
> > > > > >>>>>>       at
> > > > > >>>>>>
> > > org.apache.spark.util.Utils$.tryWithSafeFinally(Utils.scala:1360)
> > > > > >>>>>>       at
> > > > > >>>>>>
> > > > >
> org.apache.spark.executor.Executor$TaskRunner.run(Executor.scala:414)
> > > > > >>>>>>       at
> > > > > >>>>>>
> > > > > >>>>>
> > > > > >>>
> > > > > >>
> > > > >
> > > >
> > >
> >
> java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1149)
> > > > > >>>>>>       at
> > > > > >>>>>>
> > > > > >>>>>
> > > > > >>>
> > > > > >>
> > > > >
> > > >
> > >
> >
> java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:624)
> > > > > >>>>>>       at java.lang.Thread.run(Thread.java:748)
> > > > > >>>>>>
> > > > > >>>>>>
> > > > > >>>>>> Will bumping our library dependency version to 1.11 require
> a
> > > new
> > > > > >>> version
> > > > > >>>>>> of Spark, built against Parquet 1.11?
> > > > > >>>>>>
> > > > > >>>>>> Please accept my apologies if this is heading out-of-scope
> for
> > > the
> > > > > >>>>> Parquet
> > > > > >>>>>> mailing list.
> > > > > >>>>>>
> > > > > >>>>>>  michael
> > > > > >>>>>>
> > > > > >>>>>>
> > > > > >>>>>>> On Nov 20, 2019, at 10:00 AM, Michael Heuer <
> > [email protected]
> > > >
> > > > > >>> wrote:
> > > > > >>>>>>>
> > > > > >>>>>>> I am willing to do some benchmarking on genomic data at
> scale
> > > but
> > > > > am
> > > > > >>>>> not
> > > > > >>>>>> quite sure what the Spark target version for 1.11.0 might
> be.
> > > > Will
> > > > > >>>>> Parquet
> > > > > >>>>>> 1.11.0 be compatible in Spark 2.4.x?
> > > > > >>>>>>>
> > > > > >>>>>>> Updating from 1.10.1 to 1.11.0 breaks at runtime in our
> build
> > > > > >>>>>>>
> > > > > >>>>>>> …
> > > > > >>>>>>> D 0, localhost, executor driver):
> > > java.lang.NoClassDefFoundError:
> > > > > >>>>>> org/apache/parquet/schema/LogicalTypeAnnotation
> > > > > >>>>>>>     at
> > > > > >>>>>>
> > > > > >>>
> > > > >
> > >
> org.apache.parquet.avro.AvroWriteSupport.init(AvroWriteSupport.java:121)
> > > > > >>>>>>>     at
> > > > > >>>>>>
> > > > > >>>>>
> > > > > >>>
> > > > > >>
> > > > >
> > > >
> > >
> >
> org.apache.parquet.hadoop.ParquetOutputFormat.getRecordWriter(ParquetOutputFormat.java:388)
> > > > > >>>>>>>     at
> > > > > >>>>>>
> > > > > >>>>>
> > > > > >>>
> > > > > >>
> > > > >
> > > >
> > >
> >
> org.apache.parquet.hadoop.ParquetOutputFormat.getRecordWriter(ParquetOutputFormat.java:349)
> > > > > >>>>>>>     at
> > > > > >>>>>>
> > > > > >>>>>
> > > > > >>>
> > > > > >>
> > > > >
> > > >
> > >
> >
> org.apache.spark.rdd.InstrumentedOutputFormat.getRecordWriter(InstrumentedOutputFormat.scala:35)
> > > > > >>>>>>>     at org.apache.spark.internal.io
> > > > > >>>>>>
> > > > > >>
> > > >
> .HadoopMapReduceWriteConfigUtil.initWriter(SparkHadoopWriter.scala:350)
> > > > > >>>>>>>     at org.apache.spark.internal.io
> > > > > >>>>>>
> > > > > >>>>>
> > > > > >>>
> > > > > >>
> > > > >
> > > >
> > >
> >
> .SparkHadoopWriter$.org$apache$spark$internal$io$SparkHadoopWriter$$executeTask(SparkHadoopWriter.scala:120)
> > > > > >>>>>>>     at org.apache.spark.internal.io
> > > > > >>>>>>
> > .SparkHadoopWriter$$anonfun$3.apply(SparkHadoopWriter.scala:83)
> > > > > >>>>>>>     at org.apache.spark.internal.io
> > > > > >>>>>>
> > .SparkHadoopWriter$$anonfun$3.apply(SparkHadoopWriter.scala:78)
> > > > > >>>>>>>     at
> > > > > >>>>>>
> > > org.apache.spark.scheduler.ResultTask.runTask(ResultTask.scala:90)
> > > > > >>>>>>>     at org.apache.spark.scheduler.Task.run(Task.scala:123)
> > > > > >>>>>>>     at
> > > > > >>>>>>
> > > > > >>>>>
> > > > > >>>
> > > > > >>
> > > > >
> > > >
> > >
> >
> org.apache.spark.executor.Executor$TaskRunner$$anonfun$10.apply(Executor.scala:408)
> > > > > >>>>>>>     at
> > > > > >>>>>>
> > > org.apache.spark.util.Utils$.tryWithSafeFinally(Utils.scala:1360)
> > > > > >>>>>>>     at
> > > > > >>>>>>
> > > > >
> org.apache.spark.executor.Executor$TaskRunner.run(Executor.scala:414)
> > > > > >>>>>>>     at
> > > > > >>>>>>
> > > > > >>>>>
> > > > > >>>
> > > > > >>
> > > > >
> > > >
> > >
> >
> java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1149)
> > > > > >>>>>>>     at
> > > > > >>>>>>
> > > > > >>>>>
> > > > > >>>
> > > > > >>
> > > > >
> > > >
> > >
> >
> java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:624)
> > > > > >>>>>>>     at java.lang.Thread.run(Thread.java:748)
> > > > > >>>>>>> Caused by: java.lang.ClassNotFoundException:
> > > > > >>>>>> org.apache.parquet.schema.LogicalTypeAnnotation
> > > > > >>>>>>>     at
> > > java.net.URLClassLoader.findClass(URLClassLoader.java:382)
> > > > > >>>>>>>     at
> java.lang.ClassLoader.loadClass(ClassLoader.java:424)
> > > > > >>>>>>>     at
> > > > > >> sun.misc.Launcher$AppClassLoader.loadClass(Launcher.java:349)
> > > > > >>>>>>>     at
> java.lang.ClassLoader.loadClass(ClassLoader.java:357)
> > > > > >>>>>>>
> > > > > >>>>>>> michael
> > > > > >>>>>>>
> > > > > >>>>>>>
> > > > > >>>>>>>> On Nov 20, 2019, at 3:29 AM, Gabor Szadovszky <
> > > [email protected]
> > > > >
> > > > > >>>>> wrote:
> > > > > >>>>>>>>
> > > > > >>>>>>>> Thanks, Fokko.
> > > > > >>>>>>>>
> > > > > >>>>>>>> Ryan, we did not do such measurements yet. I'm afraid, I
> > won't
> > > > > have
> > > > > >>>>>> enough
> > > > > >>>>>>>> time to do that in the next couple of weeks.
> > > > > >>>>>>>>
> > > > > >>>>>>>> Cheers,
> > > > > >>>>>>>> Gabor
> > > > > >>>>>>>>
> > > > > >>>>>>>> On Tue, Nov 19, 2019 at 6:14 PM Driesprong, Fokko
> > > > > >>>>> <[email protected]
> > > > > >>>>>>>
> > > > > >>>>>>>> wrote:
> > > > > >>>>>>>>
> > > > > >>>>>>>>> Thanks Gabor for the explanation. I'd like to change my
> > vote
> > > to
> > > > > +1
> > > > > >>>>>>>>> (non-binding).
> > > > > >>>>>>>>>
> > > > > >>>>>>>>> Cheers, Fokko
> > > > > >>>>>>>>>
> > > > > >>>>>>>>> Op di 19 nov. 2019 om 18:03 schreef Ryan Blue
> > > > > >>>>>> <[email protected]>
> > > > > >>>>>>>>>
> > > > > >>>>>>>>>> Gabor, what I meant was: have we tried this with real
> data
> > > to
> > > > > see
> > > > > >>>>> the
> > > > > >>>>>>>>>> effect? I think those results would be helpful.
> > > > > >>>>>>>>>>
> > > > > >>>>>>>>>> On Mon, Nov 18, 2019 at 11:35 PM Gabor Szadovszky <
> > > > > >>> [email protected]
> > > > > >>>>>>
> > > > > >>>>>>>>>> wrote:
> > > > > >>>>>>>>>>
> > > > > >>>>>>>>>>> Hi Ryan,
> > > > > >>>>>>>>>>>
> > > > > >>>>>>>>>>> It is not easy to calculate. For the column indexes
> > feature
> > > > we
> > > > > >>>>>>>>> introduced
> > > > > >>>>>>>>>>> two new structures saved before the footer: column
> > indexes
> > > > and
> > > > > >>>>> offset
> > > > > >>>>>>>>>>> indexes. If the min/max values are not too long, then
> the
> > > > > >>>>> truncation
> > > > > >>>>>>>>>> might
> > > > > >>>>>>>>>>> not decrease the file size because of the offset
> indexes.
> > > > > >>> Moreover,
> > > > > >>>>>> we
> > > > > >>>>>>>>>> also
> > > > > >>>>>>>>>>> introduced parquet.page.row.count.limit which might
> > > increase
> > > > > the
> > > > > >>>>>> number
> > > > > >>>>>>>>>> of
> > > > > >>>>>>>>>>> pages which leads to increasing the file size.
> > > > > >>>>>>>>>>> The footer itself is also changed and we are saving
> more
> > > > values
> > > > > >> in
> > > > > >>>>>> it:
> > > > > >>>>>>>>>> the
> > > > > >>>>>>>>>>> offset values to the column/offset indexes, the new
> > logical
> > > > > type
> > > > > >>>>>>>>>>> structures, the CRC checksums (we might have some
> > others).
> > > > > >>>>>>>>>>> So, the size of the files with small amount of data
> will
> > be
> > > > > >>>>> increased
> > > > > >>>>>>>>>>> (because of the larger footer). The size of the files
> > where
> > > > the
> > > > > >>>>>> values
> > > > > >>>>>>>>>> can
> > > > > >>>>>>>>>>> be encoded very well (RLE) will probably be increased
> > > > (because
> > > > > >> we
> > > > > >>>>>> will
> > > > > >>>>>>>>>> have
> > > > > >>>>>>>>>>> more pages). The size of some files where the values
> are
> > > long
> > > > > >>>>>> (>64bytes
> > > > > >>>>>>>>>> by
> > > > > >>>>>>>>>>> default) might be decreased because of truncating the
> > > min/max
> > > > > >>>>> values.
> > > > > >>>>>>>>>>>
> > > > > >>>>>>>>>>> Regards,
> > > > > >>>>>>>>>>> Gabor
> > > > > >>>>>>>>>>>
> > > > > >>>>>>>>>>> On Mon, Nov 18, 2019 at 6:46 PM Ryan Blue
> > > > > >>>>> <[email protected]
> > > > > >>>>>>>
> > > > > >>>>>>>>>>> wrote:
> > > > > >>>>>>>>>>>
> > > > > >>>>>>>>>>>> Gabor, do we have an idea of the additional overhead
> > for a
> > > > > >>>>> non-test
> > > > > >>>>>>>>>> data
> > > > > >>>>>>>>>>>> file? It should be easy to validate that this doesn't
> > > > > introduce
> > > > > >>> an
> > > > > >>>>>>>>>>>> unreasonable amount of overhead. In some cases, it
> > should
> > > > > >>> actually
> > > > > >>>>>> be
> > > > > >>>>>>>>>>>> smaller since the column indexes are truncated and
> page
> > > > stats
> > > > > >> are
> > > > > >>>>>>>>> not.
> > > > > >>>>>>>>>>>>
> > > > > >>>>>>>>>>>> On Mon, Nov 18, 2019 at 1:00 AM Gabor Szadovszky
> > > > > >>>>>>>>>>>> <[email protected]> wrote:
> > > > > >>>>>>>>>>>>
> > > > > >>>>>>>>>>>>> Hi Fokko,
> > > > > >>>>>>>>>>>>>
> > > > > >>>>>>>>>>>>> For the first point. The referenced constructor is
> > > private
> > > > > and
> > > > > >>>>>>>>>> Iceberg
> > > > > >>>>>>>>>>>> uses
> > > > > >>>>>>>>>>>>> it via reflection. It is not a breaking change. I
> > think,
> > > > > >>>>> parquet-mr
> > > > > >>>>>>>>>>> shall
> > > > > >>>>>>>>>>>>> not keep private methods only because of clients
> might
> > > use
> > > > > >> them
> > > > > >>>>> via
> > > > > >>>>>>>>>>>>> reflection.
> > > > > >>>>>>>>>>>>>
> > > > > >>>>>>>>>>>>> About the checksum. I've agreed on having the CRC
> > > checksum
> > > > > >> write
> > > > > >>>>>>>>>>> enabled
> > > > > >>>>>>>>>>>> by
> > > > > >>>>>>>>>>>>> default because the benchmarks did not show
> significant
> > > > > >>>>> performance
> > > > > >>>>>>>>>>>>> penalties. See
> > > > https://github.com/apache/parquet-mr/pull/647
> > > > > >>> for
> > > > > >>>>>>>>>>>> details.
> > > > > >>>>>>>>>>>>>
> > > > > >>>>>>>>>>>>> About the file size change. 1.11.0 is introducing
> > column
> > > > > >>> indexes,
> > > > > >>>>>>>>> CRC
> > > > > >>>>>>>>>>>>> checksum, removing the statistics from the page
> headers
> > > and
> > > > > >>> maybe
> > > > > >>>>>>>>>> other
> > > > > >>>>>>>>>>>>> changes that impact file size. If only file size is
> in
> > > > > >> question
> > > > > >>> I
> > > > > >>>>>>>>>>> cannot
> > > > > >>>>>>>>>>>>> see a breaking change here.
> > > > > >>>>>>>>>>>>>
> > > > > >>>>>>>>>>>>> Regards,
> > > > > >>>>>>>>>>>>> Gabor
> > > > > >>>>>>>>>>>>>
> > > > > >>>>>>>>>>>>>
> > > > > >>>>>>>>>>>>>
> > > > > >>>>>>>>>>>>> On Sun, Nov 17, 2019 at 9:27 PM Driesprong, Fokko
> > > > > >>>>>>>>>> <[email protected]
> > > > > >>>>>>>>>>>>
> > > > > >>>>>>>>>>>>> wrote:
> > > > > >>>>>>>>>>>>>
> > > > > >>>>>>>>>>>>>> Unfortunately, a -1 from my side (non-binding)
> > > > > >>>>>>>>>>>>>>
> > > > > >>>>>>>>>>>>>> I've updated Iceberg to Parquet 1.11.0, and found
> > three
> > > > > >> things:
> > > > > >>>>>>>>>>>>>>
> > > > > >>>>>>>>>>>>>> - We've broken backward compatibility of the
> > constructor
> > > > of
> > > > > >>>>>>>>>>>>>> ColumnChunkPageWriteStore
> > > > > >>>>>>>>>>>>>> <
> > > > > >>>>>>>>>>>>>>
> > > > > >>>>>>>>>>>>>
> > > > > >>>>>>>>>>>>
> > > > > >>>>>>>>>>>
> > > > > >>>>>>>>>>
> > > > > >>>>>>>>>
> > > > > >>>>>>
> > > > > >>>>>
> > > > > >>>
> > > > > >>
> > > > >
> > > >
> > >
> >
> https://github.com/apache/parquet-mr/commit/e7db9e20f52c925a207ea62d6dda6dc4e870294e#diff-d007a18083a2431c30a5416f248e0a4bR80
> > > > > >>>>>>>>>>>>>>> .
> > > > > >>>>>>>>>>>>>> This required a change
> > > > > >>>>>>>>>>>>>> <
> > > > > >>>>>>>>>>>>>>
> > > > > >>>>>>>>>>>>>
> > > > > >>>>>>>>>>>>
> > > > > >>>>>>>>>>>
> > > > > >>>>>>>>>>
> > > > > >>>>>>>>>
> > > > > >>>>>>
> > > > > >>>>>
> > > > > >>>
> > > > > >>
> > > > >
> > > >
> > >
> >
> https://github.com/apache/incubator-iceberg/pull/297/files#diff-b877faa96f292b851c75fe8bcc1912f8R176
> > > > > >>>>>>>>>>>>>>>
> > > > > >>>>>>>>>>>>>> to the code. This isn't a hard blocker, but if there
> > > will
> > > > be
> > > > > >> a
> > > > > >>>>>>>>>> new
> > > > > >>>>>>>>>>>> RC,
> > > > > >>>>>>>>>>>>>> I've
> > > > > >>>>>>>>>>>>>> submitted a patch:
> > > > > >>>>>>>>>> https://github.com/apache/parquet-mr/pull/699
> > > > > >>>>>>>>>>>>>> - Related, that we need to put in the changelog, is
> > that
> > > > > >>>>>>>>>> checksums
> > > > > >>>>>>>>>>>> are
> > > > > >>>>>>>>>>>>>> enabled by default:
> > > > > >>>>>>>>>>>>>>
> > > > > >>>>>>>>>>>>>>
> > > > > >>>>>>>>>>>>>
> > > > > >>>>>>>>>>>>
> > > > > >>>>>>>>>>>
> > > > > >>>>>>>>>>
> > > > > >>>>>>>>>
> > > > > >>>>>>
> > > > > >>>>>
> > > > > >>>
> > > > > >>
> > > > >
> > > >
> > >
> >
> https://github.com/apache/parquet-mr/blob/master/parquet-column/src/main/java/org/apache/parquet/column/ParquetProperties.java#L54
> > > > > >>>>>>>>>>>>>> This
> > > > > >>>>>>>>>>>>>> will impact performance. I would suggest disabling
> it
> > by
> > > > > >>>>>>>>>> default:
> > > > > >>>>>>>>>>>>>> https://github.com/apache/parquet-mr/pull/700
> > > > > >>>>>>>>>>>>>> <
> > > > > >>>>>>>>>>>>>>
> > > > > >>>>>>>>>>>>>
> > > > > >>>>>>>>>>>>
> > > > > >>>>>>>>>>>
> > > > > >>>>>>>>>>
> > > > > >>>>>>>>>
> > > > > >>>>>>
> > > > > >>>>>
> > > > > >>>
> > > > > >>
> > > > >
> > > >
> > >
> >
> https://github.com/apache/parquet-mr/commit/e7db9e20f52c925a207ea62d6dda6dc4e870294e#diff-d007a18083a2431c30a5416f248e0a4bR277
> > > > > >>>>>>>>>>>>>>>
> > > > > >>>>>>>>>>>>>> - Binary compatibility. While updating Iceberg, I've
> > > > noticed
> > > > > >>>>>>>>>> that
> > > > > >>>>>>>>>>>> the
> > > > > >>>>>>>>>>>>>> split-test was failing:
> > > > > >>>>>>>>>>>>>>
> > > > > >>>>>>>>>>>>>>
> > > > > >>>>>>>>>>>>>
> > > > > >>>>>>>>>>>>
> > > > > >>>>>>>>>>>
> > > > > >>>>>>>>>>
> > > > > >>>>>>>>>
> > > > > >>>>>>
> > > > > >>>>>
> > > > > >>>
> > > > > >>
> > > > >
> > > >
> > >
> >
> https://github.com/apache/incubator-iceberg/pull/297/files#diff-4b64b7014f259be41b26cfb73d3e6e93L199
> > > > > >>>>>>>>>>>>>> The
> > > > > >>>>>>>>>>>>>> two records are now divided over four Spark
> > partitions.
> > > > > >>>>>>>>>> Something
> > > > > >>>>>>>>>>> in
> > > > > >>>>>>>>>>>>> the
> > > > > >>>>>>>>>>>>>> output has changed since the files are bigger now.
> Has
> > > > > anyone
> > > > > >>>>>>>>>> any
> > > > > >>>>>>>>>>>> idea
> > > > > >>>>>>>>>>>>>> to
> > > > > >>>>>>>>>>>>>> check what's changed, or a way to check this? The
> only
> > > > thing
> > > > > >> I
> > > > > >>>>>>>>>> can
> > > > > >>>>>>>>>>>>>> think of
> > > > > >>>>>>>>>>>>>> is the checksum mentioned above.
> > > > > >>>>>>>>>>>>>>
> > > > > >>>>>>>>>>>>>> $ ls -lah ~/Desktop/parquet-1-1*
> > > > > >>>>>>>>>>>>>> -rw-r--r--  1 fokkodriesprong  staff   562B 17 nov
> > 21:09
> > > > > >>>>>>>>>>>>>>
> /Users/fokkodriesprong/Desktop/parquet-1-10-1.parquet
> > > > > >>>>>>>>>>>>>> -rw-r--r--  1 fokkodriesprong  staff   611B 17 nov
> > 21:05
> > > > > >>>>>>>>>>>>>>
> /Users/fokkodriesprong/Desktop/parquet-1-11-0.parquet
> > > > > >>>>>>>>>>>>>>
> > > > > >>>>>>>>>>>>>> $ parquet-tools cat
> > > > > >>>>>>>>>>>> /Users/fokkodriesprong/Desktop/parquet-1-10-1.parquet
> > > > > >>>>>>>>>>>>>> id = 1
> > > > > >>>>>>>>>>>>>> data = a
> > > > > >>>>>>>>>>>>>>
> > > > > >>>>>>>>>>>>>> $ parquet-tools cat
> > > > > >>>>>>>>>>>> /Users/fokkodriesprong/Desktop/parquet-1-11-0.parquet
> > > > > >>>>>>>>>>>>>> id = 1
> > > > > >>>>>>>>>>>>>> data = a
> > > > > >>>>>>>>>>>>>>
> > > > > >>>>>>>>>>>>>> A binary diff here:
> > > > > >>>>>>>>>>>>>>
> > > > > >> https://gist.github.com/Fokko/1c209f158299dc2fb5878c5bae4bf6d8
> > > > > >>>>>>>>>>>>>>
> > > > > >>>>>>>>>>>>>> Cheers, Fokko
> > > > > >>>>>>>>>>>>>>
> > > > > >>>>>>>>>>>>>> Op za 16 nov. 2019 om 04:18 schreef Junjie Chen <
> > > > > >>>>>>>>>>>>> [email protected]
> > > > > >>>>>>>>>>>>>>> :
> > > > > >>>>>>>>>>>>>>
> > > > > >>>>>>>>>>>>>>> +1
> > > > > >>>>>>>>>>>>>>> Verified signature, checksum and ran mvn install
> > > > > >> successfully.
> > > > > >>>>>>>>>>>>>>>
> > > > > >>>>>>>>>>>>>>> Wang, Yuming <[email protected]>
> > 于2019年11月14日周四
> > > > > >>>>>>>>> 下午2:05写道:
> > > > > >>>>>>>>>>>>>>>>
> > > > > >>>>>>>>>>>>>>>> +1
> > > > > >>>>>>>>>>>>>>>> Tested Parquet 1.11.0 with Spark SQL module:
> > build/sbt
> > > > > >>>>>>>>>>>>> "sql/test-only"
> > > > > >>>>>>>>>>>>>>> -Phadoop-3.2
> > > > > >>>>>>>>>>>>>>>>
> > > > > >>>>>>>>>>>>>>>> On 2019/11/13, 21:33, "Gabor Szadovszky" <
> > > > > >> [email protected]>
> > > > > >>>>>>>>>>>> wrote:
> > > > > >>>>>>>>>>>>>>>>
> > > > > >>>>>>>>>>>>>>>> Hi everyone,
> > > > > >>>>>>>>>>>>>>>>
> > > > > >>>>>>>>>>>>>>>> I propose the following RC to be released as
> > official
> > > > > >>>>>>>>>> Apache
> > > > > >>>>>>>>>>>>>> Parquet
> > > > > >>>>>>>>>>>>>>> 1.11.0
> > > > > >>>>>>>>>>>>>>>> release.
> > > > > >>>>>>>>>>>>>>>>
> > > > > >>>>>>>>>>>>>>>> The commit id is
> > > > 18519eb8e059865652eee3ff0e8593f126701da4
> > > > > >>>>>>>>>>>>>>>> * This corresponds to the tag:
> > > apache-parquet-1.11.0-rc7
> > > > > >>>>>>>>>>>>>>>> *
> > > > > >>>>>>>>>>>>>>>>
> > > > > >>>>>>>>>>>>>>>
> > > > > >>>>>>>>>>>>>>
> > > > > >>>>>>>>>>>>>
> > > > > >>>>>>>>>>>>
> > > > > >>>>>>>>>>>
> > > > > >>>>>>>>>>
> > > > > >>>>>>>>>
> > > > > >>>>>>
> > > > > >>>>>
> > > > > >>>
> > > > > >>
> > > > >
> > > >
> > >
> >
> https://nam01.safelinks.protection.outlook.com/?url=https%3A%2F%2Fgithub.com%2Fapache%2Fparquet-mr%2Ftree%2F18519eb8e059865652eee3ff0e8593f126701da4&amp;data=02%7C01%7Cyumwang%40ebay.com%7C8d588ca5855842a94bed08d7683e1221%7C46326bff992841a0baca17c16c94ea99%7C0%7C0%7C637092488114756267&amp;sdata=ToLFrTB9lU%2FGzH6UpXwy7PAY7kaupbyKAgdghESCfgg%3D&amp;reserved=0
> > > > > >>>>>>>>>>>>>>>>
> > > > > >>>>>>>>>>>>>>>> The release tarball, signature, and checksums are
> > > here:
> > > > > >>>>>>>>>>>>>>>> *
> > > > > >>>>>>>>>>>>>>>
> > > > > >>>>>>>>>>>>>>
> > > > > >>>>>>>>>>>>>
> > > > > >>>>>>>>>>>>
> > > > > >>>>>>>>>>>
> > > > > >>>>>>>>>>
> > > > > >>>>>>>>>
> > > > > >>>>>>
> > > > > >>>>>
> > > > > >>>
> > > > > >>
> > > > >
> > > >
> > >
> >
> https://nam01.safelinks.protection.outlook.com/?url=https%3A%2F%2Fdist.apache.org%2Frepos%2Fdist%2Fdev%2Fparquet%2Fapache-parquet-1.11.0-rc7&amp;data=02%7C01%7Cyumwang%40ebay.com%7C8d588ca5855842a94bed08d7683e1221%7C46326bff992841a0baca17c16c94ea99%7C0%7C0%7C637092488114756267&amp;sdata=MPaHiYJT7ZcqreAYUkvDvZugthUhRPrySdXpN2ytT5k%3D&amp;reserved=0
> > > > > >>>>>>>>>>>>>>>>
> > > > > >>>>>>>>>>>>>>>> You can find the KEYS file here:
> > > > > >>>>>>>>>>>>>>>> *
> > > > > >>>>>>>>>>>>>>>
> > > > > >>>>>>>>>>>>>>
> > > > > >>>>>>>>>>>>>
> > > > > >>>>>>>>>>>>
> > > > > >>>>>>>>>>>
> > > > > >>>>>>>>>>
> > > > > >>>>>>>>>
> > > > > >>>>>>
> > > > > >>>>>
> > > > > >>>
> > > > > >>
> > > > >
> > > >
> > >
> >
> https://nam01.safelinks.protection.outlook.com/?url=https%3A%2F%2Fapache.org%2Fdist%2Fparquet%2FKEYS&amp;data=02%7C01%7Cyumwang%40ebay.com%7C8d588ca5855842a94bed08d7683e1221%7C46326bff992841a0baca17c16c94ea99%7C0%7C0%7C637092488114756267&amp;sdata=IwG4MUGsP2lVzlD4bwZUEPuEAPUg%2FHXRYtxc5CQupBM%3D&amp;reserved=0
> > > > > >>>>>>>>>>>>>>>>
> > > > > >>>>>>>>>>>>>>>> Binary artifacts are staged in Nexus here:
> > > > > >>>>>>>>>>>>>>>> *
> > > > > >>>>>>>>>>>>>>>
> > > > > >>>>>>>>>>>>>>
> > > > > >>>>>>>>>>>>>
> > > > > >>>>>>>>>>>>
> > > > > >>>>>>>>>>>
> > > > > >>>>>>>>>>
> > > > > >>>>>>>>>
> > > > > >>>>>>
> > > > > >>>>>
> > > > > >>>
> > > > > >>
> > > > >
> > > >
> > >
> >
> https://nam01.safelinks.protection.outlook.com/?url=https%3A%2F%2Frepository.apache.org%2Fcontent%2Fgroups%2Fstaging%2Forg%2Fapache%2Fparquet%2F&amp;data=02%7C01%7Cyumwang%40ebay.com%7C8d588ca5855842a94bed08d7683e1221%7C46326bff992841a0baca17c16c94ea99%7C0%7C0%7C637092488114756267&amp;sdata=lHtqLRQqQFwsyoaLSVaJuau5gxPKsCQFFVJaY8H0tZQ%3D&amp;reserved=0
> > > > > >>>>>>>>>>>>>>>>
> > > > > >>>>>>>>>>>>>>>> This release includes the changes listed at:
> > > > > >>>>>>>>>>>>>>>>
> > > > > >>>>>>>>>>>>>>>
> > > > > >>>>>>>>>>>>>>
> > > > > >>>>>>>>>>>>>
> > > > > >>>>>>>>>>>>
> > > > > >>>>>>>>>>>
> > > > > >>>>>>>>>>
> > > > > >>>>>>>>>
> > > > > >>>>>>
> > > > > >>>>>
> > > > > >>>
> > > > > >>
> > > > >
> > > >
> > >
> >
> https://nam01.safelinks.protection.outlook.com/?url=https%3A%2F%2Fgithub.com%2Fapache%2Fparquet-mr%2Fblob%2Fapache-parquet-1.11.0-rc7%2FCHANGES.md&amp;data=02%7C01%7Cyumwang%40ebay.com%7C8d588ca5855842a94bed08d7683e1221%7C46326bff992841a0baca17c16c94ea99%7C0%7C0%7C637092488114756267&amp;sdata=82BplI3bLAL6qArLHvVoYReZOk%2BboSP655rI8VX5Q5I%3D&amp;reserved=0
> > > > > >>>>>>>>>>>>>>>>
> > > > > >>>>>>>>>>>>>>>> Please download, verify, and test.
> > > > > >>>>>>>>>>>>>>>>
> > > > > >>>>>>>>>>>>>>>> Please vote in the next 72 hours.
> > > > > >>>>>>>>>>>>>>>>
> > > > > >>>>>>>>>>>>>>>> [ ] +1 Release this as Apache Parquet 1.11.0
> > > > > >>>>>>>>>>>>>>>> [ ] +0
> > > > > >>>>>>>>>>>>>>>> [ ] -1 Do not release this because...
> > > > > >>>>>>>>>>>>>>>>
> > > > > >>>>>>>>>>>>>>>>
> > > > > >>>>>>>>>>>>>>>
> > > > > >>>>>>>>>>>>>>
> > > > > >>>>>>>>>>>>>
> > > > > >>>>>>>>>>>>
> > > > > >>>>>>>>>>>>
> > > > > >>>>>>>>>>>> --
> > > > > >>>>>>>>>>>> Ryan Blue
> > > > > >>>>>>>>>>>> Software Engineer
> > > > > >>>>>>>>>>>> Netflix
> > > > > >>>>>>>>>>>>
> > > > > >>>>>>>>>>>
> > > > > >>>>>>>>>>
> > > > > >>>>>>>>>>
> > > > > >>>>>>>>>> --
> > > > > >>>>>>>>>> Ryan Blue
> > > > > >>>>>>>>>> Software Engineer
> > > > > >>>>>>>>>> Netflix
> > > > > >>>>>>>>>>
> > > > > >>>>>>>>>
> > > > > >>>>>>>
> > > > > >>>>>>
> > > > > >>>>>>
> > > > > >>>>>
> > > > > >>>>
> > > > > >>>>
> > > > > >>>> --
> > > > > >>>> Ryan Blue
> > > > > >>>> Software Engineer
> > > > > >>>> Netflix
> > > > > >>>
> > > > > >>>
> > > > > >>
> > > > > >
> > > > > >
> > > > > > --
> > > > > > Ryan Blue
> > > > > > Software Engineer
> > > > > > Netflix
> > > > >
> > > > >
> > > >
> > >
> > >
> > > --
> > > Ryan Blue
> > > Software Engineer
> > > Netflix
> > >
> >
>
>
> --
> Ryan Blue
> Software Engineer
> Netflix
>

Reply via email to