Re: [VOTE] Release Apache Parquet 1.11.0 RC7

Michael Heuer Fri, 22 Nov 2019 09:08:53 -0800

Clirr fails the binary incompatibility check against 1.10.1

parquet-mr (HEAD detached at apache-parquet-1.11.0-rc7)
$ mvn clirr:check -DcomparisonArtifacts=1.10.1
…
[INFO] --- clirr-maven-plugin:2.6.1:check (default-cli) @ parquet-common ---
[INFO] artifact org.apache.parquet:parquet-common: checking for updates from 
jitpack.io
[INFO] artifact org.apache.parquet:parquet-common: checking for updates from 
central
[INFO] Comparing to version: 1.10.1
[ERROR] 7009: org.apache.parquet.bytes.ByteBufferInputStream: Accessibility of 
method 'public ByteBufferInputStream()' has been decreased from public to 
package
[INFO] ------------------------------------------------------------------------
[INFO] Reactor Summary for Apache Parquet MR 1.11.0:
[INFO]
[INFO] Apache Parquet MR .................................. SUCCESS [  2.052 s]
[INFO] Apache Parquet Format Structures ................... SUCCESS [  7.035 s]
[INFO] Apache Parquet Generator ........................... SUCCESS [  1.872 s]
[INFO] Apache Parquet Common .............................. FAILURE [  1.478 s]
...



> On Nov 22, 2019, at 2:23 AM, Gabor Szadovszky <[email protected]> wrote:
> 
> Ryan,
> I would not trust our compatibility checks (semver) too much. Currently, it
> is configured to compare our current version to 1.7.0. It means anything
> that is added since 1.7.0 and then broke in a later release won't be
> caught. In addition, many packages are excluded from the check because of
> different reasons. For example org/apache/parquet/schema/** is excluded so
> if it would really be an API compatibility issue we certainly wouldn't
> catch it.
> 
> Michael,
> It fails because of a NoSuchMethodError pointing to a method that is newly
> introduced in 1.11. Both the caller and the callee shipped by parquet-mr.
> So, I'm quite sure it is a classpath issue. It seems that the 1.11 version
> of the parquet-column jar is not on the classpath.
> 
> 
> On Fri, Nov 22, 2019 at 1:44 AM Michael Heuer <[email protected]> wrote:
> 
>> The dependency versions are consistent in our artifact
>> 
>> $ mvn dependency:tree | grep parquet
>> [INFO] |  \- org.apache.parquet:parquet-avro:jar:1.11.0:compile
>> [INFO] |     \-
>> org.apache.parquet:parquet-format-structures:jar:1.11.0:compile
>> [INFO] |  +- org.apache.parquet:parquet-column:jar:1.11.0:compile
>> [INFO] |  |  +- org.apache.parquet:parquet-common:jar:1.11.0:compile
>> [INFO] |  |  \- org.apache.parquet:parquet-encoding:jar:1.11.0:compile
>> [INFO] |  +- org.apache.parquet:parquet-hadoop:jar:1.11.0:compile
>> [INFO] |  |  +- org.apache.parquet:parquet-jackson:jar:1.11.0:compile
>> 
>> The latter error
>> 
>> Caused by: org.apache.spark.SparkException: Job aborted due to stage
>> failure: Task 0 in stage 0.0 failed 1 times, most recent failure: Lost task
>> 0.0 in stage 0.0 (TID 0, localhost, executor driver):
>> java.lang.NoSuchMethodError:
>> org.apache.parquet.schema.Types$PrimitiveBuilder.as(Lorg/apache/parquet/schema/LogicalTypeAnnotation;)Lorg/apache/parquet/schema/Types$Builder;
>>        at
>> org.apache.parquet.avro.AvroSchemaConverter.convertField(AvroSchemaConverter.java:161)
>> 
>> occurs when I attempt to run via spark-submit on Spark 2.4.4
>> 
>> $ spark-submit --version
>> Welcome to
>>      ____              __
>>     / __/__  ___ _____/ /__
>>    _\ \/ _ \/ _ `/ __/  '_/
>>   /___/ .__/\_,_/_/ /_/\_\   version 2.4.4
>>      /_/
>> 
>> Using Scala version 2.11.12, Java HotSpot(TM) 64-Bit Server VM, 1.8.0_191
>> Branch
>> Compiled by user  on 2019-08-27T21:21:38Z
>> Revision
>> Url
>> Type --help for more information.
>> 
>> 
>> 
>>> On Nov 21, 2019, at 6:06 PM, Ryan Blue <[email protected]>
>> wrote:
>>> 
>>> Thanks for looking into it, Nandor. That doesn't sound like a problem
>> with
>>> Parquet, but a problem with the test environment since parquet-avro
>> depends
>>> on a newer API method.
>>> 
>>> On Thu, Nov 21, 2019 at 3:58 PM Nandor Kollar
>> <[email protected]>
>>> wrote:
>>> 
>>>> I'm not sure that this is a binary compatibility issue. The missing
>> builder
>>>> method was recently added in 1.11.0 with the introduction of the new
>>>> logical type API, while the original version (one with a single
>>>> OriginalType input parameter called before by AvroSchemaConverter) of
>> this
>>>> method is kept untouched. It seems to me that the Parquet version on
>> Spark
>>>> executor mismatch: parquet-avro is on 1.11.0, but parquet-column is
>> still
>>>> on an older version.
>>>> 
>>>> On Thu, Nov 21, 2019 at 11:41 PM Michael Heuer <[email protected]>
>> wrote:
>>>> 
>>>>> Perhaps not strictly necessary to say, but if this particular
>>>>> compatibility break between 1.10 and 1.11 was intentional, and no other
>>>>> compatibility breaks are found, I would vote -1 (non-binding) on this
>> RC
>>>>> such that we might go back and revisit the changes to preserve
>>>>> compatibility.
>>>>> 
>>>>> I am not sure there is presently enough motivation in the Spark project
>>>>> for a release after 2.4.4 and before 3.0 in which to bump the Parquet
>>>>> dependency version to 1.11.x.
>>>>> 
>>>>>  michael
>>>>> 
>>>>> 
>>>>>> On Nov 21, 2019, at 11:01 AM, Ryan Blue <[email protected]>
>>>>> wrote:
>>>>>> 
>>>>>> Gabor, shouldn't Parquet be binary compatible for public APIs? From
>> the
>>>>>> stack trace, it looks like this 1.11.0 RC breaks binary compatibility
>>>> in
>>>>>> the type builders.
>>>>>> 
>>>>>> Looks like this should have been caught by the binary compatibility
>>>>> checks.
>>>>>> 
>>>>>> On Thu, Nov 21, 2019 at 8:56 AM Gabor Szadovszky <[email protected]>
>>>>> wrote:
>>>>>> 
>>>>>>> Hi Michael,
>>>>>>> 
>>>>>>> Unfortunately, I don't have too much experience on Spark. But if
>> spark
>>>>> uses
>>>>>>> the parquet-mr library in an embedded way (that's how Hive uses it)
>> it
>>>>> is
>>>>>>> required to re-build Spark with 1.11 RC parquet-mr.
>>>>>>> 
>>>>>>> Regards,
>>>>>>> Gabor
>>>>>>> 
>>>>>>> On Wed, Nov 20, 2019 at 5:44 PM Michael Heuer <[email protected]>
>>>>> wrote:
>>>>>>> 
>>>>>>>> It appears a provided scope dependency on spark-sql leaks old
>> parquet
>>>>>>>> versions was causing the runtime error below.  After including new
>>>>>>>> parquet-column and parquet-hadoop compile scope dependencies (in
>>>>> addition
>>>>>>>> to parquet-avro, which we already have at compile scope), our build
>>>>>>>> succeeds.
>>>>>>>> 
>>>>>>>> https://github.com/bigdatagenomics/adam/pull/2232 <
>>>>>>>> https://github.com/bigdatagenomics/adam/pull/2232>
>>>>>>>> 
>>>>>>>> However, when running via spark-submit I run into a similar runtime
>>>>> error
>>>>>>>> 
>>>>>>>> Caused by: java.lang.NoSuchMethodError:
>>>>>>>> 
>>>>>>> 
>>>>> 
>>>> 
>> org.apache.parquet.schema.Types$PrimitiveBuilder.as(Lorg/apache/parquet/schema/LogicalTypeAnnotation;)Lorg/apache/parquet/schema/Types$Builder;
>>>>>>>>      at
>>>>>>>> 
>>>>>>> 
>>>>> 
>>>> 
>> org.apache.parquet.avro.AvroSchemaConverter.convertField(AvroSchemaConverter.java:161)
>>>>>>>>      at
>>>>>>>> 
>>>>>>> 
>>>>> 
>>>> 
>> org.apache.parquet.avro.AvroSchemaConverter.convertUnion(AvroSchemaConverter.java:226)
>>>>>>>>      at
>>>>>>>> 
>>>>>>> 
>>>>> 
>>>> 
>> org.apache.parquet.avro.AvroSchemaConverter.convertField(AvroSchemaConverter.java:182)
>>>>>>>>      at
>>>>>>>> 
>>>>>>> 
>>>>> 
>>>> 
>> org.apache.parquet.avro.AvroSchemaConverter.convertField(AvroSchemaConverter.java:141)
>>>>>>>>      at
>>>>>>>> 
>>>>>>> 
>>>>> 
>>>> 
>> org.apache.parquet.avro.AvroSchemaConverter.convertField(AvroSchemaConverter.java:244)
>>>>>>>>      at
>>>>>>>> 
>>>>>>> 
>>>>> 
>>>> 
>> org.apache.parquet.avro.AvroSchemaConverter.convertFields(AvroSchemaConverter.java:135)
>>>>>>>>      at
>>>>>>>> 
>>>>>>> 
>>>>> 
>>>> 
>> org.apache.parquet.avro.AvroSchemaConverter.convert(AvroSchemaConverter.java:126)
>>>>>>>>      at
>>>>>>>> 
>>>>> 
>> org.apache.parquet.avro.AvroWriteSupport.init(AvroWriteSupport.java:121)
>>>>>>>>      at
>>>>>>>> 
>>>>>>> 
>>>>> 
>>>> 
>> org.apache.parquet.hadoop.ParquetOutputFormat.getRecordWriter(ParquetOutputFormat.java:388)
>>>>>>>>      at
>>>>>>>> 
>>>>>>> 
>>>>> 
>>>> 
>> org.apache.parquet.hadoop.ParquetOutputFormat.getRecordWriter(ParquetOutputFormat.java:349)
>>>>>>>>      at
>>>>>>>> 
>>>>>>> 
>>>>> 
>>>> 
>> org.apache.spark.rdd.InstrumentedOutputFormat.getRecordWriter(InstrumentedOutputFormat.scala:35)
>>>>>>>>      at org.apache.spark.internal.io
>>>>>>>> 
>>>> .HadoopMapReduceWriteConfigUtil.initWriter(SparkHadoopWriter.scala:350)
>>>>>>>>      at org.apache.spark.internal.io
>>>>>>>> 
>>>>>>> 
>>>>> 
>>>> 
>> .SparkHadoopWriter$.org$apache$spark$internal$io$SparkHadoopWriter$$executeTask(SparkHadoopWriter.scala:120)
>>>>>>>>      at org.apache.spark.internal.io
>>>>>>>> .SparkHadoopWriter$$anonfun$3.apply(SparkHadoopWriter.scala:83)
>>>>>>>>      at org.apache.spark.internal.io
>>>>>>>> .SparkHadoopWriter$$anonfun$3.apply(SparkHadoopWriter.scala:78)
>>>>>>>>      at
>>>>>>>> org.apache.spark.scheduler.ResultTask.runTask(ResultTask.scala:90)
>>>>>>>>      at org.apache.spark.scheduler.Task.run(Task.scala:123)
>>>>>>>>      at
>>>>>>>> 
>>>>>>> 
>>>>> 
>>>> 
>> org.apache.spark.executor.Executor$TaskRunner$$anonfun$10.apply(Executor.scala:408)
>>>>>>>>      at
>>>>>>>> org.apache.spark.util.Utils$.tryWithSafeFinally(Utils.scala:1360)
>>>>>>>>      at
>>>>>>>> 
>> org.apache.spark.executor.Executor$TaskRunner.run(Executor.scala:414)
>>>>>>>>      at
>>>>>>>> 
>>>>>>> 
>>>>> 
>>>> 
>> java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1149)
>>>>>>>>      at
>>>>>>>> 
>>>>>>> 
>>>>> 
>>>> 
>> java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:624)
>>>>>>>>      at java.lang.Thread.run(Thread.java:748)
>>>>>>>> 
>>>>>>>> 
>>>>>>>> Will bumping our library dependency version to 1.11 require a new
>>>>> version
>>>>>>>> of Spark, built against Parquet 1.11?
>>>>>>>> 
>>>>>>>> Please accept my apologies if this is heading out-of-scope for the
>>>>>>> Parquet
>>>>>>>> mailing list.
>>>>>>>> 
>>>>>>>> michael
>>>>>>>> 
>>>>>>>> 
>>>>>>>>> On Nov 20, 2019, at 10:00 AM, Michael Heuer <[email protected]>
>>>>> wrote:
>>>>>>>>> 
>>>>>>>>> I am willing to do some benchmarking on genomic data at scale but
>> am
>>>>>>> not
>>>>>>>> quite sure what the Spark target version for 1.11.0 might be.  Will
>>>>>>> Parquet
>>>>>>>> 1.11.0 be compatible in Spark 2.4.x?
>>>>>>>>> 
>>>>>>>>> Updating from 1.10.1 to 1.11.0 breaks at runtime in our build
>>>>>>>>> 
>>>>>>>>> …
>>>>>>>>> D 0, localhost, executor driver): java.lang.NoClassDefFoundError:
>>>>>>>> org/apache/parquet/schema/LogicalTypeAnnotation
>>>>>>>>>    at
>>>>>>>> 
>>>>> 
>> org.apache.parquet.avro.AvroWriteSupport.init(AvroWriteSupport.java:121)
>>>>>>>>>    at
>>>>>>>> 
>>>>>>> 
>>>>> 
>>>> 
>> org.apache.parquet.hadoop.ParquetOutputFormat.getRecordWriter(ParquetOutputFormat.java:388)
>>>>>>>>>    at
>>>>>>>> 
>>>>>>> 
>>>>> 
>>>> 
>> org.apache.parquet.hadoop.ParquetOutputFormat.getRecordWriter(ParquetOutputFormat.java:349)
>>>>>>>>>    at
>>>>>>>> 
>>>>>>> 
>>>>> 
>>>> 
>> org.apache.spark.rdd.InstrumentedOutputFormat.getRecordWriter(InstrumentedOutputFormat.scala:35)
>>>>>>>>>    at org.apache.spark.internal.io
>>>>>>>> 
>>>> .HadoopMapReduceWriteConfigUtil.initWriter(SparkHadoopWriter.scala:350)
>>>>>>>>>    at org.apache.spark.internal.io
>>>>>>>> 
>>>>>>> 
>>>>> 
>>>> 
>> .SparkHadoopWriter$.org$apache$spark$internal$io$SparkHadoopWriter$$executeTask(SparkHadoopWriter.scala:120)
>>>>>>>>>    at org.apache.spark.internal.io
>>>>>>>> .SparkHadoopWriter$$anonfun$3.apply(SparkHadoopWriter.scala:83)
>>>>>>>>>    at org.apache.spark.internal.io
>>>>>>>> .SparkHadoopWriter$$anonfun$3.apply(SparkHadoopWriter.scala:78)
>>>>>>>>>    at
>>>>>>>> org.apache.spark.scheduler.ResultTask.runTask(ResultTask.scala:90)
>>>>>>>>>    at org.apache.spark.scheduler.Task.run(Task.scala:123)
>>>>>>>>>    at
>>>>>>>> 
>>>>>>> 
>>>>> 
>>>> 
>> org.apache.spark.executor.Executor$TaskRunner$$anonfun$10.apply(Executor.scala:408)
>>>>>>>>>    at
>>>>>>>> org.apache.spark.util.Utils$.tryWithSafeFinally(Utils.scala:1360)
>>>>>>>>>    at
>>>>>>>> 
>> org.apache.spark.executor.Executor$TaskRunner.run(Executor.scala:414)
>>>>>>>>>    at
>>>>>>>> 
>>>>>>> 
>>>>> 
>>>> 
>> java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1149)
>>>>>>>>>    at
>>>>>>>> 
>>>>>>> 
>>>>> 
>>>> 
>> java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:624)
>>>>>>>>>    at java.lang.Thread.run(Thread.java:748)
>>>>>>>>> Caused by: java.lang.ClassNotFoundException:
>>>>>>>> org.apache.parquet.schema.LogicalTypeAnnotation
>>>>>>>>>    at java.net.URLClassLoader.findClass(URLClassLoader.java:382)
>>>>>>>>>    at java.lang.ClassLoader.loadClass(ClassLoader.java:424)
>>>>>>>>>    at
>>>> sun.misc.Launcher$AppClassLoader.loadClass(Launcher.java:349)
>>>>>>>>>    at java.lang.ClassLoader.loadClass(ClassLoader.java:357)
>>>>>>>>> 
>>>>>>>>> michael
>>>>>>>>> 
>>>>>>>>> 
>>>>>>>>>> On Nov 20, 2019, at 3:29 AM, Gabor Szadovszky <[email protected]>
>>>>>>> wrote:
>>>>>>>>>> 
>>>>>>>>>> Thanks, Fokko.
>>>>>>>>>> 
>>>>>>>>>> Ryan, we did not do such measurements yet. I'm afraid, I won't
>> have
>>>>>>>> enough
>>>>>>>>>> time to do that in the next couple of weeks.
>>>>>>>>>> 
>>>>>>>>>> Cheers,
>>>>>>>>>> Gabor
>>>>>>>>>> 
>>>>>>>>>> On Tue, Nov 19, 2019 at 6:14 PM Driesprong, Fokko
>>>>>>> <[email protected]
>>>>>>>>> 
>>>>>>>>>> wrote:
>>>>>>>>>> 
>>>>>>>>>>> Thanks Gabor for the explanation. I'd like to change my vote to
>> +1
>>>>>>>>>>> (non-binding).
>>>>>>>>>>> 
>>>>>>>>>>> Cheers, Fokko
>>>>>>>>>>> 
>>>>>>>>>>> Op di 19 nov. 2019 om 18:03 schreef Ryan Blue
>>>>>>>> <[email protected]>
>>>>>>>>>>> 
>>>>>>>>>>>> Gabor, what I meant was: have we tried this with real data to
>> see
>>>>>>> the
>>>>>>>>>>>> effect? I think those results would be helpful.
>>>>>>>>>>>> 
>>>>>>>>>>>> On Mon, Nov 18, 2019 at 11:35 PM Gabor Szadovszky <
>>>>> [email protected]
>>>>>>>> 
>>>>>>>>>>>> wrote:
>>>>>>>>>>>> 
>>>>>>>>>>>>> Hi Ryan,
>>>>>>>>>>>>> 
>>>>>>>>>>>>> It is not easy to calculate. For the column indexes feature we
>>>>>>>>>>> introduced
>>>>>>>>>>>>> two new structures saved before the footer: column indexes and
>>>>>>> offset
>>>>>>>>>>>>> indexes. If the min/max values are not too long, then the
>>>>>>> truncation
>>>>>>>>>>>> might
>>>>>>>>>>>>> not decrease the file size because of the offset indexes.
>>>>> Moreover,
>>>>>>>> we
>>>>>>>>>>>> also
>>>>>>>>>>>>> introduced parquet.page.row.count.limit which might increase
>> the
>>>>>>>> number
>>>>>>>>>>>> of
>>>>>>>>>>>>> pages which leads to increasing the file size.
>>>>>>>>>>>>> The footer itself is also changed and we are saving more values
>>>> in
>>>>>>>> it:
>>>>>>>>>>>> the
>>>>>>>>>>>>> offset values to the column/offset indexes, the new logical
>> type
>>>>>>>>>>>>> structures, the CRC checksums (we might have some others).
>>>>>>>>>>>>> So, the size of the files with small amount of data will be
>>>>>>> increased
>>>>>>>>>>>>> (because of the larger footer). The size of the files where the
>>>>>>>> values
>>>>>>>>>>>> can
>>>>>>>>>>>>> be encoded very well (RLE) will probably be increased (because
>>>> we
>>>>>>>> will
>>>>>>>>>>>> have
>>>>>>>>>>>>> more pages). The size of some files where the values are long
>>>>>>>> (>64bytes
>>>>>>>>>>>> by
>>>>>>>>>>>>> default) might be decreased because of truncating the min/max
>>>>>>> values.
>>>>>>>>>>>>> 
>>>>>>>>>>>>> Regards,
>>>>>>>>>>>>> Gabor
>>>>>>>>>>>>> 
>>>>>>>>>>>>> On Mon, Nov 18, 2019 at 6:46 PM Ryan Blue
>>>>>>> <[email protected]
>>>>>>>>> 
>>>>>>>>>>>>> wrote:
>>>>>>>>>>>>> 
>>>>>>>>>>>>>> Gabor, do we have an idea of the additional overhead for a
>>>>>>> non-test
>>>>>>>>>>>> data
>>>>>>>>>>>>>> file? It should be easy to validate that this doesn't
>> introduce
>>>>> an
>>>>>>>>>>>>>> unreasonable amount of overhead. In some cases, it should
>>>>> actually
>>>>>>>> be
>>>>>>>>>>>>>> smaller since the column indexes are truncated and page stats
>>>> are
>>>>>>>>>>> not.
>>>>>>>>>>>>>> 
>>>>>>>>>>>>>> On Mon, Nov 18, 2019 at 1:00 AM Gabor Szadovszky
>>>>>>>>>>>>>> <[email protected]> wrote:
>>>>>>>>>>>>>> 
>>>>>>>>>>>>>>> Hi Fokko,
>>>>>>>>>>>>>>> 
>>>>>>>>>>>>>>> For the first point. The referenced constructor is private
>> and
>>>>>>>>>>>> Iceberg
>>>>>>>>>>>>>> uses
>>>>>>>>>>>>>>> it via reflection. It is not a breaking change. I think,
>>>>>>> parquet-mr
>>>>>>>>>>>>> shall
>>>>>>>>>>>>>>> not keep private methods only because of clients might use
>>>> them
>>>>>>> via
>>>>>>>>>>>>>>> reflection.
>>>>>>>>>>>>>>> 
>>>>>>>>>>>>>>> About the checksum. I've agreed on having the CRC checksum
>>>> write
>>>>>>>>>>>>> enabled
>>>>>>>>>>>>>> by
>>>>>>>>>>>>>>> default because the benchmarks did not show significant
>>>>>>> performance
>>>>>>>>>>>>>>> penalties. See https://github.com/apache/parquet-mr/pull/647
>>>>> for
>>>>>>>>>>>>>> details.
>>>>>>>>>>>>>>> 
>>>>>>>>>>>>>>> About the file size change. 1.11.0 is introducing column
>>>>> indexes,
>>>>>>>>>>> CRC
>>>>>>>>>>>>>>> checksum, removing the statistics from the page headers and
>>>>> maybe
>>>>>>>>>>>> other
>>>>>>>>>>>>>>> changes that impact file size. If only file size is in
>>>> question
>>>>> I
>>>>>>>>>>>>> cannot
>>>>>>>>>>>>>>> see a breaking change here.
>>>>>>>>>>>>>>> 
>>>>>>>>>>>>>>> Regards,
>>>>>>>>>>>>>>> Gabor
>>>>>>>>>>>>>>> 
>>>>>>>>>>>>>>> 
>>>>>>>>>>>>>>> 
>>>>>>>>>>>>>>> On Sun, Nov 17, 2019 at 9:27 PM Driesprong, Fokko
>>>>>>>>>>>> <[email protected]
>>>>>>>>>>>>>> 
>>>>>>>>>>>>>>> wrote:
>>>>>>>>>>>>>>> 
>>>>>>>>>>>>>>>> Unfortunately, a -1 from my side (non-binding)
>>>>>>>>>>>>>>>> 
>>>>>>>>>>>>>>>> I've updated Iceberg to Parquet 1.11.0, and found three
>>>> things:
>>>>>>>>>>>>>>>> 
>>>>>>>>>>>>>>>> - We've broken backward compatibility of the constructor of
>>>>>>>>>>>>>>>> ColumnChunkPageWriteStore
>>>>>>>>>>>>>>>> <
>>>>>>>>>>>>>>>> 
>>>>>>>>>>>>>>> 
>>>>>>>>>>>>>> 
>>>>>>>>>>>>> 
>>>>>>>>>>>> 
>>>>>>>>>>> 
>>>>>>>> 
>>>>>>> 
>>>>> 
>>>> 
>> https://github.com/apache/parquet-mr/commit/e7db9e20f52c925a207ea62d6dda6dc4e870294e#diff-d007a18083a2431c30a5416f248e0a4bR80
>>>>>>>>>>>>>>>>> .
>>>>>>>>>>>>>>>> This required a change
>>>>>>>>>>>>>>>> <
>>>>>>>>>>>>>>>> 
>>>>>>>>>>>>>>> 
>>>>>>>>>>>>>> 
>>>>>>>>>>>>> 
>>>>>>>>>>>> 
>>>>>>>>>>> 
>>>>>>>> 
>>>>>>> 
>>>>> 
>>>> 
>> https://github.com/apache/incubator-iceberg/pull/297/files#diff-b877faa96f292b851c75fe8bcc1912f8R176
>>>>>>>>>>>>>>>>> 
>>>>>>>>>>>>>>>> to the code. This isn't a hard blocker, but if there will be
>>>> a
>>>>>>>>>>>> new
>>>>>>>>>>>>>> RC,
>>>>>>>>>>>>>>>> I've
>>>>>>>>>>>>>>>> submitted a patch:
>>>>>>>>>>>> https://github.com/apache/parquet-mr/pull/699
>>>>>>>>>>>>>>>> - Related, that we need to put in the changelog, is that
>>>>>>>>>>>> checksums
>>>>>>>>>>>>>> are
>>>>>>>>>>>>>>>> enabled by default:
>>>>>>>>>>>>>>>> 
>>>>>>>>>>>>>>>> 
>>>>>>>>>>>>>>> 
>>>>>>>>>>>>>> 
>>>>>>>>>>>>> 
>>>>>>>>>>>> 
>>>>>>>>>>> 
>>>>>>>> 
>>>>>>> 
>>>>> 
>>>> 
>> https://github.com/apache/parquet-mr/blob/master/parquet-column/src/main/java/org/apache/parquet/column/ParquetProperties.java#L54
>>>>>>>>>>>>>>>> This
>>>>>>>>>>>>>>>> will impact performance. I would suggest disabling it by
>>>>>>>>>>>> default:
>>>>>>>>>>>>>>>> https://github.com/apache/parquet-mr/pull/700
>>>>>>>>>>>>>>>> <
>>>>>>>>>>>>>>>> 
>>>>>>>>>>>>>>> 
>>>>>>>>>>>>>> 
>>>>>>>>>>>>> 
>>>>>>>>>>>> 
>>>>>>>>>>> 
>>>>>>>> 
>>>>>>> 
>>>>> 
>>>> 
>> https://github.com/apache/parquet-mr/commit/e7db9e20f52c925a207ea62d6dda6dc4e870294e#diff-d007a18083a2431c30a5416f248e0a4bR277
>>>>>>>>>>>>>>>>> 
>>>>>>>>>>>>>>>> - Binary compatibility. While updating Iceberg, I've noticed
>>>>>>>>>>>> that
>>>>>>>>>>>>>> the
>>>>>>>>>>>>>>>> split-test was failing:
>>>>>>>>>>>>>>>> 
>>>>>>>>>>>>>>>> 
>>>>>>>>>>>>>>> 
>>>>>>>>>>>>>> 
>>>>>>>>>>>>> 
>>>>>>>>>>>> 
>>>>>>>>>>> 
>>>>>>>> 
>>>>>>> 
>>>>> 
>>>> 
>> https://github.com/apache/incubator-iceberg/pull/297/files#diff-4b64b7014f259be41b26cfb73d3e6e93L199
>>>>>>>>>>>>>>>> The
>>>>>>>>>>>>>>>> two records are now divided over four Spark partitions.
>>>>>>>>>>>> Something
>>>>>>>>>>>>> in
>>>>>>>>>>>>>>> the
>>>>>>>>>>>>>>>> output has changed since the files are bigger now. Has
>> anyone
>>>>>>>>>>>> any
>>>>>>>>>>>>>> idea
>>>>>>>>>>>>>>>> to
>>>>>>>>>>>>>>>> check what's changed, or a way to check this? The only thing
>>>> I
>>>>>>>>>>>> can
>>>>>>>>>>>>>>>> think of
>>>>>>>>>>>>>>>> is the checksum mentioned above.
>>>>>>>>>>>>>>>> 
>>>>>>>>>>>>>>>> $ ls -lah ~/Desktop/parquet-1-1*
>>>>>>>>>>>>>>>> -rw-r--r--  1 fokkodriesprong  staff   562B 17 nov 21:09
>>>>>>>>>>>>>>>> /Users/fokkodriesprong/Desktop/parquet-1-10-1.parquet
>>>>>>>>>>>>>>>> -rw-r--r--  1 fokkodriesprong  staff   611B 17 nov 21:05
>>>>>>>>>>>>>>>> /Users/fokkodriesprong/Desktop/parquet-1-11-0.parquet
>>>>>>>>>>>>>>>> 
>>>>>>>>>>>>>>>> $ parquet-tools cat
>>>>>>>>>>>>>> /Users/fokkodriesprong/Desktop/parquet-1-10-1.parquet
>>>>>>>>>>>>>>>> id = 1
>>>>>>>>>>>>>>>> data = a
>>>>>>>>>>>>>>>> 
>>>>>>>>>>>>>>>> $ parquet-tools cat
>>>>>>>>>>>>>> /Users/fokkodriesprong/Desktop/parquet-1-11-0.parquet
>>>>>>>>>>>>>>>> id = 1
>>>>>>>>>>>>>>>> data = a
>>>>>>>>>>>>>>>> 
>>>>>>>>>>>>>>>> A binary diff here:
>>>>>>>>>>>>>>>> 
>>>> https://gist.github.com/Fokko/1c209f158299dc2fb5878c5bae4bf6d8
>>>>>>>>>>>>>>>> 
>>>>>>>>>>>>>>>> Cheers, Fokko
>>>>>>>>>>>>>>>> 
>>>>>>>>>>>>>>>> Op za 16 nov. 2019 om 04:18 schreef Junjie Chen <
>>>>>>>>>>>>>>> [email protected]
>>>>>>>>>>>>>>>>> :
>>>>>>>>>>>>>>>> 
>>>>>>>>>>>>>>>>> +1
>>>>>>>>>>>>>>>>> Verified signature, checksum and ran mvn install
>>>> successfully.
>>>>>>>>>>>>>>>>> 
>>>>>>>>>>>>>>>>> Wang, Yuming <[email protected]> 于2019年11月14日周四
>>>>>>>>>>> 下午2:05写道：
>>>>>>>>>>>>>>>>>> 
>>>>>>>>>>>>>>>>>> +1
>>>>>>>>>>>>>>>>>> Tested Parquet 1.11.0 with Spark SQL module: build/sbt
>>>>>>>>>>>>>>> "sql/test-only"
>>>>>>>>>>>>>>>>> -Phadoop-3.2
>>>>>>>>>>>>>>>>>> 
>>>>>>>>>>>>>>>>>> On 2019/11/13, 21:33, "Gabor Szadovszky" <
>>>> [email protected]>
>>>>>>>>>>>>>> wrote:
>>>>>>>>>>>>>>>>>> 
>>>>>>>>>>>>>>>>>> Hi everyone,
>>>>>>>>>>>>>>>>>> 
>>>>>>>>>>>>>>>>>> I propose the following RC to be released as official
>>>>>>>>>>>> Apache
>>>>>>>>>>>>>>>> Parquet
>>>>>>>>>>>>>>>>> 1.11.0
>>>>>>>>>>>>>>>>>> release.
>>>>>>>>>>>>>>>>>> 
>>>>>>>>>>>>>>>>>> The commit id is 18519eb8e059865652eee3ff0e8593f126701da4
>>>>>>>>>>>>>>>>>> * This corresponds to the tag: apache-parquet-1.11.0-rc7
>>>>>>>>>>>>>>>>>> *
>>>>>>>>>>>>>>>>>> 
>>>>>>>>>>>>>>>>> 
>>>>>>>>>>>>>>>> 
>>>>>>>>>>>>>>> 
>>>>>>>>>>>>>> 
>>>>>>>>>>>>> 
>>>>>>>>>>>> 
>>>>>>>>>>> 
>>>>>>>> 
>>>>>>> 
>>>>> 
>>>> 
>> https://nam01.safelinks.protection.outlook.com/?url=https%3A%2F%2Fgithub.com%2Fapache%2Fparquet-mr%2Ftree%2F18519eb8e059865652eee3ff0e8593f126701da4&amp;data=02%7C01%7Cyumwang%40ebay.com%7C8d588ca5855842a94bed08d7683e1221%7C46326bff992841a0baca17c16c94ea99%7C0%7C0%7C637092488114756267&amp;sdata=ToLFrTB9lU%2FGzH6UpXwy7PAY7kaupbyKAgdghESCfgg%3D&amp;reserved=0
>>>>>>>>>>>>>>>>>> 
>>>>>>>>>>>>>>>>>> The release tarball, signature, and checksums are here:
>>>>>>>>>>>>>>>>>> *
>>>>>>>>>>>>>>>>> 
>>>>>>>>>>>>>>>> 
>>>>>>>>>>>>>>> 
>>>>>>>>>>>>>> 
>>>>>>>>>>>>> 
>>>>>>>>>>>> 
>>>>>>>>>>> 
>>>>>>>> 
>>>>>>> 
>>>>> 
>>>> 
>> https://nam01.safelinks.protection.outlook.com/?url=https%3A%2F%2Fdist.apache.org%2Frepos%2Fdist%2Fdev%2Fparquet%2Fapache-parquet-1.11.0-rc7&amp;data=02%7C01%7Cyumwang%40ebay.com%7C8d588ca5855842a94bed08d7683e1221%7C46326bff992841a0baca17c16c94ea99%7C0%7C0%7C637092488114756267&amp;sdata=MPaHiYJT7ZcqreAYUkvDvZugthUhRPrySdXpN2ytT5k%3D&amp;reserved=0
>>>>>>>>>>>>>>>>>> 
>>>>>>>>>>>>>>>>>> You can find the KEYS file here:
>>>>>>>>>>>>>>>>>> *
>>>>>>>>>>>>>>>>> 
>>>>>>>>>>>>>>>> 
>>>>>>>>>>>>>>> 
>>>>>>>>>>>>>> 
>>>>>>>>>>>>> 
>>>>>>>>>>>> 
>>>>>>>>>>> 
>>>>>>>> 
>>>>>>> 
>>>>> 
>>>> 
>> https://nam01.safelinks.protection.outlook.com/?url=https%3A%2F%2Fapache.org%2Fdist%2Fparquet%2FKEYS&amp;data=02%7C01%7Cyumwang%40ebay.com%7C8d588ca5855842a94bed08d7683e1221%7C46326bff992841a0baca17c16c94ea99%7C0%7C0%7C637092488114756267&amp;sdata=IwG4MUGsP2lVzlD4bwZUEPuEAPUg%2FHXRYtxc5CQupBM%3D&amp;reserved=0
>>>>>>>>>>>>>>>>>> 
>>>>>>>>>>>>>>>>>> Binary artifacts are staged in Nexus here:
>>>>>>>>>>>>>>>>>> *
>>>>>>>>>>>>>>>>> 
>>>>>>>>>>>>>>>> 
>>>>>>>>>>>>>>> 
>>>>>>>>>>>>>> 
>>>>>>>>>>>>> 
>>>>>>>>>>>> 
>>>>>>>>>>> 
>>>>>>>> 
>>>>>>> 
>>>>> 
>>>> 
>> https://nam01.safelinks.protection.outlook.com/?url=https%3A%2F%2Frepository.apache.org%2Fcontent%2Fgroups%2Fstaging%2Forg%2Fapache%2Fparquet%2F&amp;data=02%7C01%7Cyumwang%40ebay.com%7C8d588ca5855842a94bed08d7683e1221%7C46326bff992841a0baca17c16c94ea99%7C0%7C0%7C637092488114756267&amp;sdata=lHtqLRQqQFwsyoaLSVaJuau5gxPKsCQFFVJaY8H0tZQ%3D&amp;reserved=0
>>>>>>>>>>>>>>>>>> 
>>>>>>>>>>>>>>>>>> This release includes the changes listed at:
>>>>>>>>>>>>>>>>>> 
>>>>>>>>>>>>>>>>> 
>>>>>>>>>>>>>>>> 
>>>>>>>>>>>>>>> 
>>>>>>>>>>>>>> 
>>>>>>>>>>>>> 
>>>>>>>>>>>> 
>>>>>>>>>>> 
>>>>>>>> 
>>>>>>> 
>>>>> 
>>>> 
>> https://nam01.safelinks.protection.outlook.com/?url=https%3A%2F%2Fgithub.com%2Fapache%2Fparquet-mr%2Fblob%2Fapache-parquet-1.11.0-rc7%2FCHANGES.md&amp;data=02%7C01%7Cyumwang%40ebay.com%7C8d588ca5855842a94bed08d7683e1221%7C46326bff992841a0baca17c16c94ea99%7C0%7C0%7C637092488114756267&amp;sdata=82BplI3bLAL6qArLHvVoYReZOk%2BboSP655rI8VX5Q5I%3D&amp;reserved=0
>>>>>>>>>>>>>>>>>> 
>>>>>>>>>>>>>>>>>> Please download, verify, and test.
>>>>>>>>>>>>>>>>>> 
>>>>>>>>>>>>>>>>>> Please vote in the next 72 hours.
>>>>>>>>>>>>>>>>>> 
>>>>>>>>>>>>>>>>>> [ ] +1 Release this as Apache Parquet 1.11.0
>>>>>>>>>>>>>>>>>> [ ] +0
>>>>>>>>>>>>>>>>>> [ ] -1 Do not release this because...
>>>>>>>>>>>>>>>>>> 
>>>>>>>>>>>>>>>>>> 
>>>>>>>>>>>>>>>>> 
>>>>>>>>>>>>>>>> 
>>>>>>>>>>>>>>> 
>>>>>>>>>>>>>> 
>>>>>>>>>>>>>> 
>>>>>>>>>>>>>> --
>>>>>>>>>>>>>> Ryan Blue
>>>>>>>>>>>>>> Software Engineer
>>>>>>>>>>>>>> Netflix
>>>>>>>>>>>>>> 
>>>>>>>>>>>>> 
>>>>>>>>>>>> 
>>>>>>>>>>>> 
>>>>>>>>>>>> --
>>>>>>>>>>>> Ryan Blue
>>>>>>>>>>>> Software Engineer
>>>>>>>>>>>> Netflix
>>>>>>>>>>>> 
>>>>>>>>>>> 
>>>>>>>>> 
>>>>>>>> 
>>>>>>>> 
>>>>>>> 
>>>>>> 
>>>>>> 
>>>>>> --
>>>>>> Ryan Blue
>>>>>> Software Engineer
>>>>>> Netflix
>>>>> 
>>>>> 
>>>> 
>>> 
>>> 
>>> --
>>> Ryan Blue
>>> Software Engineer
>>> Netflix
>> 
>>

Re: [VOTE] Release Apache Parquet 1.11.0 RC7

Reply via email to