Re: [VOTE] Release Apache Parquet 1.11.0 RC7

2019-11-22 Thread Nandor Kollar
Michael,

Indeed it seems that during compile time Parquet versions seems to be
consistent. However, the exception happens when one Parquet module calls a
method in another Parquet module: parquet-avro calls builder methods in
parquet-column. I can't imagine how whis call could break with consistent
Parquet versions, Parquet wouldn't even build if it would be so.
Could you please check the classpath of the failing task? If you're running
Spark on Yarn, you can get the logs via yarn logs -applicationId  then
you can find the classpath somewhere at the beginning of the log file. Are
Parquet artifact versions consistent there too?

Nandor

On Fri, Nov 22, 2019 at 6:20 PM Ryan Blue  wrote:

> Gabor,
>
> 1.7.0 was the first version using the org.apache.parquet packages, so
> that's the correct base version for compatibility checks. The exclusions in
> the POM are classes that the Parquet community does not consider public. We
> rely on these checks to highlight binary incompatibilities, and then we
> discuss them on this list or in the dev sync. If the class is internal, we
> add an exclusion for it.
>
> I know you're familiar with this process since we've talked about it
> before. I also know that you'd rather have more strict binary
> compatibility, but until we have someone with the time to do some
> maintenance and build a public API module, I'm afraid that's what we have
> to work with.
>
> Michael,
>
> I hope the context above is helpful and explains why running a binary
> compatibility check tool will find incompatible changes. We allow binary
> incompatible changes to internal classes and modules, like parquet-common.
>
> On Fri, Nov 22, 2019 at 12:23 AM Gabor Szadovszky 
> wrote:
>
> > Ryan,
> > I would not trust our compatibility checks (semver) too much. Currently,
> it
> > is configured to compare our current version to 1.7.0. It means anything
> > that is added since 1.7.0 and then broke in a later release won't be
> > caught. In addition, many packages are excluded from the check because of
> > different reasons. For example org/apache/parquet/schema/** is excluded
> so
> > if it would really be an API compatibility issue we certainly wouldn't
> > catch it.
> >
> > Michael,
> > It fails because of a NoSuchMethodError pointing to a method that is
> newly
> > introduced in 1.11. Both the caller and the callee shipped by parquet-mr.
> > So, I'm quite sure it is a classpath issue. It seems that the 1.11
> version
> > of the parquet-column jar is not on the classpath.
> >
> >
> > On Fri, Nov 22, 2019 at 1:44 AM Michael Heuer  wrote:
> >
> > > The dependency versions are consistent in our artifact
> > >
> > > $ mvn dependency:tree | grep parquet
> > > [INFO] |  \- org.apache.parquet:parquet-avro:jar:1.11.0:compile
> > > [INFO] | \-
> > > org.apache.parquet:parquet-format-structures:jar:1.11.0:compile
> > > [INFO] |  +- org.apache.parquet:parquet-column:jar:1.11.0:compile
> > > [INFO] |  |  +- org.apache.parquet:parquet-common:jar:1.11.0:compile
> > > [INFO] |  |  \- org.apache.parquet:parquet-encoding:jar:1.11.0:compile
> > > [INFO] |  +- org.apache.parquet:parquet-hadoop:jar:1.11.0:compile
> > > [INFO] |  |  +- org.apache.parquet:parquet-jackson:jar:1.11.0:compile
> > >
> > > The latter error
> > >
> > > Caused by: org.apache.spark.SparkException: Job aborted due to stage
> > > failure: Task 0 in stage 0.0 failed 1 times, most recent failure: Lost
> > task
> > > 0.0 in stage 0.0 (TID 0, localhost, executor driver):
> > > java.lang.NoSuchMethodError:
> > >
> >
> org.apache.parquet.schema.Types$PrimitiveBuilder.as(Lorg/apache/parquet/schema/LogicalTypeAnnotation;)Lorg/apache/parquet/schema/Types$Builder;
> > > at
> > >
> >
> org.apache.parquet.avro.AvroSchemaConverter.convertField(AvroSchemaConverter.java:161)
> > >
> > > occurs when I attempt to run via spark-submit on Spark 2.4.4
> > >
> > > $ spark-submit --version
> > > Welcome to
> > >     __
> > >  / __/__  ___ _/ /__
> > > _\ \/ _ \/ _ `/ __/  '_/
> > >/___/ .__/\_,_/_/ /_/\_\   version 2.4.4
> > >   /_/
> > >
> > > Using Scala version 2.11.12, Java HotSpot(TM) 64-Bit Server VM,
> 1.8.0_191
> > > Branch
> > > Compiled by user  on 2019-08-27T21:21:38Z
> > > Revision
> > > Url
> > > Type --help for more information.
> > >
> > >
> > >
> > > > On Nov 21, 2019, at 6:06 PM, Ryan Blue 
> > > w

Re: [VOTE] Release Apache Parquet 1.11.0 RC7

2019-11-21 Thread Nandor Kollar
I'm not sure that this is a binary compatibility issue. The missing builder
method was recently added in 1.11.0 with the introduction of the new
logical type API, while the original version (one with a single
OriginalType input parameter called before by AvroSchemaConverter) of this
method is kept untouched. It seems to me that the Parquet version on Spark
executor mismatch: parquet-avro is on 1.11.0, but parquet-column is still
on an older version.

On Thu, Nov 21, 2019 at 11:41 PM Michael Heuer  wrote:

> Perhaps not strictly necessary to say, but if this particular
> compatibility break between 1.10 and 1.11 was intentional, and no other
> compatibility breaks are found, I would vote -1 (non-binding) on this RC
> such that we might go back and revisit the changes to preserve
> compatibility.
>
> I am not sure there is presently enough motivation in the Spark project
> for a release after 2.4.4 and before 3.0 in which to bump the Parquet
> dependency version to 1.11.x.
>
>michael
>
>
> > On Nov 21, 2019, at 11:01 AM, Ryan Blue 
> wrote:
> >
> > Gabor, shouldn't Parquet be binary compatible for public APIs? From the
> > stack trace, it looks like this 1.11.0 RC breaks binary compatibility in
> > the type builders.
> >
> > Looks like this should have been caught by the binary compatibility
> checks.
> >
> > On Thu, Nov 21, 2019 at 8:56 AM Gabor Szadovszky 
> wrote:
> >
> >> Hi Michael,
> >>
> >> Unfortunately, I don't have too much experience on Spark. But if spark
> uses
> >> the parquet-mr library in an embedded way (that's how Hive uses it) it
> is
> >> required to re-build Spark with 1.11 RC parquet-mr.
> >>
> >> Regards,
> >> Gabor
> >>
> >> On Wed, Nov 20, 2019 at 5:44 PM Michael Heuer 
> wrote:
> >>
> >>> It appears a provided scope dependency on spark-sql leaks old parquet
> >>> versions was causing the runtime error below.  After including new
> >>> parquet-column and parquet-hadoop compile scope dependencies (in
> addition
> >>> to parquet-avro, which we already have at compile scope), our build
> >>> succeeds.
> >>>
> >>> https://github.com/bigdatagenomics/adam/pull/2232 <
> >>> https://github.com/bigdatagenomics/adam/pull/2232>
> >>>
> >>> However, when running via spark-submit I run into a similar runtime
> error
> >>>
> >>> Caused by: java.lang.NoSuchMethodError:
> >>>
> >>
> org.apache.parquet.schema.Types$PrimitiveBuilder.as(Lorg/apache/parquet/schema/LogicalTypeAnnotation;)Lorg/apache/parquet/schema/Types$Builder;
> >>>at
> >>>
> >>
> org.apache.parquet.avro.AvroSchemaConverter.convertField(AvroSchemaConverter.java:161)
> >>>at
> >>>
> >>
> org.apache.parquet.avro.AvroSchemaConverter.convertUnion(AvroSchemaConverter.java:226)
> >>>at
> >>>
> >>
> org.apache.parquet.avro.AvroSchemaConverter.convertField(AvroSchemaConverter.java:182)
> >>>at
> >>>
> >>
> org.apache.parquet.avro.AvroSchemaConverter.convertField(AvroSchemaConverter.java:141)
> >>>at
> >>>
> >>
> org.apache.parquet.avro.AvroSchemaConverter.convertField(AvroSchemaConverter.java:244)
> >>>at
> >>>
> >>
> org.apache.parquet.avro.AvroSchemaConverter.convertFields(AvroSchemaConverter.java:135)
> >>>at
> >>>
> >>
> org.apache.parquet.avro.AvroSchemaConverter.convert(AvroSchemaConverter.java:126)
> >>>at
> >>>
> org.apache.parquet.avro.AvroWriteSupport.init(AvroWriteSupport.java:121)
> >>>at
> >>>
> >>
> org.apache.parquet.hadoop.ParquetOutputFormat.getRecordWriter(ParquetOutputFormat.java:388)
> >>>at
> >>>
> >>
> org.apache.parquet.hadoop.ParquetOutputFormat.getRecordWriter(ParquetOutputFormat.java:349)
> >>>at
> >>>
> >>
> org.apache.spark.rdd.InstrumentedOutputFormat.getRecordWriter(InstrumentedOutputFormat.scala:35)
> >>>at org.apache.spark.internal.io
> >>> .HadoopMapReduceWriteConfigUtil.initWriter(SparkHadoopWriter.scala:350)
> >>>at org.apache.spark.internal.io
> >>>
> >>
> .SparkHadoopWriter$.org$apache$spark$internal$io$SparkHadoopWriter$$executeTask(SparkHadoopWriter.scala:120)
> >>>at org.apache.spark.internal.io
> >>> .SparkHadoopWriter$$anonfun$3.apply(SparkHadoopWriter.scala:83)
> >>>at org.apache.spark.internal.io
> >>> .SparkHadoopWriter$$anonfun$3.apply(SparkHadoopWriter.scala:78)
> >>>at
> >>> org.apache.spark.scheduler.ResultTask.runTask(ResultTask.scala:90)
> >>>at org.apache.spark.scheduler.Task.run(Task.scala:123)
> >>>at
> >>>
> >>
> org.apache.spark.executor.Executor$TaskRunner$$anonfun$10.apply(Executor.scala:408)
> >>>at
> >>> org.apache.spark.util.Utils$.tryWithSafeFinally(Utils.scala:1360)
> >>>at
> >>> org.apache.spark.executor.Executor$TaskRunner.run(Executor.scala:414)
> >>>at
> >>>
> >>
> java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1149)
> >>>at
> >>>
> >>
> java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:624)
> >>>at java.lang.Thread.run(Thread.java:748)
> 

[jira] [Resolved] (PARQUET-1445) Remove Files.java

2019-09-05 Thread Nandor Kollar (Jira)


 [ 
https://issues.apache.org/jira/browse/PARQUET-1445?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Nandor Kollar resolved PARQUET-1445.

Resolution: Fixed

> Remove Files.java
> -
>
> Key: PARQUET-1445
> URL: https://issues.apache.org/jira/browse/PARQUET-1445
> Project: Parquet
>  Issue Type: Improvement
>  Components: parquet-mr
>Affects Versions: 1.9.0
>Reporter: David Mollitor
>Assignee: David Mollitor
>Priority: Minor
>  Labels: pull-request-available
> Attachments: PARQUET-1445.1.patch
>
>
> bq. TODO: Use java.nio.file.Files when Parquet is updated to Java 7
> https://github.com/apache/parquet-mr/blob/dc61e510126aaa1a95a46fe39bf1529f394147e9/parquet-common/src/main/java/org/apache/parquet/Files.java#L31



--
This message was sent by Atlassian Jira
(v8.3.2#803003)


[jira] [Assigned] (PARQUET-1445) Remove Files.java

2019-09-05 Thread Nandor Kollar (Jira)


 [ 
https://issues.apache.org/jira/browse/PARQUET-1445?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Nandor Kollar reassigned PARQUET-1445:
--

Assignee: David Mollitor

> Remove Files.java
> -
>
> Key: PARQUET-1445
> URL: https://issues.apache.org/jira/browse/PARQUET-1445
> Project: Parquet
>  Issue Type: Improvement
>  Components: parquet-mr
>Affects Versions: 1.9.0
>Reporter: David Mollitor
>Assignee: David Mollitor
>Priority: Minor
>  Labels: pull-request-available
> Attachments: PARQUET-1445.1.patch
>
>
> bq. TODO: Use java.nio.file.Files when Parquet is updated to Java 7
> https://github.com/apache/parquet-mr/blob/dc61e510126aaa1a95a46fe39bf1529f394147e9/parquet-common/src/main/java/org/apache/parquet/Files.java#L31



--
This message was sent by Atlassian Jira
(v8.3.2#803003)


[jira] [Commented] (PARQUET-247) Add DATE mapping in ValidTypeMap of filter2

2019-09-05 Thread Nandor Kollar (Jira)


[ 
https://issues.apache.org/jira/browse/PARQUET-247?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16923149#comment-16923149
 ] 

Nandor Kollar commented on PARQUET-247:
---

Since PARQUET-201 was resolved by removing OriginalType from the type check, I 
don't think this Jira is an outstanding issue and blocker for Hive Parquet PPD 
any more, hence I resolve it. Feel free to reopen if I was wrong.

> Add DATE mapping in ValidTypeMap of filter2
> ---
>
> Key: PARQUET-247
> URL: https://issues.apache.org/jira/browse/PARQUET-247
> Project: Parquet
>  Issue Type: Improvement
>Reporter: Dong Chen
>Assignee: Dong Chen
>Priority: Major
>
> When Hive use Parquet filter predicate, the Date type is converted to 
> Integer. In {{ValidTypeMap}}, it map the class and Parquet type. It throw 
> exception when checking the data type Date.
> We should add the map to support Date.



--
This message was sent by Atlassian Jira
(v8.3.2#803003)


[jira] [Resolved] (PARQUET-1530) Remove Dependency on commons-codec

2019-09-05 Thread Nandor Kollar (Jira)


 [ 
https://issues.apache.org/jira/browse/PARQUET-1530?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Nandor Kollar resolved PARQUET-1530.

Resolution: Fixed

> Remove Dependency on commons-codec
> --
>
> Key: PARQUET-1530
> URL: https://issues.apache.org/jira/browse/PARQUET-1530
> Project: Parquet
>  Issue Type: Improvement
>Reporter: David Mollitor
>Assignee: David Mollitor
>Priority: Major
>  Labels: pull-request-available
>




--
This message was sent by Atlassian Jira
(v8.3.2#803003)


[jira] [Assigned] (PARQUET-1641) Parquet pages for different columns cannot be read in parallel

2019-08-29 Thread Nandor Kollar (Jira)


 [ 
https://issues.apache.org/jira/browse/PARQUET-1641?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Nandor Kollar reassigned PARQUET-1641:
--

Assignee: Samarth Jain

> Parquet pages for different columns cannot be read in parallel 
> ---
>
> Key: PARQUET-1641
> URL: https://issues.apache.org/jira/browse/PARQUET-1641
> Project: Parquet
>  Issue Type: Improvement
>Reporter: Samarth Jain
>Assignee: Samarth Jain
>Priority: Major
>  Labels: pull-request-available
>
> All ColumnChunkPageReader instances use the same decompressor. 
> [https://github.com/apache/parquet-mr/blob/master/parquet-hadoop/src/main/java/org/apache/parquet/hadoop/ParquetFileReader.java#L1286]
> {code:java}
> BytesInputDecompressor decompressor = 
> options.getCodecFactory().getDecompressor(descriptor.metadata.getCodec());
> return new ColumnChunkPageReader(decompressor, pagesInChunk, dictionaryPage);
> {code}
> The CodecFactory caches the decompressors for every codec type returning the 
> same instance on every getCompressor(codecName) call. See the caching 
> happening here:
> [https://github.com/apache/parquet-mr/blob/master/parquet-hadoop/src/main/java/org/apache/parquet/hadoop/CodecFactory.java#L197]
> {code:java}
> @Override
>  public BytesDecompressor getDecompressor(CompressionCodecName codecName) {
> BytesDecompressor decomp = decompressors.get(codecName);
> if (decomp == null){ 
>decomp = createDecompressor(codecName); decompressors.put(codecName, 
> decomp); 
> }
> return decomp;
>  }
>  
> {code}
>  
> If multiple threads try to read the pages belonging to different columns, 
> they run into thread
> safety issues. This issue prevents increasing the throughput at which 
> applications can read parquet data by parallelizing page reads. 



--
This message was sent by Atlassian Jira
(v8.3.2#803003)


[jira] [Assigned] (PARQUET-1643) Use airlift non-native implementations for GZIP, LZ0 and LZ4 codecs

2019-08-29 Thread Nandor Kollar (Jira)


 [ 
https://issues.apache.org/jira/browse/PARQUET-1643?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Nandor Kollar reassigned PARQUET-1643:
--

Assignee: Samarth Jain

> Use airlift non-native implementations for GZIP, LZ0 and LZ4 codecs
> ---
>
> Key: PARQUET-1643
> URL: https://issues.apache.org/jira/browse/PARQUET-1643
> Project: Parquet
>  Issue Type: Improvement
>Reporter: Samarth Jain
>Assignee: Samarth Jain
>Priority: Major
>  Labels: pull-request-available
>
> [~rdblue] pointed me to [https://github.com/airlift/aircompressor] which 
> provides non-native implementations of compression codecs. It claims to be 
> much faster than native wrappers that parquet uses. This Jira is to track the 
> work needed for exploring using these codecs, getting benchmark results and 
> making changes including not needing to pool compressors and decompressors 
> anymore. Note that this doesn't include SNAPPY since Parquet already has its 
> own non-hadoopy implementation for it. 



--
This message was sent by Atlassian Jira
(v8.3.2#803003)


[jira] [Assigned] (PARQUET-1597) Fix parquet-cli's wrong or missing usage examples

2019-08-22 Thread Nandor Kollar (Jira)


 [ 
https://issues.apache.org/jira/browse/PARQUET-1597?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Nandor Kollar reassigned PARQUET-1597:
--

Assignee: Kengo Seki

> Fix parquet-cli's wrong or missing usage examples
> -
>
> Key: PARQUET-1597
> URL: https://issues.apache.org/jira/browse/PARQUET-1597
> Project: Parquet
>  Issue Type: Bug
>  Components: parquet-cli
>Reporter: Kengo Seki
>Assignee: Kengo Seki
>Priority: Minor
>  Labels: pull-request-available
>
> 1. The following parquet-cli's {{to-avro}} usage examples fail due to the 
> lack of {{-o}} options.
>In addition, "sample.parquet" in the second example should be 
> "sample.avro".
> {code}
>   Examples:
> # Create an Avro file from a Parquet file
> parquet to-avro sample.parquet sample.avro
> # Create an Avro file in HDFS from a local JSON file
> parquet to-avro path/to/sample.json hdfs:/user/me/sample.parquet
> # Create an Avro file from data in S3
> parquet to-avro s3:/data/path/sample.parquet sample.avro
> {code}
> 2. The above is the same for convert-csv.
> {code}
>   Examples:
> # Create a Parquet file from a CSV file
> parquet convert-csv sample.csv sample.parquet --schema schema.avsc
> # Create a Parquet file in HDFS from local CSV
> parquet convert-csv path/to/sample.csv hdfs:/user/me/sample.parquet 
> --schema schema.avsc
> # Create an Avro file from CSV data in S3
> parquet convert-csv s3:/data/path/sample.csv sample.avro --format avro 
> --schema s3:/schemas/schema.avsc
> {code}
> 3. The meta command has an "Examples:" heading but lacks its content.
> {code}
> $ java -cp 'target/*:target/dependency/*' org.apache.parquet.cli.Main help 
> meta
> Usage: parquet [general options] meta  [command options]
>   Description:
> Print a Parquet file's metadata
>   Examples:
> {code}



--
This message was sent by Atlassian Jira
(v8.3.2#803003)


[jira] [Resolved] (PARQUET-1637) Builds are failing because default jdk changed to openjdk11 on Travis

2019-08-15 Thread Nandor Kollar (JIRA)


 [ 
https://issues.apache.org/jira/browse/PARQUET-1637?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Nandor Kollar resolved PARQUET-1637.

Resolution: Fixed

> Builds are failing because default jdk changed to openjdk11 on Travis
> -
>
> Key: PARQUET-1637
> URL: https://issues.apache.org/jira/browse/PARQUET-1637
> Project: Parquet
>  Issue Type: Bug
>  Components: parquet-mr
>    Reporter: Nandor Kollar
>    Assignee: Nandor Kollar
>Priority: Major
>  Labels: pull-request-available
>
> The default distribution on Travis recently changed from Trusy to Xenial. It 
> appears that the default JDK also changed from JDK8 to JDK11, despite the doc 
> [says|https://docs.travis-ci.com/user/reference/xenial/#jvm-clojure-groovy-java-scala-support]
>  the default is openjdk8, it appears that it isn't correct (see related 
> [discussion|https://travis-ci.community/t/default-jdk-on-xenial-openjdk8-or-openjdk11/4542])
> Since Parquet still doesn't support Java 11 (PARQUET-1551), we should 
> explicitly tell in Travis config which JDK to use, at lease as long as 
> PARQUET-1551 is still open.



--
This message was sent by Atlassian JIRA
(v7.6.14#76016)


[jira] [Created] (PARQUET-1637) Builds are failing because default jdk changed to openjdk11 on Travis

2019-08-10 Thread Nandor Kollar (JIRA)
Nandor Kollar created PARQUET-1637:
--

 Summary: Builds are failing because default jdk changed to 
openjdk11 on Travis
 Key: PARQUET-1637
 URL: https://issues.apache.org/jira/browse/PARQUET-1637
 Project: Parquet
  Issue Type: Bug
  Components: parquet-mr
Reporter: Nandor Kollar
Assignee: Nandor Kollar


The default distribution on Travis recently changed from Trusy to Xenial. It 
appears that the default JDK also changed from JDK8 to JDK11, despite the doc 
[says|https://docs.travis-ci.com/user/reference/xenial/#jvm-clojure-groovy-java-scala-support]
 the default is openjdk8, it appears that it isn't correct (see related 
[discussion|https://travis-ci.community/t/default-jdk-on-xenial-openjdk8-or-openjdk11/4542])

Since Parquet still doesn't support Java 11 (PARQUET-1551), we should 
explicitly tell in Travis config which JDK to use, at lease as long as 
PARQUET-1551 is still open.



--
This message was sent by Atlassian JIRA
(v7.6.14#76016)


[jira] [Updated] (PARQUET-1303) Avro reflect @Stringable field write error if field not instanceof CharSequence

2019-07-25 Thread Nandor Kollar (JIRA)


 [ 
https://issues.apache.org/jira/browse/PARQUET-1303?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Nandor Kollar updated PARQUET-1303:
---
Fix Version/s: 1.12.0

> Avro reflect @Stringable field write error if field not instanceof 
> CharSequence
> ---
>
> Key: PARQUET-1303
> URL: https://issues.apache.org/jira/browse/PARQUET-1303
> Project: Parquet
>  Issue Type: Bug
>  Components: parquet-avro
>Affects Versions: 1.9.0, 1.10.0
>Reporter: Zack Behringer
>Assignee: Zack Behringer
>Priority: Minor
>  Labels: pull-request-available
> Fix For: 1.12.0
>
>
> Annotate a field in a pojo with org.apache.avro.reflect.Stringable and the 
> schema will consider it to be a String field. AvroWriteSupport.fromAvroString 
> assumes the field is either a Utf8 or CharSequence and does not attempt to 
> use the field class' toString method if it is not.



--
This message was sent by Atlassian JIRA
(v7.6.14#76016)


[jira] [Resolved] (PARQUET-1303) Avro reflect @Stringable field write error if field not instanceof CharSequence

2019-07-25 Thread Nandor Kollar (JIRA)


 [ 
https://issues.apache.org/jira/browse/PARQUET-1303?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Nandor Kollar resolved PARQUET-1303.

Resolution: Fixed

> Avro reflect @Stringable field write error if field not instanceof 
> CharSequence
> ---
>
> Key: PARQUET-1303
> URL: https://issues.apache.org/jira/browse/PARQUET-1303
> Project: Parquet
>  Issue Type: Bug
>  Components: parquet-avro
>Affects Versions: 1.9.0, 1.10.0
>Reporter: Zack Behringer
>Assignee: Zack Behringer
>Priority: Minor
>  Labels: pull-request-available
>
> Annotate a field in a pojo with org.apache.avro.reflect.Stringable and the 
> schema will consider it to be a String field. AvroWriteSupport.fromAvroString 
> assumes the field is either a Utf8 or CharSequence and does not attempt to 
> use the field class' toString method if it is not.



--
This message was sent by Atlassian JIRA
(v7.6.14#76016)


[jira] [Assigned] (PARQUET-1303) Avro reflect @Stringable field write error if field not instanceof CharSequence

2019-07-25 Thread Nandor Kollar (JIRA)


 [ 
https://issues.apache.org/jira/browse/PARQUET-1303?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Nandor Kollar reassigned PARQUET-1303:
--

Assignee: Zack Behringer

> Avro reflect @Stringable field write error if field not instanceof 
> CharSequence
> ---
>
> Key: PARQUET-1303
> URL: https://issues.apache.org/jira/browse/PARQUET-1303
> Project: Parquet
>  Issue Type: Bug
>  Components: parquet-avro
>Affects Versions: 1.9.0, 1.10.0
>Reporter: Zack Behringer
>Assignee: Zack Behringer
>Priority: Minor
>
> Annotate a field in a pojo with org.apache.avro.reflect.Stringable and the 
> schema will consider it to be a String field. AvroWriteSupport.fromAvroString 
> assumes the field is either a Utf8 or CharSequence and does not attempt to 
> use the field class' toString method if it is not.



--
This message was sent by Atlassian JIRA
(v7.6.14#76016)


[jira] [Resolved] (PARQUET-1605) Bump maven-javadoc-plugin from 2.9 to 3.1.0

2019-07-24 Thread Nandor Kollar (JIRA)


 [ 
https://issues.apache.org/jira/browse/PARQUET-1605?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Nandor Kollar resolved PARQUET-1605.

Resolution: Fixed

> Bump maven-javadoc-plugin from 2.9 to 3.1.0
> ---
>
> Key: PARQUET-1605
> URL: https://issues.apache.org/jira/browse/PARQUET-1605
> Project: Parquet
>  Issue Type: Improvement
>Affects Versions: 1.11.0
>Reporter: Fokko Driesprong
>Assignee: Fokko Driesprong
>Priority: Major
>  Labels: pull-request-available
> Fix For: 1.12.0
>
>




--
This message was sent by Atlassian JIRA
(v7.6.14#76016)


[jira] [Resolved] (PARQUET-1606) Fix invalid tests scope

2019-07-23 Thread Nandor Kollar (JIRA)


 [ 
https://issues.apache.org/jira/browse/PARQUET-1606?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Nandor Kollar resolved PARQUET-1606.

Resolution: Fixed

> Fix invalid tests scope
> ---
>
> Key: PARQUET-1606
> URL: https://issues.apache.org/jira/browse/PARQUET-1606
> Project: Parquet
>  Issue Type: Improvement
>Affects Versions: 1.11.0
>Reporter: Fokko Driesprong
>Assignee: Fokko Driesprong
>Priority: Major
>  Labels: pull-request-available
> Fix For: 1.12.0
>
>




--
This message was sent by Atlassian JIRA
(v7.6.14#76016)


[jira] [Resolved] (PARQUET-1600) Fix shebang in parquet-benchmarks/run.sh

2019-07-23 Thread Nandor Kollar (JIRA)


 [ 
https://issues.apache.org/jira/browse/PARQUET-1600?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Nandor Kollar resolved PARQUET-1600.

Resolution: Fixed

> Fix shebang in parquet-benchmarks/run.sh
> 
>
> Key: PARQUET-1600
> URL: https://issues.apache.org/jira/browse/PARQUET-1600
> Project: Parquet
>  Issue Type: Bug
>Reporter: Kengo Seki
>Assignee: Kengo Seki
>Priority: Minor
>  Labels: pull-request-available
>
> The following shebang does not work as expected since it's not on the first 
> line and there's a space between # and !.
> {code:title=parquet-benchmarks/run.sh}
> (snip)
> # !/usr/bin/env bash
> {code}
> For example, if users use tcsh, it fails as follows:
> {code}
> > parquet-benchmarks/run.sh
> Illegal variable name.
> {code}



--
This message was sent by Atlassian JIRA
(v7.6.14#76016)


[jira] [Resolved] (PARQUET-1552) upgrade protoc-jar-maven-plugin to 3.8.0

2019-07-10 Thread Nandor Kollar (JIRA)


 [ 
https://issues.apache.org/jira/browse/PARQUET-1552?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Nandor Kollar resolved PARQUET-1552.

Resolution: Fixed

> upgrade protoc-jar-maven-plugin to 3.8.0
> 
>
> Key: PARQUET-1552
> URL: https://issues.apache.org/jira/browse/PARQUET-1552
> Project: Parquet
>  Issue Type: Wish
>  Components: parquet-mr
>Affects Versions: 1.11.0
>Reporter: Junjie Chen
>Assignee: Junjie Chen
>Priority: Minor
>  Labels: pull-request-available
>
> Current protoc-jar-maven-plugin has a problem when building project after a 
> proxy network. The latest release 3.8.0 version fix this issue.



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)


Re: New PMC member: Gabor Szadovszky

2019-06-28 Thread Nandor Kollar
Congratulations Gabor!

On Fri, Jun 28, 2019 at 4:51 PM Zoltan Ivanfi 
wrote:

> Hi,
>
> The Project Management Committee (PMC) for Apache Parquet has invited Gabor
> Szadovszky to become a member of the PMC and we are pleased to announce
> that he has accepted.
>
> Congratulations, Gabor!
>
> Br,
>
> Zoltan
>


Re: New committer: Fokko Driesprong

2019-06-26 Thread Nandor Kollar
Congratulations Fokko!

On Tue, Jun 25, 2019 at 7:32 PM Xinli shang  wrote:

> Congratulations Fokko!
>
> On Tue, Jun 25, 2019 at 8:55 AM Lars Volker 
> wrote:
>
> > Congratulations Fokko!
> >
> > On Tue, Jun 25, 2019 at 8:52 AM Tim Armstrong
> >  wrote:
> >
> > > Congratulations!
> > >
> > > On Tue, Jun 25, 2019 at 7:12 AM 俊杰陈  wrote:
> > >
> > > > Congrats Fokko!
> > > >
> > > > On Tue, Jun 25, 2019 at 7:08 PM Zoltan Ivanfi
>  > >
> > > > wrote:
> > > >
> > > > > Hi,
> > > > >
> > > > > The Project Management Committee (PMC) for Apache Parquet has
> invited
> > > > Fokko
> > > > > Driesprong to become a committer and we are pleased to announce
> that
> > he
> > > > has
> > > > > accepted.
> > > > >
> > > > > Congratulations and welcome, Fokko!
> > > > >
> > > > > Br,
> > > > >
> > > > > Zoltan
> > > > >
> > > >
> > > >
> > > > --
> > > > Thanks & Best Regards
> > > >
> > >
> >
>
>
> --
> Xinli Shang
>


[jira] [Comment Edited] (PARQUET-1496) [Java] Update Scala for JDK 11 compatibility

2019-05-03 Thread Nandor Kollar (JIRA)


[ 
https://issues.apache.org/jira/browse/PARQUET-1496?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16832559#comment-16832559
 ] 

Nandor Kollar edited comment on PARQUET-1496 at 5/3/19 2:55 PM:


Scrooge recently release 19.4.0 with a fix for 
[Scrooge#304|https://github.com/twitter/scrooge/issues/304], we can now upgrade 
to this version. The other problem (Scrooge#303) is fixed by 
[replacing|https://github.com/apache/parquet-format/commit/84165d0a4f46106a96d68ed831965123294a5196]
 the problematic comment in parquet-format, I'm afraid we'll need a release 
from format.


was (Author: nkollar):
Scrooge recently release 19.4.0 with a fix for 
[Scrooge#303|https://github.com/twitter/scrooge/issues/303], we can now upgrade 
to this version. The other problem (Scrooge#304) is fixed by 
[replacing|https://github.com/apache/parquet-format/commit/84165d0a4f46106a96d68ed831965123294a5196]
 the problematic comment in parquet-format, I'm afraid we'll need a release 
from format.

> [Java] Update Scala for JDK 11 compatibility
> 
>
> Key: PARQUET-1496
> URL: https://issues.apache.org/jira/browse/PARQUET-1496
> Project: Parquet
>  Issue Type: Bug
>  Components: parquet-mr
>Affects Versions: 1.10.0
>Reporter: Uwe L. Korn
>Assignee: Uwe L. Korn
>Priority: Major
>  Labels: pull-request-available
>
> When trying to build the parquet-mr code on OSX Mojave with OpenJDK 10 and 9, 
> the build fails for me in {{parquet-scala}} with:
> {code:java}
> [INFO] --- maven-scala-plugin:2.15.2:compile (default) @ parquet-scala_2.10 
> ---
> [INFO] Checking for multiple versions of scala
> [INFO] includes = [**/*.java,**/*.scala,]
> [INFO] excludes = []
> [INFO] /Users/uwe/tmp/apache-parquet-1.11.0/parquet-scala/src/main/scala:-1: 
> info: compiling
> [INFO] Compiling 1 source files to 
> /Users/uwe/tmp/apache-parquet-1.11.0/parquet-scala/target/classes at 
> 1547922718010
> [ERROR] error: error while loading package, Missing dependency 'object 
> java.lang.Object in compiler mirror', required by 
> /Users/uwe/.m2/repository/org/scala-lang/scala-library/2.10.6/scala-library-2.10.6.jar(scala/package.class)
> [ERROR] error: error while loading package, Missing dependency 'object 
> java.lang.Object in compiler mirror', required by 
> /Users/uwe/.m2/repository/org/scala-lang/scala-library/2.10.6/scala-library-2.10.6.jar(scala/runtime/package.class)
> [ERROR] error: scala.reflect.internal.MissingRequirementError: object 
> java.lang.Object in compiler mirror not found.
> [ERROR] at 
> scala.reflect.internal.MissingRequirementError$.signal(MissingRequirementError.scala:16)
> [ERROR] at 
> scala.reflect.internal.MissingRequirementError$.notFound(MissingRequirementError.scala:17)
> [INFO] at 
> scala.reflect.internal.Mirrors$RootsBase.getModuleOrClass(Mirrors.scala:48)
> [INFO] at 
> scala.reflect.internal.Mirrors$RootsBase.getModuleOrClass(Mirrors.scala:40)
> [INFO] at 
> scala.reflect.internal.Mirrors$RootsBase.getModuleOrClass(Mirrors.scala:40)
> [INFO] at 
> scala.reflect.internal.Mirrors$RootsBase.getModuleOrClass(Mirrors.scala:61)
> [INFO] at 
> scala.reflect.internal.Mirrors$RootsBase.getClassByName(Mirrors.scala:99)
> [INFO] at 
> scala.reflect.internal.Mirrors$RootsBase.getRequiredClass(Mirrors.scala:102)
> [INFO] at 
> scala.reflect.internal.Definitions$DefinitionsClass.ObjectClass$lzycompute(Definitions.scala:264)
> [INFO] at 
> scala.reflect.internal.Definitions$DefinitionsClass.ObjectClass(Definitions.scala:264)
> [INFO] at 
> scala.reflect.internal.Definitions$DefinitionsClass.AnyRefClass$lzycompute(Definitions.scala:263)
> [INFO] at 
> scala.reflect.internal.Definitions$DefinitionsClass.AnyRefClass(Definitions.scala:263)
> [INFO] at 
> scala.reflect.internal.Definitions$DefinitionsClass.specialPolyClass(Definitions.scala:1120)
> [INFO] at 
> scala.reflect.internal.Definitions$DefinitionsClass.RepeatedParamClass$lzycompute(Definitions.scala:407)
> [INFO] at 
> scala.reflect.internal.Definitions$DefinitionsClass.RepeatedParamClass(Definitions.scala:407)
> [INFO] at 
> scala.reflect.internal.Definitions$DefinitionsClass.syntheticCoreClasses$lzycompute(Definitions.scala:1154)
> [INFO] at 
> scala.reflect.internal.Definitions$DefinitionsClass.syntheticCoreClasses(Definitions.scala:1152)
> [INFO] at 
> scala.reflect.internal.Definitions$DefinitionsClass.symbolsNotPresentInBytecode$lzycompute(Definitions.scala:1196)
> [INFO] at 
> scala.reflect.internal.Definitions$DefinitionsClass.symbolsNotPresentInBytecode(Definitions.scala:1196)
> [INFO] at 
> scala.ref

[jira] [Comment Edited] (PARQUET-1496) [Java] Update Scala for JDK 11 compatibility

2019-05-03 Thread Nandor Kollar (JIRA)


[ 
https://issues.apache.org/jira/browse/PARQUET-1496?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16832559#comment-16832559
 ] 

Nandor Kollar edited comment on PARQUET-1496 at 5/3/19 2:54 PM:


Scrooge recently release 19.4.0 with a fix for 
[Scrooge#303|https://github.com/twitter/scrooge/issues/303], we can now upgrade 
to this version. The other problem (Scrooge#304) is fixed by 
[replacing|https://github.com/apache/parquet-format/commit/84165d0a4f46106a96d68ed831965123294a5196]
 the problematic comment in parquet-format, I'm afraid we'll need a release 
from format.


was (Author: nkollar):
Scrooge recently release 19.4.0 with a fix for 
[Scrooge#304|https://github.com/twitter/scrooge/issues/303], we can now upgrade 
to this version. The other problem (Scrooge#303) is fixed by 
[replacing|https://github.com/apache/parquet-format/commit/84165d0a4f46106a96d68ed831965123294a5196]
 the problematic comment in parquet-format, I'm afraid we'll need a release 
from format.

> [Java] Update Scala for JDK 11 compatibility
> 
>
> Key: PARQUET-1496
> URL: https://issues.apache.org/jira/browse/PARQUET-1496
> Project: Parquet
>  Issue Type: Bug
>  Components: parquet-mr
>Affects Versions: 1.10.0
>Reporter: Uwe L. Korn
>Assignee: Uwe L. Korn
>Priority: Major
>  Labels: pull-request-available
>
> When trying to build the parquet-mr code on OSX Mojave with OpenJDK 10 and 9, 
> the build fails for me in {{parquet-scala}} with:
> {code:java}
> [INFO] --- maven-scala-plugin:2.15.2:compile (default) @ parquet-scala_2.10 
> ---
> [INFO] Checking for multiple versions of scala
> [INFO] includes = [**/*.java,**/*.scala,]
> [INFO] excludes = []
> [INFO] /Users/uwe/tmp/apache-parquet-1.11.0/parquet-scala/src/main/scala:-1: 
> info: compiling
> [INFO] Compiling 1 source files to 
> /Users/uwe/tmp/apache-parquet-1.11.0/parquet-scala/target/classes at 
> 1547922718010
> [ERROR] error: error while loading package, Missing dependency 'object 
> java.lang.Object in compiler mirror', required by 
> /Users/uwe/.m2/repository/org/scala-lang/scala-library/2.10.6/scala-library-2.10.6.jar(scala/package.class)
> [ERROR] error: error while loading package, Missing dependency 'object 
> java.lang.Object in compiler mirror', required by 
> /Users/uwe/.m2/repository/org/scala-lang/scala-library/2.10.6/scala-library-2.10.6.jar(scala/runtime/package.class)
> [ERROR] error: scala.reflect.internal.MissingRequirementError: object 
> java.lang.Object in compiler mirror not found.
> [ERROR] at 
> scala.reflect.internal.MissingRequirementError$.signal(MissingRequirementError.scala:16)
> [ERROR] at 
> scala.reflect.internal.MissingRequirementError$.notFound(MissingRequirementError.scala:17)
> [INFO] at 
> scala.reflect.internal.Mirrors$RootsBase.getModuleOrClass(Mirrors.scala:48)
> [INFO] at 
> scala.reflect.internal.Mirrors$RootsBase.getModuleOrClass(Mirrors.scala:40)
> [INFO] at 
> scala.reflect.internal.Mirrors$RootsBase.getModuleOrClass(Mirrors.scala:40)
> [INFO] at 
> scala.reflect.internal.Mirrors$RootsBase.getModuleOrClass(Mirrors.scala:61)
> [INFO] at 
> scala.reflect.internal.Mirrors$RootsBase.getClassByName(Mirrors.scala:99)
> [INFO] at 
> scala.reflect.internal.Mirrors$RootsBase.getRequiredClass(Mirrors.scala:102)
> [INFO] at 
> scala.reflect.internal.Definitions$DefinitionsClass.ObjectClass$lzycompute(Definitions.scala:264)
> [INFO] at 
> scala.reflect.internal.Definitions$DefinitionsClass.ObjectClass(Definitions.scala:264)
> [INFO] at 
> scala.reflect.internal.Definitions$DefinitionsClass.AnyRefClass$lzycompute(Definitions.scala:263)
> [INFO] at 
> scala.reflect.internal.Definitions$DefinitionsClass.AnyRefClass(Definitions.scala:263)
> [INFO] at 
> scala.reflect.internal.Definitions$DefinitionsClass.specialPolyClass(Definitions.scala:1120)
> [INFO] at 
> scala.reflect.internal.Definitions$DefinitionsClass.RepeatedParamClass$lzycompute(Definitions.scala:407)
> [INFO] at 
> scala.reflect.internal.Definitions$DefinitionsClass.RepeatedParamClass(Definitions.scala:407)
> [INFO] at 
> scala.reflect.internal.Definitions$DefinitionsClass.syntheticCoreClasses$lzycompute(Definitions.scala:1154)
> [INFO] at 
> scala.reflect.internal.Definitions$DefinitionsClass.syntheticCoreClasses(Definitions.scala:1152)
> [INFO] at 
> scala.reflect.internal.Definitions$DefinitionsClass.symbolsNotPresentInBytecode$lzycompute(Definitions.scala:1196)
> [INFO] at 
> scala.reflect.internal.Definitions$DefinitionsClass.symbolsNotPresentInBytecode(Definitions.scala:1196)
> [INFO] at 
> scala.ref

[jira] [Commented] (PARQUET-1496) [Java] Update Scala for JDK 11 compatibility

2019-05-03 Thread Nandor Kollar (JIRA)


[ 
https://issues.apache.org/jira/browse/PARQUET-1496?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16832559#comment-16832559
 ] 

Nandor Kollar commented on PARQUET-1496:


Scrooge recently release 19.4.0 with a fix for 
[Scrooge#304|https://github.com/twitter/scrooge/issues/303], we can now upgrade 
to this version. The other problem (Scrooge#303) is fixed by 
[replacing|https://github.com/apache/parquet-format/commit/84165d0a4f46106a96d68ed831965123294a5196]
 the problematic comment in parquet-format, I'm afraid we'll need a release 
from format.

> [Java] Update Scala for JDK 11 compatibility
> 
>
> Key: PARQUET-1496
> URL: https://issues.apache.org/jira/browse/PARQUET-1496
> Project: Parquet
>  Issue Type: Bug
>  Components: parquet-mr
>Affects Versions: 1.10.0
>Reporter: Uwe L. Korn
>Assignee: Uwe L. Korn
>Priority: Major
>  Labels: pull-request-available
>
> When trying to build the parquet-mr code on OSX Mojave with OpenJDK 10 and 9, 
> the build fails for me in {{parquet-scala}} with:
> {code:java}
> [INFO] --- maven-scala-plugin:2.15.2:compile (default) @ parquet-scala_2.10 
> ---
> [INFO] Checking for multiple versions of scala
> [INFO] includes = [**/*.java,**/*.scala,]
> [INFO] excludes = []
> [INFO] /Users/uwe/tmp/apache-parquet-1.11.0/parquet-scala/src/main/scala:-1: 
> info: compiling
> [INFO] Compiling 1 source files to 
> /Users/uwe/tmp/apache-parquet-1.11.0/parquet-scala/target/classes at 
> 1547922718010
> [ERROR] error: error while loading package, Missing dependency 'object 
> java.lang.Object in compiler mirror', required by 
> /Users/uwe/.m2/repository/org/scala-lang/scala-library/2.10.6/scala-library-2.10.6.jar(scala/package.class)
> [ERROR] error: error while loading package, Missing dependency 'object 
> java.lang.Object in compiler mirror', required by 
> /Users/uwe/.m2/repository/org/scala-lang/scala-library/2.10.6/scala-library-2.10.6.jar(scala/runtime/package.class)
> [ERROR] error: scala.reflect.internal.MissingRequirementError: object 
> java.lang.Object in compiler mirror not found.
> [ERROR] at 
> scala.reflect.internal.MissingRequirementError$.signal(MissingRequirementError.scala:16)
> [ERROR] at 
> scala.reflect.internal.MissingRequirementError$.notFound(MissingRequirementError.scala:17)
> [INFO] at 
> scala.reflect.internal.Mirrors$RootsBase.getModuleOrClass(Mirrors.scala:48)
> [INFO] at 
> scala.reflect.internal.Mirrors$RootsBase.getModuleOrClass(Mirrors.scala:40)
> [INFO] at 
> scala.reflect.internal.Mirrors$RootsBase.getModuleOrClass(Mirrors.scala:40)
> [INFO] at 
> scala.reflect.internal.Mirrors$RootsBase.getModuleOrClass(Mirrors.scala:61)
> [INFO] at 
> scala.reflect.internal.Mirrors$RootsBase.getClassByName(Mirrors.scala:99)
> [INFO] at 
> scala.reflect.internal.Mirrors$RootsBase.getRequiredClass(Mirrors.scala:102)
> [INFO] at 
> scala.reflect.internal.Definitions$DefinitionsClass.ObjectClass$lzycompute(Definitions.scala:264)
> [INFO] at 
> scala.reflect.internal.Definitions$DefinitionsClass.ObjectClass(Definitions.scala:264)
> [INFO] at 
> scala.reflect.internal.Definitions$DefinitionsClass.AnyRefClass$lzycompute(Definitions.scala:263)
> [INFO] at 
> scala.reflect.internal.Definitions$DefinitionsClass.AnyRefClass(Definitions.scala:263)
> [INFO] at 
> scala.reflect.internal.Definitions$DefinitionsClass.specialPolyClass(Definitions.scala:1120)
> [INFO] at 
> scala.reflect.internal.Definitions$DefinitionsClass.RepeatedParamClass$lzycompute(Definitions.scala:407)
> [INFO] at 
> scala.reflect.internal.Definitions$DefinitionsClass.RepeatedParamClass(Definitions.scala:407)
> [INFO] at 
> scala.reflect.internal.Definitions$DefinitionsClass.syntheticCoreClasses$lzycompute(Definitions.scala:1154)
> [INFO] at 
> scala.reflect.internal.Definitions$DefinitionsClass.syntheticCoreClasses(Definitions.scala:1152)
> [INFO] at 
> scala.reflect.internal.Definitions$DefinitionsClass.symbolsNotPresentInBytecode$lzycompute(Definitions.scala:1196)
> [INFO] at 
> scala.reflect.internal.Definitions$DefinitionsClass.symbolsNotPresentInBytecode(Definitions.scala:1196)
> [INFO] at 
> scala.reflect.internal.Definitions$DefinitionsClass.init(Definitions.scala:1261)
> [INFO] at scala.tools.nsc.Global$Run.(Global.scala:1290)
> [INFO] at scala.tools.nsc.Driver.doCompile(Driver.scala:32)
> [INFO] at scala.tools.nsc.Main$.doCompile(Main.scala:79)
> [INFO] at scala.tools.nsc.Driver.process(Driver.scala:54)
> [INFO] at scala.tools.nsc.Driver.main(Driver.scala:67)
> [INFO] at scala.tools.nsc.Main.main

[jira] [Commented] (PARQUET-1556) Instructions are missing for configuring twitter maven repo for hadoop-lzo dependency

2019-04-03 Thread Nandor Kollar (JIRA)


[ 
https://issues.apache.org/jira/browse/PARQUET-1556?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16808837#comment-16808837
 ] 

Nandor Kollar commented on PARQUET-1556:


This additional repository should be in the POM instead of adding it to the 
settings. Looks strange for me, I couldn't reproduce the failure. My settings 
file doesn't have this additional Twitter repository, and I couldn't see it in 
the output of {{mvn help:effective-pom}} either.

> Instructions are missing for configuring twitter maven repo for hadoop-lzo 
> dependency
> -
>
> Key: PARQUET-1556
> URL: https://issues.apache.org/jira/browse/PARQUET-1556
> Project: Parquet
>  Issue Type: Improvement
>Affects Versions: 1.11.0
>Reporter: Andy Grove
>Assignee: Andy Grove
>Priority: Major
> Fix For: 1.12.0
>
>
> Running mvn verify based on the instructions in the README results in this 
> error
> {code:java}
> Could not resolve dependencies for project 
> org.apache.parquet:parquet-thrift:jar:1.11.0: Could not find artifact 
> com.hadoop.gplcompression:hadoop-lzo:jar:0.4.16{code}
> To fix this, it was necessary to configure my local ~/.m2/settings.xml to 
> include the twitter maven repo:
> {code:java}
> 
> twitter
> twitter
> http://maven.twttr.com
> {code}
> After adding this, mvn verify worked.
> We should add these instructions to the README.



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)


[jira] [Created] (PARQUET-1554) Compilation error when upgrading Scrooge version

2019-04-02 Thread Nandor Kollar (JIRA)
Nandor Kollar created PARQUET-1554:
--

 Summary: Compilation error when upgrading Scrooge version
 Key: PARQUET-1554
 URL: https://issues.apache.org/jira/browse/PARQUET-1554
 Project: Parquet
  Issue Type: Bug
  Components: parquet-format
Reporter: Nandor Kollar
Assignee: Nandor Kollar


When upgrading Scrooge version to 19.1.0, the build fails with
{code}
[510.1] failure: string matching regex `[A-Za-z_][A-Za-z0-9\._]*' expected but 
`}' found
{code}

This is due to Javadoc style comment in IndexPageHeader struct. Changing the 
style of the comment would solve the failure.



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)


[jira] [Commented] (PARQUET-1548) Meta data is lost when writing avro union types to parquet

2019-03-21 Thread Nandor Kollar (JIRA)


[ 
https://issues.apache.org/jira/browse/PARQUET-1548?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16798073#comment-16798073
 ] 

Nandor Kollar commented on PARQUET-1548:


Would you mind sharing more details how this happens, how to reproduce? Unit 
test, or steps to reproduce would be really useful.

> Meta data is lost when writing avro union types to parquet
> --
>
> Key: PARQUET-1548
> URL: https://issues.apache.org/jira/browse/PARQUET-1548
> Project: Parquet
>  Issue Type: Bug
>  Components: parquet-avro
> Environment: macOS -mojave
>Reporter: Michael O'Shea
>Priority: Major
>




--
This message was sent by Atlassian JIRA
(v7.6.3#76005)


[jira] [Resolved] (PARQUET-1545) Logical Type for timezone-naive timestamps

2019-03-20 Thread Nandor Kollar (JIRA)


 [ 
https://issues.apache.org/jira/browse/PARQUET-1545?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Nandor Kollar resolved PARQUET-1545.

Resolution: Fixed

> Logical Type for timezone-naive timestamps
> --
>
> Key: PARQUET-1545
> URL: https://issues.apache.org/jira/browse/PARQUET-1545
> Project: Parquet
>  Issue Type: New Feature
>Reporter: Tim Swast`
>Priority: Major
>
> {{In many systems there is a difference between a timezone-naive timestamp 
> column (called DATETIME in BigQuery, 'logicalType': 'datetime' in Avro) and a 
> timezone-aware timestamp (called TIMESTAMP in BigQuery and always stored in 
> UTC). It seems from [this 
> discussion|https://github.com/apache/parquet-format/pull/51#discussion_r119911623]
>  and the [list of logical 
> types|https://github.com/apache/parquet-format/blob/master/LogicalTypes.md] 
> that parquet only has the timezone-aware version, as all timestamps are 
> stored according to UTC.}}



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)


[jira] [Commented] (PARQUET-1545) Logical Type for timezone-naive timestamps

2019-03-20 Thread Nandor Kollar (JIRA)


[ 
https://issues.apache.org/jira/browse/PARQUET-1545?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16796990#comment-16796990
 ] 

Nandor Kollar commented on PARQUET-1545:


Great, closing this Jira as done then. By the way, like I told, if you're using 
parquet-mr, then this is only supported since 1.11.0.

> Logical Type for timezone-naive timestamps
> --
>
> Key: PARQUET-1545
> URL: https://issues.apache.org/jira/browse/PARQUET-1545
> Project: Parquet
>  Issue Type: New Feature
>Reporter: Tim Swast`
>Priority: Major
>
> {{In many systems there is a difference between a timezone-naive timestamp 
> column (called DATETIME in BigQuery, 'logicalType': 'datetime' in Avro) and a 
> timezone-aware timestamp (called TIMESTAMP in BigQuery and always stored in 
> UTC). It seems from [this 
> discussion|https://github.com/apache/parquet-format/pull/51#discussion_r119911623]
>  and the [list of logical 
> types|https://github.com/apache/parquet-format/blob/master/LogicalTypes.md] 
> that parquet only has the timezone-aware version, as all timestamps are 
> stored according to UTC.}}



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)


[jira] [Commented] (PARQUET-1545) Logical Type for timezone-naive timestamps

2019-03-18 Thread Nandor Kollar (JIRA)


[ 
https://issues.apache.org/jira/browse/PARQUET-1545?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16795001#comment-16795001
 ] 

Nandor Kollar commented on PARQUET-1545:


[~swast] actually Parquet already supports both timezone-naive and 
timezone-aware timestamps too. The logical type API is improved in 1.11.0 
(based on the recent improvement in parquet-format), and the timestamp logical 
type introduced an additional {{isAdjustedToUTC}} parameter, which tells what 
is the semantic of the given timestamp field: {{false}} means timezone-naive, 
and {{true}} means timezone-aware timestamp.

> Logical Type for timezone-naive timestamps
> --
>
> Key: PARQUET-1545
> URL: https://issues.apache.org/jira/browse/PARQUET-1545
> Project: Parquet
>  Issue Type: New Feature
>Reporter: Tim Swast`
>Priority: Major
>
> {{In many systems there is a difference between a timezone-naive timestamp 
> column (called DATETIME in BigQuery, 'logicalType': 'datetime' in Avro) and a 
> timezone-aware timestamp (called TIMESTAMP in BigQuery and always stored in 
> UTC). It seems from [this 
> discussion|https://github.com/apache/parquet-format/pull/51#discussion_r119911623]
>  and the [list of logical 
> types|https://github.com/apache/parquet-format/blob/master/LogicalTypes.md] 
> that parquet only has the timezone-aware version, as all timestamps are 
> stored according to UTC.}}



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)


Re: [VOTE] Release Apache Parquet 1.10.1 RC0

2019-01-29 Thread Nandor Kollar
+1 (non-binding)

Verified signature and checksum, ran unit tests, all passed.

Cheers,
Nandor

On Tue, Jan 29, 2019 at 1:21 PM Gabor Szadovszky  wrote:

> Hi Ryan,
>
> Checked the tarball: checksum/signature are correct. Content is correct
> based on the release tag. Unit tests pass.
>
> +1 (non-binding)
>
> Cheers,
> Gabor
>
>
> On Mon, Jan 28, 2019 at 11:08 PM Ryan Blue 
> wrote:
>
> > Hi everyone,
> >
> > I propose the following RC to be released as official Apache Parquet Java
> > 1.10.1 release.
> >
> > The commit id is a89df8f9932b6ef6633d06069e50c9b7970bebd1
> >
> >- This corresponds to the tag: apache-parquet-1.10.1
> >- https://github.com/apache/parquet-mr/commit/a89df8f
> >- https://github.com/apache/parquet-mr/tree/apache-parquet-1.10.1
> >
> > The release tarball, signature, and checksums are here:
> >
> >-
> >
> >
> https://dist.apache.org/repos/dist/dev/parquet/apache-parquet-1.10.1-rc0/
> >
> > You can find the KEYS file here:
> >
> >- https://dist.apache.org/repos/dist/dev/parquet/KEYS
> >
> > Binary artifacts are staged in Nexus here:
> >
> >-
> >
> >
> https://repository.apache.org/content/repositories/orgapacheparquet-1022/
> >
> > This release is a patch release for Parquet 1.10.0. It includes the
> > following bug fixes:
> >
> >- PARQUET-1309: Properties to disable stats and dictionary filtering
> are
> >swapped
> >- PARQUET-1510: Dictionary filter bug skips null for notEq with
> >dictionary of one value
> >
> > Please download, verify, and test.
> >
> > Please vote in the next 72 hours:
> >
> > [ ] +1 Release this as Apache Parquet Java 1.10.1
> > [ ] +0
> > [ ] -1 Do not release this because…
> > --
> > Ryan Blue
> > Software Engineer
> > Netflix
> >
>


Re: [DISCUSS] Remove old modules?

2019-01-29 Thread Nandor Kollar
Removing parquet-hive-* is a great idea, the code in Parquet is not
maintained any more, it is just a burden there.

As of parquet-pig, I'd prefer moving it to Pig (if Pig community accepts it
as it is) instead of dropping it or moving to a separate project. I know
people who still use Pig with Parquet.

Regards,
Nandor

On Mon, Jan 28, 2019 at 6:29 PM Ryan Blue  wrote:

> Hi everyone,
>
> I’m working on the 1.10.1 build and I’ve noticed that we will have several
> modules that are not maintained or are very old. This includes all of the
> Hive modules that moved into Hive years ago and also modules like
> parquet-scrooge and parquet-scala that are based on Scala 2.10 that has
> been EOL for years.
>
> We also have 2 command-line utilities, parquet-tools and parquet-cli. The
> parquet-cli version is friendlier to use, but I’m clearly biased. In any
> case, I don’t think we need to maintain both and it is confusing for users
> to have two modules that do the same thing.
>
> I propose we remove the following modules:
>
>- parquet-hive-*
>- parquet-scrooge
>- parquet-scala
>- parquet-tools
>- parquet-hadoop-bundle (shaded deps)
>-
>
>parquet-cascading (in favor of parquet-cascading3, if we keep it)
>There are also modules that I’m not sure about. Does anyone use these?
>-
>
>parquet-thrift
>- parquet-pig
>- parquet-cascading3
>
> Pig hasn’t had an update (other than project-wide changes) since Oct 2017.
> I think it may be time to drop support in Pig and allow that to exist as a
> separate project if anyone is still interested in it.
>
> In the last few years, we’ve moved more to a model where processing
> frameworks and engines maintain their own integration. Spark, Presto,
> Iceberg, and Hive fall into this category. So I would prefer to drop Pig
> and Cascading3. I’m fine keeping thrift if people think it is useful.
>
> Thoughts?
>
> rb
> --
> Ryan Blue
> Software Engineer
> Netflix
>


[jira] [Updated] (PARQUET-409) InternalParquetRecordWriter doesn't use min/max row counts

2019-01-18 Thread Nandor Kollar (JIRA)


 [ 
https://issues.apache.org/jira/browse/PARQUET-409?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Nandor Kollar updated PARQUET-409:
--
Fix Version/s: (was: 1.9.0)

> InternalParquetRecordWriter doesn't use min/max row counts
> --
>
> Key: PARQUET-409
> URL: https://issues.apache.org/jira/browse/PARQUET-409
> Project: Parquet
>  Issue Type: Bug
>  Components: parquet-mr
>Affects Versions: 1.8.1
>Reporter: Ryan Blue
>Priority: Major
>  Labels: pull-request-available
>
> PARQUET-99 added settings to control the min and max number of rows between 
> size checks when flushing pages, and a setting to control whether to always 
> use a static size (the min). The [InternalParquetRecordWriter has similar 
> checks|https://github.com/apache/parquet-mr/blob/master/parquet-hadoop/src/main/java/org/apache/parquet/hadoop/InternalParquetRecordWriter.java#L143]
>  that don't use those settings. We should determine if it should update it to 
> use those settings or similar.



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)


[jira] [Updated] (PARQUET-182) FilteredRecordReader skips rows it shouldn't for schema with optional columns

2019-01-18 Thread Nandor Kollar (JIRA)


 [ 
https://issues.apache.org/jira/browse/PARQUET-182?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Nandor Kollar updated PARQUET-182:
--
Fix Version/s: (was: 1.9.0)

> FilteredRecordReader skips rows it shouldn't for schema with optional columns
> -
>
> Key: PARQUET-182
> URL: https://issues.apache.org/jira/browse/PARQUET-182
> Project: Parquet
>  Issue Type: Bug
>  Components: parquet-mr
>Affects Versions: 1.5.0, 1.6.0, 1.7.0
> Environment: Linux, Java7/Java8
>Reporter: Steven Mellinger
>Priority: Blocker
>
> When using UnboundRecordFilter with nested AND/OR filters over OPTIONAL 
> columns, there seems to be a case with a mismatch between the current 
> record's column value and the value read during filtering.
> The structure of my filter predicate that results in incorrect filtering is: 
> (x && (y || z))
> When I step through it with a debugger I can see that the value being read 
> from the ColumnReader inside my Predicate is different than the value for 
> that row.
> Looking deeper there seems to be a buffer with dictionary keys in 
> RunLenghBitPackingHybridDecoder (I am using RLE). There are only two 
> different keys in this array, [0,1], whereas my optional column has three 
> different values, [null,0,1]. If I had a column with values 5,10,10,null,10, 
> and keys 0 -> 5 and 1 -> 10, the buffer would hold 0,1,1,1,0, and in the case 
> that it reads the last row, would return 0 -> 5.
> So it seems that nothing is keeping track of where nulls appear.
> Hope someone can take a look, as it is a blocker for my project.



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)


[jira] [Updated] (PARQUET-1006) ColumnChunkPageWriter uses only heap memory.

2019-01-18 Thread Nandor Kollar (JIRA)


 [ 
https://issues.apache.org/jira/browse/PARQUET-1006?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Nandor Kollar updated PARQUET-1006:
---
Fix Version/s: (was: 1.9.0)

> ColumnChunkPageWriter uses only heap memory.
> 
>
> Key: PARQUET-1006
> URL: https://issues.apache.org/jira/browse/PARQUET-1006
> Project: Parquet
>  Issue Type: Bug
>  Components: parquet-mr
>Affects Versions: 1.8.0
>Reporter: Vitalii Diravka
>Priority: Major
>
> After PARQUET-160 was resolved, ColumnChunkPageWriter started using 
> ConcatenatingByteArrayCollector. There are all data is collected in the List 
> of byte[], before writing the page. No way to use direct memory for 
> allocating buffers. ByteBufferAllocator is present in the 
> [ColumnChunkPageWriter|https://github.com/apache/parquet-mr/blob/master/parquet-hadoop/src/main/java/org/apache/parquet/hadoop/ColumnChunkPageWriteStore.java#L73]
>  class, but never used.
> Using of java heap space in some cases can cause OOM exceptions or GC's 
> overhead. 
> ByteBufferAllocator should be used in the ConcatenatingByteArrayCollector or 
> OutputStream classes.



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)


[jira] [Resolved] (PARQUET-1278) [Java] parquet-arrow is broken due to different JDK version

2019-01-11 Thread Nandor Kollar (JIRA)


 [ 
https://issues.apache.org/jira/browse/PARQUET-1278?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Nandor Kollar resolved PARQUET-1278.

Resolution: Duplicate

[~andygrove] I resolved this Jira since it looked like it is a duplicate of 
PARQUET-1390

Feel free to reopen if I was wrong.

> [Java] parquet-arrow is broken due to different JDK version
> ---
>
> Key: PARQUET-1278
> URL: https://issues.apache.org/jira/browse/PARQUET-1278
> Project: Parquet
>  Issue Type: Bug
>Affects Versions: 1.10.0
>Reporter: Andy Grove
>Priority: Major
>
> Parquet was recently update to use Arrow 0.8.0 to resolve this JIRA: 
> https://issues.apache.org/jira/browse/PARQUET-1128 but unfortunately the 
> issue is actually not resolved because Arrow 0.8.0 uses JDK 7 and Parquet 
> uses JDK 8.
> I have filed a Jira against Arrow to upgrade to JDK 8: 
> https://issues.apache.org/jira/browse/ARROW-2498
> Once this is done, we will need to update parquet-arrow to use the new 
> version of arrow.



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)


[jira] [Assigned] (PARQUET-1478) Can't read spec compliant, 3-level lists via parquet-proto

2019-01-03 Thread Nandor Kollar (JIRA)


 [ 
https://issues.apache.org/jira/browse/PARQUET-1478?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Nandor Kollar reassigned PARQUET-1478:
--

Assignee: Nandor Kollar

> Can't read spec compliant, 3-level lists via parquet-proto
> --
>
> Key: PARQUET-1478
> URL: https://issues.apache.org/jira/browse/PARQUET-1478
> Project: Parquet
>  Issue Type: Bug
>  Components: parquet-mr
>Affects Versions: 1.11.0
>    Reporter: Nandor Kollar
>    Assignee: Nandor Kollar
>Priority: Major
>  Labels: pull-request-available
> Fix For: 1.11.0
>
>
> I noticed that ProtoInputOutputFormatTest doesn't test the following case 
> properly: when lists are written using the spec compliant 3-level structure. 
> The test actually doesn't write 3-level list, because the passed 
> configuration is not used at all, a new one is created each time. See 
> attached PR.
> When I fixed this test, it turned out that it is failing: now it writes the 
> correct 3-level structure, but looks like the read path is broken. Is it 
> indeed a bug, or I'm doing something wrong?



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)


[jira] [Created] (PARQUET-1478) Can't read spec compliant, 3-level lists via parquet-proto

2018-12-14 Thread Nandor Kollar (JIRA)
Nandor Kollar created PARQUET-1478:
--

 Summary: Can't read spec compliant, 3-level lists via parquet-proto
 Key: PARQUET-1478
 URL: https://issues.apache.org/jira/browse/PARQUET-1478
 Project: Parquet
  Issue Type: Bug
  Components: parquet-mr
Reporter: Nandor Kollar


I noticed that ProtoInputOutputFormatTest doesn't test the following case 
properly: when lists are written using the spec compliant 3-level structure. 
The test actually doesn't write 3-level list, because the passed configuration 
is not used at all, a new one is created each time. See attached PR.

When I fixed this test, it turned out that it is failing: now it writes the 
correct 3-level structure, but looks like the read path is broken. Is it indeed 
a bug, or I'm doing something wrong?



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)


[jira] [Created] (PARQUET-1476) Don't emit a warning message for files without new logical type

2018-12-13 Thread Nandor Kollar (JIRA)
Nandor Kollar created PARQUET-1476:
--

 Summary: Don't emit a warning message for files without new 
logical type
 Key: PARQUET-1476
 URL: https://issues.apache.org/jira/browse/PARQUET-1476
 Project: Parquet
  Issue Type: Improvement
  Components: parquet-mr
Reporter: Nandor Kollar
Assignee: Nandor Kollar


When the only the old logical type representation is present in the file, the 
converter emits a warning that the two types mismatch. This creates unwanted 
noise in the logs, metadata converter should only emit a warning if the new 
logical type is not null.



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)


[jira] [Commented] (PARQUET-1434) Release parquet-mr 1.11.0

2018-12-12 Thread Nandor Kollar (JIRA)


[ 
https://issues.apache.org/jira/browse/PARQUET-1434?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16718669#comment-16718669
 ] 

Nandor Kollar commented on PARQUET-1434:


[~yumwang] the release is not yet finished, voting is not yet closed.

> Release parquet-mr 1.11.0
> -
>
> Key: PARQUET-1434
> URL: https://issues.apache.org/jira/browse/PARQUET-1434
> Project: Parquet
>  Issue Type: Task
>  Components: parquet-mr
>    Reporter: Nandor Kollar
>Priority: Major
>




--
This message was sent by Atlassian JIRA
(v7.6.3#76005)


Re: Regarding Apache Parquet Project

2018-12-11 Thread Nandor Kollar
Hi Arjit,

I'd also recommend you to have a look at Parquet website:
https://parquet.apache.org/

You can find a couple of old, but great presentations there, I recommend
you to watch those to understand the basics (despite Parquet improved
during the years with additional features, the basics didn't change, and
you can understand it from these presentations). Also, you can find the
link to the Git repositories, I'd also recommend you to have a look at
those.

If you're interested in the latest ongoing development efforts, have a look
at the Jira: https://issues.apache.org/jira/projects/PARQUET/ and have a
look at the open pull requests attached to these Jiras.

Regards,
Nandor

On Tue, Dec 11, 2018 at 9:41 AM Hatem Helal 
wrote:

> Hi Arjit,
>
> I'm new around here too but interested to hear what the others on this
> list have to say.  For C++ development, I've recommend reading through the
> examples:
>
> https://github.com/apache/arrow/tree/master/cpp/examples/parquet
>
> and the command-line tools:
>
> https://github.com/apache/arrow/tree/master/cpp/tools/parquet
>
> Both were helpful for getting up to speed on the main APIs.  I use an IDE
> (Xcode but doesn't matter which) to debug and step through the code and try
> to understand the internal dependencies.  The setup for Xcode was a bit
> manual but let me know if there is interest and I can investigate
> automation so that I can share it with others.
>
> Hope this helps,
>
> Hatem
>
> On 12/11/18, 5:39 AM, "Arjit Yadav"  wrote:
>
> Hi all,
>
> I am new to this project. While I have used parquet in the past, I
> want to
> know how it works internally and look up relevant documentation and
> code
> inorder to start contributing to the project.
>
> Please let me know any available resources in this regard.
>
> Regards,
> Arjit Yadav
>
>
>


[jira] [Commented] (PARQUET-1470) Inputstream leakage in ParquetFileWriter.appendFile

2018-12-04 Thread Nandor Kollar (JIRA)


[ 
https://issues.apache.org/jira/browse/PARQUET-1470?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16708702#comment-16708702
 ] 

Nandor Kollar commented on PARQUET-1470:


Sounds like a reasonable improvement. Would you mind opening a PR?

> Inputstream leakage in ParquetFileWriter.appendFile
> ---
>
> Key: PARQUET-1470
> URL: https://issues.apache.org/jira/browse/PARQUET-1470
> Project: Parquet
>  Issue Type: Bug
>  Components: parquet-mr
>Affects Versions: 1.10.0
>Reporter: Arnaud Linz
>Priority: Major
>
> Current implementation of ParquetFileWriter.appendFile is:
>  
> {{public void appendFile(InputFile file) throws IOException {}}
> {{    ParquetFileReader.open(file).appendTo(this);}}
> {{ }}}
> this method never closes the inputstream created when the file is opened in 
> the ParquetFileReader constructor.
> This leads for instance to TooManyFilesOpened exceptions when large merge are 
> made with the parquet tools.
> something  like
> {{ try (ParquetFileReader reader = ParquetFileReader.open(file)) {}}
> {{    reader.appendTo(this);}}
> {{ }}}
> would be cleaner.
>  



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)


[jira] [Commented] (PARQUET-1441) SchemaParseException: Can't redefine: list in AvroIndexedRecordConverter

2018-11-28 Thread Nandor Kollar (JIRA)


[ 
https://issues.apache.org/jira/browse/PARQUET-1441?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16701851#comment-16701851
 ] 

Nandor Kollar commented on PARQUET-1441:


Ok, so looks like there is a more fundamental problem related to this. Parquet 
allows using the same name in nested structures, while Avro doesn't allow it 
for records. For example, a file with this Parquet schema
{code}
message Message {
optional group a1 {
required float a2;
optional group a1 {
required float a4;
}
}
}
{code}
is not readable via AvroParquetReader. Of course this could be easily solved by 
renaming the inner a1 to something else, but for lists, this doesn't work. I 
think using Avro namespaces during schema conversion could fix this bug.

> SchemaParseException: Can't redefine: list in AvroIndexedRecordConverter
> 
>
> Key: PARQUET-1441
> URL: https://issues.apache.org/jira/browse/PARQUET-1441
> Project: Parquet
>  Issue Type: Bug
>  Components: parquet-avro
>Reporter: Michael Heuer
>Priority: Major
>  Labels: pull-request-available
>
> The following unit test added to TestAvroSchemaConverter fails
> {code:java}
> @Test
> public void testConvertedSchemaToStringCantRedefineList() throws Exception {
>   String parquet = "message spark_schema {\n" +
>   "  optional group annotation {\n" +
>   "optional group transcriptEffects (LIST) {\n" +
>   "  repeated group list {\n" +
>   "optional group element {\n" +
>   "  optional group effects (LIST) {\n" +
>   "repeated group list {\n" +
>   "  optional binary element (UTF8);\n" +
>   "}\n" +
>   "  }\n" +
>   "}\n" +
>   "  }\n" +
>   "}\n" +
>   "  }\n" +
>   "}\n";
>   Configuration conf = new Configuration(false);
>   AvroSchemaConverter avroSchemaConverter = new AvroSchemaConverter(conf);
>   Schema schema = 
> avroSchemaConverter.convert(MessageTypeParser.parseMessageType(parquet));
>   schema.toString();
> }
> {code}
> while this one succeeds
> {code:java}
> @Test
> public void testConvertedSchemaToStringCantRedefineList() throws Exception {
>   String parquet = "message spark_schema {\n" +
>   "  optional group annotation {\n" +
>   "optional group transcriptEffects (LIST) {\n" +
>   "  repeated group list {\n" +
>   "optional group element {\n" +
>   "  optional group effects (LIST) {\n" +
>   "repeated group list {\n" +
>   "  optional binary element (UTF8);\n" +
>   "}\n" +
>   "  }\n" +
>   "}\n" +
>   "  }\n" +
>   "}\n" +
>   "  }\n" +
>   "}\n";
>  
>   Configuration conf = new Configuration(false);
>   conf.setBoolean("parquet.avro.add-list-element-records", false);
>   AvroSchemaConverter avroSchemaConverter = new AvroSchemaConverter(conf);
>   Schema schema = 
> avroSchemaConverter.convert(MessageTypeParser.parseMessageType(parquet));
>   schema.toString();
> }
> {code}
> I don't see a way to influence the code path in AvroIndexedRecordConverter to 
> respect this configuration, resulting in the following stack trace downstream
> {noformat}
>   Cause: org.apache.avro.SchemaParseException: Can't redefine: list
>   at org.apache.avro.Schema$Names.put(Schema.java:1128)
>   at org.apache.avro.Schema$NamedSchema.writeNameRef(Schema.java:562)
>   at org.apache.avro.Schema$RecordSchema.toJson(Schema.java:690)
>   at org.apache.avro.Schema$ArraySchema.toJson(Schema.java:805)
>   at org.apache.avro.Schema$UnionSchema.toJson(Schema.java:882)
>   at org.apache.avro.Schema$RecordSchema.fieldsToJson(Schema.java:716)
>   at org.apache.avro.Schema$RecordSchema.toJson(Schema.java:701)
>   at org.apache.avro.Schema$UnionSchema.toJson(Schema.java:882)
>   at org.apache.avro.Schema$RecordSchema.fieldsToJson(Schema.java:716)
>   at org.apache.avro.Schema$RecordSchema.toJson(Schema.java:701)
>   at org.apache.avro.Schema.toString(Schema.java:324)
>   at 
> org.apache.avro.SchemaCompatibility.checkReaderWriterCompatibility(Sche

[jira] [Commented] (PARQUET-1441) SchemaParseException: Can't redefine: list in AvroIndexedRecordConverter

2018-11-27 Thread Nandor Kollar (JIRA)


[ 
https://issues.apache.org/jira/browse/PARQUET-1441?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16700913#comment-16700913
 ] 

Nandor Kollar commented on PARQUET-1441:


The compatibility check introduced by PARQUET-651 in AvroRecordConverter, 
[this|https://github.com/apache/parquet-mr/blob/master/parquet-avro/src/main/java/org/apache/parquet/avro/AvroRecordConverter.java#L866].
 I was referring to Parquet version, 1.8.1 doesn't have this change, while 
1.8.2 already has.

> SchemaParseException: Can't redefine: list in AvroIndexedRecordConverter
> 
>
> Key: PARQUET-1441
> URL: https://issues.apache.org/jira/browse/PARQUET-1441
> Project: Parquet
>  Issue Type: Bug
>  Components: parquet-avro
>Reporter: Michael Heuer
>Priority: Major
>  Labels: pull-request-available
>
> The following unit test added to TestAvroSchemaConverter fails
> {code:java}
> @Test
> public void testConvertedSchemaToStringCantRedefineList() throws Exception {
>   String parquet = "message spark_schema {\n" +
>   "  optional group annotation {\n" +
>   "optional group transcriptEffects (LIST) {\n" +
>   "  repeated group list {\n" +
>   "optional group element {\n" +
>   "  optional group effects (LIST) {\n" +
>   "repeated group list {\n" +
>   "  optional binary element (UTF8);\n" +
>   "}\n" +
>   "  }\n" +
>   "}\n" +
>   "  }\n" +
>   "}\n" +
>   "  }\n" +
>   "}\n";
>   Configuration conf = new Configuration(false);
>   AvroSchemaConverter avroSchemaConverter = new AvroSchemaConverter(conf);
>   Schema schema = 
> avroSchemaConverter.convert(MessageTypeParser.parseMessageType(parquet));
>   schema.toString();
> }
> {code}
> while this one succeeds
> {code:java}
> @Test
> public void testConvertedSchemaToStringCantRedefineList() throws Exception {
>   String parquet = "message spark_schema {\n" +
>   "  optional group annotation {\n" +
>   "optional group transcriptEffects (LIST) {\n" +
>   "  repeated group list {\n" +
>   "optional group element {\n" +
>   "  optional group effects (LIST) {\n" +
>   "repeated group list {\n" +
>   "  optional binary element (UTF8);\n" +
>   "}\n" +
>   "  }\n" +
>   "}\n" +
>   "  }\n" +
>   "}\n" +
>   "  }\n" +
>   "}\n";
>  
>   Configuration conf = new Configuration(false);
>   conf.setBoolean("parquet.avro.add-list-element-records", false);
>   AvroSchemaConverter avroSchemaConverter = new AvroSchemaConverter(conf);
>   Schema schema = 
> avroSchemaConverter.convert(MessageTypeParser.parseMessageType(parquet));
>   schema.toString();
> }
> {code}
> I don't see a way to influence the code path in AvroIndexedRecordConverter to 
> respect this configuration, resulting in the following stack trace downstream
> {noformat}
>   Cause: org.apache.avro.SchemaParseException: Can't redefine: list
>   at org.apache.avro.Schema$Names.put(Schema.java:1128)
>   at org.apache.avro.Schema$NamedSchema.writeNameRef(Schema.java:562)
>   at org.apache.avro.Schema$RecordSchema.toJson(Schema.java:690)
>   at org.apache.avro.Schema$ArraySchema.toJson(Schema.java:805)
>   at org.apache.avro.Schema$UnionSchema.toJson(Schema.java:882)
>   at org.apache.avro.Schema$RecordSchema.fieldsToJson(Schema.java:716)
>   at org.apache.avro.Schema$RecordSchema.toJson(Schema.java:701)
>   at org.apache.avro.Schema$UnionSchema.toJson(Schema.java:882)
>   at org.apache.avro.Schema$RecordSchema.fieldsToJson(Schema.java:716)
>   at org.apache.avro.Schema$RecordSchema.toJson(Schema.java:701)
>   at org.apache.avro.Schema.toString(Schema.java:324)
>   at 
> org.apache.avro.SchemaCompatibility.checkReaderWriterCompatibility(SchemaCompatibility.java:68)
>   at 
> org.apache.parquet.avro.AvroRecordConverter.isElementType(AvroRecordConverter.java:866)
>   at 
> org.apache.parquet.avro.AvroIndexedRecordConverter$AvroArrayConverter.(AvroIndexedRecordConverter.java:333)
>   at 
> org.apache.parquet.avro.AvroI

[jira] [Commented] (PARQUET-1441) SchemaParseException: Can't redefine: list in AvroIndexedRecordConverter

2018-11-27 Thread Nandor Kollar (JIRA)


[ 
https://issues.apache.org/jira/browse/PARQUET-1441?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16700502#comment-16700502
 ] 

Nandor Kollar commented on PARQUET-1441:


Looks like this compatibility check broke this scenario. The commit wasn't 
committed when 1.8.1 was released, but it is backported to all other later 
branches (including 1.8.x, so if 1.8.2 is gets released, then this case will 
break there too).

> SchemaParseException: Can't redefine: list in AvroIndexedRecordConverter
> 
>
> Key: PARQUET-1441
> URL: https://issues.apache.org/jira/browse/PARQUET-1441
> Project: Parquet
>  Issue Type: Bug
>  Components: parquet-avro
>Reporter: Michael Heuer
>Priority: Major
>  Labels: pull-request-available
>
> The following unit test added to TestAvroSchemaConverter fails
> {code:java}
> @Test
> public void testConvertedSchemaToStringCantRedefineList() throws Exception {
>   String parquet = "message spark_schema {\n" +
>   "  optional group annotation {\n" +
>   "optional group transcriptEffects (LIST) {\n" +
>   "  repeated group list {\n" +
>   "optional group element {\n" +
>   "  optional group effects (LIST) {\n" +
>   "repeated group list {\n" +
>   "  optional binary element (UTF8);\n" +
>   "}\n" +
>   "  }\n" +
>   "}\n" +
>   "  }\n" +
>   "}\n" +
>   "  }\n" +
>   "}\n";
>   Configuration conf = new Configuration(false);
>   AvroSchemaConverter avroSchemaConverter = new AvroSchemaConverter(conf);
>   Schema schema = 
> avroSchemaConverter.convert(MessageTypeParser.parseMessageType(parquet));
>   schema.toString();
> }
> {code}
> while this one succeeds
> {code:java}
> @Test
> public void testConvertedSchemaToStringCantRedefineList() throws Exception {
>   String parquet = "message spark_schema {\n" +
>   "  optional group annotation {\n" +
>   "optional group transcriptEffects (LIST) {\n" +
>   "  repeated group list {\n" +
>   "optional group element {\n" +
>   "  optional group effects (LIST) {\n" +
>   "repeated group list {\n" +
>   "  optional binary element (UTF8);\n" +
>   "}\n" +
>   "  }\n" +
>   "}\n" +
>   "  }\n" +
>   "}\n" +
>   "  }\n" +
>   "}\n";
>  
>   Configuration conf = new Configuration(false);
>   conf.setBoolean("parquet.avro.add-list-element-records", false);
>   AvroSchemaConverter avroSchemaConverter = new AvroSchemaConverter(conf);
>   Schema schema = 
> avroSchemaConverter.convert(MessageTypeParser.parseMessageType(parquet));
>   schema.toString();
> }
> {code}
> I don't see a way to influence the code path in AvroIndexedRecordConverter to 
> respect this configuration, resulting in the following stack trace downstream
> {noformat}
>   Cause: org.apache.avro.SchemaParseException: Can't redefine: list
>   at org.apache.avro.Schema$Names.put(Schema.java:1128)
>   at org.apache.avro.Schema$NamedSchema.writeNameRef(Schema.java:562)
>   at org.apache.avro.Schema$RecordSchema.toJson(Schema.java:690)
>   at org.apache.avro.Schema$ArraySchema.toJson(Schema.java:805)
>   at org.apache.avro.Schema$UnionSchema.toJson(Schema.java:882)
>   at org.apache.avro.Schema$RecordSchema.fieldsToJson(Schema.java:716)
>   at org.apache.avro.Schema$RecordSchema.toJson(Schema.java:701)
>   at org.apache.avro.Schema$UnionSchema.toJson(Schema.java:882)
>   at org.apache.avro.Schema$RecordSchema.fieldsToJson(Schema.java:716)
>   at org.apache.avro.Schema$RecordSchema.toJson(Schema.java:701)
>   at org.apache.avro.Schema.toString(Schema.java:324)
>   at 
> org.apache.avro.SchemaCompatibility.checkReaderWriterCompatibility(SchemaCompatibility.java:68)
>   at 
> org.apache.parquet.avro.AvroRecordConverter.isElementType(AvroRecordConverter.java:866)
>   at 
> org.apache.parquet.avro.AvroIndexedRecordConverter$AvroArrayConverter.(AvroIndexedRecordConverter.java:333)
>   at 
> org.apache.parquet.avro.AvroIndexedRecordConverter.newConverter(AvroIndexedRecordConverter.java:172)
>   at 
> org.apache

[jira] [Commented] (PARQUET-1441) SchemaParseException: Can't redefine: list in AvroIndexedRecordConverter

2018-11-27 Thread Nandor Kollar (JIRA)


[ 
https://issues.apache.org/jira/browse/PARQUET-1441?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16700445#comment-16700445
 ] 

Nandor Kollar commented on PARQUET-1441:


After a bit more investigation, I think this is probably a Parquet issue, 
however it isn't clean to me how can this be a regression. The failing code 
path in Parquet and in Avro was committed long ago, what I can think of as a 
possibility is that Spark might have recently moved from 2-level list structure 
to 3-level lists.

The unit test attached to this PR doesn't reflect the problem, because I think 
it tests the correct behaviour: in the converter one can switch between 2 and 3 
level list with {{parquet.avro.add-list-element-records}} property. The test 
for the Spark Jira is a lot more informative.

I think that the problem is that 
[AvroRecordConverter|https://github.com/apache/parquet-mr/blob/master/parquet-avro/src/main/java/org/apache/parquet/avro/AvroRecordConverter.java#L865]
 tries to decide between 3 and 2 level list by first trying to interpret the 
schema as 2 level, and check the compatibility with the expected Avro schema. 
Normally, the two are incompatible (if it was written as 3 level), and Parquet 
will know that it is a 3-level list. This works fine when lists are not nested 
into other lists, but if we try to represent the 3 level nested list Parquet 
structure as 2 level, the resulting 2 level Avro schema is not even a valid 
Avro schema!

> SchemaParseException: Can't redefine: list in AvroIndexedRecordConverter
> 
>
> Key: PARQUET-1441
> URL: https://issues.apache.org/jira/browse/PARQUET-1441
> Project: Parquet
>  Issue Type: Bug
>  Components: parquet-avro
>Reporter: Michael Heuer
>Priority: Major
>  Labels: pull-request-available
>
> The following unit test added to TestAvroSchemaConverter fails
> {code:java}
> @Test
> public void testConvertedSchemaToStringCantRedefineList() throws Exception {
>   String parquet = "message spark_schema {\n" +
>   "  optional group annotation {\n" +
>   "optional group transcriptEffects (LIST) {\n" +
>   "  repeated group list {\n" +
>   "optional group element {\n" +
>   "  optional group effects (LIST) {\n" +
>   "repeated group list {\n" +
>   "  optional binary element (UTF8);\n" +
>   "}\n" +
>   "  }\n" +
>   "}\n" +
>   "  }\n" +
>   "}\n" +
>   "  }\n" +
>   "}\n";
>   Configuration conf = new Configuration(false);
>   AvroSchemaConverter avroSchemaConverter = new AvroSchemaConverter(conf);
>   Schema schema = 
> avroSchemaConverter.convert(MessageTypeParser.parseMessageType(parquet));
>   schema.toString();
> }
> {code}
> while this one succeeds
> {code:java}
> @Test
> public void testConvertedSchemaToStringCantRedefineList() throws Exception {
>   String parquet = "message spark_schema {\n" +
>   "  optional group annotation {\n" +
>   "optional group transcriptEffects (LIST) {\n" +
>   "  repeated group list {\n" +
>   "optional group element {\n" +
>   "  optional group effects (LIST) {\n" +
>   "repeated group list {\n" +
>   "  optional binary element (UTF8);\n" +
>   "}\n" +
>   "  }\n" +
>   "}\n" +
>   "  }\n" +
>   "}\n" +
>   "  }\n" +
>   "}\n";
>  
>   Configuration conf = new Configuration(false);
>   conf.setBoolean("parquet.avro.add-list-element-records", false);
>   AvroSchemaConverter avroSchemaConverter = new AvroSchemaConverter(conf);
>   Schema schema = 
> avroSchemaConverter.convert(MessageTypeParser.parseMessageType(parquet));
>   schema.toString();
> }
> {code}
> I don't see a way to influence the code path in AvroIndexedRecordConverter to 
> respect this configuration, resulting in the following stack trace downstream
> {noformat}
>   Cause: org.apache.avro.SchemaParseException: Can't redefine: list
>   at org.apache.avro.Schema$Names.put(Schema.java:1128)
>   at org.apache.avro.Schema$NamedSchema.writeNameRef(Schema.java:562)
>   at org.apache.avro.Schema$RecordSchema.toJson(Schema.java:690)
>

[jira] [Commented] (PARQUET-1441) SchemaParseException: Can't redefine: list in AvroIndexedRecordConverter

2018-11-26 Thread Nandor Kollar (JIRA)


[ 
https://issues.apache.org/jira/browse/PARQUET-1441?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16699061#comment-16699061
 ] 

Nandor Kollar commented on PARQUET-1441:


Well, I think the correct solution is setting 
{{parquet.avro.add-list-element-records}} false, like in the second test case 
in the attached PR.

> SchemaParseException: Can't redefine: list in AvroIndexedRecordConverter
> 
>
> Key: PARQUET-1441
> URL: https://issues.apache.org/jira/browse/PARQUET-1441
> Project: Parquet
>  Issue Type: Bug
>  Components: parquet-avro
>Reporter: Michael Heuer
>Priority: Major
>  Labels: pull-request-available
>
> The following unit test added to TestAvroSchemaConverter fails
> {code:java}
> @Test
> public void testConvertedSchemaToStringCantRedefineList() throws Exception {
>   String parquet = "message spark_schema {\n" +
>   "  optional group annotation {\n" +
>   "optional group transcriptEffects (LIST) {\n" +
>   "  repeated group list {\n" +
>   "optional group element {\n" +
>   "  optional group effects (LIST) {\n" +
>   "repeated group list {\n" +
>   "  optional binary element (UTF8);\n" +
>   "}\n" +
>   "  }\n" +
>   "}\n" +
>   "  }\n" +
>   "}\n" +
>   "  }\n" +
>   "}\n";
>   Configuration conf = new Configuration(false);
>   AvroSchemaConverter avroSchemaConverter = new AvroSchemaConverter(conf);
>   Schema schema = 
> avroSchemaConverter.convert(MessageTypeParser.parseMessageType(parquet));
>   schema.toString();
> }
> {code}
> while this one succeeds
> {code:java}
> @Test
> public void testConvertedSchemaToStringCantRedefineList() throws Exception {
>   String parquet = "message spark_schema {\n" +
>   "  optional group annotation {\n" +
>   "optional group transcriptEffects (LIST) {\n" +
>   "  repeated group list {\n" +
>   "optional group element {\n" +
>   "  optional group effects (LIST) {\n" +
>   "repeated group list {\n" +
>   "  optional binary element (UTF8);\n" +
>   "}\n" +
>   "  }\n" +
>   "}\n" +
>   "  }\n" +
>   "}\n" +
>   "  }\n" +
>   "}\n";
>  
>   Configuration conf = new Configuration(false);
>   conf.setBoolean("parquet.avro.add-list-element-records", false);
>   AvroSchemaConverter avroSchemaConverter = new AvroSchemaConverter(conf);
>   Schema schema = 
> avroSchemaConverter.convert(MessageTypeParser.parseMessageType(parquet));
>   schema.toString();
> }
> {code}
> I don't see a way to influence the code path in AvroIndexedRecordConverter to 
> respect this configuration, resulting in the following stack trace downstream
> {noformat}
>   Cause: org.apache.avro.SchemaParseException: Can't redefine: list
>   at org.apache.avro.Schema$Names.put(Schema.java:1128)
>   at org.apache.avro.Schema$NamedSchema.writeNameRef(Schema.java:562)
>   at org.apache.avro.Schema$RecordSchema.toJson(Schema.java:690)
>   at org.apache.avro.Schema$ArraySchema.toJson(Schema.java:805)
>   at org.apache.avro.Schema$UnionSchema.toJson(Schema.java:882)
>   at org.apache.avro.Schema$RecordSchema.fieldsToJson(Schema.java:716)
>   at org.apache.avro.Schema$RecordSchema.toJson(Schema.java:701)
>   at org.apache.avro.Schema$UnionSchema.toJson(Schema.java:882)
>   at org.apache.avro.Schema$RecordSchema.fieldsToJson(Schema.java:716)
>   at org.apache.avro.Schema$RecordSchema.toJson(Schema.java:701)
>   at org.apache.avro.Schema.toString(Schema.java:324)
>   at 
> org.apache.avro.SchemaCompatibility.checkReaderWriterCompatibility(SchemaCompatibility.java:68)
>   at 
> org.apache.parquet.avro.AvroRecordConverter.isElementType(AvroRecordConverter.java:866)
>   at 
> org.apache.parquet.avro.AvroIndexedRecordConverter$AvroArrayConverter.(AvroIndexedRecordConverter.java:333)
>   at 
> org.apache.parquet.avro.AvroIndexedRecordConverter.newConverter(AvroIndexedRecordConverter.java:172)
>   at 
> org.apache.parquet.avro.AvroIndexedRecordConverter.(AvroIndexedRecordConverter.java:94)
>   at 
&g

[jira] [Commented] (PARQUET-1441) SchemaParseException: Can't redefine: list in AvroIndexedRecordConverter

2018-11-26 Thread Nandor Kollar (JIRA)


[ 
https://issues.apache.org/jira/browse/PARQUET-1441?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16698733#comment-16698733
 ] 

Nandor Kollar commented on PARQUET-1441:


The workaround proposed above should work, however it is not 100% compliant 
with Parquet [LogicalTypes 
spec|https://github.com/apache/parquet-format/blob/master/LogicalTypes.md]. It 
states, that "The middle level, named list, must be a repeated group with a 
single field named element.", but fortunately due to backward compatibility 
rules for nested type, it doesn't cause error.

> SchemaParseException: Can't redefine: list in AvroIndexedRecordConverter
> 
>
> Key: PARQUET-1441
> URL: https://issues.apache.org/jira/browse/PARQUET-1441
> Project: Parquet
>  Issue Type: Bug
>  Components: parquet-avro
>Reporter: Michael Heuer
>Priority: Major
>  Labels: pull-request-available
>
> The following unit test added to TestAvroSchemaConverter fails
> {code:java}
> @Test
> public void testConvertedSchemaToStringCantRedefineList() throws Exception {
>   String parquet = "message spark_schema {\n" +
>   "  optional group annotation {\n" +
>   "optional group transcriptEffects (LIST) {\n" +
>   "  repeated group list {\n" +
>   "optional group element {\n" +
>   "  optional group effects (LIST) {\n" +
>   "repeated group list {\n" +
>   "  optional binary element (UTF8);\n" +
>   "}\n" +
>   "  }\n" +
>   "}\n" +
>   "  }\n" +
>   "}\n" +
>   "  }\n" +
>   "}\n";
>   Configuration conf = new Configuration(false);
>   AvroSchemaConverter avroSchemaConverter = new AvroSchemaConverter(conf);
>   Schema schema = 
> avroSchemaConverter.convert(MessageTypeParser.parseMessageType(parquet));
>   schema.toString();
> }
> {code}
> while this one succeeds
> {code:java}
> @Test
> public void testConvertedSchemaToStringCantRedefineList() throws Exception {
>   String parquet = "message spark_schema {\n" +
>   "  optional group annotation {\n" +
>   "optional group transcriptEffects (LIST) {\n" +
>   "  repeated group list {\n" +
>   "optional group element {\n" +
>   "  optional group effects (LIST) {\n" +
>   "repeated group list {\n" +
>   "  optional binary element (UTF8);\n" +
>   "}\n" +
>   "  }\n" +
>   "}\n" +
>   "  }\n" +
>   "}\n" +
>   "  }\n" +
>   "}\n";
>  
>   Configuration conf = new Configuration(false);
>   conf.setBoolean("parquet.avro.add-list-element-records", false);
>   AvroSchemaConverter avroSchemaConverter = new AvroSchemaConverter(conf);
>   Schema schema = 
> avroSchemaConverter.convert(MessageTypeParser.parseMessageType(parquet));
>   schema.toString();
> }
> {code}
> I don't see a way to influence the code path in AvroIndexedRecordConverter to 
> respect this configuration, resulting in the following stack trace downstream
> {noformat}
>   Cause: org.apache.avro.SchemaParseException: Can't redefine: list
>   at org.apache.avro.Schema$Names.put(Schema.java:1128)
>   at org.apache.avro.Schema$NamedSchema.writeNameRef(Schema.java:562)
>   at org.apache.avro.Schema$RecordSchema.toJson(Schema.java:690)
>   at org.apache.avro.Schema$ArraySchema.toJson(Schema.java:805)
>   at org.apache.avro.Schema$UnionSchema.toJson(Schema.java:882)
>   at org.apache.avro.Schema$RecordSchema.fieldsToJson(Schema.java:716)
>   at org.apache.avro.Schema$RecordSchema.toJson(Schema.java:701)
>   at org.apache.avro.Schema$UnionSchema.toJson(Schema.java:882)
>   at org.apache.avro.Schema$RecordSchema.fieldsToJson(Schema.java:716)
>   at org.apache.avro.Schema$RecordSchema.toJson(Schema.java:701)
>   at org.apache.avro.Schema.toString(Schema.java:324)
>   at 
> org.apache.avro.SchemaCompatibility.checkReaderWriterCompatibility(SchemaCompatibility.java:68)
>   at 
> org.apache.parquet.avro.AvroRecordConverter.isElementType(AvroRecordConverter.java:866)
>   at 
> org.apache.parquet.avro.AvroIndexedRecordConverter$AvroArrayConverter.(AvroIndexedRecordConverter.java:33

Re: [VOTE] Release Apache Parquet 1.11.0 RC0

2018-11-22 Thread Nandor Kollar
+1 (non-binding)

Verified checksums, all matched. Executed unit tests, each passed.

Regards,
Nandor
On Thu, Nov 22, 2018 at 9:07 AM Gabor Szadovszky  wrote:
>
> Hi,
>
> Verified source tarball checksums and content. All are correct. Unit tests
> pass.
>
> +1 (non-binding)
>
> Cheers,
> Gabor
>
> On Wed, Nov 21, 2018 at 7:11 PM Zoltan Ivanfi 
> wrote:
>
> > Dear Parquet Users and Developers,
> >
> > I propose the following RC to be released as the official Apache
> > Parquet 1.11.0 release:
> >
> > The commit id is b873a0ab31da570bb615ab2253cf90a2f451b0e4
> > * This corresponds to the tag: apache-parquet-1.11.0
> > *
> > https://github.com/apache/parquet-mr/tree/b873a0ab31da570bb615ab2253cf90a2f451b0e4
> >
> > The release tarball, signature, and checksums are here:
> > *
> > https://dist.apache.org/repos/dist/dev/parquet/apache-parquet-1.11.0-rc0/
> >
> > You can find the KEYS file here:
> > * https://dist.apache.org/repos/dist/dev/parquet/KEYS
> >
> > Binary artifacts are staged in Nexus here:
> > *
> > https://repository.apache.org/content/groups/staging/org/apache/parquet/parquet/1.11.0/
> >
> > This release includes many bug fixes and the following new features:
> >
> > - PARQUET-1201 - Column indexes
> > - PARQUET-1253 - Support for new logical type representation
> > - PARQUET-1381 - Add merge blocks command to parquet-tools
> > - PARQUET-1388 - Nanosecond precision time and timestamp - parquet-mr
> >
> > Please download, verify, and test. The vote will be open for at least 72
> > hours.
> >
> > Thanks,
> >
> > Zoltan
> >


[jira] [Commented] (PARQUET-1407) Data loss on duplicate values with AvroParquetWriter/Reader

2018-11-19 Thread Nandor Kollar (JIRA)


[ 
https://issues.apache.org/jira/browse/PARQUET-1407?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16692014#comment-16692014
 ] 

Nandor Kollar commented on PARQUET-1407:


[~rdblue] [~scottcarey] [~jackytan] I added a unit test to Ryan's fix. Could 
you please review PR #552

> Data loss on duplicate values with AvroParquetWriter/Reader
> ---
>
> Key: PARQUET-1407
> URL: https://issues.apache.org/jira/browse/PARQUET-1407
> Project: Parquet
>  Issue Type: Bug
>  Components: parquet-avro
>Affects Versions: 1.9.0, 1.10.0, 1.8.3
>Reporter: Scott Carey
>Priority: Critical
>  Labels: pull-request-available
> Fix For: 1.11.0
>
>
> {code:java}
> public class Blah {
>   private static Path parquetFile = new Path("oops");
>   private static Schema schema = SchemaBuilder.record("spark_schema")
>   .fields().optionalBytes("value").endRecord();
>   private static GenericData.Record recordFor(String value) {
> return new GenericRecordBuilder(schema)
> .set("value", value.getBytes()).build();
>   }
>   public static void main(String ... args) throws IOException {
> try (ParquetWriter writer = AvroParquetWriter
>   .builder(parquetFile)
>   .withSchema(schema)
>   .build()) {
>   writer.write(recordFor("one"));
>   writer.write(recordFor("two"));
>   writer.write(recordFor("three"));
>   writer.write(recordFor("three"));
>   writer.write(recordFor("two"));
>   writer.write(recordFor("one"));
>   writer.write(recordFor("zero"));
> }
> try (ParquetReader reader = AvroParquetReader
> .builder(parquetFile)
> .withConf(new Configuration()).build()) {
>   GenericRecord rec;
>   int i = 0;
>   while ((rec = reader.read()) != null) {
> ByteBuffer buf = (ByteBuffer) rec.get("value");
> byte[] bytes = new byte[buf.remaining()];
> buf.get(bytes);
> System.out.println("rec " + i++ + ": " + new String(bytes));
>   }
> }
>   }
> }
> {code}
> Expected output:
> {noformat}
> rec 0: one
> rec 1: two
> rec 2: three
> rec 3: three
> rec 4: two
> rec 5: one
> rec 6: zero{noformat}
> Actual:
> {noformat}
> rec 0: one
> rec 1: two
> rec 2: three
> rec 3: 
> rec 4: 
> rec 5: 
> rec 6: zero{noformat}
>  
> This was found when we started getting empty byte[] values back in spark 
> unexpectedly.  (Spark 2.3.1 and Parquet 1.8.3).   I have not tried to 
> reproduce with parquet 1.9.0, but its a bad enough bug that I would like a 
> 1.8.4 release that I can drop-in replace 1.8.3 without any binary 
> compatibility issues.
>  Duplicate byte[] values are lost.
>  
> A few clues: 
> If I do not call ByteBuffer.get, the size of ByteBuffer.remaining does not go 
> to zero.  I suspect a ByteBuffer is being recycled, but the call to 
> ByteBuffer.get mutates it.  I wonder if an appropriately placed 
> ByteBuffer.duplicate() would fix it.
>  



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)


[jira] [Comment Edited] (PARQUET-1407) Data loss on duplicate values with AvroParquetWriter/Reader

2018-11-19 Thread Nandor Kollar (JIRA)


[ 
https://issues.apache.org/jira/browse/PARQUET-1407?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16692014#comment-16692014
 ] 

Nandor Kollar edited comment on PARQUET-1407 at 11/19/18 5:09 PM:
--

[~rdblue] [~scottcarey] [~jackytan] I added a unit test to Ryan's fix. Could 
you please review PR #552?


was (Author: nkollar):
[~rdblue] [~scottcarey] [~jackytan] I added a unit test to Ryan's fix. Could 
you please review PR #552

> Data loss on duplicate values with AvroParquetWriter/Reader
> ---
>
> Key: PARQUET-1407
> URL: https://issues.apache.org/jira/browse/PARQUET-1407
> Project: Parquet
>  Issue Type: Bug
>  Components: parquet-avro
>Affects Versions: 1.9.0, 1.10.0, 1.8.3
>Reporter: Scott Carey
>Priority: Critical
>  Labels: pull-request-available
> Fix For: 1.11.0
>
>
> {code:java}
> public class Blah {
>   private static Path parquetFile = new Path("oops");
>   private static Schema schema = SchemaBuilder.record("spark_schema")
>   .fields().optionalBytes("value").endRecord();
>   private static GenericData.Record recordFor(String value) {
> return new GenericRecordBuilder(schema)
> .set("value", value.getBytes()).build();
>   }
>   public static void main(String ... args) throws IOException {
> try (ParquetWriter writer = AvroParquetWriter
>   .builder(parquetFile)
>   .withSchema(schema)
>   .build()) {
>   writer.write(recordFor("one"));
>   writer.write(recordFor("two"));
>   writer.write(recordFor("three"));
>   writer.write(recordFor("three"));
>   writer.write(recordFor("two"));
>   writer.write(recordFor("one"));
>   writer.write(recordFor("zero"));
> }
> try (ParquetReader reader = AvroParquetReader
> .builder(parquetFile)
> .withConf(new Configuration()).build()) {
>   GenericRecord rec;
>   int i = 0;
>   while ((rec = reader.read()) != null) {
> ByteBuffer buf = (ByteBuffer) rec.get("value");
> byte[] bytes = new byte[buf.remaining()];
> buf.get(bytes);
> System.out.println("rec " + i++ + ": " + new String(bytes));
>   }
> }
>   }
> }
> {code}
> Expected output:
> {noformat}
> rec 0: one
> rec 1: two
> rec 2: three
> rec 3: three
> rec 4: two
> rec 5: one
> rec 6: zero{noformat}
> Actual:
> {noformat}
> rec 0: one
> rec 1: two
> rec 2: three
> rec 3: 
> rec 4: 
> rec 5: 
> rec 6: zero{noformat}
>  
> This was found when we started getting empty byte[] values back in spark 
> unexpectedly.  (Spark 2.3.1 and Parquet 1.8.3).   I have not tried to 
> reproduce with parquet 1.9.0, but its a bad enough bug that I would like a 
> 1.8.4 release that I can drop-in replace 1.8.3 without any binary 
> compatibility issues.
>  Duplicate byte[] values are lost.
>  
> A few clues: 
> If I do not call ByteBuffer.get, the size of ByteBuffer.remaining does not go 
> to zero.  I suspect a ByteBuffer is being recycled, but the call to 
> ByteBuffer.get mutates it.  I wonder if an appropriately placed 
> ByteBuffer.duplicate() would fix it.
>  



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)


[jira] [Updated] (PARQUET-1451) Deprecate old logical types API

2018-11-06 Thread Nandor Kollar (JIRA)


 [ 
https://issues.apache.org/jira/browse/PARQUET-1451?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Nandor Kollar updated PARQUET-1451:
---
Fix Version/s: (was: 1.11.0)

> Deprecate old logical types API
> ---
>
> Key: PARQUET-1451
> URL: https://issues.apache.org/jira/browse/PARQUET-1451
> Project: Parquet
>  Issue Type: Bug
>  Components: parquet-mr
>Reporter: Zoltan Ivanfi
>    Assignee: Nandor Kollar
>Priority: Major
>
> Now that the new logical types API is ready, we should deprecate the old one 
> because new types will not support it (in fact, nano precision has already 
> been added without support in the old API).



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)


[jira] [Updated] (PARQUET-1309) Parquet Java uses incorrect stats and dictionary filter properties

2018-11-06 Thread Nandor Kollar (JIRA)


 [ 
https://issues.apache.org/jira/browse/PARQUET-1309?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Nandor Kollar updated PARQUET-1309:
---
Fix Version/s: (was: 1.10.1)

> Parquet Java uses incorrect stats and dictionary filter properties
> --
>
> Key: PARQUET-1309
> URL: https://issues.apache.org/jira/browse/PARQUET-1309
> Project: Parquet
>  Issue Type: Bug
>  Components: parquet-mr
>Reporter: Ryan Blue
>Assignee: Gabor Szadovszky
>Priority: Major
> Fix For: 1.11.0
>
>
> In SPARK-24251, we found that the changes to use HadoopReadOptions 
> accidentally switched the [properties that enable stats and dictionary 
> filters|https://github.com/apache/parquet-mr/blob/8bbc6cb95fd9b4b9e86c924ca1e40fd555ecac1d/parquet-hadoop/src/main/java/org/apache/parquet/HadoopReadOptions.java#L83].
>  Both are enabled by default so it is unlikely that anyone will need to turn 
> them off and there is an easy work-around, but we should fix the properties 
> for 1.10.1. This doesn't affect the 1.8.x or 1.9.x releases (Spark 2.3.x is 
> on 1.8.x).



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)


[jira] [Updated] (PARQUET-1325) High-level flexible and fine-grained column level access control through encryption with pluggable key access

2018-11-06 Thread Nandor Kollar (JIRA)


 [ 
https://issues.apache.org/jira/browse/PARQUET-1325?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Nandor Kollar updated PARQUET-1325:
---
Fix Version/s: (was: 1.10.1)

> High-level flexible and fine-grained column level access control through 
> encryption with pluggable key access
> -
>
> Key: PARQUET-1325
> URL: https://issues.apache.org/jira/browse/PARQUET-1325
> Project: Parquet
>  Issue Type: New Feature
>  Components: parquet-mr
>Affects Versions: 1.10.1
>Reporter: Xinli Shang
>Priority: Major
>  Labels: None
>   Original Estimate: 1,440h
>  Remaining Estimate: 1,440h
>
> This JIRA is an extension to Parquet Modular Encryption Jira(PARQUET-1178) 
> that will provide the basic building blocks and APIs for the encryption 
> support. On top of PARQUET-1178, this feature will create a high-level layer 
> that enables fine-grained and flexible column level access control, with 
> pluggable key access module, without a need to use the low level encryption 
> APIs. Also this feature will enable seamless integration with existing 
> clients.
> A detailed design doc will follow soon.



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)


[jira] [Updated] (PARQUET-1341) Null count is suppressed when columns have no min or max and use unsigned sort order

2018-11-06 Thread Nandor Kollar (JIRA)


 [ 
https://issues.apache.org/jira/browse/PARQUET-1341?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Nandor Kollar updated PARQUET-1341:
---
Fix Version/s: (was: 1.10.1)

> Null count is suppressed when columns have no min or max and use unsigned 
> sort order
> 
>
> Key: PARQUET-1341
> URL: https://issues.apache.org/jira/browse/PARQUET-1341
> Project: Parquet
>  Issue Type: Bug
>  Components: parquet-mr
>Affects Versions: 1.10.0
>Reporter: Ryan Blue
>Assignee: Ryan Blue
>Priority: Major
>  Labels: pull-request-available
> Fix For: 1.11.0
>
>




--
This message was sent by Atlassian JIRA
(v7.6.3#76005)


[jira] [Updated] (PARQUET-1435) Benchmark filtering column-indexes

2018-11-06 Thread Nandor Kollar (JIRA)


 [ 
https://issues.apache.org/jira/browse/PARQUET-1435?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Nandor Kollar updated PARQUET-1435:
---
Fix Version/s: (was: 1.11.0)

> Benchmark filtering column-indexes
> --
>
> Key: PARQUET-1435
> URL: https://issues.apache.org/jira/browse/PARQUET-1435
> Project: Parquet
>  Issue Type: Improvement
>  Components: parquet-mr
>Reporter: Gabor Szadovszky
>Assignee: Gabor Szadovszky
>Priority: Major
>
> Benchmark the improvements or drawbacks of filtering with and without using 
> column-indexes. We shall also benchmark the overhead of reading and using the 
> column-indexes in case of it is not useful (e.g. completely randomized data).



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)


[jira] [Updated] (PARQUET-1292) Add constructors to ProtoParquetWriter to write specs compliant Parquet

2018-11-06 Thread Nandor Kollar (JIRA)


 [ 
https://issues.apache.org/jira/browse/PARQUET-1292?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Nandor Kollar updated PARQUET-1292:
---
Fix Version/s: (was: 1.11.0)

> Add constructors to ProtoParquetWriter to write specs compliant Parquet
> ---
>
> Key: PARQUET-1292
> URL: https://issues.apache.org/jira/browse/PARQUET-1292
> Project: Parquet
>  Issue Type: Improvement
>Reporter: Kunal Chawla
>Assignee: Kunal Chawla
>Priority: Minor
>




--
This message was sent by Atlassian JIRA
(v7.6.3#76005)


Re: When the next release with page index release?

2018-11-05 Thread Nandor Kollar
Hi Aron Tao,

We plan to have a release with page indexes and bunch of new features
and bug fixes soon. Can't tell any ETA now, but we'll discuss it on
the next Parquet sync tomorrow.

Regards,
Nandor
On Thu, Nov 1, 2018 at 3:53 PM Tao JiaTao  wrote:
>
>
> Hi
>
> Recently I’m reading Parquet’s page index in branch master, and it seems 
> good. And I notice it has been a while since the previous release, so I’m 
> wondering when the next release?
>
> --
> Regards!
> Aron Tao
>


-- 
Currently I'm out of office from 22th October until 4th November. If
you have any urgent issue related to Avro/Parquet/File Formats, please
send an email to fileformats-develop...@cloudera.com. With Pig related
questions/problems, contact with eng-...@cloudera.com


[jira] [Updated] (PARQUET-1419) Enable old readers to access unencrypted columns in files with plaintext footer

2018-10-18 Thread Nandor Kollar (JIRA)


 [ 
https://issues.apache.org/jira/browse/PARQUET-1419?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Nandor Kollar updated PARQUET-1419:
---
Fix Version/s: encryption-feature-branch

> Enable old readers to access unencrypted columns in files with plaintext 
> footer
> ---
>
> Key: PARQUET-1419
> URL: https://issues.apache.org/jira/browse/PARQUET-1419
> Project: Parquet
>  Issue Type: Sub-task
>  Components: parquet-cpp, parquet-format, parquet-mr
>Reporter: Gidon Gershinsky
>Assignee: Gidon Gershinsky
>Priority: Major
>  Labels: pull-request-available
> Fix For: encryption-feature-branch
>
>




--
This message was sent by Atlassian JIRA
(v7.6.3#76005)


[jira] [Updated] (PARQUET-1232) Document the modular encryption in parquet-format

2018-10-18 Thread Nandor Kollar (JIRA)


 [ 
https://issues.apache.org/jira/browse/PARQUET-1232?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Nandor Kollar updated PARQUET-1232:
---
Fix Version/s: encryption-feature-branch

> Document the modular encryption in parquet-format
> -
>
> Key: PARQUET-1232
> URL: https://issues.apache.org/jira/browse/PARQUET-1232
> Project: Parquet
>  Issue Type: Sub-task
>  Components: parquet-format
>Reporter: Gidon Gershinsky
>Assignee: Gidon Gershinsky
>Priority: Major
>  Labels: pull-request-available
> Fix For: encryption-feature-branch
>
>
> Create Encryption.md from the design googledoc.



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)


[jira] [Updated] (PARQUET-1401) RowGroup offset and total compressed size fields

2018-10-18 Thread Nandor Kollar (JIRA)


 [ 
https://issues.apache.org/jira/browse/PARQUET-1401?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Nandor Kollar updated PARQUET-1401:
---
Fix Version/s: encryption-feature-branch

> RowGroup offset and total compressed size fields
> 
>
> Key: PARQUET-1401
> URL: https://issues.apache.org/jira/browse/PARQUET-1401
> Project: Parquet
>  Issue Type: Sub-task
>  Components: parquet-cpp, parquet-format
>Reporter: Gidon Gershinsky
>Assignee: Gidon Gershinsky
>Priority: Major
>  Labels: pull-request-available
> Fix For: encryption-feature-branch
>
>  Time Spent: 0.5h
>  Remaining Estimate: 0h
>
> Spark uses filterFileMetaData* methods in ParquetMetadataConverter class, 
> that  calculate the offset and total compressed size of a RowGroup data.
> The offset calculation is done by extracting the ColumnMetaData of the first 
> column, and using its offset fields.
> The total compressed size calculation is done by running a loop over all 
> column chunks in the RowGroup, and summing up the size values from each 
> chunk's ColumnMetaData .
> If one or more columns are hidden (encrypted with a key unavailable to the 
> reader), these calculations can't be performed, because the column metadata 
> is protected. 
>  
> But: these calculations don't really need the individual column values. The 
> results pertain to the whole RowGroup, not specific columns. 
> Therefore, we will define two new optional fields in the RowGroup Thrift 
> structure:
>  
> _optional i64 file_offset_
> _optional i64 total_compressed_size_
>  
> and calculate/set them upon file writing. Then, Spark will be able to query a 
> file with hidden columns (of course, only if the query itself doesn't need 
> the hidden columns - works with a masked version of them, or reads columns 
> with available keys).
>  
> These values can be set only for encrypted files (or for all files, to skip 
> the loop upon reading). I've tested this, works fine in Spark writers and 
> readers.
>  
> I've also checked other references to ColumnMetaData fields in parquet-mr. 
> There are none - therefore, its the only change we need in parquet.thrift to 
> handle hidden columns.



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)


[jira] [Assigned] (PARQUET-1331) Use new logical types API in parquet-mr

2018-10-15 Thread Nandor Kollar (JIRA)


 [ 
https://issues.apache.org/jira/browse/PARQUET-1331?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Nandor Kollar reassigned PARQUET-1331:
--

Assignee: (was: Nandor Kollar)

> Use new logical types API in parquet-mr
> ---
>
> Key: PARQUET-1331
> URL: https://issues.apache.org/jira/browse/PARQUET-1331
> Project: Parquet
>  Issue Type: Improvement
>  Components: parquet-mr
>    Reporter: Nandor Kollar
>Priority: Minor
>
> PARQUET-1253 introduced the a new API for logical types making OriginalTypes 
> deprecated. Parquet-mr makes several decision based on OriginalTypes, this 
> logic should be replaced to new logical type API



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)


[jira] [Resolved] (PARQUET-1331) Use new logical types API in parquet-mr

2018-10-15 Thread Nandor Kollar (JIRA)


 [ 
https://issues.apache.org/jira/browse/PARQUET-1331?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Nandor Kollar resolved PARQUET-1331.

Resolution: Duplicate

> Use new logical types API in parquet-mr
> ---
>
> Key: PARQUET-1331
> URL: https://issues.apache.org/jira/browse/PARQUET-1331
> Project: Parquet
>  Issue Type: Improvement
>  Components: parquet-mr
>    Reporter: Nandor Kollar
>Priority: Minor
>
> PARQUET-1253 introduced the a new API for logical types making OriginalTypes 
> deprecated. Parquet-mr makes several decision based on OriginalTypes, this 
> logic should be replaced to new logical type API



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)


[jira] [Updated] (PARQUET-1433) Parquet-format doesn't compile with Thrift 0.10.0

2018-10-15 Thread Nandor Kollar (JIRA)


 [ 
https://issues.apache.org/jira/browse/PARQUET-1433?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Nandor Kollar updated PARQUET-1433:
---
Fix Version/s: format-2.7.0

> Parquet-format doesn't compile with Thrift 0.10.0
> -
>
> Key: PARQUET-1433
> URL: https://issues.apache.org/jira/browse/PARQUET-1433
> Project: Parquet
>  Issue Type: Task
>  Components: parquet-format
>    Reporter: Nandor Kollar
>    Assignee: Nandor Kollar
>Priority: Major
>  Labels: pull-request-available
> Fix For: format-2.7.0
>
>
> Compilation of parquet-format fails with Thrift 0.10.0:
> [ERROR] thrift failed error: [FAILURE:generation:1] Error: unknown
> option java:hashcode



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)


[jira] [Assigned] (PARQUET-1436) TimestampMicrosStringifier shows wrong microseconds for timestamps before 1970

2018-10-15 Thread Nandor Kollar (JIRA)


 [ 
https://issues.apache.org/jira/browse/PARQUET-1436?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Nandor Kollar reassigned PARQUET-1436:
--

Assignee: Nandor Kollar

> TimestampMicrosStringifier shows wrong microseconds for timestamps before 1970
> --
>
> Key: PARQUET-1436
> URL: https://issues.apache.org/jira/browse/PARQUET-1436
> Project: Parquet
>  Issue Type: Task
>  Components: parquet-mr
>Reporter: Zoltan Ivanfi
>    Assignee: Nandor Kollar
>Priority: Major
>  Labels: pull-request-available
> Fix For: 1.11.0
>
>
> testTimestampMicrosStringifier takes the timestamp 1848-03-15T09:23:59.765 
> and subtracts 1 microseconds from it. The result (both expected and actual) 
> is 1848-03-15T09:23:59.765001, but it should be 1848-03-15T09:23:59.764999 
> instead.



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)


[jira] [Updated] (PARQUET-1383) Parquet tools should indicate UTC parameter for time/timestamp types

2018-10-11 Thread Nandor Kollar (JIRA)


 [ 
https://issues.apache.org/jira/browse/PARQUET-1383?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Nandor Kollar updated PARQUET-1383:
---
Fix Version/s: 1.11.0

> Parquet tools should indicate UTC parameter for time/timestamp types
> 
>
> Key: PARQUET-1383
> URL: https://issues.apache.org/jira/browse/PARQUET-1383
> Project: Parquet
>  Issue Type: Improvement
>  Components: parquet-mr
>    Reporter: Nandor Kollar
>    Assignee: Nandor Kollar
>Priority: Minor
>  Labels: pull-request-available
> Fix For: 1.11.0
>
>
> Parquet-tools should indicate if a time/timestamp is UTC adjusted or timezone 
> agnostic, the values written by the tools should take UTC normalized 
> parameters into account. Right now, every time and timestamp value is 
> adjusted to UTC when printed via parquet-tools



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)


[jira] [Created] (PARQUET-1434) Release parquet-mr 1.11.0

2018-10-03 Thread Nandor Kollar (JIRA)
Nandor Kollar created PARQUET-1434:
--

 Summary: Release parquet-mr 1.11.0
 Key: PARQUET-1434
 URL: https://issues.apache.org/jira/browse/PARQUET-1434
 Project: Parquet
  Issue Type: Task
  Components: parquet-mr
Reporter: Nandor Kollar






--
This message was sent by Atlassian JIRA
(v7.6.3#76005)


[jira] [Updated] (PARQUET-1388) Nanosecond precision time and timestamp - parquet-mr

2018-10-02 Thread Nandor Kollar (JIRA)


 [ 
https://issues.apache.org/jira/browse/PARQUET-1388?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Nandor Kollar updated PARQUET-1388:
---
Fix Version/s: 1.11.0

> Nanosecond precision time and timestamp - parquet-mr
> 
>
> Key: PARQUET-1388
> URL: https://issues.apache.org/jira/browse/PARQUET-1388
> Project: Parquet
>  Issue Type: Improvement
>  Components: parquet-mr
>    Reporter: Nandor Kollar
>    Assignee: Nandor Kollar
>Priority: Major
>  Labels: pull-request-available
> Fix For: 1.11.0
>
>




--
This message was sent by Atlassian JIRA
(v7.6.3#76005)


[jira] [Updated] (PARQUET-1201) Column indexes

2018-10-02 Thread Nandor Kollar (JIRA)


 [ 
https://issues.apache.org/jira/browse/PARQUET-1201?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Nandor Kollar updated PARQUET-1201:
---
Fix Version/s: 1.11.0

> Column indexes
> --
>
> Key: PARQUET-1201
> URL: https://issues.apache.org/jira/browse/PARQUET-1201
> Project: Parquet
>  Issue Type: New Feature
>Affects Versions: 1.10.0
>Reporter: Gabor Szadovszky
>Assignee: Gabor Szadovszky
>Priority: Major
>  Labels: pull-request-available
> Fix For: format-2.5.0, 1.11.0
>
>
> Write the column indexes described in PARQUET-922.
>  This is the first phase of implementing the whole feature. The 
> implementation is done in the following steps:
>  * Utility to read/write indexes in parquet-format
>  * Writing indexes in the parquet file
>  * Extend parquet-tools and parquet-cli to show the indexes
>  * Limit index size based on parquet properties
>  * Trim min/max values where possible based on parquet properties
>  * Filtering based on column indexes
> The work is done on the feature branch {{column-indexes}}. This JIRA will be 
> resolved after the branch has been merged to {{master}}.



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)


[jira] [Updated] (PARQUET-1368) ParquetFileReader should close its input stream for the failure in constructor

2018-10-02 Thread Nandor Kollar (JIRA)


 [ 
https://issues.apache.org/jira/browse/PARQUET-1368?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Nandor Kollar updated PARQUET-1368:
---
Fix Version/s: 1.11.0

> ParquetFileReader should close its input stream for the failure in constructor
> --
>
> Key: PARQUET-1368
> URL: https://issues.apache.org/jira/browse/PARQUET-1368
> Project: Parquet
>  Issue Type: Bug
>  Components: parquet-mr
>Affects Versions: 1.10.0
>Reporter: Hyukjin Kwon
>Priority: Major
>  Labels: pull-request-available
> Fix For: 1.11.0
>
>
> I was trying to replace deprecated usage {{readFooter}} to 
> {{ParquetFileReader.open}} according to the node:
> {code}
> [warn] 
> /home/jenkins/workspace/spark-master-test-maven-hadoop-2.7/sql/core/src/main/scala/org/apache/spark/sql/execution/datasources/parquet/ParquetFileFormat.scala:368:
>  method readFooter in object ParquetFileReader is deprecated: see 
> corresponding Javadoc for more information.
> [warn] ParquetFileReader.readFooter(sharedConf, filePath, 
> SKIP_ROW_GROUPS).getFileMetaData
> [warn]   ^
> [warn] 
> /home/jenkins/workspace/spark-master-test-maven-hadoop-2.7/sql/core/src/main/scala/org/apache/spark/sql/execution/datasources/parquet/ParquetFileFormat.scala:545:
>  method readFooter in object ParquetFileReader is deprecated: see 
> corresponding Javadoc for more information.
> [warn] ParquetFileReader.readFooter(
> [warn]   ^
> {code}
> Then, I realised some test suites reports resource leak:
> {code}
> java.lang.Throwable
>   at 
> org.apache.spark.DebugFilesystem$.addOpenStream(DebugFilesystem.scala:36)
>   at org.apache.spark.DebugFilesystem.open(DebugFilesystem.scala:70)
>   at org.apache.hadoop.fs.FileSystem.open(FileSystem.java:766)
>   at 
> org.apache.parquet.hadoop.util.HadoopInputFile.newStream(HadoopInputFile.java:65)
>   at 
> org.apache.parquet.hadoop.ParquetFileReader.(ParquetFileReader.java:687)
>   at 
> org.apache.parquet.hadoop.ParquetFileReader.open(ParquetFileReader.java:595)
>   at 
> org.apache.spark.sql.execution.datasources.parquet.ParquetUtils$.createParquetReader(ParquetUtils.scala:67)
>   at 
> org.apache.spark.sql.execution.datasources.parquet.ParquetUtils$.readFooter(ParquetUtils.scala:46)
>   at 
> org.apache.spark.sql.execution.datasources.parquet.ParquetFileFormat$$anonfun$readParquetFootersInParallel$1.apply(ParquetFileFormat.scala:544)
>   at 
> org.apache.spark.sql.execution.datasources.parquet.ParquetFileFormat$$anonfun$readParquetFootersInParallel$1.apply(ParquetFileFormat.scala:539)
>   at 
> scala.collection.parallel.AugmentedIterableIterator$class.flatmap2combiner(RemainsIterator.scala:132)
>   at 
> scala.collection.parallel.immutable.ParVector$ParVectorIterator.flatmap2combiner(ParVector.scala:62)
>   at 
> scala.collection.parallel.ParIterableLike$FlatMap.leaf(ParIterableLike.scala:1072)
>   at 
> scala.collection.parallel.Task$$anonfun$tryLeaf$1.apply$mcV$sp(Tasks.scala:49)
>   at 
> scala.collection.parallel.Task$$anonfun$tryLeaf$1.apply(Tasks.scala:48)
>   at 
> scala.collection.parallel.Task$$anonfun$tryLeaf$1.apply(Tasks.scala:48)
>   at scala.collection.parallel.Task$class.tryLeaf(Tasks.scala:51)
>   at 
> scala.collection.parallel.ParIterableLike$FlatMap.tryLeaf(ParIterableLike.scala:1068)
>   at 
> scala.collection.parallel.AdaptiveWorkStealingTasks$WrappedTask$class.internal(Tasks.scala:159)
>   at 
> scala.collection.parallel.AdaptiveWorkStealingForkJoinTasks$WrappedTask.internal(Tasks.scala:443)
>   at 
> scala.collection.parallel.AdaptiveWorkStealingTasks$WrappedTask$class.compute(Tasks.scala:149)
>   at 
> scala.collection.parallel.AdaptiveWorkStealingForkJoinTasks$WrappedTask.compute(Tasks.scala:443)
>   at 
> scala.concurrent.forkjoin.RecursiveAction.exec(RecursiveAction.java:160)
>   at scala.concurrent.forkjoin.ForkJoinTask.doExec(ForkJoinTask.java:260)
>   at scala.concurrent.forkjoin.ForkJoinTask.doJoin(ForkJoinTask.java:341)
>   at scala.concurrent.forkjoin.ForkJoinTask.join(ForkJoinTask.java:673)
>   at 
> scala.collection.parallel.ForkJoinTasks$WrappedTask$class.sync(Tasks.scala:378)
>   at 
> scala.collection.parallel.AdaptiveWorkStealingForkJoinTasks$WrappedTask.sync(Tasks.scala:443)
>   at 
> scala.collection.parallel.ForkJoinTasks$class.executeAndWaitResult(Tasks.scala:426)
>   at 
> scala.collectio

[jira] [Resolved] (PARQUET-1428) Move columnar encryption into its feature branch

2018-10-02 Thread Nandor Kollar (JIRA)


 [ 
https://issues.apache.org/jira/browse/PARQUET-1428?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Nandor Kollar resolved PARQUET-1428.

Resolution: Fixed

> Move columnar encryption into its feature branch
> 
>
> Key: PARQUET-1428
> URL: https://issues.apache.org/jira/browse/PARQUET-1428
> Project: Parquet
>  Issue Type: Task
>  Components: parquet-format
>    Reporter: Nandor Kollar
>    Assignee: Nandor Kollar
>Priority: Major
>  Labels: pull-request-available
>




--
This message was sent by Atlassian JIRA
(v7.6.3#76005)


[jira] [Assigned] (PARQUET-1428) Move columnar encryption into its feature branch

2018-10-02 Thread Nandor Kollar (JIRA)


 [ 
https://issues.apache.org/jira/browse/PARQUET-1428?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Nandor Kollar reassigned PARQUET-1428:
--

Assignee: Nandor Kollar

> Move columnar encryption into its feature branch
> 
>
> Key: PARQUET-1428
> URL: https://issues.apache.org/jira/browse/PARQUET-1428
> Project: Parquet
>  Issue Type: Task
>  Components: parquet-format
>    Reporter: Nandor Kollar
>    Assignee: Nandor Kollar
>Priority: Major
>  Labels: pull-request-available
>




--
This message was sent by Atlassian JIRA
(v7.6.3#76005)


[jira] [Resolved] (PARQUET-1424) Release parquet-format 2.6.0

2018-10-02 Thread Nandor Kollar (JIRA)


 [ 
https://issues.apache.org/jira/browse/PARQUET-1424?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Nandor Kollar resolved PARQUET-1424.

Resolution: Fixed

> Release parquet-format 2.6.0
> 
>
> Key: PARQUET-1424
> URL: https://issues.apache.org/jira/browse/PARQUET-1424
> Project: Parquet
>  Issue Type: Task
>  Components: parquet-format
>    Reporter: Nandor Kollar
>    Assignee: Nandor Kollar
>Priority: Major
>
> Release parquet-format 2.6.0
> The release requires reverting of the merged PRs related to columnar 
> encryption, since there's no signed spec yet. Those PRs should be developed 
> on a feature branch instead and merge to master once the spec is signed and 
> the format changes are ready to get released.



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)


[jira] [Assigned] (PARQUET-1424) Release parquet-format 2.6.0

2018-10-02 Thread Nandor Kollar (JIRA)


 [ 
https://issues.apache.org/jira/browse/PARQUET-1424?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Nandor Kollar reassigned PARQUET-1424:
--

Assignee: Nandor Kollar

> Release parquet-format 2.6.0
> 
>
> Key: PARQUET-1424
> URL: https://issues.apache.org/jira/browse/PARQUET-1424
> Project: Parquet
>  Issue Type: Task
>  Components: parquet-format
>    Reporter: Nandor Kollar
>    Assignee: Nandor Kollar
>Priority: Major
>
> Release parquet-format 2.6.0
> The release requires reverting of the merged PRs related to columnar 
> encryption, since there's no signed spec yet. Those PRs should be developed 
> on a feature branch instead and merge to master once the spec is signed and 
> the format changes are ready to get released.



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)


[jira] [Updated] (PARQUET-1399) Move parquet-mr related code from parquet-format

2018-10-02 Thread Nandor Kollar (JIRA)


 [ 
https://issues.apache.org/jira/browse/PARQUET-1399?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Nandor Kollar updated PARQUET-1399:
---
Fix Version/s: 1.11.0

> Move parquet-mr related code from parquet-format
> 
>
> Key: PARQUET-1399
> URL: https://issues.apache.org/jira/browse/PARQUET-1399
> Project: Parquet
>  Issue Type: Improvement
>  Components: parquet-mr
>Reporter: Gabor Szadovszky
>Assignee: Gabor Szadovszky
>Priority: Major
>  Labels: pull-request-available
> Fix For: 1.11.0
>
>
> There are java classes in the 
> [parquet-format|https://github.com/apache/parquet-format] repo that shall be 
> in the [parquet-mr|https://github.com/apache/parquet-mr] repo instead: [java 
> classes|https://github.com/apache/parquet-format/tree/master/src/main] and 
> [test classes|https://github.com/apache/parquet-format/tree/master/src/test]
> The idea is to create a separate module in 
> [parquet-mr|https://github.com/apache/parquet-mr] and depend on it instead of 
> depending on [parquet-format|https://github.com/apache/parquet-format]. Only 
> this separate module would depend on 
> [parquet-format|https://github.com/apache/parquet-format] directly.



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)


Re: [VOTE] Release Apache Parquet format 2.6.0 RC0

2018-10-01 Thread Nandor Kollar
Hi All,

The vote for this parquet-format release have passed with
3   "+1" votes (and 1 non-binding)
0   "0" votes
0   "-1" votes

With 3 binding “+1” votes this vote PASSES. We’ll release the
artifacts and send an announcement soon.

Regards,
Nandor
On Sun, Sep 30, 2018 at 11:28 PM Ryan Blue  wrote:
>
> +1 (binding)
>
> On Sat, Sep 29, 2018 at 2:11 AM Wes McKinney  wrote:
>
> > +1 (binding)
> >
> > * Checked checksums, signature
> > * Ran unit tests
> >
> > Note that `mvn test` fails if Apache Thrift 0.10.0 or higher is
> > installed. It looks like this is a problem with the Maven Thrift
> > plugin and not a problem with parquet-format, but definitely a rough
> > edge that will affect users
> >
> > [ERROR] thrift failed output:
> >
> > [WARNING:/home/wesm/Downloads/apache-parquet-format-2.6.0/src/main/thrift/parquet.thrift:295]
> > The "byte" type is a compatibility alias for "i8". Use "i8" to
> > emphasize the signedness of this type.
> >
> > [ERROR] thrift failed error: [FAILURE:generation:1] Error: unknown
> > option java:hashcode
> >
> > - Wes
> > On Fri, Sep 28, 2018 at 2:52 AM Gabor Szadovszky
> >  wrote:
> > >
> > > +1 (non-binding)
> > >
> > > - Checked source tarball content
> > > - Checked checksums, signature
> > >
> > > Cheers,
> > > Gabor
> > >
> > > On Thu, Sep 27, 2018 at 5:10 PM Zoltan Ivanfi 
> > > wrote:
> > >
> > > > +1 (binding)
> > > >
> > > > - contents look good
> > > > - units tests pass
> > > > - checksums match
> > > > - signature matches
> > > >
> > > > Thanks,
> > > >
> > > > Zoltan
> > > >
> > > > On Thu, Sep 27, 2018 at 5:02 PM Nandor Kollar
> >  > > > >
> > > > wrote:
> > > >
> > > > > Hi everyone,
> > > > >
> > > > > I propose the following RC to be released as official Apache Parquet
> > > > > Format 2.6.0 release.
> > > > >
> > > > > The commit id is df6132b94f273521a418a74442085fdd5a0aa009
> > > > > * This corresponds to the tag: apache-parquet-format-2.6.0
> > > > > *
> > > > >
> > > >
> > https://github.com/apache/parquet-format/tree/df6132b94f273521a418a74442085fdd5a0aa009
> > > > > *
> > > > >
> > > >
> > https://gitbox.apache.org/repos/asf?p=parquet-format.git;a=commit;h=df6132b94f273521a418a74442085fdd5a0aa009
> > > > >
> > > > > The release tarball, signature, and checksums are here:
> > > > > *
> > > > >
> > > >
> > https://dist.apache.org/repos/dist/dev/parquet/apache-parquet-format-2.6.0-rc0
> > > > >
> > > > > You can find the KEYS file here:
> > > > > * https://dist.apache.org/repos/dist/dev/parquet/KEYS
> > > > >
> > > > > Binary artifacts are staged in Nexus here:
> > > > > *
> > > > >
> > > >
> > https://repository.apache.org/content/groups/staging/org/apache/parquet/parquet-format/2.6.0
> > > > >
> > > > > This release includes following changes:
> > > > >
> > > > > PARQUET-1266 - LogicalTypes union in parquet-format doesn't include
> > UUID
> > > > > PARQUET-1290 - Clarify maximum run lengths for RLE encoding
> > > > > PARQUET-1387 - Nanosecond precision time and timestamp -
> > parquet-format
> > > > > PARQUET-1400 - Deprecate parquet-mr related code in parquet-format
> > > > > PARQUET-1429 - Turn off DocLint on parquet-format
> > > > >
> > > > > Please download, verify, and test.
> > > > >
> > > > > The voting will be open at least for 72 hour from now.
> > > > >
> > > > > [ ] +1 Release this as Apache Parquet Format 2.6.0
> > > > > [ ] +0
> > > > > [ ] -1 Do not release this because...
> > > > >
> > > > > Thanks,
> > > > > Nandor
> > > > >
> > > >
> >
>
>
> --
> Ryan Blue
> Software Engineer
> Netflix


[jira] [Assigned] (PARQUET-1433) Parquet-format doesn't compile with Thrift 0.10.0

2018-10-01 Thread Nandor Kollar (JIRA)


 [ 
https://issues.apache.org/jira/browse/PARQUET-1433?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Nandor Kollar reassigned PARQUET-1433:
--

Assignee: Nandor Kollar

> Parquet-format doesn't compile with Thrift 0.10.0
> -
>
> Key: PARQUET-1433
> URL: https://issues.apache.org/jira/browse/PARQUET-1433
> Project: Parquet
>  Issue Type: Task
>  Components: parquet-format
>    Reporter: Nandor Kollar
>    Assignee: Nandor Kollar
>Priority: Major
>  Labels: pull-request-available
>
> Compilation of parquet-format fails with Thrift 0.10.0:
> [ERROR] thrift failed error: [FAILURE:generation:1] Error: unknown
> option java:hashcode



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)


Re: [VOTE] Release Apache Parquet format 2.6.0 RC0

2018-10-01 Thread Nandor Kollar
Wes, I created a Jira for the failure with Thrift 0.10.0 problem:
PARQUET-1433 We should address it in the upcoming format release
On Sat, Sep 29, 2018 at 11:11 AM Wes McKinney  wrote:
>
> +1 (binding)
>
> * Checked checksums, signature
> * Ran unit tests
>
> Note that `mvn test` fails if Apache Thrift 0.10.0 or higher is
> installed. It looks like this is a problem with the Maven Thrift
> plugin and not a problem with parquet-format, but definitely a rough
> edge that will affect users
>
> [ERROR] thrift failed output:
> [WARNING:/home/wesm/Downloads/apache-parquet-format-2.6.0/src/main/thrift/parquet.thrift:295]
> The "byte" type is a compatibility alias for "i8". Use "i8" to
> emphasize the signedness of this type.
>
> [ERROR] thrift failed error: [FAILURE:generation:1] Error: unknown
> option java:hashcode
>
> - Wes
> On Fri, Sep 28, 2018 at 2:52 AM Gabor Szadovszky
>  wrote:
> >
> > +1 (non-binding)
> >
> > - Checked source tarball content
> > - Checked checksums, signature
> >
> > Cheers,
> > Gabor
> >
> > On Thu, Sep 27, 2018 at 5:10 PM Zoltan Ivanfi 
> > wrote:
> >
> > > +1 (binding)
> > >
> > > - contents look good
> > > - units tests pass
> > > - checksums match
> > > - signature matches
> > >
> > > Thanks,
> > >
> > > Zoltan
> > >
> > > On Thu, Sep 27, 2018 at 5:02 PM Nandor Kollar 
> > >  > > >
> > > wrote:
> > >
> > > > Hi everyone,
> > > >
> > > > I propose the following RC to be released as official Apache Parquet
> > > > Format 2.6.0 release.
> > > >
> > > > The commit id is df6132b94f273521a418a74442085fdd5a0aa009
> > > > * This corresponds to the tag: apache-parquet-format-2.6.0
> > > > *
> > > >
> > > https://github.com/apache/parquet-format/tree/df6132b94f273521a418a74442085fdd5a0aa009
> > > > *
> > > >
> > > https://gitbox.apache.org/repos/asf?p=parquet-format.git;a=commit;h=df6132b94f273521a418a74442085fdd5a0aa009
> > > >
> > > > The release tarball, signature, and checksums are here:
> > > > *
> > > >
> > > https://dist.apache.org/repos/dist/dev/parquet/apache-parquet-format-2.6.0-rc0
> > > >
> > > > You can find the KEYS file here:
> > > > * https://dist.apache.org/repos/dist/dev/parquet/KEYS
> > > >
> > > > Binary artifacts are staged in Nexus here:
> > > > *
> > > >
> > > https://repository.apache.org/content/groups/staging/org/apache/parquet/parquet-format/2.6.0
> > > >
> > > > This release includes following changes:
> > > >
> > > > PARQUET-1266 - LogicalTypes union in parquet-format doesn't include UUID
> > > > PARQUET-1290 - Clarify maximum run lengths for RLE encoding
> > > > PARQUET-1387 - Nanosecond precision time and timestamp - parquet-format
> > > > PARQUET-1400 - Deprecate parquet-mr related code in parquet-format
> > > > PARQUET-1429 - Turn off DocLint on parquet-format
> > > >
> > > > Please download, verify, and test.
> > > >
> > > > The voting will be open at least for 72 hour from now.
> > > >
> > > > [ ] +1 Release this as Apache Parquet Format 2.6.0
> > > > [ ] +0
> > > > [ ] -1 Do not release this because...
> > > >
> > > > Thanks,
> > > > Nandor
> > > >
> > >


[jira] [Created] (PARQUET-1433) Parquet-format doesn't compile with Thrift 0.10.0

2018-10-01 Thread Nandor Kollar (JIRA)
Nandor Kollar created PARQUET-1433:
--

 Summary: Parquet-format doesn't compile with Thrift 0.10.0
 Key: PARQUET-1433
 URL: https://issues.apache.org/jira/browse/PARQUET-1433
 Project: Parquet
  Issue Type: Task
  Components: parquet-format
Reporter: Nandor Kollar


Compilation of parquet-format fails with Thrift 0.10.0:

[ERROR] thrift failed error: [FAILURE:generation:1] Error: unknown
option java:hashcode



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)


[VOTE] Release Apache Parquet format 2.6.0 RC0

2018-09-27 Thread Nandor Kollar
Hi everyone,

I propose the following RC to be released as official Apache Parquet
Format 2.6.0 release.

The commit id is df6132b94f273521a418a74442085fdd5a0aa009
* This corresponds to the tag: apache-parquet-format-2.6.0
* 
https://github.com/apache/parquet-format/tree/df6132b94f273521a418a74442085fdd5a0aa009
* 
https://gitbox.apache.org/repos/asf?p=parquet-format.git;a=commit;h=df6132b94f273521a418a74442085fdd5a0aa009

The release tarball, signature, and checksums are here:
* https://dist.apache.org/repos/dist/dev/parquet/apache-parquet-format-2.6.0-rc0

You can find the KEYS file here:
* https://dist.apache.org/repos/dist/dev/parquet/KEYS

Binary artifacts are staged in Nexus here:
* 
https://repository.apache.org/content/groups/staging/org/apache/parquet/parquet-format/2.6.0

This release includes following changes:

PARQUET-1266 - LogicalTypes union in parquet-format doesn't include UUID
PARQUET-1290 - Clarify maximum run lengths for RLE encoding
PARQUET-1387 - Nanosecond precision time and timestamp - parquet-format
PARQUET-1400 - Deprecate parquet-mr related code in parquet-format
PARQUET-1429 - Turn off DocLint on parquet-format

Please download, verify, and test.

The voting will be open at least for 72 hour from now.

[ ] +1 Release this as Apache Parquet Format 2.6.0
[ ] +0
[ ] -1 Do not release this because...

Thanks,
Nandor


[jira] [Resolved] (PARQUET-1429) Turn off DocLint on parquet-format

2018-09-27 Thread Nandor Kollar (JIRA)


 [ 
https://issues.apache.org/jira/browse/PARQUET-1429?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Nandor Kollar resolved PARQUET-1429.

Resolution: Fixed

> Turn off DocLint on parquet-format
> --
>
> Key: PARQUET-1429
> URL: https://issues.apache.org/jira/browse/PARQUET-1429
> Project: Parquet
>  Issue Type: Task
>  Components: parquet-format
>    Reporter: Nandor Kollar
>    Assignee: Nandor Kollar
>Priority: Major
>  Labels: pull-request-available
> Fix For: format-2.6.0
>
>
> DocLint is introduced in Java 8, and since the generated code in 
> parquet-format has several issues found by DocLint, attach-javadocs goal will 
> fail.



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)


[jira] [Created] (PARQUET-1429) Turn off DocLint on parquet-format

2018-09-27 Thread Nandor Kollar (JIRA)
Nandor Kollar created PARQUET-1429:
--

 Summary: Turn off DocLint on parquet-format
 Key: PARQUET-1429
 URL: https://issues.apache.org/jira/browse/PARQUET-1429
 Project: Parquet
  Issue Type: Task
  Components: parquet-format
Reporter: Nandor Kollar
Assignee: Nandor Kollar
 Fix For: format-2.6.0


DocLint is introduced in Java 8, and since the generated code in parquet-format 
has several issues found by DocLint, attach-javadocs goal will fail.



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)


[jira] [Updated] (PARQUET-1227) Thrift crypto metadata structures

2018-09-27 Thread Nandor Kollar (JIRA)


 [ 
https://issues.apache.org/jira/browse/PARQUET-1227?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Nandor Kollar updated PARQUET-1227:
---
Fix Version/s: format-encryption-feature-branch

> Thrift crypto metadata structures
> -
>
> Key: PARQUET-1227
> URL: https://issues.apache.org/jira/browse/PARQUET-1227
> Project: Parquet
>  Issue Type: Sub-task
>  Components: parquet-cpp, parquet-format
>Reporter: Gidon Gershinsky
>Assignee: Gidon Gershinsky
>Priority: Major
>  Labels: pull-request-available
> Fix For: cpp-1.5.0, format-encryption-feature-branch
>
>
> New Thrift structures for Parquet modular encryption



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)


[jira] [Updated] (PARQUET-1398) Separate iv_prefix for GCM and CTR modes

2018-09-27 Thread Nandor Kollar (JIRA)


 [ 
https://issues.apache.org/jira/browse/PARQUET-1398?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Nandor Kollar updated PARQUET-1398:
---
Fix Version/s: format-encryption-feature-branch

> Separate iv_prefix for GCM and CTR modes
> 
>
> Key: PARQUET-1398
> URL: https://issues.apache.org/jira/browse/PARQUET-1398
> Project: Parquet
>  Issue Type: Sub-task
>  Components: parquet-cpp, parquet-format
>Reporter: Gidon Gershinsky
>Assignee: Gidon Gershinsky
>Priority: Minor
>  Labels: pull-request-available
> Fix For: format-encryption-feature-branch
>
>  Time Spent: 10m
>  Remaining Estimate: 0h
>
> There is an ambiguity in what the iv_prefix applies to - GCM or CTR or both. 
> This parameter will be moved it to the Algorithms structures (from the 
> FileCryptoMetaData structure).



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)


[jira] [Updated] (PARQUET-1424) Release parquet-format 2.6.0

2018-09-27 Thread Nandor Kollar (JIRA)


 [ 
https://issues.apache.org/jira/browse/PARQUET-1424?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Nandor Kollar updated PARQUET-1424:
---
Fix Version/s: (was: format-2.6.0)

> Release parquet-format 2.6.0
> 
>
> Key: PARQUET-1424
> URL: https://issues.apache.org/jira/browse/PARQUET-1424
> Project: Parquet
>  Issue Type: Task
>  Components: parquet-format
>    Reporter: Nandor Kollar
>Priority: Major
>
> Release parquet-format 2.6.0
> The release requires reverting of the merged PRs related to columnar 
> encryption, since there's no signed spec yet. Those PRs should be developed 
> on a feature branch instead and merge to master once the spec is signed and 
> the format changes are ready to get released.



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)


[jira] [Updated] (PARQUET-1227) Thrift crypto metadata structures

2018-09-27 Thread Nandor Kollar (JIRA)


 [ 
https://issues.apache.org/jira/browse/PARQUET-1227?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Nandor Kollar updated PARQUET-1227:
---
Fix Version/s: (was: format-2.6.0)

> Thrift crypto metadata structures
> -
>
> Key: PARQUET-1227
> URL: https://issues.apache.org/jira/browse/PARQUET-1227
> Project: Parquet
>  Issue Type: Sub-task
>  Components: parquet-cpp, parquet-format
>Reporter: Gidon Gershinsky
>Assignee: Gidon Gershinsky
>Priority: Major
>  Labels: pull-request-available
> Fix For: cpp-1.5.0
>
>
> New Thrift structures for Parquet modular encryption



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)


[jira] [Updated] (PARQUET-1398) Separate iv_prefix for GCM and CTR modes

2018-09-27 Thread Nandor Kollar (JIRA)


 [ 
https://issues.apache.org/jira/browse/PARQUET-1398?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Nandor Kollar updated PARQUET-1398:
---
Fix Version/s: (was: format-2.6.0)

> Separate iv_prefix for GCM and CTR modes
> 
>
> Key: PARQUET-1398
> URL: https://issues.apache.org/jira/browse/PARQUET-1398
> Project: Parquet
>  Issue Type: Sub-task
>  Components: parquet-cpp, parquet-format
>Reporter: Gidon Gershinsky
>Assignee: Gidon Gershinsky
>Priority: Minor
>  Labels: pull-request-available
>  Time Spent: 10m
>  Remaining Estimate: 0h
>
> There is an ambiguity in what the iv_prefix applies to - GCM or CTR or both. 
> This parameter will be moved it to the Algorithms structures (from the 
> FileCryptoMetaData structure).



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)


[jira] [Created] (PARQUET-1428) Move columnar encryption into its feature branch

2018-09-27 Thread Nandor Kollar (JIRA)
Nandor Kollar created PARQUET-1428:
--

 Summary: Move columnar encryption into its feature branch
 Key: PARQUET-1428
 URL: https://issues.apache.org/jira/browse/PARQUET-1428
 Project: Parquet
  Issue Type: Task
  Components: parquet-format
Reporter: Nandor Kollar






--
This message was sent by Atlassian JIRA
(v7.6.3#76005)


[jira] [Updated] (PARQUET-1425) [Format] Fix Thrift compiler warning

2018-09-27 Thread Nandor Kollar (JIRA)


 [ 
https://issues.apache.org/jira/browse/PARQUET-1425?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Nandor Kollar updated PARQUET-1425:
---
Fix Version/s: (was: format-2.6.0)

> [Format] Fix Thrift compiler warning
> 
>
> Key: PARQUET-1425
> URL: https://issues.apache.org/jira/browse/PARQUET-1425
> Project: Parquet
>  Issue Type: Bug
>  Components: parquet-format
>Reporter: Wes McKinney
>Priority: Major
>
> I see this warning frequently
> {code}
> [1/127] Running thrift compiler on parquet.thrift
> [WARNING:/home/wesm/code/arrow/cpp/src/parquet/parquet.thrift:295] The "byte" 
> type is a compatibility alias for "i8". Use "i8" to emphasize the signedness 
> of this type.
> {code}



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)


[jira] [Commented] (PARQUET-1425) [Format] Fix Thrift compiler warning

2018-09-26 Thread Nandor Kollar (JIRA)


[ 
https://issues.apache.org/jira/browse/PARQUET-1425?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16628950#comment-16628950
 ] 

Nandor Kollar commented on PARQUET-1425:


[~wesmckinn] would you like this to be fixed in 2.6.0 format release, or is 
fine for you to get this done in a later release?

Currently parquet-format depends on 0.9.3 Thrift, and we also use this older 
version downstream. i8 was introduced in 0.10.0 (according to [Thrift 
Jira|https://issues.apache.org/jira/browse/THRIFT-3393]), and if we change byte 
to i8, 0.9.3 thrift compiler will fail.

> [Format] Fix Thrift compiler warning
> 
>
> Key: PARQUET-1425
> URL: https://issues.apache.org/jira/browse/PARQUET-1425
> Project: Parquet
>  Issue Type: Bug
>  Components: parquet-format
>Reporter: Wes McKinney
>Priority: Major
> Fix For: format-2.6.0
>
>
> I see this warning frequently
> {code}
> [1/127] Running thrift compiler on parquet.thrift
> [WARNING:/home/wesm/code/arrow/cpp/src/parquet/parquet.thrift:295] The "byte" 
> type is a compatibility alias for "i8". Use "i8" to emphasize the signedness 
> of this type.
> {code}



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)


[jira] [Updated] (PARQUET-1387) Nanosecond precision time and timestamp - parquet-format

2018-09-26 Thread Nandor Kollar (JIRA)


 [ 
https://issues.apache.org/jira/browse/PARQUET-1387?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Nandor Kollar updated PARQUET-1387:
---
Fix Version/s: format-2.6.0

> Nanosecond precision time and timestamp - parquet-format
> 
>
> Key: PARQUET-1387
> URL: https://issues.apache.org/jira/browse/PARQUET-1387
> Project: Parquet
>  Issue Type: Improvement
>  Components: parquet-format
>    Reporter: Nandor Kollar
>    Assignee: Nandor Kollar
>Priority: Major
>  Labels: pull-request-available
> Fix For: format-2.6.0
>
>




--
This message was sent by Atlassian JIRA
(v7.6.3#76005)


[jira] [Updated] (PARQUET-1266) LogicalTypes union in parquet-format doesn't include UUID

2018-09-26 Thread Nandor Kollar (JIRA)


 [ 
https://issues.apache.org/jira/browse/PARQUET-1266?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Nandor Kollar updated PARQUET-1266:
---
Fix Version/s: format-2.6.0

> LogicalTypes union in parquet-format doesn't include UUID
> -
>
> Key: PARQUET-1266
> URL: https://issues.apache.org/jira/browse/PARQUET-1266
> Project: Parquet
>  Issue Type: Bug
>  Components: parquet-format
>    Reporter: Nandor Kollar
>    Assignee: Nandor Kollar
>Priority: Minor
> Fix For: format-2.6.0
>
>
> parquet-format new logical type representation doesn't include UUID type



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)


[jira] [Updated] (PARQUET-1290) Clarify maximum run lengths for RLE encoding

2018-09-26 Thread Nandor Kollar (JIRA)


 [ 
https://issues.apache.org/jira/browse/PARQUET-1290?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Nandor Kollar updated PARQUET-1290:
---
Fix Version/s: format-2.6.0

> Clarify maximum run lengths for RLE encoding
> 
>
> Key: PARQUET-1290
> URL: https://issues.apache.org/jira/browse/PARQUET-1290
> Project: Parquet
>  Issue Type: Improvement
>  Components: parquet-format
>Reporter: Tim Armstrong
>Assignee: Tim Armstrong
>Priority: Major
> Fix For: format-2.6.0
>
>
> The Parquet spec isn't clear about what the upper bound on run lengths in the 
> RLE encoding is - 
> https://github.com/apache/parquet-format/blob/master/Encodings.md#run-length-encoding--bit-packing-hybrid-rle--3
>  .
> It sounds like in practice that the major implementations don't support run 
> lengths > (2^31 - 1) - see 
> https://lists.apache.org/thread.html/6731a94a98b790ad24a9a5bb4e1bf9bb799d729e948e046efb40014f@%3Cdev.parquet.apache.org%3E
> I propose that we limit {{bit-pack-count}} and {{number of times repeated}} 
> to being <= 2^31.
> It seems unlikely that there are parquet files in existence with larger run 
> lengths, given that it requires huge numbers of values per page and major 
> implementations can't write or read such files without overflowing integers. 
> Maybe it would be possible if all the columns in a file were extremely 
> compressible, but it seems like in practice most implementations will hit 
> page or file size limits before producing a very-large run.



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)


[jira] [Created] (PARQUET-1424) Release parquet-format 2.6.0

2018-09-25 Thread Nandor Kollar (JIRA)
Nandor Kollar created PARQUET-1424:
--

 Summary: Release parquet-format 2.6.0
 Key: PARQUET-1424
 URL: https://issues.apache.org/jira/browse/PARQUET-1424
 Project: Parquet
  Issue Type: Task
  Components: parquet-format
Reporter: Nandor Kollar
 Fix For: format-2.6.0


Release parquet-format 2.6.0

The release requires reverting of the merged PRs related to columnar 
encryption, since there's no signed spec yet. Those PRs should be developed on 
a feature branch instead and merge to master once the spec is signed and the 
format changes are ready to get released.



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)


Re: Date and time for next Parquet sync

2018-09-18 Thread Nandor Kollar
Hi All,

Since it sees that apart from you several other community members
can't attend the meeting tomorrow, would anyone mind if we'd
reschedule it for next Tuesday at the same time?

Thanks,
Nandor

On Tue, Sep 18, 2018 at 9:51 AM, Zoltan Ivanfi  wrote:
> Hi,
>
> It seems that I won't be able to attend after all, sorry for the late
> decline.
>
> Zoltan
>
> On Mon, Sep 10, 2018 at 7:21 PM Ryan Blue  wrote:
>>
>> Sorry, looks like I was wrong on the dates. Thanks, Nandor.
>>
>> On Mon, Sep 10, 2018 at 5:15 AM Nandor Kollar 
>> wrote:
>>
>> > Ryan, I was aware of Strata, actually I wanted to schedule it to 18th
>> > September, but forgot to change 'next week' in the email. So in fact I
>> > already pushed it out one week, sorry for the confusion.
>> >
>> > Gidon, 19th is fine for me, if there's no objection against it, then
>> > we can have it then!
>> >
>> > Thanks,
>> > Nandor
>> >
>> > On Fri, Sep 7, 2018 at 9:21 PM, Ryan Blue 
>> > wrote:
>> > > We may want to push this out another week because it also conflicts
>> > > with
>> > > Strata NY. I think a few of us will be travelling Tuesday and both
>> > > Julien
>> > > and I have talks on Wednesday.
>> > >
>> > > On Fri, Sep 7, 2018 at 6:24 AM Gidon Gershinsky 
>> > wrote:
>> > >
>> > >> Hi Nandor,
>> > >>
>> > >> Can we make it Wed this time, Sept 19? Or any of Tue/Wed on another
>> > week.
>> > >> Sept 18 is the Yom Kippur eve - this basically means I won't have a
>> > >> technical ability to join a call.
>> > >>
>> > >> Regarding the Google doc vs reviewed PR + .md file - it indeed
>> > >> becomes
>> > >> difficult and unneccesary to maintain two
>> > >> versions of the same documentation. Following you last mail, there
>> > >> was a
>> > >> high volume of review
>> > >> activity at the google doc, but now the spike is winding down, I'll
>> > >> be
>> > >> removing the duplicate part from the google doc
>> > >> (keeping the samples), with new comments to go to PRs (md and code).
>> > I'll
>> > >> send a detailed mail early next week.
>> > >>
>> > >>
>> > >> Cheers, Gidon.
>> > >>
>> > >> On Fri, Sep 7, 2018 at 3:42 PM Nandor Kollar
>> > > > >> >
>> > >> wrote:
>> > >>
>> > >> > Hi All,
>> > >> >
>> > >> > I'd like propose to have a Parquet Sync next week Tuesday
>> > >> > (September
>> > >> > 18th) at 6pm CEST / 9 am PST.
>> > >> >
>> > >> > Some of the topics which would be nice to discuss:
>> > >> > - review column indexes (PRs and feature branch)
>> > >> > - move Java code from format to mr (PR #517)
>> > >> > - Bloom filter spec
>> > >> > - columnar encryption spec (and general question, where to track
>> > >> > specs, Google doc vs reviewed PR + .md file)
>> > >> > - Refactor modules to use the new logical type API (PR under
>> > >> > review)
>> > >> > - new format release scope (nano precision timestamp, bloom filer?,
>> > >> > columnar encryption?)
>> > >> >
>> > >> > I'll send the meeting invite shortly. Feel free to propose other
>> > >> > time
>> > >> > slot if it is not suitable for you, and bring any additional topic
>> > >> > you'd like to discuss.
>> > >> >
>> > >> > Regards,
>> > >> > Nandor
>> > >> >
>> > >>
>> > >
>> > >
>> > > --
>> > > Ryan Blue
>> > > Software Engineer
>> > > Netflix
>> >
>>
>>
>> --
>> Ryan Blue
>> Software Engineer
>> Netflix


[jira] [Updated] (PARQUET-1383) Parquet tools should indicate UTC parameter for time/timestamp types

2018-09-12 Thread Nandor Kollar (JIRA)


 [ 
https://issues.apache.org/jira/browse/PARQUET-1383?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Nandor Kollar updated PARQUET-1383:
---
Description: Parquet-tools should indicate if a time/timestamp is UTC 
adjusted or timezone agnostic, the values written by the tools should take UTC 
normalized parameters into account. Right now, every time and timestamp value 
is adjusted to UTC when printed via parquet-tools  (was: Currently, 
parquet-tools should print original type. Since the new logical type API is 
introduced, it would be better to print it instead of, or besides the original 
type.

Also, the values written by the tools should take UTC normalized parameters 
into account. Right now, every time and timestamp value is adjusted to UTC when 
printed via parquet-tools)

> Parquet tools should indicate UTC parameter for time/timestamp types
> 
>
> Key: PARQUET-1383
> URL: https://issues.apache.org/jira/browse/PARQUET-1383
> Project: Parquet
>  Issue Type: Improvement
>  Components: parquet-mr
>    Reporter: Nandor Kollar
>    Assignee: Nandor Kollar
>Priority: Minor
>  Labels: pull-request-available
>
> Parquet-tools should indicate if a time/timestamp is UTC adjusted or timezone 
> agnostic, the values written by the tools should take UTC normalized 
> parameters into account. Right now, every time and timestamp value is 
> adjusted to UTC when printed via parquet-tools



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)


[jira] [Updated] (PARQUET-1383) Parquet tools should indicate UTC parameter for time/timestamp types

2018-09-12 Thread Nandor Kollar (JIRA)


 [ 
https://issues.apache.org/jira/browse/PARQUET-1383?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Nandor Kollar updated PARQUET-1383:
---
Summary: Parquet tools should indicate UTC parameter for time/timestamp 
types  (was: Parquet tools should print logical type instead of (or besides) 
original type)

> Parquet tools should indicate UTC parameter for time/timestamp types
> 
>
> Key: PARQUET-1383
> URL: https://issues.apache.org/jira/browse/PARQUET-1383
> Project: Parquet
>  Issue Type: Improvement
>  Components: parquet-mr
>    Reporter: Nandor Kollar
>    Assignee: Nandor Kollar
>Priority: Minor
>  Labels: pull-request-available
>
> Currently, parquet-tools should print original type. Since the new logical 
> type API is introduced, it would be better to print it instead of, or besides 
> the original type.
> Also, the values written by the tools should take UTC normalized parameters 
> into account. Right now, every time and timestamp value is adjusted to UTC 
> when printed via parquet-tools



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)


Re: Date and time for next Parquet sync

2018-09-10 Thread Nandor Kollar
Ryan, I was aware of Strata, actually I wanted to schedule it to 18th
September, but forgot to change 'next week' in the email. So in fact I
already pushed it out one week, sorry for the confusion.

Gidon, 19th is fine for me, if there's no objection against it, then
we can have it then!

Thanks,
Nandor

On Fri, Sep 7, 2018 at 9:21 PM, Ryan Blue  wrote:
> We may want to push this out another week because it also conflicts with
> Strata NY. I think a few of us will be travelling Tuesday and both Julien
> and I have talks on Wednesday.
>
> On Fri, Sep 7, 2018 at 6:24 AM Gidon Gershinsky  wrote:
>
>> Hi Nandor,
>>
>> Can we make it Wed this time, Sept 19? Or any of Tue/Wed on another week.
>> Sept 18 is the Yom Kippur eve - this basically means I won't have a
>> technical ability to join a call.
>>
>> Regarding the Google doc vs reviewed PR + .md file - it indeed becomes
>> difficult and unneccesary to maintain two
>> versions of the same documentation. Following you last mail, there was a
>> high volume of review
>> activity at the google doc, but now the spike is winding down, I'll be
>> removing the duplicate part from the google doc
>> (keeping the samples), with new comments to go to PRs (md and code). I'll
>> send a detailed mail early next week.
>>
>>
>> Cheers, Gidon.
>>
>> On Fri, Sep 7, 2018 at 3:42 PM Nandor Kollar > >
>> wrote:
>>
>> > Hi All,
>> >
>> > I'd like propose to have a Parquet Sync next week Tuesday (September
>> > 18th) at 6pm CEST / 9 am PST.
>> >
>> > Some of the topics which would be nice to discuss:
>> > - review column indexes (PRs and feature branch)
>> > - move Java code from format to mr (PR #517)
>> > - Bloom filter spec
>> > - columnar encryption spec (and general question, where to track
>> > specs, Google doc vs reviewed PR + .md file)
>> > - Refactor modules to use the new logical type API (PR under review)
>> > - new format release scope (nano precision timestamp, bloom filer?,
>> > columnar encryption?)
>> >
>> > I'll send the meeting invite shortly. Feel free to propose other time
>> > slot if it is not suitable for you, and bring any additional topic
>> > you'd like to discuss.
>> >
>> > Regards,
>> > Nandor
>> >
>>
>
>
> --
> Ryan Blue
> Software Engineer
> Netflix


Date and time for next Parquet sync

2018-09-07 Thread Nandor Kollar
Hi All,

I'd like propose to have a Parquet Sync next week Tuesday (September
18th) at 6pm CEST / 9 am PST.

Some of the topics which would be nice to discuss:
- review column indexes (PRs and feature branch)
- move Java code from format to mr (PR #517)
- Bloom filter spec
- columnar encryption spec (and general question, where to track
specs, Google doc vs reviewed PR + .md file)
- Refactor modules to use the new logical type API (PR under review)
- new format release scope (nano precision timestamp, bloom filer?,
columnar encryption?)

I'll send the meeting invite shortly. Feel free to propose other time
slot if it is not suitable for you, and bring any additional topic
you'd like to discuss.

Regards,
Nandor


[jira] [Updated] (PARQUET-1371) Time/Timestamp UTC normalization parameter doesn't work

2018-09-06 Thread Nandor Kollar (JIRA)


 [ 
https://issues.apache.org/jira/browse/PARQUET-1371?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Nandor Kollar updated PARQUET-1371:
---
Fix Version/s: 1.11.0

> Time/Timestamp UTC normalization parameter doesn't work
> ---
>
> Key: PARQUET-1371
> URL: https://issues.apache.org/jira/browse/PARQUET-1371
> Project: Parquet
>  Issue Type: Bug
>    Reporter: Nandor Kollar
>    Assignee: Nandor Kollar
>Priority: Major
>  Labels: pull-request-available
> Fix For: 1.11.0
>
>
> After creating a Parquet file with non-UTC normalized logical type, when 
> reading back with the API, the result show it is UTC normalized. Looks like 
> the read path incorrectly reads the actual logical type (with new API).



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)


[jira] [Created] (PARQUET-1410) Refactor modules to use the new logical type API

2018-08-31 Thread Nandor Kollar (JIRA)
Nandor Kollar created PARQUET-1410:
--

 Summary: Refactor modules to use the new logical type API
 Key: PARQUET-1410
 URL: https://issues.apache.org/jira/browse/PARQUET-1410
 Project: Parquet
  Issue Type: Improvement
  Components: parquet-mr
Reporter: Nandor Kollar
Assignee: Nandor Kollar


Refactor parquet-mr modules to use the new logical type API for internal 
decisions (e.g.: replace the OriginalType-based switch cases to a more flexible 
solution, for example in type builder when checking if the proper annotation is 
present on physical type)



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)


  1   2   3   >