[jira] [Resolved] (PARQUET-1445) Remove Files.java

2019-09-05 Thread Nandor Kollar (Jira)


 [ 
https://issues.apache.org/jira/browse/PARQUET-1445?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Nandor Kollar resolved PARQUET-1445.

Resolution: Fixed

> Remove Files.java
> -
>
> Key: PARQUET-1445
> URL: https://issues.apache.org/jira/browse/PARQUET-1445
> Project: Parquet
>  Issue Type: Improvement
>  Components: parquet-mr
>Affects Versions: 1.9.0
>Reporter: David Mollitor
>Assignee: David Mollitor
>Priority: Minor
>  Labels: pull-request-available
> Attachments: PARQUET-1445.1.patch
>
>
> bq. TODO: Use java.nio.file.Files when Parquet is updated to Java 7
> https://github.com/apache/parquet-mr/blob/dc61e510126aaa1a95a46fe39bf1529f394147e9/parquet-common/src/main/java/org/apache/parquet/Files.java#L31



--
This message was sent by Atlassian Jira
(v8.3.2#803003)


[jira] [Assigned] (PARQUET-1445) Remove Files.java

2019-09-05 Thread Nandor Kollar (Jira)


 [ 
https://issues.apache.org/jira/browse/PARQUET-1445?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Nandor Kollar reassigned PARQUET-1445:
--

Assignee: David Mollitor

> Remove Files.java
> -
>
> Key: PARQUET-1445
> URL: https://issues.apache.org/jira/browse/PARQUET-1445
> Project: Parquet
>  Issue Type: Improvement
>  Components: parquet-mr
>Affects Versions: 1.9.0
>Reporter: David Mollitor
>Assignee: David Mollitor
>Priority: Minor
>  Labels: pull-request-available
> Attachments: PARQUET-1445.1.patch
>
>
> bq. TODO: Use java.nio.file.Files when Parquet is updated to Java 7
> https://github.com/apache/parquet-mr/blob/dc61e510126aaa1a95a46fe39bf1529f394147e9/parquet-common/src/main/java/org/apache/parquet/Files.java#L31



--
This message was sent by Atlassian Jira
(v8.3.2#803003)


[jira] [Commented] (PARQUET-247) Add DATE mapping in ValidTypeMap of filter2

2019-09-05 Thread Nandor Kollar (Jira)


[ 
https://issues.apache.org/jira/browse/PARQUET-247?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16923149#comment-16923149
 ] 

Nandor Kollar commented on PARQUET-247:
---

Since PARQUET-201 was resolved by removing OriginalType from the type check, I 
don't think this Jira is an outstanding issue and blocker for Hive Parquet PPD 
any more, hence I resolve it. Feel free to reopen if I was wrong.

> Add DATE mapping in ValidTypeMap of filter2
> ---
>
> Key: PARQUET-247
> URL: https://issues.apache.org/jira/browse/PARQUET-247
> Project: Parquet
>  Issue Type: Improvement
>Reporter: Dong Chen
>Assignee: Dong Chen
>Priority: Major
>
> When Hive use Parquet filter predicate, the Date type is converted to 
> Integer. In {{ValidTypeMap}}, it map the class and Parquet type. It throw 
> exception when checking the data type Date.
> We should add the map to support Date.



--
This message was sent by Atlassian Jira
(v8.3.2#803003)


[jira] [Resolved] (PARQUET-1530) Remove Dependency on commons-codec

2019-09-05 Thread Nandor Kollar (Jira)


 [ 
https://issues.apache.org/jira/browse/PARQUET-1530?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Nandor Kollar resolved PARQUET-1530.

Resolution: Fixed

> Remove Dependency on commons-codec
> --
>
> Key: PARQUET-1530
> URL: https://issues.apache.org/jira/browse/PARQUET-1530
> Project: Parquet
>  Issue Type: Improvement
>Reporter: David Mollitor
>Assignee: David Mollitor
>Priority: Major
>  Labels: pull-request-available
>




--
This message was sent by Atlassian Jira
(v8.3.2#803003)


[jira] [Assigned] (PARQUET-1641) Parquet pages for different columns cannot be read in parallel

2019-08-29 Thread Nandor Kollar (Jira)


 [ 
https://issues.apache.org/jira/browse/PARQUET-1641?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Nandor Kollar reassigned PARQUET-1641:
--

Assignee: Samarth Jain

> Parquet pages for different columns cannot be read in parallel 
> ---
>
> Key: PARQUET-1641
> URL: https://issues.apache.org/jira/browse/PARQUET-1641
> Project: Parquet
>  Issue Type: Improvement
>Reporter: Samarth Jain
>Assignee: Samarth Jain
>Priority: Major
>  Labels: pull-request-available
>
> All ColumnChunkPageReader instances use the same decompressor. 
> [https://github.com/apache/parquet-mr/blob/master/parquet-hadoop/src/main/java/org/apache/parquet/hadoop/ParquetFileReader.java#L1286]
> {code:java}
> BytesInputDecompressor decompressor = 
> options.getCodecFactory().getDecompressor(descriptor.metadata.getCodec());
> return new ColumnChunkPageReader(decompressor, pagesInChunk, dictionaryPage);
> {code}
> The CodecFactory caches the decompressors for every codec type returning the 
> same instance on every getCompressor(codecName) call. See the caching 
> happening here:
> [https://github.com/apache/parquet-mr/blob/master/parquet-hadoop/src/main/java/org/apache/parquet/hadoop/CodecFactory.java#L197]
> {code:java}
> @Override
>  public BytesDecompressor getDecompressor(CompressionCodecName codecName) {
> BytesDecompressor decomp = decompressors.get(codecName);
> if (decomp == null){ 
>decomp = createDecompressor(codecName); decompressors.put(codecName, 
> decomp); 
> }
> return decomp;
>  }
>  
> {code}
>  
> If multiple threads try to read the pages belonging to different columns, 
> they run into thread
> safety issues. This issue prevents increasing the throughput at which 
> applications can read parquet data by parallelizing page reads. 



--
This message was sent by Atlassian Jira
(v8.3.2#803003)


[jira] [Assigned] (PARQUET-1643) Use airlift non-native implementations for GZIP, LZ0 and LZ4 codecs

2019-08-29 Thread Nandor Kollar (Jira)


 [ 
https://issues.apache.org/jira/browse/PARQUET-1643?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Nandor Kollar reassigned PARQUET-1643:
--

Assignee: Samarth Jain

> Use airlift non-native implementations for GZIP, LZ0 and LZ4 codecs
> ---
>
> Key: PARQUET-1643
> URL: https://issues.apache.org/jira/browse/PARQUET-1643
> Project: Parquet
>  Issue Type: Improvement
>Reporter: Samarth Jain
>Assignee: Samarth Jain
>Priority: Major
>  Labels: pull-request-available
>
> [~rdblue] pointed me to [https://github.com/airlift/aircompressor] which 
> provides non-native implementations of compression codecs. It claims to be 
> much faster than native wrappers that parquet uses. This Jira is to track the 
> work needed for exploring using these codecs, getting benchmark results and 
> making changes including not needing to pool compressors and decompressors 
> anymore. Note that this doesn't include SNAPPY since Parquet already has its 
> own non-hadoopy implementation for it. 



--
This message was sent by Atlassian Jira
(v8.3.2#803003)


[jira] [Assigned] (PARQUET-1597) Fix parquet-cli's wrong or missing usage examples

2019-08-22 Thread Nandor Kollar (Jira)


 [ 
https://issues.apache.org/jira/browse/PARQUET-1597?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Nandor Kollar reassigned PARQUET-1597:
--

Assignee: Kengo Seki

> Fix parquet-cli's wrong or missing usage examples
> -
>
> Key: PARQUET-1597
> URL: https://issues.apache.org/jira/browse/PARQUET-1597
> Project: Parquet
>  Issue Type: Bug
>  Components: parquet-cli
>Reporter: Kengo Seki
>Assignee: Kengo Seki
>Priority: Minor
>  Labels: pull-request-available
>
> 1. The following parquet-cli's {{to-avro}} usage examples fail due to the 
> lack of {{-o}} options.
>In addition, "sample.parquet" in the second example should be 
> "sample.avro".
> {code}
>   Examples:
> # Create an Avro file from a Parquet file
> parquet to-avro sample.parquet sample.avro
> # Create an Avro file in HDFS from a local JSON file
> parquet to-avro path/to/sample.json hdfs:/user/me/sample.parquet
> # Create an Avro file from data in S3
> parquet to-avro s3:/data/path/sample.parquet sample.avro
> {code}
> 2. The above is the same for convert-csv.
> {code}
>   Examples:
> # Create a Parquet file from a CSV file
> parquet convert-csv sample.csv sample.parquet --schema schema.avsc
> # Create a Parquet file in HDFS from local CSV
> parquet convert-csv path/to/sample.csv hdfs:/user/me/sample.parquet 
> --schema schema.avsc
> # Create an Avro file from CSV data in S3
> parquet convert-csv s3:/data/path/sample.csv sample.avro --format avro 
> --schema s3:/schemas/schema.avsc
> {code}
> 3. The meta command has an "Examples:" heading but lacks its content.
> {code}
> $ java -cp 'target/*:target/dependency/*' org.apache.parquet.cli.Main help 
> meta
> Usage: parquet [general options] meta  [command options]
>   Description:
> Print a Parquet file's metadata
>   Examples:
> {code}



--
This message was sent by Atlassian Jira
(v8.3.2#803003)


[jira] [Resolved] (PARQUET-1637) Builds are failing because default jdk changed to openjdk11 on Travis

2019-08-15 Thread Nandor Kollar (JIRA)


 [ 
https://issues.apache.org/jira/browse/PARQUET-1637?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Nandor Kollar resolved PARQUET-1637.

Resolution: Fixed

> Builds are failing because default jdk changed to openjdk11 on Travis
> -
>
> Key: PARQUET-1637
> URL: https://issues.apache.org/jira/browse/PARQUET-1637
> Project: Parquet
>  Issue Type: Bug
>  Components: parquet-mr
>Reporter: Nandor Kollar
>Assignee: Nandor Kollar
>Priority: Major
>  Labels: pull-request-available
>
> The default distribution on Travis recently changed from Trusy to Xenial. It 
> appears that the default JDK also changed from JDK8 to JDK11, despite the doc 
> [says|https://docs.travis-ci.com/user/reference/xenial/#jvm-clojure-groovy-java-scala-support]
>  the default is openjdk8, it appears that it isn't correct (see related 
> [discussion|https://travis-ci.community/t/default-jdk-on-xenial-openjdk8-or-openjdk11/4542])
> Since Parquet still doesn't support Java 11 (PARQUET-1551), we should 
> explicitly tell in Travis config which JDK to use, at lease as long as 
> PARQUET-1551 is still open.



--
This message was sent by Atlassian JIRA
(v7.6.14#76016)


[jira] [Created] (PARQUET-1637) Builds are failing because default jdk changed to openjdk11 on Travis

2019-08-10 Thread Nandor Kollar (JIRA)
Nandor Kollar created PARQUET-1637:
--

 Summary: Builds are failing because default jdk changed to 
openjdk11 on Travis
 Key: PARQUET-1637
 URL: https://issues.apache.org/jira/browse/PARQUET-1637
 Project: Parquet
  Issue Type: Bug
  Components: parquet-mr
Reporter: Nandor Kollar
Assignee: Nandor Kollar


The default distribution on Travis recently changed from Trusy to Xenial. It 
appears that the default JDK also changed from JDK8 to JDK11, despite the doc 
[says|https://docs.travis-ci.com/user/reference/xenial/#jvm-clojure-groovy-java-scala-support]
 the default is openjdk8, it appears that it isn't correct (see related 
[discussion|https://travis-ci.community/t/default-jdk-on-xenial-openjdk8-or-openjdk11/4542])

Since Parquet still doesn't support Java 11 (PARQUET-1551), we should 
explicitly tell in Travis config which JDK to use, at lease as long as 
PARQUET-1551 is still open.



--
This message was sent by Atlassian JIRA
(v7.6.14#76016)


[jira] [Updated] (PARQUET-1303) Avro reflect @Stringable field write error if field not instanceof CharSequence

2019-07-25 Thread Nandor Kollar (JIRA)


 [ 
https://issues.apache.org/jira/browse/PARQUET-1303?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Nandor Kollar updated PARQUET-1303:
---
Fix Version/s: 1.12.0

> Avro reflect @Stringable field write error if field not instanceof 
> CharSequence
> ---
>
> Key: PARQUET-1303
> URL: https://issues.apache.org/jira/browse/PARQUET-1303
> Project: Parquet
>  Issue Type: Bug
>  Components: parquet-avro
>Affects Versions: 1.9.0, 1.10.0
>Reporter: Zack Behringer
>Assignee: Zack Behringer
>Priority: Minor
>  Labels: pull-request-available
> Fix For: 1.12.0
>
>
> Annotate a field in a pojo with org.apache.avro.reflect.Stringable and the 
> schema will consider it to be a String field. AvroWriteSupport.fromAvroString 
> assumes the field is either a Utf8 or CharSequence and does not attempt to 
> use the field class' toString method if it is not.



--
This message was sent by Atlassian JIRA
(v7.6.14#76016)


[jira] [Resolved] (PARQUET-1303) Avro reflect @Stringable field write error if field not instanceof CharSequence

2019-07-25 Thread Nandor Kollar (JIRA)


 [ 
https://issues.apache.org/jira/browse/PARQUET-1303?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Nandor Kollar resolved PARQUET-1303.

Resolution: Fixed

> Avro reflect @Stringable field write error if field not instanceof 
> CharSequence
> ---
>
> Key: PARQUET-1303
> URL: https://issues.apache.org/jira/browse/PARQUET-1303
> Project: Parquet
>  Issue Type: Bug
>  Components: parquet-avro
>Affects Versions: 1.9.0, 1.10.0
>Reporter: Zack Behringer
>Assignee: Zack Behringer
>Priority: Minor
>  Labels: pull-request-available
>
> Annotate a field in a pojo with org.apache.avro.reflect.Stringable and the 
> schema will consider it to be a String field. AvroWriteSupport.fromAvroString 
> assumes the field is either a Utf8 or CharSequence and does not attempt to 
> use the field class' toString method if it is not.



--
This message was sent by Atlassian JIRA
(v7.6.14#76016)


[jira] [Assigned] (PARQUET-1303) Avro reflect @Stringable field write error if field not instanceof CharSequence

2019-07-25 Thread Nandor Kollar (JIRA)


 [ 
https://issues.apache.org/jira/browse/PARQUET-1303?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Nandor Kollar reassigned PARQUET-1303:
--

Assignee: Zack Behringer

> Avro reflect @Stringable field write error if field not instanceof 
> CharSequence
> ---
>
> Key: PARQUET-1303
> URL: https://issues.apache.org/jira/browse/PARQUET-1303
> Project: Parquet
>  Issue Type: Bug
>  Components: parquet-avro
>Affects Versions: 1.9.0, 1.10.0
>Reporter: Zack Behringer
>Assignee: Zack Behringer
>Priority: Minor
>
> Annotate a field in a pojo with org.apache.avro.reflect.Stringable and the 
> schema will consider it to be a String field. AvroWriteSupport.fromAvroString 
> assumes the field is either a Utf8 or CharSequence and does not attempt to 
> use the field class' toString method if it is not.



--
This message was sent by Atlassian JIRA
(v7.6.14#76016)


[jira] [Resolved] (PARQUET-1605) Bump maven-javadoc-plugin from 2.9 to 3.1.0

2019-07-24 Thread Nandor Kollar (JIRA)


 [ 
https://issues.apache.org/jira/browse/PARQUET-1605?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Nandor Kollar resolved PARQUET-1605.

Resolution: Fixed

> Bump maven-javadoc-plugin from 2.9 to 3.1.0
> ---
>
> Key: PARQUET-1605
> URL: https://issues.apache.org/jira/browse/PARQUET-1605
> Project: Parquet
>  Issue Type: Improvement
>Affects Versions: 1.11.0
>Reporter: Fokko Driesprong
>Assignee: Fokko Driesprong
>Priority: Major
>  Labels: pull-request-available
> Fix For: 1.12.0
>
>




--
This message was sent by Atlassian JIRA
(v7.6.14#76016)


[jira] [Resolved] (PARQUET-1606) Fix invalid tests scope

2019-07-23 Thread Nandor Kollar (JIRA)


 [ 
https://issues.apache.org/jira/browse/PARQUET-1606?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Nandor Kollar resolved PARQUET-1606.

Resolution: Fixed

> Fix invalid tests scope
> ---
>
> Key: PARQUET-1606
> URL: https://issues.apache.org/jira/browse/PARQUET-1606
> Project: Parquet
>  Issue Type: Improvement
>Affects Versions: 1.11.0
>Reporter: Fokko Driesprong
>Assignee: Fokko Driesprong
>Priority: Major
>  Labels: pull-request-available
> Fix For: 1.12.0
>
>




--
This message was sent by Atlassian JIRA
(v7.6.14#76016)


[jira] [Resolved] (PARQUET-1600) Fix shebang in parquet-benchmarks/run.sh

2019-07-23 Thread Nandor Kollar (JIRA)


 [ 
https://issues.apache.org/jira/browse/PARQUET-1600?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Nandor Kollar resolved PARQUET-1600.

Resolution: Fixed

> Fix shebang in parquet-benchmarks/run.sh
> 
>
> Key: PARQUET-1600
> URL: https://issues.apache.org/jira/browse/PARQUET-1600
> Project: Parquet
>  Issue Type: Bug
>Reporter: Kengo Seki
>Assignee: Kengo Seki
>Priority: Minor
>  Labels: pull-request-available
>
> The following shebang does not work as expected since it's not on the first 
> line and there's a space between # and !.
> {code:title=parquet-benchmarks/run.sh}
> (snip)
> # !/usr/bin/env bash
> {code}
> For example, if users use tcsh, it fails as follows:
> {code}
> > parquet-benchmarks/run.sh
> Illegal variable name.
> {code}



--
This message was sent by Atlassian JIRA
(v7.6.14#76016)


[jira] [Resolved] (PARQUET-1552) upgrade protoc-jar-maven-plugin to 3.8.0

2019-07-10 Thread Nandor Kollar (JIRA)


 [ 
https://issues.apache.org/jira/browse/PARQUET-1552?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Nandor Kollar resolved PARQUET-1552.

Resolution: Fixed

> upgrade protoc-jar-maven-plugin to 3.8.0
> 
>
> Key: PARQUET-1552
> URL: https://issues.apache.org/jira/browse/PARQUET-1552
> Project: Parquet
>  Issue Type: Wish
>  Components: parquet-mr
>Affects Versions: 1.11.0
>Reporter: Junjie Chen
>Assignee: Junjie Chen
>Priority: Minor
>  Labels: pull-request-available
>
> Current protoc-jar-maven-plugin has a problem when building project after a 
> proxy network. The latest release 3.8.0 version fix this issue.



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)


[jira] [Comment Edited] (PARQUET-1496) [Java] Update Scala for JDK 11 compatibility

2019-05-03 Thread Nandor Kollar (JIRA)


[ 
https://issues.apache.org/jira/browse/PARQUET-1496?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16832559#comment-16832559
 ] 

Nandor Kollar edited comment on PARQUET-1496 at 5/3/19 2:55 PM:


Scrooge recently release 19.4.0 with a fix for 
[Scrooge#304|https://github.com/twitter/scrooge/issues/304], we can now upgrade 
to this version. The other problem (Scrooge#303) is fixed by 
[replacing|https://github.com/apache/parquet-format/commit/84165d0a4f46106a96d68ed831965123294a5196]
 the problematic comment in parquet-format, I'm afraid we'll need a release 
from format.


was (Author: nkollar):
Scrooge recently release 19.4.0 with a fix for 
[Scrooge#303|https://github.com/twitter/scrooge/issues/303], we can now upgrade 
to this version. The other problem (Scrooge#304) is fixed by 
[replacing|https://github.com/apache/parquet-format/commit/84165d0a4f46106a96d68ed831965123294a5196]
 the problematic comment in parquet-format, I'm afraid we'll need a release 
from format.

> [Java] Update Scala for JDK 11 compatibility
> 
>
> Key: PARQUET-1496
> URL: https://issues.apache.org/jira/browse/PARQUET-1496
> Project: Parquet
>  Issue Type: Bug
>  Components: parquet-mr
>Affects Versions: 1.10.0
>Reporter: Uwe L. Korn
>Assignee: Uwe L. Korn
>Priority: Major
>  Labels: pull-request-available
>
> When trying to build the parquet-mr code on OSX Mojave with OpenJDK 10 and 9, 
> the build fails for me in {{parquet-scala}} with:
> {code:java}
> [INFO] --- maven-scala-plugin:2.15.2:compile (default) @ parquet-scala_2.10 
> ---
> [INFO] Checking for multiple versions of scala
> [INFO] includes = [**/*.java,**/*.scala,]
> [INFO] excludes = []
> [INFO] /Users/uwe/tmp/apache-parquet-1.11.0/parquet-scala/src/main/scala:-1: 
> info: compiling
> [INFO] Compiling 1 source files to 
> /Users/uwe/tmp/apache-parquet-1.11.0/parquet-scala/target/classes at 
> 1547922718010
> [ERROR] error: error while loading package, Missing dependency 'object 
> java.lang.Object in compiler mirror', required by 
> /Users/uwe/.m2/repository/org/scala-lang/scala-library/2.10.6/scala-library-2.10.6.jar(scala/package.class)
> [ERROR] error: error while loading package, Missing dependency 'object 
> java.lang.Object in compiler mirror', required by 
> /Users/uwe/.m2/repository/org/scala-lang/scala-library/2.10.6/scala-library-2.10.6.jar(scala/runtime/package.class)
> [ERROR] error: scala.reflect.internal.MissingRequirementError: object 
> java.lang.Object in compiler mirror not found.
> [ERROR] at 
> scala.reflect.internal.MissingRequirementError$.signal(MissingRequirementError.scala:16)
> [ERROR] at 
> scala.reflect.internal.MissingRequirementError$.notFound(MissingRequirementError.scala:17)
> [INFO] at 
> scala.reflect.internal.Mirrors$RootsBase.getModuleOrClass(Mirrors.scala:48)
> [INFO] at 
> scala.reflect.internal.Mirrors$RootsBase.getModuleOrClass(Mirrors.scala:40)
> [INFO] at 
> scala.reflect.internal.Mirrors$RootsBase.getModuleOrClass(Mirrors.scala:40)
> [INFO] at 
> scala.reflect.internal.Mirrors$RootsBase.getModuleOrClass(Mirrors.scala:61)
> [INFO] at 
> scala.reflect.internal.Mirrors$RootsBase.getClassByName(Mirrors.scala:99)
> [INFO] at 
> scala.reflect.internal.Mirrors$RootsBase.getRequiredClass(Mirrors.scala:102)
> [INFO] at 
> scala.reflect.internal.Definitions$DefinitionsClass.ObjectClass$lzycompute(Definitions.scala:264)
> [INFO] at 
> scala.reflect.internal.Definitions$DefinitionsClass.ObjectClass(Definitions.scala:264)
> [INFO] at 
> scala.reflect.internal.Definitions$DefinitionsClass.AnyRefClass$lzycompute(Definitions.scala:263)
> [INFO] at 
> scala.reflect.internal.Definitions$DefinitionsClass.AnyRefClass(Definitions.scala:263)
> [INFO] at 
> scala.reflect.internal.Definitions$DefinitionsClass.specialPolyClass(Definitions.scala:1120)
> [INFO] at 
> scala.reflect.internal.Definitions$DefinitionsClass.RepeatedParamClass$lzycompute(Definitions.scala:407)
> [INFO] at 
> scala.reflect.internal.Definitions$DefinitionsClass.RepeatedParamClass(Definitions.scala:407)
> [INFO] at 
> scala.reflect.internal.Definitions$DefinitionsClass.syntheticCoreClasses$lzycompute(Definitions.scala:1154)
> [INFO] at 
> scala.reflect.internal.Definitions$DefinitionsClass.syntheticCoreClasses(Definitions.scala:1152)
> [INFO] at 
> scala.reflect.internal.Definitions$DefinitionsClass.symbolsNotPresentInBytecode$lzycompute(Definitions.scala:1196)
> [INFO] at 
> scala.reflect.internal.Definitions$DefinitionsClass.symbolsNotPresentInBytecode(Definitions.scala:1196)
> [INFO] at 
> scala.reflect.internal.Definitions$DefinitionsClass.init(Definitions.scala:1261)
> [INFO] at scala.tools.nsc.Global$Run.(Global.scala:1290)
> [INFO] at scala.tools.nsc.Driver.doCompile(Driver.scala:32)
> [INFO] at 

[jira] [Comment Edited] (PARQUET-1496) [Java] Update Scala for JDK 11 compatibility

2019-05-03 Thread Nandor Kollar (JIRA)


[ 
https://issues.apache.org/jira/browse/PARQUET-1496?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16832559#comment-16832559
 ] 

Nandor Kollar edited comment on PARQUET-1496 at 5/3/19 2:54 PM:


Scrooge recently release 19.4.0 with a fix for 
[Scrooge#303|https://github.com/twitter/scrooge/issues/303], we can now upgrade 
to this version. The other problem (Scrooge#304) is fixed by 
[replacing|https://github.com/apache/parquet-format/commit/84165d0a4f46106a96d68ed831965123294a5196]
 the problematic comment in parquet-format, I'm afraid we'll need a release 
from format.


was (Author: nkollar):
Scrooge recently release 19.4.0 with a fix for 
[Scrooge#304|https://github.com/twitter/scrooge/issues/303], we can now upgrade 
to this version. The other problem (Scrooge#303) is fixed by 
[replacing|https://github.com/apache/parquet-format/commit/84165d0a4f46106a96d68ed831965123294a5196]
 the problematic comment in parquet-format, I'm afraid we'll need a release 
from format.

> [Java] Update Scala for JDK 11 compatibility
> 
>
> Key: PARQUET-1496
> URL: https://issues.apache.org/jira/browse/PARQUET-1496
> Project: Parquet
>  Issue Type: Bug
>  Components: parquet-mr
>Affects Versions: 1.10.0
>Reporter: Uwe L. Korn
>Assignee: Uwe L. Korn
>Priority: Major
>  Labels: pull-request-available
>
> When trying to build the parquet-mr code on OSX Mojave with OpenJDK 10 and 9, 
> the build fails for me in {{parquet-scala}} with:
> {code:java}
> [INFO] --- maven-scala-plugin:2.15.2:compile (default) @ parquet-scala_2.10 
> ---
> [INFO] Checking for multiple versions of scala
> [INFO] includes = [**/*.java,**/*.scala,]
> [INFO] excludes = []
> [INFO] /Users/uwe/tmp/apache-parquet-1.11.0/parquet-scala/src/main/scala:-1: 
> info: compiling
> [INFO] Compiling 1 source files to 
> /Users/uwe/tmp/apache-parquet-1.11.0/parquet-scala/target/classes at 
> 1547922718010
> [ERROR] error: error while loading package, Missing dependency 'object 
> java.lang.Object in compiler mirror', required by 
> /Users/uwe/.m2/repository/org/scala-lang/scala-library/2.10.6/scala-library-2.10.6.jar(scala/package.class)
> [ERROR] error: error while loading package, Missing dependency 'object 
> java.lang.Object in compiler mirror', required by 
> /Users/uwe/.m2/repository/org/scala-lang/scala-library/2.10.6/scala-library-2.10.6.jar(scala/runtime/package.class)
> [ERROR] error: scala.reflect.internal.MissingRequirementError: object 
> java.lang.Object in compiler mirror not found.
> [ERROR] at 
> scala.reflect.internal.MissingRequirementError$.signal(MissingRequirementError.scala:16)
> [ERROR] at 
> scala.reflect.internal.MissingRequirementError$.notFound(MissingRequirementError.scala:17)
> [INFO] at 
> scala.reflect.internal.Mirrors$RootsBase.getModuleOrClass(Mirrors.scala:48)
> [INFO] at 
> scala.reflect.internal.Mirrors$RootsBase.getModuleOrClass(Mirrors.scala:40)
> [INFO] at 
> scala.reflect.internal.Mirrors$RootsBase.getModuleOrClass(Mirrors.scala:40)
> [INFO] at 
> scala.reflect.internal.Mirrors$RootsBase.getModuleOrClass(Mirrors.scala:61)
> [INFO] at 
> scala.reflect.internal.Mirrors$RootsBase.getClassByName(Mirrors.scala:99)
> [INFO] at 
> scala.reflect.internal.Mirrors$RootsBase.getRequiredClass(Mirrors.scala:102)
> [INFO] at 
> scala.reflect.internal.Definitions$DefinitionsClass.ObjectClass$lzycompute(Definitions.scala:264)
> [INFO] at 
> scala.reflect.internal.Definitions$DefinitionsClass.ObjectClass(Definitions.scala:264)
> [INFO] at 
> scala.reflect.internal.Definitions$DefinitionsClass.AnyRefClass$lzycompute(Definitions.scala:263)
> [INFO] at 
> scala.reflect.internal.Definitions$DefinitionsClass.AnyRefClass(Definitions.scala:263)
> [INFO] at 
> scala.reflect.internal.Definitions$DefinitionsClass.specialPolyClass(Definitions.scala:1120)
> [INFO] at 
> scala.reflect.internal.Definitions$DefinitionsClass.RepeatedParamClass$lzycompute(Definitions.scala:407)
> [INFO] at 
> scala.reflect.internal.Definitions$DefinitionsClass.RepeatedParamClass(Definitions.scala:407)
> [INFO] at 
> scala.reflect.internal.Definitions$DefinitionsClass.syntheticCoreClasses$lzycompute(Definitions.scala:1154)
> [INFO] at 
> scala.reflect.internal.Definitions$DefinitionsClass.syntheticCoreClasses(Definitions.scala:1152)
> [INFO] at 
> scala.reflect.internal.Definitions$DefinitionsClass.symbolsNotPresentInBytecode$lzycompute(Definitions.scala:1196)
> [INFO] at 
> scala.reflect.internal.Definitions$DefinitionsClass.symbolsNotPresentInBytecode(Definitions.scala:1196)
> [INFO] at 
> scala.reflect.internal.Definitions$DefinitionsClass.init(Definitions.scala:1261)
> [INFO] at scala.tools.nsc.Global$Run.(Global.scala:1290)
> [INFO] at scala.tools.nsc.Driver.doCompile(Driver.scala:32)
> [INFO] at 

[jira] [Commented] (PARQUET-1496) [Java] Update Scala for JDK 11 compatibility

2019-05-03 Thread Nandor Kollar (JIRA)


[ 
https://issues.apache.org/jira/browse/PARQUET-1496?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16832559#comment-16832559
 ] 

Nandor Kollar commented on PARQUET-1496:


Scrooge recently release 19.4.0 with a fix for 
[Scrooge#304|https://github.com/twitter/scrooge/issues/303], we can now upgrade 
to this version. The other problem (Scrooge#303) is fixed by 
[replacing|https://github.com/apache/parquet-format/commit/84165d0a4f46106a96d68ed831965123294a5196]
 the problematic comment in parquet-format, I'm afraid we'll need a release 
from format.

> [Java] Update Scala for JDK 11 compatibility
> 
>
> Key: PARQUET-1496
> URL: https://issues.apache.org/jira/browse/PARQUET-1496
> Project: Parquet
>  Issue Type: Bug
>  Components: parquet-mr
>Affects Versions: 1.10.0
>Reporter: Uwe L. Korn
>Assignee: Uwe L. Korn
>Priority: Major
>  Labels: pull-request-available
>
> When trying to build the parquet-mr code on OSX Mojave with OpenJDK 10 and 9, 
> the build fails for me in {{parquet-scala}} with:
> {code:java}
> [INFO] --- maven-scala-plugin:2.15.2:compile (default) @ parquet-scala_2.10 
> ---
> [INFO] Checking for multiple versions of scala
> [INFO] includes = [**/*.java,**/*.scala,]
> [INFO] excludes = []
> [INFO] /Users/uwe/tmp/apache-parquet-1.11.0/parquet-scala/src/main/scala:-1: 
> info: compiling
> [INFO] Compiling 1 source files to 
> /Users/uwe/tmp/apache-parquet-1.11.0/parquet-scala/target/classes at 
> 1547922718010
> [ERROR] error: error while loading package, Missing dependency 'object 
> java.lang.Object in compiler mirror', required by 
> /Users/uwe/.m2/repository/org/scala-lang/scala-library/2.10.6/scala-library-2.10.6.jar(scala/package.class)
> [ERROR] error: error while loading package, Missing dependency 'object 
> java.lang.Object in compiler mirror', required by 
> /Users/uwe/.m2/repository/org/scala-lang/scala-library/2.10.6/scala-library-2.10.6.jar(scala/runtime/package.class)
> [ERROR] error: scala.reflect.internal.MissingRequirementError: object 
> java.lang.Object in compiler mirror not found.
> [ERROR] at 
> scala.reflect.internal.MissingRequirementError$.signal(MissingRequirementError.scala:16)
> [ERROR] at 
> scala.reflect.internal.MissingRequirementError$.notFound(MissingRequirementError.scala:17)
> [INFO] at 
> scala.reflect.internal.Mirrors$RootsBase.getModuleOrClass(Mirrors.scala:48)
> [INFO] at 
> scala.reflect.internal.Mirrors$RootsBase.getModuleOrClass(Mirrors.scala:40)
> [INFO] at 
> scala.reflect.internal.Mirrors$RootsBase.getModuleOrClass(Mirrors.scala:40)
> [INFO] at 
> scala.reflect.internal.Mirrors$RootsBase.getModuleOrClass(Mirrors.scala:61)
> [INFO] at 
> scala.reflect.internal.Mirrors$RootsBase.getClassByName(Mirrors.scala:99)
> [INFO] at 
> scala.reflect.internal.Mirrors$RootsBase.getRequiredClass(Mirrors.scala:102)
> [INFO] at 
> scala.reflect.internal.Definitions$DefinitionsClass.ObjectClass$lzycompute(Definitions.scala:264)
> [INFO] at 
> scala.reflect.internal.Definitions$DefinitionsClass.ObjectClass(Definitions.scala:264)
> [INFO] at 
> scala.reflect.internal.Definitions$DefinitionsClass.AnyRefClass$lzycompute(Definitions.scala:263)
> [INFO] at 
> scala.reflect.internal.Definitions$DefinitionsClass.AnyRefClass(Definitions.scala:263)
> [INFO] at 
> scala.reflect.internal.Definitions$DefinitionsClass.specialPolyClass(Definitions.scala:1120)
> [INFO] at 
> scala.reflect.internal.Definitions$DefinitionsClass.RepeatedParamClass$lzycompute(Definitions.scala:407)
> [INFO] at 
> scala.reflect.internal.Definitions$DefinitionsClass.RepeatedParamClass(Definitions.scala:407)
> [INFO] at 
> scala.reflect.internal.Definitions$DefinitionsClass.syntheticCoreClasses$lzycompute(Definitions.scala:1154)
> [INFO] at 
> scala.reflect.internal.Definitions$DefinitionsClass.syntheticCoreClasses(Definitions.scala:1152)
> [INFO] at 
> scala.reflect.internal.Definitions$DefinitionsClass.symbolsNotPresentInBytecode$lzycompute(Definitions.scala:1196)
> [INFO] at 
> scala.reflect.internal.Definitions$DefinitionsClass.symbolsNotPresentInBytecode(Definitions.scala:1196)
> [INFO] at 
> scala.reflect.internal.Definitions$DefinitionsClass.init(Definitions.scala:1261)
> [INFO] at scala.tools.nsc.Global$Run.(Global.scala:1290)
> [INFO] at scala.tools.nsc.Driver.doCompile(Driver.scala:32)
> [INFO] at scala.tools.nsc.Main$.doCompile(Main.scala:79)
> [INFO] at scala.tools.nsc.Driver.process(Driver.scala:54)
> [INFO] at scala.tools.nsc.Driver.main(Driver.scala:67)
> [INFO] at scala.tools.nsc.Main.main(Main.scala)
> [INFO] at 
> java.base/jdk.internal.reflect.NativeMethodAccessorImpl.invoke0(Native Method)
> [INFO] at 
> java.base/jdk.internal.reflect.NativeMethodAccessorImpl.invoke(NativeMethodAccessorImpl.java:62)
> [INFO] at 
> 

[jira] [Commented] (PARQUET-1556) Instructions are missing for configuring twitter maven repo for hadoop-lzo dependency

2019-04-03 Thread Nandor Kollar (JIRA)


[ 
https://issues.apache.org/jira/browse/PARQUET-1556?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16808837#comment-16808837
 ] 

Nandor Kollar commented on PARQUET-1556:


This additional repository should be in the POM instead of adding it to the 
settings. Looks strange for me, I couldn't reproduce the failure. My settings 
file doesn't have this additional Twitter repository, and I couldn't see it in 
the output of {{mvn help:effective-pom}} either.

> Instructions are missing for configuring twitter maven repo for hadoop-lzo 
> dependency
> -
>
> Key: PARQUET-1556
> URL: https://issues.apache.org/jira/browse/PARQUET-1556
> Project: Parquet
>  Issue Type: Improvement
>Affects Versions: 1.11.0
>Reporter: Andy Grove
>Assignee: Andy Grove
>Priority: Major
> Fix For: 1.12.0
>
>
> Running mvn verify based on the instructions in the README results in this 
> error
> {code:java}
> Could not resolve dependencies for project 
> org.apache.parquet:parquet-thrift:jar:1.11.0: Could not find artifact 
> com.hadoop.gplcompression:hadoop-lzo:jar:0.4.16{code}
> To fix this, it was necessary to configure my local ~/.m2/settings.xml to 
> include the twitter maven repo:
> {code:java}
> 
> twitter
> twitter
> http://maven.twttr.com
> {code}
> After adding this, mvn verify worked.
> We should add these instructions to the README.



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)


[jira] [Created] (PARQUET-1554) Compilation error when upgrading Scrooge version

2019-04-02 Thread Nandor Kollar (JIRA)
Nandor Kollar created PARQUET-1554:
--

 Summary: Compilation error when upgrading Scrooge version
 Key: PARQUET-1554
 URL: https://issues.apache.org/jira/browse/PARQUET-1554
 Project: Parquet
  Issue Type: Bug
  Components: parquet-format
Reporter: Nandor Kollar
Assignee: Nandor Kollar


When upgrading Scrooge version to 19.1.0, the build fails with
{code}
[510.1] failure: string matching regex `[A-Za-z_][A-Za-z0-9\._]*' expected but 
`}' found
{code}

This is due to Javadoc style comment in IndexPageHeader struct. Changing the 
style of the comment would solve the failure.



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)


[jira] [Commented] (PARQUET-1548) Meta data is lost when writing avro union types to parquet

2019-03-21 Thread Nandor Kollar (JIRA)


[ 
https://issues.apache.org/jira/browse/PARQUET-1548?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16798073#comment-16798073
 ] 

Nandor Kollar commented on PARQUET-1548:


Would you mind sharing more details how this happens, how to reproduce? Unit 
test, or steps to reproduce would be really useful.

> Meta data is lost when writing avro union types to parquet
> --
>
> Key: PARQUET-1548
> URL: https://issues.apache.org/jira/browse/PARQUET-1548
> Project: Parquet
>  Issue Type: Bug
>  Components: parquet-avro
> Environment: macOS -mojave
>Reporter: Michael O'Shea
>Priority: Major
>




--
This message was sent by Atlassian JIRA
(v7.6.3#76005)


[jira] [Resolved] (PARQUET-1545) Logical Type for timezone-naive timestamps

2019-03-20 Thread Nandor Kollar (JIRA)


 [ 
https://issues.apache.org/jira/browse/PARQUET-1545?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Nandor Kollar resolved PARQUET-1545.

Resolution: Fixed

> Logical Type for timezone-naive timestamps
> --
>
> Key: PARQUET-1545
> URL: https://issues.apache.org/jira/browse/PARQUET-1545
> Project: Parquet
>  Issue Type: New Feature
>Reporter: Tim Swast`
>Priority: Major
>
> {{In many systems there is a difference between a timezone-naive timestamp 
> column (called DATETIME in BigQuery, 'logicalType': 'datetime' in Avro) and a 
> timezone-aware timestamp (called TIMESTAMP in BigQuery and always stored in 
> UTC). It seems from [this 
> discussion|https://github.com/apache/parquet-format/pull/51#discussion_r119911623]
>  and the [list of logical 
> types|https://github.com/apache/parquet-format/blob/master/LogicalTypes.md] 
> that parquet only has the timezone-aware version, as all timestamps are 
> stored according to UTC.}}



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)


[jira] [Commented] (PARQUET-1545) Logical Type for timezone-naive timestamps

2019-03-20 Thread Nandor Kollar (JIRA)


[ 
https://issues.apache.org/jira/browse/PARQUET-1545?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16796990#comment-16796990
 ] 

Nandor Kollar commented on PARQUET-1545:


Great, closing this Jira as done then. By the way, like I told, if you're using 
parquet-mr, then this is only supported since 1.11.0.

> Logical Type for timezone-naive timestamps
> --
>
> Key: PARQUET-1545
> URL: https://issues.apache.org/jira/browse/PARQUET-1545
> Project: Parquet
>  Issue Type: New Feature
>Reporter: Tim Swast`
>Priority: Major
>
> {{In many systems there is a difference between a timezone-naive timestamp 
> column (called DATETIME in BigQuery, 'logicalType': 'datetime' in Avro) and a 
> timezone-aware timestamp (called TIMESTAMP in BigQuery and always stored in 
> UTC). It seems from [this 
> discussion|https://github.com/apache/parquet-format/pull/51#discussion_r119911623]
>  and the [list of logical 
> types|https://github.com/apache/parquet-format/blob/master/LogicalTypes.md] 
> that parquet only has the timezone-aware version, as all timestamps are 
> stored according to UTC.}}



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)


[jira] [Commented] (PARQUET-1545) Logical Type for timezone-naive timestamps

2019-03-18 Thread Nandor Kollar (JIRA)


[ 
https://issues.apache.org/jira/browse/PARQUET-1545?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16795001#comment-16795001
 ] 

Nandor Kollar commented on PARQUET-1545:


[~swast] actually Parquet already supports both timezone-naive and 
timezone-aware timestamps too. The logical type API is improved in 1.11.0 
(based on the recent improvement in parquet-format), and the timestamp logical 
type introduced an additional {{isAdjustedToUTC}} parameter, which tells what 
is the semantic of the given timestamp field: {{false}} means timezone-naive, 
and {{true}} means timezone-aware timestamp.

> Logical Type for timezone-naive timestamps
> --
>
> Key: PARQUET-1545
> URL: https://issues.apache.org/jira/browse/PARQUET-1545
> Project: Parquet
>  Issue Type: New Feature
>Reporter: Tim Swast`
>Priority: Major
>
> {{In many systems there is a difference between a timezone-naive timestamp 
> column (called DATETIME in BigQuery, 'logicalType': 'datetime' in Avro) and a 
> timezone-aware timestamp (called TIMESTAMP in BigQuery and always stored in 
> UTC). It seems from [this 
> discussion|https://github.com/apache/parquet-format/pull/51#discussion_r119911623]
>  and the [list of logical 
> types|https://github.com/apache/parquet-format/blob/master/LogicalTypes.md] 
> that parquet only has the timezone-aware version, as all timestamps are 
> stored according to UTC.}}



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)


[jira] [Updated] (PARQUET-409) InternalParquetRecordWriter doesn't use min/max row counts

2019-01-18 Thread Nandor Kollar (JIRA)


 [ 
https://issues.apache.org/jira/browse/PARQUET-409?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Nandor Kollar updated PARQUET-409:
--
Fix Version/s: (was: 1.9.0)

> InternalParquetRecordWriter doesn't use min/max row counts
> --
>
> Key: PARQUET-409
> URL: https://issues.apache.org/jira/browse/PARQUET-409
> Project: Parquet
>  Issue Type: Bug
>  Components: parquet-mr
>Affects Versions: 1.8.1
>Reporter: Ryan Blue
>Priority: Major
>  Labels: pull-request-available
>
> PARQUET-99 added settings to control the min and max number of rows between 
> size checks when flushing pages, and a setting to control whether to always 
> use a static size (the min). The [InternalParquetRecordWriter has similar 
> checks|https://github.com/apache/parquet-mr/blob/master/parquet-hadoop/src/main/java/org/apache/parquet/hadoop/InternalParquetRecordWriter.java#L143]
>  that don't use those settings. We should determine if it should update it to 
> use those settings or similar.



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)


[jira] [Updated] (PARQUET-182) FilteredRecordReader skips rows it shouldn't for schema with optional columns

2019-01-18 Thread Nandor Kollar (JIRA)


 [ 
https://issues.apache.org/jira/browse/PARQUET-182?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Nandor Kollar updated PARQUET-182:
--
Fix Version/s: (was: 1.9.0)

> FilteredRecordReader skips rows it shouldn't for schema with optional columns
> -
>
> Key: PARQUET-182
> URL: https://issues.apache.org/jira/browse/PARQUET-182
> Project: Parquet
>  Issue Type: Bug
>  Components: parquet-mr
>Affects Versions: 1.5.0, 1.6.0, 1.7.0
> Environment: Linux, Java7/Java8
>Reporter: Steven Mellinger
>Priority: Blocker
>
> When using UnboundRecordFilter with nested AND/OR filters over OPTIONAL 
> columns, there seems to be a case with a mismatch between the current 
> record's column value and the value read during filtering.
> The structure of my filter predicate that results in incorrect filtering is: 
> (x && (y || z))
> When I step through it with a debugger I can see that the value being read 
> from the ColumnReader inside my Predicate is different than the value for 
> that row.
> Looking deeper there seems to be a buffer with dictionary keys in 
> RunLenghBitPackingHybridDecoder (I am using RLE). There are only two 
> different keys in this array, [0,1], whereas my optional column has three 
> different values, [null,0,1]. If I had a column with values 5,10,10,null,10, 
> and keys 0 -> 5 and 1 -> 10, the buffer would hold 0,1,1,1,0, and in the case 
> that it reads the last row, would return 0 -> 5.
> So it seems that nothing is keeping track of where nulls appear.
> Hope someone can take a look, as it is a blocker for my project.



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)


[jira] [Updated] (PARQUET-1006) ColumnChunkPageWriter uses only heap memory.

2019-01-18 Thread Nandor Kollar (JIRA)


 [ 
https://issues.apache.org/jira/browse/PARQUET-1006?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Nandor Kollar updated PARQUET-1006:
---
Fix Version/s: (was: 1.9.0)

> ColumnChunkPageWriter uses only heap memory.
> 
>
> Key: PARQUET-1006
> URL: https://issues.apache.org/jira/browse/PARQUET-1006
> Project: Parquet
>  Issue Type: Bug
>  Components: parquet-mr
>Affects Versions: 1.8.0
>Reporter: Vitalii Diravka
>Priority: Major
>
> After PARQUET-160 was resolved, ColumnChunkPageWriter started using 
> ConcatenatingByteArrayCollector. There are all data is collected in the List 
> of byte[], before writing the page. No way to use direct memory for 
> allocating buffers. ByteBufferAllocator is present in the 
> [ColumnChunkPageWriter|https://github.com/apache/parquet-mr/blob/master/parquet-hadoop/src/main/java/org/apache/parquet/hadoop/ColumnChunkPageWriteStore.java#L73]
>  class, but never used.
> Using of java heap space in some cases can cause OOM exceptions or GC's 
> overhead. 
> ByteBufferAllocator should be used in the ConcatenatingByteArrayCollector or 
> OutputStream classes.



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)


[jira] [Resolved] (PARQUET-1278) [Java] parquet-arrow is broken due to different JDK version

2019-01-11 Thread Nandor Kollar (JIRA)


 [ 
https://issues.apache.org/jira/browse/PARQUET-1278?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Nandor Kollar resolved PARQUET-1278.

Resolution: Duplicate

[~andygrove] I resolved this Jira since it looked like it is a duplicate of 
PARQUET-1390

Feel free to reopen if I was wrong.

> [Java] parquet-arrow is broken due to different JDK version
> ---
>
> Key: PARQUET-1278
> URL: https://issues.apache.org/jira/browse/PARQUET-1278
> Project: Parquet
>  Issue Type: Bug
>Affects Versions: 1.10.0
>Reporter: Andy Grove
>Priority: Major
>
> Parquet was recently update to use Arrow 0.8.0 to resolve this JIRA: 
> https://issues.apache.org/jira/browse/PARQUET-1128 but unfortunately the 
> issue is actually not resolved because Arrow 0.8.0 uses JDK 7 and Parquet 
> uses JDK 8.
> I have filed a Jira against Arrow to upgrade to JDK 8: 
> https://issues.apache.org/jira/browse/ARROW-2498
> Once this is done, we will need to update parquet-arrow to use the new 
> version of arrow.



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)


[jira] [Assigned] (PARQUET-1478) Can't read spec compliant, 3-level lists via parquet-proto

2019-01-03 Thread Nandor Kollar (JIRA)


 [ 
https://issues.apache.org/jira/browse/PARQUET-1478?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Nandor Kollar reassigned PARQUET-1478:
--

Assignee: Nandor Kollar

> Can't read spec compliant, 3-level lists via parquet-proto
> --
>
> Key: PARQUET-1478
> URL: https://issues.apache.org/jira/browse/PARQUET-1478
> Project: Parquet
>  Issue Type: Bug
>  Components: parquet-mr
>Affects Versions: 1.11.0
>Reporter: Nandor Kollar
>Assignee: Nandor Kollar
>Priority: Major
>  Labels: pull-request-available
> Fix For: 1.11.0
>
>
> I noticed that ProtoInputOutputFormatTest doesn't test the following case 
> properly: when lists are written using the spec compliant 3-level structure. 
> The test actually doesn't write 3-level list, because the passed 
> configuration is not used at all, a new one is created each time. See 
> attached PR.
> When I fixed this test, it turned out that it is failing: now it writes the 
> correct 3-level structure, but looks like the read path is broken. Is it 
> indeed a bug, or I'm doing something wrong?



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)


[jira] [Created] (PARQUET-1478) Can't read spec compliant, 3-level lists via parquet-proto

2018-12-14 Thread Nandor Kollar (JIRA)
Nandor Kollar created PARQUET-1478:
--

 Summary: Can't read spec compliant, 3-level lists via parquet-proto
 Key: PARQUET-1478
 URL: https://issues.apache.org/jira/browse/PARQUET-1478
 Project: Parquet
  Issue Type: Bug
  Components: parquet-mr
Reporter: Nandor Kollar


I noticed that ProtoInputOutputFormatTest doesn't test the following case 
properly: when lists are written using the spec compliant 3-level structure. 
The test actually doesn't write 3-level list, because the passed configuration 
is not used at all, a new one is created each time. See attached PR.

When I fixed this test, it turned out that it is failing: now it writes the 
correct 3-level structure, but looks like the read path is broken. Is it indeed 
a bug, or I'm doing something wrong?



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)


[jira] [Created] (PARQUET-1476) Don't emit a warning message for files without new logical type

2018-12-13 Thread Nandor Kollar (JIRA)
Nandor Kollar created PARQUET-1476:
--

 Summary: Don't emit a warning message for files without new 
logical type
 Key: PARQUET-1476
 URL: https://issues.apache.org/jira/browse/PARQUET-1476
 Project: Parquet
  Issue Type: Improvement
  Components: parquet-mr
Reporter: Nandor Kollar
Assignee: Nandor Kollar


When the only the old logical type representation is present in the file, the 
converter emits a warning that the two types mismatch. This creates unwanted 
noise in the logs, metadata converter should only emit a warning if the new 
logical type is not null.



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)


[jira] [Commented] (PARQUET-1434) Release parquet-mr 1.11.0

2018-12-12 Thread Nandor Kollar (JIRA)


[ 
https://issues.apache.org/jira/browse/PARQUET-1434?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16718669#comment-16718669
 ] 

Nandor Kollar commented on PARQUET-1434:


[~yumwang] the release is not yet finished, voting is not yet closed.

> Release parquet-mr 1.11.0
> -
>
> Key: PARQUET-1434
> URL: https://issues.apache.org/jira/browse/PARQUET-1434
> Project: Parquet
>  Issue Type: Task
>  Components: parquet-mr
>Reporter: Nandor Kollar
>Priority: Major
>




--
This message was sent by Atlassian JIRA
(v7.6.3#76005)


[jira] [Commented] (PARQUET-1470) Inputstream leakage in ParquetFileWriter.appendFile

2018-12-04 Thread Nandor Kollar (JIRA)


[ 
https://issues.apache.org/jira/browse/PARQUET-1470?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16708702#comment-16708702
 ] 

Nandor Kollar commented on PARQUET-1470:


Sounds like a reasonable improvement. Would you mind opening a PR?

> Inputstream leakage in ParquetFileWriter.appendFile
> ---
>
> Key: PARQUET-1470
> URL: https://issues.apache.org/jira/browse/PARQUET-1470
> Project: Parquet
>  Issue Type: Bug
>  Components: parquet-mr
>Affects Versions: 1.10.0
>Reporter: Arnaud Linz
>Priority: Major
>
> Current implementation of ParquetFileWriter.appendFile is:
>  
> {{public void appendFile(InputFile file) throws IOException {}}
> {{    ParquetFileReader.open(file).appendTo(this);}}
> {{ }}}
> this method never closes the inputstream created when the file is opened in 
> the ParquetFileReader constructor.
> This leads for instance to TooManyFilesOpened exceptions when large merge are 
> made with the parquet tools.
> something  like
> {{ try (ParquetFileReader reader = ParquetFileReader.open(file)) {}}
> {{    reader.appendTo(this);}}
> {{ }}}
> would be cleaner.
>  



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)


[jira] [Commented] (PARQUET-1441) SchemaParseException: Can't redefine: list in AvroIndexedRecordConverter

2018-11-28 Thread Nandor Kollar (JIRA)


[ 
https://issues.apache.org/jira/browse/PARQUET-1441?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16701851#comment-16701851
 ] 

Nandor Kollar commented on PARQUET-1441:


Ok, so looks like there is a more fundamental problem related to this. Parquet 
allows using the same name in nested structures, while Avro doesn't allow it 
for records. For example, a file with this Parquet schema
{code}
message Message {
optional group a1 {
required float a2;
optional group a1 {
required float a4;
}
}
}
{code}
is not readable via AvroParquetReader. Of course this could be easily solved by 
renaming the inner a1 to something else, but for lists, this doesn't work. I 
think using Avro namespaces during schema conversion could fix this bug.

> SchemaParseException: Can't redefine: list in AvroIndexedRecordConverter
> 
>
> Key: PARQUET-1441
> URL: https://issues.apache.org/jira/browse/PARQUET-1441
> Project: Parquet
>  Issue Type: Bug
>  Components: parquet-avro
>Reporter: Michael Heuer
>Priority: Major
>  Labels: pull-request-available
>
> The following unit test added to TestAvroSchemaConverter fails
> {code:java}
> @Test
> public void testConvertedSchemaToStringCantRedefineList() throws Exception {
>   String parquet = "message spark_schema {\n" +
>   "  optional group annotation {\n" +
>   "optional group transcriptEffects (LIST) {\n" +
>   "  repeated group list {\n" +
>   "optional group element {\n" +
>   "  optional group effects (LIST) {\n" +
>   "repeated group list {\n" +
>   "  optional binary element (UTF8);\n" +
>   "}\n" +
>   "  }\n" +
>   "}\n" +
>   "  }\n" +
>   "}\n" +
>   "  }\n" +
>   "}\n";
>   Configuration conf = new Configuration(false);
>   AvroSchemaConverter avroSchemaConverter = new AvroSchemaConverter(conf);
>   Schema schema = 
> avroSchemaConverter.convert(MessageTypeParser.parseMessageType(parquet));
>   schema.toString();
> }
> {code}
> while this one succeeds
> {code:java}
> @Test
> public void testConvertedSchemaToStringCantRedefineList() throws Exception {
>   String parquet = "message spark_schema {\n" +
>   "  optional group annotation {\n" +
>   "optional group transcriptEffects (LIST) {\n" +
>   "  repeated group list {\n" +
>   "optional group element {\n" +
>   "  optional group effects (LIST) {\n" +
>   "repeated group list {\n" +
>   "  optional binary element (UTF8);\n" +
>   "}\n" +
>   "  }\n" +
>   "}\n" +
>   "  }\n" +
>   "}\n" +
>   "  }\n" +
>   "}\n";
>  
>   Configuration conf = new Configuration(false);
>   conf.setBoolean("parquet.avro.add-list-element-records", false);
>   AvroSchemaConverter avroSchemaConverter = new AvroSchemaConverter(conf);
>   Schema schema = 
> avroSchemaConverter.convert(MessageTypeParser.parseMessageType(parquet));
>   schema.toString();
> }
> {code}
> I don't see a way to influence the code path in AvroIndexedRecordConverter to 
> respect this configuration, resulting in the following stack trace downstream
> {noformat}
>   Cause: org.apache.avro.SchemaParseException: Can't redefine: list
>   at org.apache.avro.Schema$Names.put(Schema.java:1128)
>   at org.apache.avro.Schema$NamedSchema.writeNameRef(Schema.java:562)
>   at org.apache.avro.Schema$RecordSchema.toJson(Schema.java:690)
>   at org.apache.avro.Schema$ArraySchema.toJson(Schema.java:805)
>   at org.apache.avro.Schema$UnionSchema.toJson(Schema.java:882)
>   at org.apache.avro.Schema$RecordSchema.fieldsToJson(Schema.java:716)
>   at org.apache.avro.Schema$RecordSchema.toJson(Schema.java:701)
>   at org.apache.avro.Schema$UnionSchema.toJson(Schema.java:882)
>   at org.apache.avro.Schema$RecordSchema.fieldsToJson(Schema.java:716)
>   at org.apache.avro.Schema$RecordSchema.toJson(Schema.java:701)
>   at org.apache.avro.Schema.toString(Schema.java:324)
>   at 
> org.apache.avro.SchemaCompatibility.checkReaderWriterCompatibility(SchemaCompatibility.java:68)
>   at 
> org.apache.parquet.avro.AvroRecordConverter.isElementType(AvroRecordConverter.java:866)
>   at 
> org.apache.parquet.avro.AvroIndexedRecordConverter$AvroArrayConverter.(AvroIndexedRecordConverter.java:333)
>   at 
> org.apache.parquet.avro.AvroIndexedRecordConverter.newConverter(AvroIndexedRecordConverter.java:172)
>   at 
> org.apache.parquet.avro.AvroIndexedRecordConverter.(AvroIndexedRecordConverter.java:94)
>   at 
> org.apache.parquet.avro.AvroIndexedRecordConverter.newConverter(AvroIndexedRecordConverter.java:168)
>   at 
> 

[jira] [Commented] (PARQUET-1441) SchemaParseException: Can't redefine: list in AvroIndexedRecordConverter

2018-11-27 Thread Nandor Kollar (JIRA)


[ 
https://issues.apache.org/jira/browse/PARQUET-1441?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16700913#comment-16700913
 ] 

Nandor Kollar commented on PARQUET-1441:


The compatibility check introduced by PARQUET-651 in AvroRecordConverter, 
[this|https://github.com/apache/parquet-mr/blob/master/parquet-avro/src/main/java/org/apache/parquet/avro/AvroRecordConverter.java#L866].
 I was referring to Parquet version, 1.8.1 doesn't have this change, while 
1.8.2 already has.

> SchemaParseException: Can't redefine: list in AvroIndexedRecordConverter
> 
>
> Key: PARQUET-1441
> URL: https://issues.apache.org/jira/browse/PARQUET-1441
> Project: Parquet
>  Issue Type: Bug
>  Components: parquet-avro
>Reporter: Michael Heuer
>Priority: Major
>  Labels: pull-request-available
>
> The following unit test added to TestAvroSchemaConverter fails
> {code:java}
> @Test
> public void testConvertedSchemaToStringCantRedefineList() throws Exception {
>   String parquet = "message spark_schema {\n" +
>   "  optional group annotation {\n" +
>   "optional group transcriptEffects (LIST) {\n" +
>   "  repeated group list {\n" +
>   "optional group element {\n" +
>   "  optional group effects (LIST) {\n" +
>   "repeated group list {\n" +
>   "  optional binary element (UTF8);\n" +
>   "}\n" +
>   "  }\n" +
>   "}\n" +
>   "  }\n" +
>   "}\n" +
>   "  }\n" +
>   "}\n";
>   Configuration conf = new Configuration(false);
>   AvroSchemaConverter avroSchemaConverter = new AvroSchemaConverter(conf);
>   Schema schema = 
> avroSchemaConverter.convert(MessageTypeParser.parseMessageType(parquet));
>   schema.toString();
> }
> {code}
> while this one succeeds
> {code:java}
> @Test
> public void testConvertedSchemaToStringCantRedefineList() throws Exception {
>   String parquet = "message spark_schema {\n" +
>   "  optional group annotation {\n" +
>   "optional group transcriptEffects (LIST) {\n" +
>   "  repeated group list {\n" +
>   "optional group element {\n" +
>   "  optional group effects (LIST) {\n" +
>   "repeated group list {\n" +
>   "  optional binary element (UTF8);\n" +
>   "}\n" +
>   "  }\n" +
>   "}\n" +
>   "  }\n" +
>   "}\n" +
>   "  }\n" +
>   "}\n";
>  
>   Configuration conf = new Configuration(false);
>   conf.setBoolean("parquet.avro.add-list-element-records", false);
>   AvroSchemaConverter avroSchemaConverter = new AvroSchemaConverter(conf);
>   Schema schema = 
> avroSchemaConverter.convert(MessageTypeParser.parseMessageType(parquet));
>   schema.toString();
> }
> {code}
> I don't see a way to influence the code path in AvroIndexedRecordConverter to 
> respect this configuration, resulting in the following stack trace downstream
> {noformat}
>   Cause: org.apache.avro.SchemaParseException: Can't redefine: list
>   at org.apache.avro.Schema$Names.put(Schema.java:1128)
>   at org.apache.avro.Schema$NamedSchema.writeNameRef(Schema.java:562)
>   at org.apache.avro.Schema$RecordSchema.toJson(Schema.java:690)
>   at org.apache.avro.Schema$ArraySchema.toJson(Schema.java:805)
>   at org.apache.avro.Schema$UnionSchema.toJson(Schema.java:882)
>   at org.apache.avro.Schema$RecordSchema.fieldsToJson(Schema.java:716)
>   at org.apache.avro.Schema$RecordSchema.toJson(Schema.java:701)
>   at org.apache.avro.Schema$UnionSchema.toJson(Schema.java:882)
>   at org.apache.avro.Schema$RecordSchema.fieldsToJson(Schema.java:716)
>   at org.apache.avro.Schema$RecordSchema.toJson(Schema.java:701)
>   at org.apache.avro.Schema.toString(Schema.java:324)
>   at 
> org.apache.avro.SchemaCompatibility.checkReaderWriterCompatibility(SchemaCompatibility.java:68)
>   at 
> org.apache.parquet.avro.AvroRecordConverter.isElementType(AvroRecordConverter.java:866)
>   at 
> org.apache.parquet.avro.AvroIndexedRecordConverter$AvroArrayConverter.(AvroIndexedRecordConverter.java:333)
>   at 
> org.apache.parquet.avro.AvroIndexedRecordConverter.newConverter(AvroIndexedRecordConverter.java:172)
>   at 
> org.apache.parquet.avro.AvroIndexedRecordConverter.(AvroIndexedRecordConverter.java:94)
>   at 
> org.apache.parquet.avro.AvroIndexedRecordConverter.newConverter(AvroIndexedRecordConverter.java:168)
>   at 
> org.apache.parquet.avro.AvroIndexedRecordConverter.(AvroIndexedRecordConverter.java:94)
>   at 
> org.apache.parquet.avro.AvroIndexedRecordConverter.(AvroIndexedRecordConverter.java:66)
>   at 
> org.apache.parquet.avro.AvroCompatRecordMaterializer.(AvroCompatRecordMaterializer.java:34)
>   at 
> 

[jira] [Commented] (PARQUET-1441) SchemaParseException: Can't redefine: list in AvroIndexedRecordConverter

2018-11-27 Thread Nandor Kollar (JIRA)


[ 
https://issues.apache.org/jira/browse/PARQUET-1441?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16700502#comment-16700502
 ] 

Nandor Kollar commented on PARQUET-1441:


Looks like this compatibility check broke this scenario. The commit wasn't 
committed when 1.8.1 was released, but it is backported to all other later 
branches (including 1.8.x, so if 1.8.2 is gets released, then this case will 
break there too).

> SchemaParseException: Can't redefine: list in AvroIndexedRecordConverter
> 
>
> Key: PARQUET-1441
> URL: https://issues.apache.org/jira/browse/PARQUET-1441
> Project: Parquet
>  Issue Type: Bug
>  Components: parquet-avro
>Reporter: Michael Heuer
>Priority: Major
>  Labels: pull-request-available
>
> The following unit test added to TestAvroSchemaConverter fails
> {code:java}
> @Test
> public void testConvertedSchemaToStringCantRedefineList() throws Exception {
>   String parquet = "message spark_schema {\n" +
>   "  optional group annotation {\n" +
>   "optional group transcriptEffects (LIST) {\n" +
>   "  repeated group list {\n" +
>   "optional group element {\n" +
>   "  optional group effects (LIST) {\n" +
>   "repeated group list {\n" +
>   "  optional binary element (UTF8);\n" +
>   "}\n" +
>   "  }\n" +
>   "}\n" +
>   "  }\n" +
>   "}\n" +
>   "  }\n" +
>   "}\n";
>   Configuration conf = new Configuration(false);
>   AvroSchemaConverter avroSchemaConverter = new AvroSchemaConverter(conf);
>   Schema schema = 
> avroSchemaConverter.convert(MessageTypeParser.parseMessageType(parquet));
>   schema.toString();
> }
> {code}
> while this one succeeds
> {code:java}
> @Test
> public void testConvertedSchemaToStringCantRedefineList() throws Exception {
>   String parquet = "message spark_schema {\n" +
>   "  optional group annotation {\n" +
>   "optional group transcriptEffects (LIST) {\n" +
>   "  repeated group list {\n" +
>   "optional group element {\n" +
>   "  optional group effects (LIST) {\n" +
>   "repeated group list {\n" +
>   "  optional binary element (UTF8);\n" +
>   "}\n" +
>   "  }\n" +
>   "}\n" +
>   "  }\n" +
>   "}\n" +
>   "  }\n" +
>   "}\n";
>  
>   Configuration conf = new Configuration(false);
>   conf.setBoolean("parquet.avro.add-list-element-records", false);
>   AvroSchemaConverter avroSchemaConverter = new AvroSchemaConverter(conf);
>   Schema schema = 
> avroSchemaConverter.convert(MessageTypeParser.parseMessageType(parquet));
>   schema.toString();
> }
> {code}
> I don't see a way to influence the code path in AvroIndexedRecordConverter to 
> respect this configuration, resulting in the following stack trace downstream
> {noformat}
>   Cause: org.apache.avro.SchemaParseException: Can't redefine: list
>   at org.apache.avro.Schema$Names.put(Schema.java:1128)
>   at org.apache.avro.Schema$NamedSchema.writeNameRef(Schema.java:562)
>   at org.apache.avro.Schema$RecordSchema.toJson(Schema.java:690)
>   at org.apache.avro.Schema$ArraySchema.toJson(Schema.java:805)
>   at org.apache.avro.Schema$UnionSchema.toJson(Schema.java:882)
>   at org.apache.avro.Schema$RecordSchema.fieldsToJson(Schema.java:716)
>   at org.apache.avro.Schema$RecordSchema.toJson(Schema.java:701)
>   at org.apache.avro.Schema$UnionSchema.toJson(Schema.java:882)
>   at org.apache.avro.Schema$RecordSchema.fieldsToJson(Schema.java:716)
>   at org.apache.avro.Schema$RecordSchema.toJson(Schema.java:701)
>   at org.apache.avro.Schema.toString(Schema.java:324)
>   at 
> org.apache.avro.SchemaCompatibility.checkReaderWriterCompatibility(SchemaCompatibility.java:68)
>   at 
> org.apache.parquet.avro.AvroRecordConverter.isElementType(AvroRecordConverter.java:866)
>   at 
> org.apache.parquet.avro.AvroIndexedRecordConverter$AvroArrayConverter.(AvroIndexedRecordConverter.java:333)
>   at 
> org.apache.parquet.avro.AvroIndexedRecordConverter.newConverter(AvroIndexedRecordConverter.java:172)
>   at 
> org.apache.parquet.avro.AvroIndexedRecordConverter.(AvroIndexedRecordConverter.java:94)
>   at 
> org.apache.parquet.avro.AvroIndexedRecordConverter.newConverter(AvroIndexedRecordConverter.java:168)
>   at 
> org.apache.parquet.avro.AvroIndexedRecordConverter.(AvroIndexedRecordConverter.java:94)
>   at 
> org.apache.parquet.avro.AvroIndexedRecordConverter.(AvroIndexedRecordConverter.java:66)
>   at 
> org.apache.parquet.avro.AvroCompatRecordMaterializer.(AvroCompatRecordMaterializer.java:34)
>   at 
> org.apache.parquet.avro.AvroReadSupport.newCompatMaterializer(AvroReadSupport.java:144)
>   at 
> 

[jira] [Commented] (PARQUET-1441) SchemaParseException: Can't redefine: list in AvroIndexedRecordConverter

2018-11-27 Thread Nandor Kollar (JIRA)


[ 
https://issues.apache.org/jira/browse/PARQUET-1441?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16700445#comment-16700445
 ] 

Nandor Kollar commented on PARQUET-1441:


After a bit more investigation, I think this is probably a Parquet issue, 
however it isn't clean to me how can this be a regression. The failing code 
path in Parquet and in Avro was committed long ago, what I can think of as a 
possibility is that Spark might have recently moved from 2-level list structure 
to 3-level lists.

The unit test attached to this PR doesn't reflect the problem, because I think 
it tests the correct behaviour: in the converter one can switch between 2 and 3 
level list with {{parquet.avro.add-list-element-records}} property. The test 
for the Spark Jira is a lot more informative.

I think that the problem is that 
[AvroRecordConverter|https://github.com/apache/parquet-mr/blob/master/parquet-avro/src/main/java/org/apache/parquet/avro/AvroRecordConverter.java#L865]
 tries to decide between 3 and 2 level list by first trying to interpret the 
schema as 2 level, and check the compatibility with the expected Avro schema. 
Normally, the two are incompatible (if it was written as 3 level), and Parquet 
will know that it is a 3-level list. This works fine when lists are not nested 
into other lists, but if we try to represent the 3 level nested list Parquet 
structure as 2 level, the resulting 2 level Avro schema is not even a valid 
Avro schema!

> SchemaParseException: Can't redefine: list in AvroIndexedRecordConverter
> 
>
> Key: PARQUET-1441
> URL: https://issues.apache.org/jira/browse/PARQUET-1441
> Project: Parquet
>  Issue Type: Bug
>  Components: parquet-avro
>Reporter: Michael Heuer
>Priority: Major
>  Labels: pull-request-available
>
> The following unit test added to TestAvroSchemaConverter fails
> {code:java}
> @Test
> public void testConvertedSchemaToStringCantRedefineList() throws Exception {
>   String parquet = "message spark_schema {\n" +
>   "  optional group annotation {\n" +
>   "optional group transcriptEffects (LIST) {\n" +
>   "  repeated group list {\n" +
>   "optional group element {\n" +
>   "  optional group effects (LIST) {\n" +
>   "repeated group list {\n" +
>   "  optional binary element (UTF8);\n" +
>   "}\n" +
>   "  }\n" +
>   "}\n" +
>   "  }\n" +
>   "}\n" +
>   "  }\n" +
>   "}\n";
>   Configuration conf = new Configuration(false);
>   AvroSchemaConverter avroSchemaConverter = new AvroSchemaConverter(conf);
>   Schema schema = 
> avroSchemaConverter.convert(MessageTypeParser.parseMessageType(parquet));
>   schema.toString();
> }
> {code}
> while this one succeeds
> {code:java}
> @Test
> public void testConvertedSchemaToStringCantRedefineList() throws Exception {
>   String parquet = "message spark_schema {\n" +
>   "  optional group annotation {\n" +
>   "optional group transcriptEffects (LIST) {\n" +
>   "  repeated group list {\n" +
>   "optional group element {\n" +
>   "  optional group effects (LIST) {\n" +
>   "repeated group list {\n" +
>   "  optional binary element (UTF8);\n" +
>   "}\n" +
>   "  }\n" +
>   "}\n" +
>   "  }\n" +
>   "}\n" +
>   "  }\n" +
>   "}\n";
>  
>   Configuration conf = new Configuration(false);
>   conf.setBoolean("parquet.avro.add-list-element-records", false);
>   AvroSchemaConverter avroSchemaConverter = new AvroSchemaConverter(conf);
>   Schema schema = 
> avroSchemaConverter.convert(MessageTypeParser.parseMessageType(parquet));
>   schema.toString();
> }
> {code}
> I don't see a way to influence the code path in AvroIndexedRecordConverter to 
> respect this configuration, resulting in the following stack trace downstream
> {noformat}
>   Cause: org.apache.avro.SchemaParseException: Can't redefine: list
>   at org.apache.avro.Schema$Names.put(Schema.java:1128)
>   at org.apache.avro.Schema$NamedSchema.writeNameRef(Schema.java:562)
>   at org.apache.avro.Schema$RecordSchema.toJson(Schema.java:690)
>   at org.apache.avro.Schema$ArraySchema.toJson(Schema.java:805)
>   at org.apache.avro.Schema$UnionSchema.toJson(Schema.java:882)
>   at org.apache.avro.Schema$RecordSchema.fieldsToJson(Schema.java:716)
>   at org.apache.avro.Schema$RecordSchema.toJson(Schema.java:701)
>   at org.apache.avro.Schema$UnionSchema.toJson(Schema.java:882)
>   at org.apache.avro.Schema$RecordSchema.fieldsToJson(Schema.java:716)
>   at org.apache.avro.Schema$RecordSchema.toJson(Schema.java:701)
>   at org.apache.avro.Schema.toString(Schema.java:324)
>   at 
> 

[jira] [Commented] (PARQUET-1441) SchemaParseException: Can't redefine: list in AvroIndexedRecordConverter

2018-11-26 Thread Nandor Kollar (JIRA)


[ 
https://issues.apache.org/jira/browse/PARQUET-1441?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16699061#comment-16699061
 ] 

Nandor Kollar commented on PARQUET-1441:


Well, I think the correct solution is setting 
{{parquet.avro.add-list-element-records}} false, like in the second test case 
in the attached PR.

> SchemaParseException: Can't redefine: list in AvroIndexedRecordConverter
> 
>
> Key: PARQUET-1441
> URL: https://issues.apache.org/jira/browse/PARQUET-1441
> Project: Parquet
>  Issue Type: Bug
>  Components: parquet-avro
>Reporter: Michael Heuer
>Priority: Major
>  Labels: pull-request-available
>
> The following unit test added to TestAvroSchemaConverter fails
> {code:java}
> @Test
> public void testConvertedSchemaToStringCantRedefineList() throws Exception {
>   String parquet = "message spark_schema {\n" +
>   "  optional group annotation {\n" +
>   "optional group transcriptEffects (LIST) {\n" +
>   "  repeated group list {\n" +
>   "optional group element {\n" +
>   "  optional group effects (LIST) {\n" +
>   "repeated group list {\n" +
>   "  optional binary element (UTF8);\n" +
>   "}\n" +
>   "  }\n" +
>   "}\n" +
>   "  }\n" +
>   "}\n" +
>   "  }\n" +
>   "}\n";
>   Configuration conf = new Configuration(false);
>   AvroSchemaConverter avroSchemaConverter = new AvroSchemaConverter(conf);
>   Schema schema = 
> avroSchemaConverter.convert(MessageTypeParser.parseMessageType(parquet));
>   schema.toString();
> }
> {code}
> while this one succeeds
> {code:java}
> @Test
> public void testConvertedSchemaToStringCantRedefineList() throws Exception {
>   String parquet = "message spark_schema {\n" +
>   "  optional group annotation {\n" +
>   "optional group transcriptEffects (LIST) {\n" +
>   "  repeated group list {\n" +
>   "optional group element {\n" +
>   "  optional group effects (LIST) {\n" +
>   "repeated group list {\n" +
>   "  optional binary element (UTF8);\n" +
>   "}\n" +
>   "  }\n" +
>   "}\n" +
>   "  }\n" +
>   "}\n" +
>   "  }\n" +
>   "}\n";
>  
>   Configuration conf = new Configuration(false);
>   conf.setBoolean("parquet.avro.add-list-element-records", false);
>   AvroSchemaConverter avroSchemaConverter = new AvroSchemaConverter(conf);
>   Schema schema = 
> avroSchemaConverter.convert(MessageTypeParser.parseMessageType(parquet));
>   schema.toString();
> }
> {code}
> I don't see a way to influence the code path in AvroIndexedRecordConverter to 
> respect this configuration, resulting in the following stack trace downstream
> {noformat}
>   Cause: org.apache.avro.SchemaParseException: Can't redefine: list
>   at org.apache.avro.Schema$Names.put(Schema.java:1128)
>   at org.apache.avro.Schema$NamedSchema.writeNameRef(Schema.java:562)
>   at org.apache.avro.Schema$RecordSchema.toJson(Schema.java:690)
>   at org.apache.avro.Schema$ArraySchema.toJson(Schema.java:805)
>   at org.apache.avro.Schema$UnionSchema.toJson(Schema.java:882)
>   at org.apache.avro.Schema$RecordSchema.fieldsToJson(Schema.java:716)
>   at org.apache.avro.Schema$RecordSchema.toJson(Schema.java:701)
>   at org.apache.avro.Schema$UnionSchema.toJson(Schema.java:882)
>   at org.apache.avro.Schema$RecordSchema.fieldsToJson(Schema.java:716)
>   at org.apache.avro.Schema$RecordSchema.toJson(Schema.java:701)
>   at org.apache.avro.Schema.toString(Schema.java:324)
>   at 
> org.apache.avro.SchemaCompatibility.checkReaderWriterCompatibility(SchemaCompatibility.java:68)
>   at 
> org.apache.parquet.avro.AvroRecordConverter.isElementType(AvroRecordConverter.java:866)
>   at 
> org.apache.parquet.avro.AvroIndexedRecordConverter$AvroArrayConverter.(AvroIndexedRecordConverter.java:333)
>   at 
> org.apache.parquet.avro.AvroIndexedRecordConverter.newConverter(AvroIndexedRecordConverter.java:172)
>   at 
> org.apache.parquet.avro.AvroIndexedRecordConverter.(AvroIndexedRecordConverter.java:94)
>   at 
> org.apache.parquet.avro.AvroIndexedRecordConverter.newConverter(AvroIndexedRecordConverter.java:168)
>   at 
> org.apache.parquet.avro.AvroIndexedRecordConverter.(AvroIndexedRecordConverter.java:94)
>   at 
> org.apache.parquet.avro.AvroIndexedRecordConverter.(AvroIndexedRecordConverter.java:66)
>   at 
> org.apache.parquet.avro.AvroCompatRecordMaterializer.(AvroCompatRecordMaterializer.java:34)
>   at 
> org.apache.parquet.avro.AvroReadSupport.newCompatMaterializer(AvroReadSupport.java:144)
>   at 
> org.apache.parquet.avro.AvroReadSupport.prepareForRead(AvroReadSupport.java:136)
>   at 
> 

[jira] [Commented] (PARQUET-1441) SchemaParseException: Can't redefine: list in AvroIndexedRecordConverter

2018-11-26 Thread Nandor Kollar (JIRA)


[ 
https://issues.apache.org/jira/browse/PARQUET-1441?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16698733#comment-16698733
 ] 

Nandor Kollar commented on PARQUET-1441:


The workaround proposed above should work, however it is not 100% compliant 
with Parquet [LogicalTypes 
spec|https://github.com/apache/parquet-format/blob/master/LogicalTypes.md]. It 
states, that "The middle level, named list, must be a repeated group with a 
single field named element.", but fortunately due to backward compatibility 
rules for nested type, it doesn't cause error.

> SchemaParseException: Can't redefine: list in AvroIndexedRecordConverter
> 
>
> Key: PARQUET-1441
> URL: https://issues.apache.org/jira/browse/PARQUET-1441
> Project: Parquet
>  Issue Type: Bug
>  Components: parquet-avro
>Reporter: Michael Heuer
>Priority: Major
>  Labels: pull-request-available
>
> The following unit test added to TestAvroSchemaConverter fails
> {code:java}
> @Test
> public void testConvertedSchemaToStringCantRedefineList() throws Exception {
>   String parquet = "message spark_schema {\n" +
>   "  optional group annotation {\n" +
>   "optional group transcriptEffects (LIST) {\n" +
>   "  repeated group list {\n" +
>   "optional group element {\n" +
>   "  optional group effects (LIST) {\n" +
>   "repeated group list {\n" +
>   "  optional binary element (UTF8);\n" +
>   "}\n" +
>   "  }\n" +
>   "}\n" +
>   "  }\n" +
>   "}\n" +
>   "  }\n" +
>   "}\n";
>   Configuration conf = new Configuration(false);
>   AvroSchemaConverter avroSchemaConverter = new AvroSchemaConverter(conf);
>   Schema schema = 
> avroSchemaConverter.convert(MessageTypeParser.parseMessageType(parquet));
>   schema.toString();
> }
> {code}
> while this one succeeds
> {code:java}
> @Test
> public void testConvertedSchemaToStringCantRedefineList() throws Exception {
>   String parquet = "message spark_schema {\n" +
>   "  optional group annotation {\n" +
>   "optional group transcriptEffects (LIST) {\n" +
>   "  repeated group list {\n" +
>   "optional group element {\n" +
>   "  optional group effects (LIST) {\n" +
>   "repeated group list {\n" +
>   "  optional binary element (UTF8);\n" +
>   "}\n" +
>   "  }\n" +
>   "}\n" +
>   "  }\n" +
>   "}\n" +
>   "  }\n" +
>   "}\n";
>  
>   Configuration conf = new Configuration(false);
>   conf.setBoolean("parquet.avro.add-list-element-records", false);
>   AvroSchemaConverter avroSchemaConverter = new AvroSchemaConverter(conf);
>   Schema schema = 
> avroSchemaConverter.convert(MessageTypeParser.parseMessageType(parquet));
>   schema.toString();
> }
> {code}
> I don't see a way to influence the code path in AvroIndexedRecordConverter to 
> respect this configuration, resulting in the following stack trace downstream
> {noformat}
>   Cause: org.apache.avro.SchemaParseException: Can't redefine: list
>   at org.apache.avro.Schema$Names.put(Schema.java:1128)
>   at org.apache.avro.Schema$NamedSchema.writeNameRef(Schema.java:562)
>   at org.apache.avro.Schema$RecordSchema.toJson(Schema.java:690)
>   at org.apache.avro.Schema$ArraySchema.toJson(Schema.java:805)
>   at org.apache.avro.Schema$UnionSchema.toJson(Schema.java:882)
>   at org.apache.avro.Schema$RecordSchema.fieldsToJson(Schema.java:716)
>   at org.apache.avro.Schema$RecordSchema.toJson(Schema.java:701)
>   at org.apache.avro.Schema$UnionSchema.toJson(Schema.java:882)
>   at org.apache.avro.Schema$RecordSchema.fieldsToJson(Schema.java:716)
>   at org.apache.avro.Schema$RecordSchema.toJson(Schema.java:701)
>   at org.apache.avro.Schema.toString(Schema.java:324)
>   at 
> org.apache.avro.SchemaCompatibility.checkReaderWriterCompatibility(SchemaCompatibility.java:68)
>   at 
> org.apache.parquet.avro.AvroRecordConverter.isElementType(AvroRecordConverter.java:866)
>   at 
> org.apache.parquet.avro.AvroIndexedRecordConverter$AvroArrayConverter.(AvroIndexedRecordConverter.java:333)
>   at 
> org.apache.parquet.avro.AvroIndexedRecordConverter.newConverter(AvroIndexedRecordConverter.java:172)
>   at 
> org.apache.parquet.avro.AvroIndexedRecordConverter.(AvroIndexedRecordConverter.java:94)
>   at 
> org.apache.parquet.avro.AvroIndexedRecordConverter.newConverter(AvroIndexedRecordConverter.java:168)
>   at 
> org.apache.parquet.avro.AvroIndexedRecordConverter.(AvroIndexedRecordConverter.java:94)
>   at 
> org.apache.parquet.avro.AvroIndexedRecordConverter.(AvroIndexedRecordConverter.java:66)
>   at 
> 

[jira] [Commented] (PARQUET-1407) Data loss on duplicate values with AvroParquetWriter/Reader

2018-11-19 Thread Nandor Kollar (JIRA)


[ 
https://issues.apache.org/jira/browse/PARQUET-1407?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16692014#comment-16692014
 ] 

Nandor Kollar commented on PARQUET-1407:


[~rdblue] [~scottcarey] [~jackytan] I added a unit test to Ryan's fix. Could 
you please review PR #552

> Data loss on duplicate values with AvroParquetWriter/Reader
> ---
>
> Key: PARQUET-1407
> URL: https://issues.apache.org/jira/browse/PARQUET-1407
> Project: Parquet
>  Issue Type: Bug
>  Components: parquet-avro
>Affects Versions: 1.9.0, 1.10.0, 1.8.3
>Reporter: Scott Carey
>Priority: Critical
>  Labels: pull-request-available
> Fix For: 1.11.0
>
>
> {code:java}
> public class Blah {
>   private static Path parquetFile = new Path("oops");
>   private static Schema schema = SchemaBuilder.record("spark_schema")
>   .fields().optionalBytes("value").endRecord();
>   private static GenericData.Record recordFor(String value) {
> return new GenericRecordBuilder(schema)
> .set("value", value.getBytes()).build();
>   }
>   public static void main(String ... args) throws IOException {
> try (ParquetWriter writer = AvroParquetWriter
>   .builder(parquetFile)
>   .withSchema(schema)
>   .build()) {
>   writer.write(recordFor("one"));
>   writer.write(recordFor("two"));
>   writer.write(recordFor("three"));
>   writer.write(recordFor("three"));
>   writer.write(recordFor("two"));
>   writer.write(recordFor("one"));
>   writer.write(recordFor("zero"));
> }
> try (ParquetReader reader = AvroParquetReader
> .builder(parquetFile)
> .withConf(new Configuration()).build()) {
>   GenericRecord rec;
>   int i = 0;
>   while ((rec = reader.read()) != null) {
> ByteBuffer buf = (ByteBuffer) rec.get("value");
> byte[] bytes = new byte[buf.remaining()];
> buf.get(bytes);
> System.out.println("rec " + i++ + ": " + new String(bytes));
>   }
> }
>   }
> }
> {code}
> Expected output:
> {noformat}
> rec 0: one
> rec 1: two
> rec 2: three
> rec 3: three
> rec 4: two
> rec 5: one
> rec 6: zero{noformat}
> Actual:
> {noformat}
> rec 0: one
> rec 1: two
> rec 2: three
> rec 3: 
> rec 4: 
> rec 5: 
> rec 6: zero{noformat}
>  
> This was found when we started getting empty byte[] values back in spark 
> unexpectedly.  (Spark 2.3.1 and Parquet 1.8.3).   I have not tried to 
> reproduce with parquet 1.9.0, but its a bad enough bug that I would like a 
> 1.8.4 release that I can drop-in replace 1.8.3 without any binary 
> compatibility issues.
>  Duplicate byte[] values are lost.
>  
> A few clues: 
> If I do not call ByteBuffer.get, the size of ByteBuffer.remaining does not go 
> to zero.  I suspect a ByteBuffer is being recycled, but the call to 
> ByteBuffer.get mutates it.  I wonder if an appropriately placed 
> ByteBuffer.duplicate() would fix it.
>  



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)


[jira] [Comment Edited] (PARQUET-1407) Data loss on duplicate values with AvroParquetWriter/Reader

2018-11-19 Thread Nandor Kollar (JIRA)


[ 
https://issues.apache.org/jira/browse/PARQUET-1407?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16692014#comment-16692014
 ] 

Nandor Kollar edited comment on PARQUET-1407 at 11/19/18 5:09 PM:
--

[~rdblue] [~scottcarey] [~jackytan] I added a unit test to Ryan's fix. Could 
you please review PR #552?


was (Author: nkollar):
[~rdblue] [~scottcarey] [~jackytan] I added a unit test to Ryan's fix. Could 
you please review PR #552

> Data loss on duplicate values with AvroParquetWriter/Reader
> ---
>
> Key: PARQUET-1407
> URL: https://issues.apache.org/jira/browse/PARQUET-1407
> Project: Parquet
>  Issue Type: Bug
>  Components: parquet-avro
>Affects Versions: 1.9.0, 1.10.0, 1.8.3
>Reporter: Scott Carey
>Priority: Critical
>  Labels: pull-request-available
> Fix For: 1.11.0
>
>
> {code:java}
> public class Blah {
>   private static Path parquetFile = new Path("oops");
>   private static Schema schema = SchemaBuilder.record("spark_schema")
>   .fields().optionalBytes("value").endRecord();
>   private static GenericData.Record recordFor(String value) {
> return new GenericRecordBuilder(schema)
> .set("value", value.getBytes()).build();
>   }
>   public static void main(String ... args) throws IOException {
> try (ParquetWriter writer = AvroParquetWriter
>   .builder(parquetFile)
>   .withSchema(schema)
>   .build()) {
>   writer.write(recordFor("one"));
>   writer.write(recordFor("two"));
>   writer.write(recordFor("three"));
>   writer.write(recordFor("three"));
>   writer.write(recordFor("two"));
>   writer.write(recordFor("one"));
>   writer.write(recordFor("zero"));
> }
> try (ParquetReader reader = AvroParquetReader
> .builder(parquetFile)
> .withConf(new Configuration()).build()) {
>   GenericRecord rec;
>   int i = 0;
>   while ((rec = reader.read()) != null) {
> ByteBuffer buf = (ByteBuffer) rec.get("value");
> byte[] bytes = new byte[buf.remaining()];
> buf.get(bytes);
> System.out.println("rec " + i++ + ": " + new String(bytes));
>   }
> }
>   }
> }
> {code}
> Expected output:
> {noformat}
> rec 0: one
> rec 1: two
> rec 2: three
> rec 3: three
> rec 4: two
> rec 5: one
> rec 6: zero{noformat}
> Actual:
> {noformat}
> rec 0: one
> rec 1: two
> rec 2: three
> rec 3: 
> rec 4: 
> rec 5: 
> rec 6: zero{noformat}
>  
> This was found when we started getting empty byte[] values back in spark 
> unexpectedly.  (Spark 2.3.1 and Parquet 1.8.3).   I have not tried to 
> reproduce with parquet 1.9.0, but its a bad enough bug that I would like a 
> 1.8.4 release that I can drop-in replace 1.8.3 without any binary 
> compatibility issues.
>  Duplicate byte[] values are lost.
>  
> A few clues: 
> If I do not call ByteBuffer.get, the size of ByteBuffer.remaining does not go 
> to zero.  I suspect a ByteBuffer is being recycled, but the call to 
> ByteBuffer.get mutates it.  I wonder if an appropriately placed 
> ByteBuffer.duplicate() would fix it.
>  



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)


[jira] [Updated] (PARQUET-1451) Deprecate old logical types API

2018-11-06 Thread Nandor Kollar (JIRA)


 [ 
https://issues.apache.org/jira/browse/PARQUET-1451?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Nandor Kollar updated PARQUET-1451:
---
Fix Version/s: (was: 1.11.0)

> Deprecate old logical types API
> ---
>
> Key: PARQUET-1451
> URL: https://issues.apache.org/jira/browse/PARQUET-1451
> Project: Parquet
>  Issue Type: Bug
>  Components: parquet-mr
>Reporter: Zoltan Ivanfi
>Assignee: Nandor Kollar
>Priority: Major
>
> Now that the new logical types API is ready, we should deprecate the old one 
> because new types will not support it (in fact, nano precision has already 
> been added without support in the old API).



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)


[jira] [Updated] (PARQUET-1309) Parquet Java uses incorrect stats and dictionary filter properties

2018-11-06 Thread Nandor Kollar (JIRA)


 [ 
https://issues.apache.org/jira/browse/PARQUET-1309?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Nandor Kollar updated PARQUET-1309:
---
Fix Version/s: (was: 1.10.1)

> Parquet Java uses incorrect stats and dictionary filter properties
> --
>
> Key: PARQUET-1309
> URL: https://issues.apache.org/jira/browse/PARQUET-1309
> Project: Parquet
>  Issue Type: Bug
>  Components: parquet-mr
>Reporter: Ryan Blue
>Assignee: Gabor Szadovszky
>Priority: Major
> Fix For: 1.11.0
>
>
> In SPARK-24251, we found that the changes to use HadoopReadOptions 
> accidentally switched the [properties that enable stats and dictionary 
> filters|https://github.com/apache/parquet-mr/blob/8bbc6cb95fd9b4b9e86c924ca1e40fd555ecac1d/parquet-hadoop/src/main/java/org/apache/parquet/HadoopReadOptions.java#L83].
>  Both are enabled by default so it is unlikely that anyone will need to turn 
> them off and there is an easy work-around, but we should fix the properties 
> for 1.10.1. This doesn't affect the 1.8.x or 1.9.x releases (Spark 2.3.x is 
> on 1.8.x).



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)


[jira] [Updated] (PARQUET-1325) High-level flexible and fine-grained column level access control through encryption with pluggable key access

2018-11-06 Thread Nandor Kollar (JIRA)


 [ 
https://issues.apache.org/jira/browse/PARQUET-1325?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Nandor Kollar updated PARQUET-1325:
---
Fix Version/s: (was: 1.10.1)

> High-level flexible and fine-grained column level access control through 
> encryption with pluggable key access
> -
>
> Key: PARQUET-1325
> URL: https://issues.apache.org/jira/browse/PARQUET-1325
> Project: Parquet
>  Issue Type: New Feature
>  Components: parquet-mr
>Affects Versions: 1.10.1
>Reporter: Xinli Shang
>Priority: Major
>  Labels: None
>   Original Estimate: 1,440h
>  Remaining Estimate: 1,440h
>
> This JIRA is an extension to Parquet Modular Encryption Jira(PARQUET-1178) 
> that will provide the basic building blocks and APIs for the encryption 
> support. On top of PARQUET-1178, this feature will create a high-level layer 
> that enables fine-grained and flexible column level access control, with 
> pluggable key access module, without a need to use the low level encryption 
> APIs. Also this feature will enable seamless integration with existing 
> clients.
> A detailed design doc will follow soon.



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)


[jira] [Updated] (PARQUET-1341) Null count is suppressed when columns have no min or max and use unsigned sort order

2018-11-06 Thread Nandor Kollar (JIRA)


 [ 
https://issues.apache.org/jira/browse/PARQUET-1341?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Nandor Kollar updated PARQUET-1341:
---
Fix Version/s: (was: 1.10.1)

> Null count is suppressed when columns have no min or max and use unsigned 
> sort order
> 
>
> Key: PARQUET-1341
> URL: https://issues.apache.org/jira/browse/PARQUET-1341
> Project: Parquet
>  Issue Type: Bug
>  Components: parquet-mr
>Affects Versions: 1.10.0
>Reporter: Ryan Blue
>Assignee: Ryan Blue
>Priority: Major
>  Labels: pull-request-available
> Fix For: 1.11.0
>
>




--
This message was sent by Atlassian JIRA
(v7.6.3#76005)


[jira] [Updated] (PARQUET-1435) Benchmark filtering column-indexes

2018-11-06 Thread Nandor Kollar (JIRA)


 [ 
https://issues.apache.org/jira/browse/PARQUET-1435?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Nandor Kollar updated PARQUET-1435:
---
Fix Version/s: (was: 1.11.0)

> Benchmark filtering column-indexes
> --
>
> Key: PARQUET-1435
> URL: https://issues.apache.org/jira/browse/PARQUET-1435
> Project: Parquet
>  Issue Type: Improvement
>  Components: parquet-mr
>Reporter: Gabor Szadovszky
>Assignee: Gabor Szadovszky
>Priority: Major
>
> Benchmark the improvements or drawbacks of filtering with and without using 
> column-indexes. We shall also benchmark the overhead of reading and using the 
> column-indexes in case of it is not useful (e.g. completely randomized data).



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)


[jira] [Updated] (PARQUET-1292) Add constructors to ProtoParquetWriter to write specs compliant Parquet

2018-11-06 Thread Nandor Kollar (JIRA)


 [ 
https://issues.apache.org/jira/browse/PARQUET-1292?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Nandor Kollar updated PARQUET-1292:
---
Fix Version/s: (was: 1.11.0)

> Add constructors to ProtoParquetWriter to write specs compliant Parquet
> ---
>
> Key: PARQUET-1292
> URL: https://issues.apache.org/jira/browse/PARQUET-1292
> Project: Parquet
>  Issue Type: Improvement
>Reporter: Kunal Chawla
>Assignee: Kunal Chawla
>Priority: Minor
>




--
This message was sent by Atlassian JIRA
(v7.6.3#76005)


[jira] [Updated] (PARQUET-1419) Enable old readers to access unencrypted columns in files with plaintext footer

2018-10-18 Thread Nandor Kollar (JIRA)


 [ 
https://issues.apache.org/jira/browse/PARQUET-1419?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Nandor Kollar updated PARQUET-1419:
---
Fix Version/s: encryption-feature-branch

> Enable old readers to access unencrypted columns in files with plaintext 
> footer
> ---
>
> Key: PARQUET-1419
> URL: https://issues.apache.org/jira/browse/PARQUET-1419
> Project: Parquet
>  Issue Type: Sub-task
>  Components: parquet-cpp, parquet-format, parquet-mr
>Reporter: Gidon Gershinsky
>Assignee: Gidon Gershinsky
>Priority: Major
>  Labels: pull-request-available
> Fix For: encryption-feature-branch
>
>




--
This message was sent by Atlassian JIRA
(v7.6.3#76005)


[jira] [Updated] (PARQUET-1232) Document the modular encryption in parquet-format

2018-10-18 Thread Nandor Kollar (JIRA)


 [ 
https://issues.apache.org/jira/browse/PARQUET-1232?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Nandor Kollar updated PARQUET-1232:
---
Fix Version/s: encryption-feature-branch

> Document the modular encryption in parquet-format
> -
>
> Key: PARQUET-1232
> URL: https://issues.apache.org/jira/browse/PARQUET-1232
> Project: Parquet
>  Issue Type: Sub-task
>  Components: parquet-format
>Reporter: Gidon Gershinsky
>Assignee: Gidon Gershinsky
>Priority: Major
>  Labels: pull-request-available
> Fix For: encryption-feature-branch
>
>
> Create Encryption.md from the design googledoc.



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)


[jira] [Updated] (PARQUET-1401) RowGroup offset and total compressed size fields

2018-10-18 Thread Nandor Kollar (JIRA)


 [ 
https://issues.apache.org/jira/browse/PARQUET-1401?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Nandor Kollar updated PARQUET-1401:
---
Fix Version/s: encryption-feature-branch

> RowGroup offset and total compressed size fields
> 
>
> Key: PARQUET-1401
> URL: https://issues.apache.org/jira/browse/PARQUET-1401
> Project: Parquet
>  Issue Type: Sub-task
>  Components: parquet-cpp, parquet-format
>Reporter: Gidon Gershinsky
>Assignee: Gidon Gershinsky
>Priority: Major
>  Labels: pull-request-available
> Fix For: encryption-feature-branch
>
>  Time Spent: 0.5h
>  Remaining Estimate: 0h
>
> Spark uses filterFileMetaData* methods in ParquetMetadataConverter class, 
> that  calculate the offset and total compressed size of a RowGroup data.
> The offset calculation is done by extracting the ColumnMetaData of the first 
> column, and using its offset fields.
> The total compressed size calculation is done by running a loop over all 
> column chunks in the RowGroup, and summing up the size values from each 
> chunk's ColumnMetaData .
> If one or more columns are hidden (encrypted with a key unavailable to the 
> reader), these calculations can't be performed, because the column metadata 
> is protected. 
>  
> But: these calculations don't really need the individual column values. The 
> results pertain to the whole RowGroup, not specific columns. 
> Therefore, we will define two new optional fields in the RowGroup Thrift 
> structure:
>  
> _optional i64 file_offset_
> _optional i64 total_compressed_size_
>  
> and calculate/set them upon file writing. Then, Spark will be able to query a 
> file with hidden columns (of course, only if the query itself doesn't need 
> the hidden columns - works with a masked version of them, or reads columns 
> with available keys).
>  
> These values can be set only for encrypted files (or for all files, to skip 
> the loop upon reading). I've tested this, works fine in Spark writers and 
> readers.
>  
> I've also checked other references to ColumnMetaData fields in parquet-mr. 
> There are none - therefore, its the only change we need in parquet.thrift to 
> handle hidden columns.



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)


[jira] [Assigned] (PARQUET-1331) Use new logical types API in parquet-mr

2018-10-15 Thread Nandor Kollar (JIRA)


 [ 
https://issues.apache.org/jira/browse/PARQUET-1331?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Nandor Kollar reassigned PARQUET-1331:
--

Assignee: (was: Nandor Kollar)

> Use new logical types API in parquet-mr
> ---
>
> Key: PARQUET-1331
> URL: https://issues.apache.org/jira/browse/PARQUET-1331
> Project: Parquet
>  Issue Type: Improvement
>  Components: parquet-mr
>Reporter: Nandor Kollar
>Priority: Minor
>
> PARQUET-1253 introduced the a new API for logical types making OriginalTypes 
> deprecated. Parquet-mr makes several decision based on OriginalTypes, this 
> logic should be replaced to new logical type API



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)


[jira] [Resolved] (PARQUET-1331) Use new logical types API in parquet-mr

2018-10-15 Thread Nandor Kollar (JIRA)


 [ 
https://issues.apache.org/jira/browse/PARQUET-1331?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Nandor Kollar resolved PARQUET-1331.

Resolution: Duplicate

> Use new logical types API in parquet-mr
> ---
>
> Key: PARQUET-1331
> URL: https://issues.apache.org/jira/browse/PARQUET-1331
> Project: Parquet
>  Issue Type: Improvement
>  Components: parquet-mr
>Reporter: Nandor Kollar
>Priority: Minor
>
> PARQUET-1253 introduced the a new API for logical types making OriginalTypes 
> deprecated. Parquet-mr makes several decision based on OriginalTypes, this 
> logic should be replaced to new logical type API



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)


[jira] [Updated] (PARQUET-1433) Parquet-format doesn't compile with Thrift 0.10.0

2018-10-15 Thread Nandor Kollar (JIRA)


 [ 
https://issues.apache.org/jira/browse/PARQUET-1433?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Nandor Kollar updated PARQUET-1433:
---
Fix Version/s: format-2.7.0

> Parquet-format doesn't compile with Thrift 0.10.0
> -
>
> Key: PARQUET-1433
> URL: https://issues.apache.org/jira/browse/PARQUET-1433
> Project: Parquet
>  Issue Type: Task
>  Components: parquet-format
>Reporter: Nandor Kollar
>Assignee: Nandor Kollar
>Priority: Major
>  Labels: pull-request-available
> Fix For: format-2.7.0
>
>
> Compilation of parquet-format fails with Thrift 0.10.0:
> [ERROR] thrift failed error: [FAILURE:generation:1] Error: unknown
> option java:hashcode



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)


[jira] [Assigned] (PARQUET-1436) TimestampMicrosStringifier shows wrong microseconds for timestamps before 1970

2018-10-15 Thread Nandor Kollar (JIRA)


 [ 
https://issues.apache.org/jira/browse/PARQUET-1436?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Nandor Kollar reassigned PARQUET-1436:
--

Assignee: Nandor Kollar

> TimestampMicrosStringifier shows wrong microseconds for timestamps before 1970
> --
>
> Key: PARQUET-1436
> URL: https://issues.apache.org/jira/browse/PARQUET-1436
> Project: Parquet
>  Issue Type: Task
>  Components: parquet-mr
>Reporter: Zoltan Ivanfi
>Assignee: Nandor Kollar
>Priority: Major
>  Labels: pull-request-available
> Fix For: 1.11.0
>
>
> testTimestampMicrosStringifier takes the timestamp 1848-03-15T09:23:59.765 
> and subtracts 1 microseconds from it. The result (both expected and actual) 
> is 1848-03-15T09:23:59.765001, but it should be 1848-03-15T09:23:59.764999 
> instead.



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)


[jira] [Updated] (PARQUET-1383) Parquet tools should indicate UTC parameter for time/timestamp types

2018-10-11 Thread Nandor Kollar (JIRA)


 [ 
https://issues.apache.org/jira/browse/PARQUET-1383?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Nandor Kollar updated PARQUET-1383:
---
Fix Version/s: 1.11.0

> Parquet tools should indicate UTC parameter for time/timestamp types
> 
>
> Key: PARQUET-1383
> URL: https://issues.apache.org/jira/browse/PARQUET-1383
> Project: Parquet
>  Issue Type: Improvement
>  Components: parquet-mr
>Reporter: Nandor Kollar
>Assignee: Nandor Kollar
>Priority: Minor
>  Labels: pull-request-available
> Fix For: 1.11.0
>
>
> Parquet-tools should indicate if a time/timestamp is UTC adjusted or timezone 
> agnostic, the values written by the tools should take UTC normalized 
> parameters into account. Right now, every time and timestamp value is 
> adjusted to UTC when printed via parquet-tools



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)


[jira] [Created] (PARQUET-1434) Release parquet-mr 1.11.0

2018-10-03 Thread Nandor Kollar (JIRA)
Nandor Kollar created PARQUET-1434:
--

 Summary: Release parquet-mr 1.11.0
 Key: PARQUET-1434
 URL: https://issues.apache.org/jira/browse/PARQUET-1434
 Project: Parquet
  Issue Type: Task
  Components: parquet-mr
Reporter: Nandor Kollar






--
This message was sent by Atlassian JIRA
(v7.6.3#76005)


[jira] [Updated] (PARQUET-1388) Nanosecond precision time and timestamp - parquet-mr

2018-10-02 Thread Nandor Kollar (JIRA)


 [ 
https://issues.apache.org/jira/browse/PARQUET-1388?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Nandor Kollar updated PARQUET-1388:
---
Fix Version/s: 1.11.0

> Nanosecond precision time and timestamp - parquet-mr
> 
>
> Key: PARQUET-1388
> URL: https://issues.apache.org/jira/browse/PARQUET-1388
> Project: Parquet
>  Issue Type: Improvement
>  Components: parquet-mr
>Reporter: Nandor Kollar
>Assignee: Nandor Kollar
>Priority: Major
>  Labels: pull-request-available
> Fix For: 1.11.0
>
>




--
This message was sent by Atlassian JIRA
(v7.6.3#76005)


[jira] [Updated] (PARQUET-1201) Column indexes

2018-10-02 Thread Nandor Kollar (JIRA)


 [ 
https://issues.apache.org/jira/browse/PARQUET-1201?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Nandor Kollar updated PARQUET-1201:
---
Fix Version/s: 1.11.0

> Column indexes
> --
>
> Key: PARQUET-1201
> URL: https://issues.apache.org/jira/browse/PARQUET-1201
> Project: Parquet
>  Issue Type: New Feature
>Affects Versions: 1.10.0
>Reporter: Gabor Szadovszky
>Assignee: Gabor Szadovszky
>Priority: Major
>  Labels: pull-request-available
> Fix For: format-2.5.0, 1.11.0
>
>
> Write the column indexes described in PARQUET-922.
>  This is the first phase of implementing the whole feature. The 
> implementation is done in the following steps:
>  * Utility to read/write indexes in parquet-format
>  * Writing indexes in the parquet file
>  * Extend parquet-tools and parquet-cli to show the indexes
>  * Limit index size based on parquet properties
>  * Trim min/max values where possible based on parquet properties
>  * Filtering based on column indexes
> The work is done on the feature branch {{column-indexes}}. This JIRA will be 
> resolved after the branch has been merged to {{master}}.



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)


[jira] [Updated] (PARQUET-1368) ParquetFileReader should close its input stream for the failure in constructor

2018-10-02 Thread Nandor Kollar (JIRA)


 [ 
https://issues.apache.org/jira/browse/PARQUET-1368?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Nandor Kollar updated PARQUET-1368:
---
Fix Version/s: 1.11.0

> ParquetFileReader should close its input stream for the failure in constructor
> --
>
> Key: PARQUET-1368
> URL: https://issues.apache.org/jira/browse/PARQUET-1368
> Project: Parquet
>  Issue Type: Bug
>  Components: parquet-mr
>Affects Versions: 1.10.0
>Reporter: Hyukjin Kwon
>Priority: Major
>  Labels: pull-request-available
> Fix For: 1.11.0
>
>
> I was trying to replace deprecated usage {{readFooter}} to 
> {{ParquetFileReader.open}} according to the node:
> {code}
> [warn] 
> /home/jenkins/workspace/spark-master-test-maven-hadoop-2.7/sql/core/src/main/scala/org/apache/spark/sql/execution/datasources/parquet/ParquetFileFormat.scala:368:
>  method readFooter in object ParquetFileReader is deprecated: see 
> corresponding Javadoc for more information.
> [warn] ParquetFileReader.readFooter(sharedConf, filePath, 
> SKIP_ROW_GROUPS).getFileMetaData
> [warn]   ^
> [warn] 
> /home/jenkins/workspace/spark-master-test-maven-hadoop-2.7/sql/core/src/main/scala/org/apache/spark/sql/execution/datasources/parquet/ParquetFileFormat.scala:545:
>  method readFooter in object ParquetFileReader is deprecated: see 
> corresponding Javadoc for more information.
> [warn] ParquetFileReader.readFooter(
> [warn]   ^
> {code}
> Then, I realised some test suites reports resource leak:
> {code}
> java.lang.Throwable
>   at 
> org.apache.spark.DebugFilesystem$.addOpenStream(DebugFilesystem.scala:36)
>   at org.apache.spark.DebugFilesystem.open(DebugFilesystem.scala:70)
>   at org.apache.hadoop.fs.FileSystem.open(FileSystem.java:766)
>   at 
> org.apache.parquet.hadoop.util.HadoopInputFile.newStream(HadoopInputFile.java:65)
>   at 
> org.apache.parquet.hadoop.ParquetFileReader.(ParquetFileReader.java:687)
>   at 
> org.apache.parquet.hadoop.ParquetFileReader.open(ParquetFileReader.java:595)
>   at 
> org.apache.spark.sql.execution.datasources.parquet.ParquetUtils$.createParquetReader(ParquetUtils.scala:67)
>   at 
> org.apache.spark.sql.execution.datasources.parquet.ParquetUtils$.readFooter(ParquetUtils.scala:46)
>   at 
> org.apache.spark.sql.execution.datasources.parquet.ParquetFileFormat$$anonfun$readParquetFootersInParallel$1.apply(ParquetFileFormat.scala:544)
>   at 
> org.apache.spark.sql.execution.datasources.parquet.ParquetFileFormat$$anonfun$readParquetFootersInParallel$1.apply(ParquetFileFormat.scala:539)
>   at 
> scala.collection.parallel.AugmentedIterableIterator$class.flatmap2combiner(RemainsIterator.scala:132)
>   at 
> scala.collection.parallel.immutable.ParVector$ParVectorIterator.flatmap2combiner(ParVector.scala:62)
>   at 
> scala.collection.parallel.ParIterableLike$FlatMap.leaf(ParIterableLike.scala:1072)
>   at 
> scala.collection.parallel.Task$$anonfun$tryLeaf$1.apply$mcV$sp(Tasks.scala:49)
>   at 
> scala.collection.parallel.Task$$anonfun$tryLeaf$1.apply(Tasks.scala:48)
>   at 
> scala.collection.parallel.Task$$anonfun$tryLeaf$1.apply(Tasks.scala:48)
>   at scala.collection.parallel.Task$class.tryLeaf(Tasks.scala:51)
>   at 
> scala.collection.parallel.ParIterableLike$FlatMap.tryLeaf(ParIterableLike.scala:1068)
>   at 
> scala.collection.parallel.AdaptiveWorkStealingTasks$WrappedTask$class.internal(Tasks.scala:159)
>   at 
> scala.collection.parallel.AdaptiveWorkStealingForkJoinTasks$WrappedTask.internal(Tasks.scala:443)
>   at 
> scala.collection.parallel.AdaptiveWorkStealingTasks$WrappedTask$class.compute(Tasks.scala:149)
>   at 
> scala.collection.parallel.AdaptiveWorkStealingForkJoinTasks$WrappedTask.compute(Tasks.scala:443)
>   at 
> scala.concurrent.forkjoin.RecursiveAction.exec(RecursiveAction.java:160)
>   at scala.concurrent.forkjoin.ForkJoinTask.doExec(ForkJoinTask.java:260)
>   at scala.concurrent.forkjoin.ForkJoinTask.doJoin(ForkJoinTask.java:341)
>   at scala.concurrent.forkjoin.ForkJoinTask.join(ForkJoinTask.java:673)
>   at 
> scala.collection.parallel.ForkJoinTasks$WrappedTask$class.sync(Tasks.scala:378)
>   at 
> scala.collection.parallel.AdaptiveWorkStealingForkJoinTasks$WrappedTask.sync(Tasks.scala:443)
>   at 
> scala.collection.parallel.ForkJoinTasks$class.executeAndWaitResult(Tasks.scala:426)
>   at 
> scala.collection.parallel.ForkJoinTaskSupport.executeAndWaitResult(TaskSupport.scala:56)
>   at 
> scala.collection.parallel.ParIterableLike$ResultMapping.leaf(ParIterableLike.scala:958)
>   at 
> scala.collection.parallel.Task$$anonfun$tryLeaf$1.apply$mcV$sp(Tasks.scala:49)
> 

[jira] [Resolved] (PARQUET-1428) Move columnar encryption into its feature branch

2018-10-02 Thread Nandor Kollar (JIRA)


 [ 
https://issues.apache.org/jira/browse/PARQUET-1428?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Nandor Kollar resolved PARQUET-1428.

Resolution: Fixed

> Move columnar encryption into its feature branch
> 
>
> Key: PARQUET-1428
> URL: https://issues.apache.org/jira/browse/PARQUET-1428
> Project: Parquet
>  Issue Type: Task
>  Components: parquet-format
>Reporter: Nandor Kollar
>Assignee: Nandor Kollar
>Priority: Major
>  Labels: pull-request-available
>




--
This message was sent by Atlassian JIRA
(v7.6.3#76005)


[jira] [Assigned] (PARQUET-1428) Move columnar encryption into its feature branch

2018-10-02 Thread Nandor Kollar (JIRA)


 [ 
https://issues.apache.org/jira/browse/PARQUET-1428?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Nandor Kollar reassigned PARQUET-1428:
--

Assignee: Nandor Kollar

> Move columnar encryption into its feature branch
> 
>
> Key: PARQUET-1428
> URL: https://issues.apache.org/jira/browse/PARQUET-1428
> Project: Parquet
>  Issue Type: Task
>  Components: parquet-format
>Reporter: Nandor Kollar
>Assignee: Nandor Kollar
>Priority: Major
>  Labels: pull-request-available
>




--
This message was sent by Atlassian JIRA
(v7.6.3#76005)


[jira] [Resolved] (PARQUET-1424) Release parquet-format 2.6.0

2018-10-02 Thread Nandor Kollar (JIRA)


 [ 
https://issues.apache.org/jira/browse/PARQUET-1424?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Nandor Kollar resolved PARQUET-1424.

Resolution: Fixed

> Release parquet-format 2.6.0
> 
>
> Key: PARQUET-1424
> URL: https://issues.apache.org/jira/browse/PARQUET-1424
> Project: Parquet
>  Issue Type: Task
>  Components: parquet-format
>Reporter: Nandor Kollar
>Assignee: Nandor Kollar
>Priority: Major
>
> Release parquet-format 2.6.0
> The release requires reverting of the merged PRs related to columnar 
> encryption, since there's no signed spec yet. Those PRs should be developed 
> on a feature branch instead and merge to master once the spec is signed and 
> the format changes are ready to get released.



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)


[jira] [Assigned] (PARQUET-1424) Release parquet-format 2.6.0

2018-10-02 Thread Nandor Kollar (JIRA)


 [ 
https://issues.apache.org/jira/browse/PARQUET-1424?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Nandor Kollar reassigned PARQUET-1424:
--

Assignee: Nandor Kollar

> Release parquet-format 2.6.0
> 
>
> Key: PARQUET-1424
> URL: https://issues.apache.org/jira/browse/PARQUET-1424
> Project: Parquet
>  Issue Type: Task
>  Components: parquet-format
>Reporter: Nandor Kollar
>Assignee: Nandor Kollar
>Priority: Major
>
> Release parquet-format 2.6.0
> The release requires reverting of the merged PRs related to columnar 
> encryption, since there's no signed spec yet. Those PRs should be developed 
> on a feature branch instead and merge to master once the spec is signed and 
> the format changes are ready to get released.



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)


[jira] [Updated] (PARQUET-1399) Move parquet-mr related code from parquet-format

2018-10-02 Thread Nandor Kollar (JIRA)


 [ 
https://issues.apache.org/jira/browse/PARQUET-1399?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Nandor Kollar updated PARQUET-1399:
---
Fix Version/s: 1.11.0

> Move parquet-mr related code from parquet-format
> 
>
> Key: PARQUET-1399
> URL: https://issues.apache.org/jira/browse/PARQUET-1399
> Project: Parquet
>  Issue Type: Improvement
>  Components: parquet-mr
>Reporter: Gabor Szadovszky
>Assignee: Gabor Szadovszky
>Priority: Major
>  Labels: pull-request-available
> Fix For: 1.11.0
>
>
> There are java classes in the 
> [parquet-format|https://github.com/apache/parquet-format] repo that shall be 
> in the [parquet-mr|https://github.com/apache/parquet-mr] repo instead: [java 
> classes|https://github.com/apache/parquet-format/tree/master/src/main] and 
> [test classes|https://github.com/apache/parquet-format/tree/master/src/test]
> The idea is to create a separate module in 
> [parquet-mr|https://github.com/apache/parquet-mr] and depend on it instead of 
> depending on [parquet-format|https://github.com/apache/parquet-format]. Only 
> this separate module would depend on 
> [parquet-format|https://github.com/apache/parquet-format] directly.



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)


[jira] [Assigned] (PARQUET-1433) Parquet-format doesn't compile with Thrift 0.10.0

2018-10-01 Thread Nandor Kollar (JIRA)


 [ 
https://issues.apache.org/jira/browse/PARQUET-1433?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Nandor Kollar reassigned PARQUET-1433:
--

Assignee: Nandor Kollar

> Parquet-format doesn't compile with Thrift 0.10.0
> -
>
> Key: PARQUET-1433
> URL: https://issues.apache.org/jira/browse/PARQUET-1433
> Project: Parquet
>  Issue Type: Task
>  Components: parquet-format
>Reporter: Nandor Kollar
>Assignee: Nandor Kollar
>Priority: Major
>  Labels: pull-request-available
>
> Compilation of parquet-format fails with Thrift 0.10.0:
> [ERROR] thrift failed error: [FAILURE:generation:1] Error: unknown
> option java:hashcode



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)


[jira] [Created] (PARQUET-1433) Parquet-format doesn't compile with Thrift 0.10.0

2018-10-01 Thread Nandor Kollar (JIRA)
Nandor Kollar created PARQUET-1433:
--

 Summary: Parquet-format doesn't compile with Thrift 0.10.0
 Key: PARQUET-1433
 URL: https://issues.apache.org/jira/browse/PARQUET-1433
 Project: Parquet
  Issue Type: Task
  Components: parquet-format
Reporter: Nandor Kollar


Compilation of parquet-format fails with Thrift 0.10.0:

[ERROR] thrift failed error: [FAILURE:generation:1] Error: unknown
option java:hashcode



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)


[jira] [Resolved] (PARQUET-1429) Turn off DocLint on parquet-format

2018-09-27 Thread Nandor Kollar (JIRA)


 [ 
https://issues.apache.org/jira/browse/PARQUET-1429?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Nandor Kollar resolved PARQUET-1429.

Resolution: Fixed

> Turn off DocLint on parquet-format
> --
>
> Key: PARQUET-1429
> URL: https://issues.apache.org/jira/browse/PARQUET-1429
> Project: Parquet
>  Issue Type: Task
>  Components: parquet-format
>Reporter: Nandor Kollar
>Assignee: Nandor Kollar
>Priority: Major
>  Labels: pull-request-available
> Fix For: format-2.6.0
>
>
> DocLint is introduced in Java 8, and since the generated code in 
> parquet-format has several issues found by DocLint, attach-javadocs goal will 
> fail.



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)


[jira] [Created] (PARQUET-1429) Turn off DocLint on parquet-format

2018-09-27 Thread Nandor Kollar (JIRA)
Nandor Kollar created PARQUET-1429:
--

 Summary: Turn off DocLint on parquet-format
 Key: PARQUET-1429
 URL: https://issues.apache.org/jira/browse/PARQUET-1429
 Project: Parquet
  Issue Type: Task
  Components: parquet-format
Reporter: Nandor Kollar
Assignee: Nandor Kollar
 Fix For: format-2.6.0


DocLint is introduced in Java 8, and since the generated code in parquet-format 
has several issues found by DocLint, attach-javadocs goal will fail.



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)


[jira] [Updated] (PARQUET-1227) Thrift crypto metadata structures

2018-09-27 Thread Nandor Kollar (JIRA)


 [ 
https://issues.apache.org/jira/browse/PARQUET-1227?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Nandor Kollar updated PARQUET-1227:
---
Fix Version/s: format-encryption-feature-branch

> Thrift crypto metadata structures
> -
>
> Key: PARQUET-1227
> URL: https://issues.apache.org/jira/browse/PARQUET-1227
> Project: Parquet
>  Issue Type: Sub-task
>  Components: parquet-cpp, parquet-format
>Reporter: Gidon Gershinsky
>Assignee: Gidon Gershinsky
>Priority: Major
>  Labels: pull-request-available
> Fix For: cpp-1.5.0, format-encryption-feature-branch
>
>
> New Thrift structures for Parquet modular encryption



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)


[jira] [Updated] (PARQUET-1398) Separate iv_prefix for GCM and CTR modes

2018-09-27 Thread Nandor Kollar (JIRA)


 [ 
https://issues.apache.org/jira/browse/PARQUET-1398?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Nandor Kollar updated PARQUET-1398:
---
Fix Version/s: format-encryption-feature-branch

> Separate iv_prefix for GCM and CTR modes
> 
>
> Key: PARQUET-1398
> URL: https://issues.apache.org/jira/browse/PARQUET-1398
> Project: Parquet
>  Issue Type: Sub-task
>  Components: parquet-cpp, parquet-format
>Reporter: Gidon Gershinsky
>Assignee: Gidon Gershinsky
>Priority: Minor
>  Labels: pull-request-available
> Fix For: format-encryption-feature-branch
>
>  Time Spent: 10m
>  Remaining Estimate: 0h
>
> There is an ambiguity in what the iv_prefix applies to - GCM or CTR or both. 
> This parameter will be moved it to the Algorithms structures (from the 
> FileCryptoMetaData structure).



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)


[jira] [Updated] (PARQUET-1424) Release parquet-format 2.6.0

2018-09-27 Thread Nandor Kollar (JIRA)


 [ 
https://issues.apache.org/jira/browse/PARQUET-1424?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Nandor Kollar updated PARQUET-1424:
---
Fix Version/s: (was: format-2.6.0)

> Release parquet-format 2.6.0
> 
>
> Key: PARQUET-1424
> URL: https://issues.apache.org/jira/browse/PARQUET-1424
> Project: Parquet
>  Issue Type: Task
>  Components: parquet-format
>Reporter: Nandor Kollar
>Priority: Major
>
> Release parquet-format 2.6.0
> The release requires reverting of the merged PRs related to columnar 
> encryption, since there's no signed spec yet. Those PRs should be developed 
> on a feature branch instead and merge to master once the spec is signed and 
> the format changes are ready to get released.



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)


[jira] [Updated] (PARQUET-1227) Thrift crypto metadata structures

2018-09-27 Thread Nandor Kollar (JIRA)


 [ 
https://issues.apache.org/jira/browse/PARQUET-1227?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Nandor Kollar updated PARQUET-1227:
---
Fix Version/s: (was: format-2.6.0)

> Thrift crypto metadata structures
> -
>
> Key: PARQUET-1227
> URL: https://issues.apache.org/jira/browse/PARQUET-1227
> Project: Parquet
>  Issue Type: Sub-task
>  Components: parquet-cpp, parquet-format
>Reporter: Gidon Gershinsky
>Assignee: Gidon Gershinsky
>Priority: Major
>  Labels: pull-request-available
> Fix For: cpp-1.5.0
>
>
> New Thrift structures for Parquet modular encryption



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)


[jira] [Updated] (PARQUET-1398) Separate iv_prefix for GCM and CTR modes

2018-09-27 Thread Nandor Kollar (JIRA)


 [ 
https://issues.apache.org/jira/browse/PARQUET-1398?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Nandor Kollar updated PARQUET-1398:
---
Fix Version/s: (was: format-2.6.0)

> Separate iv_prefix for GCM and CTR modes
> 
>
> Key: PARQUET-1398
> URL: https://issues.apache.org/jira/browse/PARQUET-1398
> Project: Parquet
>  Issue Type: Sub-task
>  Components: parquet-cpp, parquet-format
>Reporter: Gidon Gershinsky
>Assignee: Gidon Gershinsky
>Priority: Minor
>  Labels: pull-request-available
>  Time Spent: 10m
>  Remaining Estimate: 0h
>
> There is an ambiguity in what the iv_prefix applies to - GCM or CTR or both. 
> This parameter will be moved it to the Algorithms structures (from the 
> FileCryptoMetaData structure).



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)


[jira] [Created] (PARQUET-1428) Move columnar encryption into its feature branch

2018-09-27 Thread Nandor Kollar (JIRA)
Nandor Kollar created PARQUET-1428:
--

 Summary: Move columnar encryption into its feature branch
 Key: PARQUET-1428
 URL: https://issues.apache.org/jira/browse/PARQUET-1428
 Project: Parquet
  Issue Type: Task
  Components: parquet-format
Reporter: Nandor Kollar






--
This message was sent by Atlassian JIRA
(v7.6.3#76005)


[jira] [Updated] (PARQUET-1425) [Format] Fix Thrift compiler warning

2018-09-27 Thread Nandor Kollar (JIRA)


 [ 
https://issues.apache.org/jira/browse/PARQUET-1425?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Nandor Kollar updated PARQUET-1425:
---
Fix Version/s: (was: format-2.6.0)

> [Format] Fix Thrift compiler warning
> 
>
> Key: PARQUET-1425
> URL: https://issues.apache.org/jira/browse/PARQUET-1425
> Project: Parquet
>  Issue Type: Bug
>  Components: parquet-format
>Reporter: Wes McKinney
>Priority: Major
>
> I see this warning frequently
> {code}
> [1/127] Running thrift compiler on parquet.thrift
> [WARNING:/home/wesm/code/arrow/cpp/src/parquet/parquet.thrift:295] The "byte" 
> type is a compatibility alias for "i8". Use "i8" to emphasize the signedness 
> of this type.
> {code}



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)


[jira] [Commented] (PARQUET-1425) [Format] Fix Thrift compiler warning

2018-09-26 Thread Nandor Kollar (JIRA)


[ 
https://issues.apache.org/jira/browse/PARQUET-1425?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16628950#comment-16628950
 ] 

Nandor Kollar commented on PARQUET-1425:


[~wesmckinn] would you like this to be fixed in 2.6.0 format release, or is 
fine for you to get this done in a later release?

Currently parquet-format depends on 0.9.3 Thrift, and we also use this older 
version downstream. i8 was introduced in 0.10.0 (according to [Thrift 
Jira|https://issues.apache.org/jira/browse/THRIFT-3393]), and if we change byte 
to i8, 0.9.3 thrift compiler will fail.

> [Format] Fix Thrift compiler warning
> 
>
> Key: PARQUET-1425
> URL: https://issues.apache.org/jira/browse/PARQUET-1425
> Project: Parquet
>  Issue Type: Bug
>  Components: parquet-format
>Reporter: Wes McKinney
>Priority: Major
> Fix For: format-2.6.0
>
>
> I see this warning frequently
> {code}
> [1/127] Running thrift compiler on parquet.thrift
> [WARNING:/home/wesm/code/arrow/cpp/src/parquet/parquet.thrift:295] The "byte" 
> type is a compatibility alias for "i8". Use "i8" to emphasize the signedness 
> of this type.
> {code}



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)


[jira] [Updated] (PARQUET-1387) Nanosecond precision time and timestamp - parquet-format

2018-09-26 Thread Nandor Kollar (JIRA)


 [ 
https://issues.apache.org/jira/browse/PARQUET-1387?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Nandor Kollar updated PARQUET-1387:
---
Fix Version/s: format-2.6.0

> Nanosecond precision time and timestamp - parquet-format
> 
>
> Key: PARQUET-1387
> URL: https://issues.apache.org/jira/browse/PARQUET-1387
> Project: Parquet
>  Issue Type: Improvement
>  Components: parquet-format
>Reporter: Nandor Kollar
>Assignee: Nandor Kollar
>Priority: Major
>  Labels: pull-request-available
> Fix For: format-2.6.0
>
>




--
This message was sent by Atlassian JIRA
(v7.6.3#76005)


[jira] [Updated] (PARQUET-1266) LogicalTypes union in parquet-format doesn't include UUID

2018-09-26 Thread Nandor Kollar (JIRA)


 [ 
https://issues.apache.org/jira/browse/PARQUET-1266?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Nandor Kollar updated PARQUET-1266:
---
Fix Version/s: format-2.6.0

> LogicalTypes union in parquet-format doesn't include UUID
> -
>
> Key: PARQUET-1266
> URL: https://issues.apache.org/jira/browse/PARQUET-1266
> Project: Parquet
>  Issue Type: Bug
>  Components: parquet-format
>Reporter: Nandor Kollar
>Assignee: Nandor Kollar
>Priority: Minor
> Fix For: format-2.6.0
>
>
> parquet-format new logical type representation doesn't include UUID type



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)


[jira] [Updated] (PARQUET-1290) Clarify maximum run lengths for RLE encoding

2018-09-26 Thread Nandor Kollar (JIRA)


 [ 
https://issues.apache.org/jira/browse/PARQUET-1290?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Nandor Kollar updated PARQUET-1290:
---
Fix Version/s: format-2.6.0

> Clarify maximum run lengths for RLE encoding
> 
>
> Key: PARQUET-1290
> URL: https://issues.apache.org/jira/browse/PARQUET-1290
> Project: Parquet
>  Issue Type: Improvement
>  Components: parquet-format
>Reporter: Tim Armstrong
>Assignee: Tim Armstrong
>Priority: Major
> Fix For: format-2.6.0
>
>
> The Parquet spec isn't clear about what the upper bound on run lengths in the 
> RLE encoding is - 
> https://github.com/apache/parquet-format/blob/master/Encodings.md#run-length-encoding--bit-packing-hybrid-rle--3
>  .
> It sounds like in practice that the major implementations don't support run 
> lengths > (2^31 - 1) - see 
> https://lists.apache.org/thread.html/6731a94a98b790ad24a9a5bb4e1bf9bb799d729e948e046efb40014f@%3Cdev.parquet.apache.org%3E
> I propose that we limit {{bit-pack-count}} and {{number of times repeated}} 
> to being <= 2^31.
> It seems unlikely that there are parquet files in existence with larger run 
> lengths, given that it requires huge numbers of values per page and major 
> implementations can't write or read such files without overflowing integers. 
> Maybe it would be possible if all the columns in a file were extremely 
> compressible, but it seems like in practice most implementations will hit 
> page or file size limits before producing a very-large run.



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)


[jira] [Created] (PARQUET-1424) Release parquet-format 2.6.0

2018-09-25 Thread Nandor Kollar (JIRA)
Nandor Kollar created PARQUET-1424:
--

 Summary: Release parquet-format 2.6.0
 Key: PARQUET-1424
 URL: https://issues.apache.org/jira/browse/PARQUET-1424
 Project: Parquet
  Issue Type: Task
  Components: parquet-format
Reporter: Nandor Kollar
 Fix For: format-2.6.0


Release parquet-format 2.6.0

The release requires reverting of the merged PRs related to columnar 
encryption, since there's no signed spec yet. Those PRs should be developed on 
a feature branch instead and merge to master once the spec is signed and the 
format changes are ready to get released.



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)


[jira] [Updated] (PARQUET-1383) Parquet tools should indicate UTC parameter for time/timestamp types

2018-09-12 Thread Nandor Kollar (JIRA)


 [ 
https://issues.apache.org/jira/browse/PARQUET-1383?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Nandor Kollar updated PARQUET-1383:
---
Description: Parquet-tools should indicate if a time/timestamp is UTC 
adjusted or timezone agnostic, the values written by the tools should take UTC 
normalized parameters into account. Right now, every time and timestamp value 
is adjusted to UTC when printed via parquet-tools  (was: Currently, 
parquet-tools should print original type. Since the new logical type API is 
introduced, it would be better to print it instead of, or besides the original 
type.

Also, the values written by the tools should take UTC normalized parameters 
into account. Right now, every time and timestamp value is adjusted to UTC when 
printed via parquet-tools)

> Parquet tools should indicate UTC parameter for time/timestamp types
> 
>
> Key: PARQUET-1383
> URL: https://issues.apache.org/jira/browse/PARQUET-1383
> Project: Parquet
>  Issue Type: Improvement
>  Components: parquet-mr
>Reporter: Nandor Kollar
>Assignee: Nandor Kollar
>Priority: Minor
>  Labels: pull-request-available
>
> Parquet-tools should indicate if a time/timestamp is UTC adjusted or timezone 
> agnostic, the values written by the tools should take UTC normalized 
> parameters into account. Right now, every time and timestamp value is 
> adjusted to UTC when printed via parquet-tools



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)


[jira] [Updated] (PARQUET-1383) Parquet tools should indicate UTC parameter for time/timestamp types

2018-09-12 Thread Nandor Kollar (JIRA)


 [ 
https://issues.apache.org/jira/browse/PARQUET-1383?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Nandor Kollar updated PARQUET-1383:
---
Summary: Parquet tools should indicate UTC parameter for time/timestamp 
types  (was: Parquet tools should print logical type instead of (or besides) 
original type)

> Parquet tools should indicate UTC parameter for time/timestamp types
> 
>
> Key: PARQUET-1383
> URL: https://issues.apache.org/jira/browse/PARQUET-1383
> Project: Parquet
>  Issue Type: Improvement
>  Components: parquet-mr
>Reporter: Nandor Kollar
>Assignee: Nandor Kollar
>Priority: Minor
>  Labels: pull-request-available
>
> Currently, parquet-tools should print original type. Since the new logical 
> type API is introduced, it would be better to print it instead of, or besides 
> the original type.
> Also, the values written by the tools should take UTC normalized parameters 
> into account. Right now, every time and timestamp value is adjusted to UTC 
> when printed via parquet-tools



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)


[jira] [Updated] (PARQUET-1371) Time/Timestamp UTC normalization parameter doesn't work

2018-09-06 Thread Nandor Kollar (JIRA)


 [ 
https://issues.apache.org/jira/browse/PARQUET-1371?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Nandor Kollar updated PARQUET-1371:
---
Fix Version/s: 1.11.0

> Time/Timestamp UTC normalization parameter doesn't work
> ---
>
> Key: PARQUET-1371
> URL: https://issues.apache.org/jira/browse/PARQUET-1371
> Project: Parquet
>  Issue Type: Bug
>Reporter: Nandor Kollar
>Assignee: Nandor Kollar
>Priority: Major
>  Labels: pull-request-available
> Fix For: 1.11.0
>
>
> After creating a Parquet file with non-UTC normalized logical type, when 
> reading back with the API, the result show it is UTC normalized. Looks like 
> the read path incorrectly reads the actual logical type (with new API).



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)


[jira] [Created] (PARQUET-1410) Refactor modules to use the new logical type API

2018-08-31 Thread Nandor Kollar (JIRA)
Nandor Kollar created PARQUET-1410:
--

 Summary: Refactor modules to use the new logical type API
 Key: PARQUET-1410
 URL: https://issues.apache.org/jira/browse/PARQUET-1410
 Project: Parquet
  Issue Type: Improvement
  Components: parquet-mr
Reporter: Nandor Kollar
Assignee: Nandor Kollar


Refactor parquet-mr modules to use the new logical type API for internal 
decisions (e.g.: replace the OriginalType-based switch cases to a more flexible 
solution, for example in type builder when checking if the proper annotation is 
present on physical type)



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)


[jira] [Assigned] (PARQUET-1388) Nanosecond precision time and timestamp - parquet-mr

2018-08-31 Thread Nandor Kollar (JIRA)


 [ 
https://issues.apache.org/jira/browse/PARQUET-1388?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Nandor Kollar reassigned PARQUET-1388:
--

Assignee: Nandor Kollar

> Nanosecond precision time and timestamp - parquet-mr
> 
>
> Key: PARQUET-1388
> URL: https://issues.apache.org/jira/browse/PARQUET-1388
> Project: Parquet
>  Issue Type: Improvement
>  Components: parquet-mr
>Reporter: Nandor Kollar
>Assignee: Nandor Kollar
>Priority: Major
>  Labels: pull-request-available
>




--
This message was sent by Atlassian JIRA
(v7.6.3#76005)


[jira] [Updated] (PARQUET-1401) RowGroup offset and total compressed size fields

2018-08-23 Thread Nandor Kollar (JIRA)


 [ 
https://issues.apache.org/jira/browse/PARQUET-1401?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Nandor Kollar updated PARQUET-1401:
---
Labels: pull-request-available  (was: )

> RowGroup offset and total compressed size fields
> 
>
> Key: PARQUET-1401
> URL: https://issues.apache.org/jira/browse/PARQUET-1401
> Project: Parquet
>  Issue Type: Sub-task
>  Components: parquet-cpp, parquet-format
>Reporter: Gidon Gershinsky
>Assignee: Gidon Gershinsky
>Priority: Major
>  Labels: pull-request-available
>
> Spark uses filterFileMetaData* methods in ParquetMetadataConverter class, 
> that  calculate the offset and total compressed size of a RowGroup data.
> The offset calculation is done by extracting the ColumnMetaData of the first 
> column, and using its offset fields.
> The total compressed size calculation is done by running a loop over all 
> column chunks in the RowGroup, and summing up the size values from each 
> chunk's ColumnMetaData .
> If one or more columns are hidden (encrypted with a key unavailable to the 
> reader), these calculations can't be performed, because the column metadata 
> is protected. 
>  
> But: these calculations don't really need the individual column values. The 
> results pertain to the whole RowGroup, not specific columns. 
> Therefore, we will define two new optional fields in the RowGroup Thrift 
> structure:
>  
> _optional i64 file_offset_
> _optional i64 total_compressed_size_
>  
> and calculate/set them upon file writing. Then, Spark will be able to query a 
> file with hidden columns (of course, only if the query itself doesn't need 
> the hidden columns - works with a masked version of them, or reads columns 
> with available keys).
>  
> These values can be set only for encrypted files (or for all files, to skip 
> the loop upon reading). I've tested this, works fine in Spark writers and 
> readers.
>  
> I've also checked other references to ColumnMetaData fields in parquet-mr. 
> There are none - therefore, its the only change we need in parquet.thrift to 
> handle hidden columns.



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)


[jira] [Created] (PARQUET-1388) Nanosecond precision time and timestamp - parquet-mr

2018-08-17 Thread Nandor Kollar (JIRA)
Nandor Kollar created PARQUET-1388:
--

 Summary: Nanosecond precision time and timestamp - parquet-mr
 Key: PARQUET-1388
 URL: https://issues.apache.org/jira/browse/PARQUET-1388
 Project: Parquet
  Issue Type: Improvement
  Components: parquet-mr
Reporter: Nandor Kollar






--
This message was sent by Atlassian JIRA
(v7.6.3#76005)


[jira] [Updated] (PARQUET-1387) Nanosecond precision time and timestamp - parquet-format

2018-08-17 Thread Nandor Kollar (JIRA)


 [ 
https://issues.apache.org/jira/browse/PARQUET-1387?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Nandor Kollar updated PARQUET-1387:
---
Fix Version/s: (was: format-2.6.0)

> Nanosecond precision time and timestamp - parquet-format
> 
>
> Key: PARQUET-1387
> URL: https://issues.apache.org/jira/browse/PARQUET-1387
> Project: Parquet
>  Issue Type: Improvement
>  Components: parquet-format
>Reporter: Nandor Kollar
>Assignee: Nandor Kollar
>Priority: Major
>




--
This message was sent by Atlassian JIRA
(v7.6.3#76005)


[jira] [Updated] (PARQUET-1387) Nanosecond precision time and timestamp - parquet-format

2018-08-17 Thread Nandor Kollar (JIRA)


 [ 
https://issues.apache.org/jira/browse/PARQUET-1387?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Nandor Kollar updated PARQUET-1387:
---
Fix Version/s: format-2.6.0

> Nanosecond precision time and timestamp - parquet-format
> 
>
> Key: PARQUET-1387
> URL: https://issues.apache.org/jira/browse/PARQUET-1387
> Project: Parquet
>  Issue Type: Improvement
>  Components: parquet-format
>Reporter: Nandor Kollar
>Assignee: Nandor Kollar
>Priority: Major
> Fix For: format-2.6.0
>
>




--
This message was sent by Atlassian JIRA
(v7.6.3#76005)


[jira] [Created] (PARQUET-1387) Nanosecond precision time and timestamp - parquet-format

2018-08-17 Thread Nandor Kollar (JIRA)
Nandor Kollar created PARQUET-1387:
--

 Summary: Nanosecond precision time and timestamp - parquet-format
 Key: PARQUET-1387
 URL: https://issues.apache.org/jira/browse/PARQUET-1387
 Project: Parquet
  Issue Type: Improvement
  Components: parquet-format
Reporter: Nandor Kollar
Assignee: Nandor Kollar






--
This message was sent by Atlassian JIRA
(v7.6.3#76005)


[jira] [Created] (PARQUET-1383) Parquet tools should print logical type instead of (or besides) original type

2018-08-16 Thread Nandor Kollar (JIRA)
Nandor Kollar created PARQUET-1383:
--

 Summary: Parquet tools should print logical type instead of (or 
besides) original type
 Key: PARQUET-1383
 URL: https://issues.apache.org/jira/browse/PARQUET-1383
 Project: Parquet
  Issue Type: Improvement
  Components: parquet-mr
Reporter: Nandor Kollar
Assignee: Nandor Kollar


Currently, parquet-tools should print original type. Since the new logical type 
API is introduced, it would be better to print it instead of, or besides the 
original type.

Also, the values written by the tools should take UTC normalized parameters 
into account. Right now, every time and timestamp value is adjusted to UTC when 
printed via parquet-tools



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)


[jira] [Created] (PARQUET-1371) Time/Timestamp UTC normalization parameter doesn't work

2018-08-06 Thread Nandor Kollar (JIRA)
Nandor Kollar created PARQUET-1371:
--

 Summary: Time/Timestamp UTC normalization parameter doesn't work
 Key: PARQUET-1371
 URL: https://issues.apache.org/jira/browse/PARQUET-1371
 Project: Parquet
  Issue Type: Bug
Reporter: Nandor Kollar
Assignee: Nandor Kollar


After creating a Parquet file with non-UTC normalized logical type, when 
reading back with the API, the result show it is UTC normalized. Looks like the 
read path incorrectly reads the actual logical type (with new API).



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)


[jira] [Updated] (PARQUET-1364) Column Indexes: Invalid row indexes for pages starting with nulls

2018-08-03 Thread Nandor Kollar (JIRA)


 [ 
https://issues.apache.org/jira/browse/PARQUET-1364?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Nandor Kollar updated PARQUET-1364:
---
Fix Version/s: 1.11.0

> Column Indexes: Invalid row indexes for pages starting with nulls
> -
>
> Key: PARQUET-1364
> URL: https://issues.apache.org/jira/browse/PARQUET-1364
> Project: Parquet
>  Issue Type: Sub-task
>Reporter: Gabor Szadovszky
>Assignee: Gabor Szadovszky
>Priority: Major
>  Labels: pull-request-available
> Fix For: 1.11.0
>
>
> The current implementation for writing managing row indexes for the pages is 
> not reliable. There is a logic 
> [MessageColumnIO|https://github.com/apache/parquet-mr/blob/master/parquet-column/src/main/java/org/apache/parquet/io/MessageColumnIO.java#L153]
>  which caches null values and flush them just *before* opening a new group. 
> This logic might cause starting pages with these cached nulls which are not 
> correctly counted in the written rows so the rowIndexes are incorrect. It 
> does not cause any issues if all the pages are read continuously put it is a 
> huge problem for column index based filtering.
> The implementation described above is really complicated and would not like 
> to redesign because of the mentioned issue. It is easier to simply count the 
> {{0}} repetition levels as record boundaries at the column writer level.



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)


[jira] [Updated] (PARQUET-1367) upgrade libraries to work around security issues

2018-08-02 Thread Nandor Kollar (JIRA)


 [ 
https://issues.apache.org/jira/browse/PARQUET-1367?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Nandor Kollar updated PARQUET-1367:
---
Description: 
There are a number of libraries which need updating.  Among other reasons, 
there are [several security issues filed in CVE for 
[Hadoop|https://cve.mitre.org/cgi-bin/cvekey.cgi?keyword=hadoop] and 
[guava|https://cve.mitre.org/cgi-bin/cvename.cgi?name=CVE-2018-10237]

 

 

  was:
There are a number of libraries which need updating.  Among other reasons, 
there are [several security issues filed in CVE for 
Hadoop|https://cve.mitre.org/cgi-bin/cvekey.cgi?keyword=hadoop] and 
[guava|https://cve.mitre.org/cgi-bin/cvename.cgi?name=CVE-2018-10237]

 

 


> upgrade libraries to work around security issues
> 
>
> Key: PARQUET-1367
> URL: https://issues.apache.org/jira/browse/PARQUET-1367
> Project: Parquet
>  Issue Type: Bug
>  Components: parquet-mr
>Affects Versions: 1.10.0
>Reporter: Matt Darwin
>Priority: Major
>  Labels: pull-request-available, security
>   Original Estimate: 1h
>  Remaining Estimate: 1h
>
> There are a number of libraries which need updating.  Among other reasons, 
> there are [several security issues filed in CVE for 
> [Hadoop|https://cve.mitre.org/cgi-bin/cvekey.cgi?keyword=hadoop] and 
> [guava|https://cve.mitre.org/cgi-bin/cvename.cgi?name=CVE-2018-10237]
>  
>  



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)


[jira] [Updated] (PARQUET-1336) PrimitiveComparator should implements Serializable

2018-08-02 Thread Nandor Kollar (JIRA)


 [ 
https://issues.apache.org/jira/browse/PARQUET-1336?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Nandor Kollar updated PARQUET-1336:
---
Fix Version/s: 1.11.0

> PrimitiveComparator should implements Serializable 
> ---
>
> Key: PARQUET-1336
> URL: https://issues.apache.org/jira/browse/PARQUET-1336
> Project: Parquet
>  Issue Type: Improvement
>  Components: parquet-mr
>Affects Versions: 1.10.0
>Reporter: Yuming Wang
>Assignee: Yuming Wang
>Priority: Major
>  Labels: pull-request-available
> Fix For: 1.11.0
>
>
> {code:java}
> [info] Cause: java.lang.RuntimeException: java.io.NotSerializableException: 
> org.apache.parquet.schema.PrimitiveComparator$8
> [info] at 
> org.apache.parquet.hadoop.ParquetInputFormat.setFilterPredicate(ParquetInputFormat.java:211)
> [info] at 
> org.apache.spark.sql.execution.datasources.parquet.ParquetFileFormat$$anonfun$buildReaderWithPartitionValues$1.apply(ParquetFileFormat.scala:399)
> [info] at 
> org.apache.spark.sql.execution.datasources.parquet.ParquetFileFormat$$anonfun$buildReaderWithPartitionValues$1.apply(ParquetFileFormat.scala:349)
> [info] at 
> org.apache.spark.sql.execution.datasources.FileScanRDD$$anon$1.org$apache$spark$sql$execution$datasources$FileScanRDD$$anon$$readCurrentFile(FileScanRDD.scala:128)
> [info] at 
> org.apache.spark.sql.execution.datasources.FileScanRDD$$anon$1.nextIterator(FileScanRDD.scala:182)
> [info] at 
> org.apache.spark.sql.execution.datasources.FileScanRDD$$anon$1.hasNext(FileScanRDD.scala:109)
> [info] at scala.collection.Iterator$$anon$11.hasNext(Iterator.scala:408)
> [info] at scala.collection.Iterator$$anon$11.hasNext(Iterator.scala:408)
> [info] at scala.collection.Iterator$$anon$11.hasNext(Iterator.scala:408)
> [info] at scala.collection.Iterator$$anon$11.hasNext(Iterator.scala:408)
> [info] at scala.collection.Iterator$$anon$11.hasNext(Iterator.scala:408)
> [info] at org.apache.spark.util.Utils$.getIteratorSize(Utils.scala:1791)
> [info] at org.apache.spark.rdd.RDD$$anonfun$count$1.apply(RDD.scala:1162)
> [info] at org.apache.spark.rdd.RDD$$anonfun$count$1.apply(RDD.scala:1162)
> [info] at 
> org.apache.spark.SparkContext$$anonfun$runJob$5.apply(SparkContext.scala:2071)
> [info] at 
> org.apache.spark.SparkContext$$anonfun$runJob$5.apply(SparkContext.scala:2071)
> [info] at org.apache.spark.scheduler.ResultTask.runTask(ResultTask.scala:87)
> [info] at org.apache.spark.scheduler.Task.run(Task.scala:109)
> [info] at 
> org.apache.spark.executor.Executor$TaskRunner.run(Executor.scala:367)
> {code}



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)


[jira] [Updated] (PARQUET-1321) LogicalTypeAnnotation.LogicalTypeAnnotationVisitor#visit methods should have a return value

2018-08-02 Thread Nandor Kollar (JIRA)


 [ 
https://issues.apache.org/jira/browse/PARQUET-1321?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Nandor Kollar updated PARQUET-1321:
---
Fix Version/s: 1.11.0

> LogicalTypeAnnotation.LogicalTypeAnnotationVisitor#visit methods should have 
> a return value
> ---
>
> Key: PARQUET-1321
> URL: https://issues.apache.org/jira/browse/PARQUET-1321
> Project: Parquet
>  Issue Type: Improvement
>  Components: parquet-mr
>Reporter: Nandor Kollar
>Assignee: Nandor Kollar
>Priority: Major
> Fix For: 1.11.0
>
>
> LogicalTypeAnnotationVisitor inside LogicalTypeAnnotation is intended to be 
> used by clients who would like to execute custom code which depends on the 
> type of the logical type annotation. However, it is difficult to give back 
> any result from the visitor, since both LogicalTypeAnnotation#accept, and 
> LogicalTypeAnnotationVisitor has void return type.



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)


[jira] [Updated] (PARQUET-1227) Thrift crypto metadata structures

2018-08-02 Thread Nandor Kollar (JIRA)


 [ 
https://issues.apache.org/jira/browse/PARQUET-1227?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Nandor Kollar updated PARQUET-1227:
---
Fix Version/s: format-2.6.0

> Thrift crypto metadata structures
> -
>
> Key: PARQUET-1227
> URL: https://issues.apache.org/jira/browse/PARQUET-1227
> Project: Parquet
>  Issue Type: Sub-task
>  Components: parquet-cpp, parquet-format
>Reporter: Gidon Gershinsky
>Assignee: Gidon Gershinsky
>Priority: Major
>  Labels: pull-request-available
> Fix For: cpp-1.5.0, format-2.6.0
>
>
> New Thrift structures for Parquet modular encryption



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)


[jira] [Updated] (PARQUET-1285) [Java] SchemaConverter should not convert from TimeUnit.SECOND AND TimeUnit.NANOSECOND of Arrow

2018-08-02 Thread Nandor Kollar (JIRA)


 [ 
https://issues.apache.org/jira/browse/PARQUET-1285?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Nandor Kollar updated PARQUET-1285:
---
Fix Version/s: (was: 1.10.0)
   1.11.0

> [Java] SchemaConverter should not convert from TimeUnit.SECOND AND 
> TimeUnit.NANOSECOND of Arrow
> ---
>
> Key: PARQUET-1285
> URL: https://issues.apache.org/jira/browse/PARQUET-1285
> Project: Parquet
>  Issue Type: Bug
>  Components: parquet-mr
>Reporter: Masayuki Takahashi
>Assignee: Masayuki Takahashi
>Priority: Minor
> Fix For: 1.11.0
>
>
> Arrow's 'Time' definition is below:
> {code:java}
> { "name" : "time", "unit" : "SECOND|MILLISECOND|MICROSECOND|NANOSECOND", 
> "bitWidth": /* integer: 32 or 64 */ }{code}
> [http://arrow.apache.org/docs/metadata.html]
>  
> But Parquet only supports 'TIME_MILLIS' and 'TIME_MICROS'.
>  [https://github.com/Apache/parquet-format/blob/master/LogicalTypes.md]
> Therefore SchemaConverter should not convert from TimeUnit.SECOND AND 
> TimeUnit.NANOSECOND of Arrow to Parquet.
>   



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)


[jira] [Updated] (PARQUET-1335) Logical type names in parquet-mr are not consistent with parquet-format

2018-08-02 Thread Nandor Kollar (JIRA)


 [ 
https://issues.apache.org/jira/browse/PARQUET-1335?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Nandor Kollar updated PARQUET-1335:
---
Fix Version/s: 1.11.0

> Logical type names in parquet-mr are not consistent with parquet-format
> ---
>
> Key: PARQUET-1335
> URL: https://issues.apache.org/jira/browse/PARQUET-1335
> Project: Parquet
>  Issue Type: Improvement
>  Components: parquet-mr
>Affects Versions: 1.11.0
>Reporter: Nandor Kollar
>Assignee: Nandor Kollar
>Priority: Minor
>  Labels: pull-request-available
> Fix For: 1.11.0
>
>
> UTF8 logical type should be called STRING, INT should be called INTEGER.



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)


  1   2   >