[jira] [Created] (PARQUET-1264) Update Javadoc for Java 1.8

2018-03-30 Thread Ryan Blue (JIRA)
Ryan Blue created PARQUET-1264:
--

 Summary: Update Javadoc for Java 1.8
 Key: PARQUET-1264
 URL: https://issues.apache.org/jira/browse/PARQUET-1264
 Project: Parquet
  Issue Type: Improvement
  Components: parquet-mr
Affects Versions: 1.9.0
Reporter: Ryan Blue
Assignee: Ryan Blue
 Fix For: 1.10.0


After moving the build to Java 1.8, the release procedure no longer works 
because Javadoc generation fails.



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)


[jira] [Resolved] (PARQUET-1263) ParquetReader's builder should use Configuration from the InputFile

2018-03-30 Thread Ryan Blue (JIRA)

 [ 
https://issues.apache.org/jira/browse/PARQUET-1263?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Ryan Blue resolved PARQUET-1263.

Resolution: Fixed
  Assignee: Ryan Blue

Merged #464.

> ParquetReader's builder should use Configuration from the InputFile
> ---
>
> Key: PARQUET-1263
> URL: https://issues.apache.org/jira/browse/PARQUET-1263
> Project: Parquet
>  Issue Type: Improvement
>Reporter: Ryan Blue
>Assignee: Ryan Blue
>Priority: Major
> Fix For: 1.10.0
>
>
> ParquetReader can be built using an InputFile, which may be a HadoopInputFile 
> and have a Configuration. If it is, ParquetHadoopOptions should be be based 
> on that configuration instance.



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)


[jira] [Commented] (PARQUET-1263) ParquetReader's builder should use Configuration from the InputFile

2018-03-30 Thread ASF GitHub Bot (JIRA)

[ 
https://issues.apache.org/jira/browse/PARQUET-1263?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16421024#comment-16421024
 ] 

ASF GitHub Bot commented on PARQUET-1263:
-

rdblue closed pull request #464: PARQUET-1263: If file has a config, use it for 
ParquetReadOptions.
URL: https://github.com/apache/parquet-mr/pull/464
 
 
   

This is a PR merged from a forked repository.
As GitHub hides the original diff on merge, it is displayed below for
the sake of provenance:

As this is a foreign pull request (from a fork), the diff is supplied
below (as it won't show otherwise due to GitHub magic):

diff --git 
a/parquet-hadoop/src/main/java/org/apache/parquet/hadoop/ParquetReader.java 
b/parquet-hadoop/src/main/java/org/apache/parquet/hadoop/ParquetReader.java
index 1ba5380c8..22c219885 100644
--- a/parquet-hadoop/src/main/java/org/apache/parquet/hadoop/ParquetReader.java
+++ b/parquet-hadoop/src/main/java/org/apache/parquet/hadoop/ParquetReader.java
@@ -177,14 +177,16 @@ public void close() throws IOException {
 private final InputFile file;
 private final Path path;
 private Filter filter = null;
-protected Configuration conf = new Configuration();
-private ParquetReadOptions.Builder optionsBuilder = 
HadoopReadOptions.builder(conf);
+protected Configuration conf;
+private ParquetReadOptions.Builder optionsBuilder;
 
 @Deprecated
 private Builder(ReadSupport readSupport, Path path) {
   this.readSupport = checkNotNull(readSupport, "readSupport");
   this.file = null;
   this.path = checkNotNull(path, "path");
+  this.conf = new Configuration();
+  this.optionsBuilder = HadoopReadOptions.builder(conf);
 }
 
 @Deprecated
@@ -192,12 +194,20 @@ protected Builder(Path path) {
   this.readSupport = null;
   this.file = null;
   this.path = checkNotNull(path, "path");
+  this.conf = new Configuration();
+  this.optionsBuilder = HadoopReadOptions.builder(conf);
 }
 
 protected Builder(InputFile file) {
   this.readSupport = null;
   this.file = checkNotNull(file, "file");
   this.path = null;
+  if (file instanceof HadoopInputFile) {
+this.conf = ((HadoopInputFile) file).getConfiguration();
+  } else {
+this.conf = new Configuration();
+  }
+  optionsBuilder = HadoopReadOptions.builder(conf);
 }
 
 // when called, resets options to the defaults from conf


 


This is an automated message from the Apache Git Service.
To respond to the message, please log on GitHub and use the
URL above to go to the specific comment.
 
For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


> ParquetReader's builder should use Configuration from the InputFile
> ---
>
> Key: PARQUET-1263
> URL: https://issues.apache.org/jira/browse/PARQUET-1263
> Project: Parquet
>  Issue Type: Improvement
>Reporter: Ryan Blue
>Priority: Major
> Fix For: 1.10.0
>
>
> ParquetReader can be built using an InputFile, which may be a HadoopInputFile 
> and have a Configuration. If it is, ParquetHadoopOptions should be be based 
> on that configuration instance.



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)


[jira] [Commented] (PARQUET-1183) AvroParquetWriter needs OutputFile based Builder

2018-03-30 Thread ASF GitHub Bot (JIRA)

[ 
https://issues.apache.org/jira/browse/PARQUET-1183?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16421017#comment-16421017
 ] 

ASF GitHub Bot commented on PARQUET-1183:
-

rdblue closed pull request #460: PARQUET-1183: Add Avro builders using 
InputFile and OutputFile.
URL: https://github.com/apache/parquet-mr/pull/460
 
 
   

This is a PR merged from a forked repository.
As GitHub hides the original diff on merge, it is displayed below for
the sake of provenance:

As this is a foreign pull request (from a fork), the diff is supplied
below (as it won't show otherwise due to GitHub magic):

diff --git 
a/parquet-avro/src/main/java/org/apache/parquet/avro/AvroParquetReader.java 
b/parquet-avro/src/main/java/org/apache/parquet/avro/AvroParquetReader.java
index a361c62fd..442c5b78f 100644
--- a/parquet-avro/src/main/java/org/apache/parquet/avro/AvroParquetReader.java
+++ b/parquet-avro/src/main/java/org/apache/parquet/avro/AvroParquetReader.java
@@ -28,16 +28,25 @@
 import org.apache.parquet.filter.UnboundRecordFilter;
 import org.apache.parquet.hadoop.ParquetReader;
 import org.apache.parquet.hadoop.api.ReadSupport;
+import org.apache.parquet.io.InputFile;
 
 /**
  * Read Avro records from a Parquet file.
  */
 public class AvroParquetReader extends ParquetReader {
 
+  /**
+   * @deprecated will be removed in 2.0.0; use {@link #builder(InputFile)} 
instead.
+   */
+  @Deprecated
   public static  Builder builder(Path file) {
 return new Builder(file);
   }
 
+  public static  Builder builder(InputFile file) {
+return new Builder(file);
+  }
+
   /**
* @deprecated use {@link #builder(Path)}
*/
@@ -76,10 +85,15 @@ public AvroParquetReader(Configuration conf, Path file, 
UnboundRecordFilter unbo
 private boolean enableCompatibility = true;
 private boolean isReflect = true;
 
+@Deprecated
 private Builder(Path path) {
   super(path);
 }
 
+private Builder(InputFile file) {
+  super(file);
+}
+
 public Builder withDataModel(GenericData model) {
   this.model = model;
 
diff --git 
a/parquet-avro/src/main/java/org/apache/parquet/avro/AvroParquetWriter.java 
b/parquet-avro/src/main/java/org/apache/parquet/avro/AvroParquetWriter.java
index d0c063325..3e802a84f 100644
--- a/parquet-avro/src/main/java/org/apache/parquet/avro/AvroParquetWriter.java
+++ b/parquet-avro/src/main/java/org/apache/parquet/avro/AvroParquetWriter.java
@@ -28,6 +28,7 @@
 import org.apache.parquet.hadoop.ParquetWriter;
 import org.apache.parquet.hadoop.api.WriteSupport;
 import org.apache.parquet.hadoop.metadata.CompressionCodecName;
+import org.apache.parquet.io.OutputFile;
 
 /**
  * Write Avro records to a Parquet file.
@@ -38,6 +39,10 @@
 return new Builder(file);
   }
 
+  public static  Builder builder(OutputFile file) {
+return new Builder(file);
+  }
+
   /** Create a new {@link AvroParquetWriter}.
*
* @param file
@@ -153,6 +158,10 @@ private Builder(Path file) {
   super(file);
 }
 
+private Builder(OutputFile file) {
+  super(file);
+}
+
 public Builder withSchema(Schema schema) {
   this.schema = schema;
   return this;


 


This is an automated message from the Apache Git Service.
To respond to the message, please log on GitHub and use the
URL above to go to the specific comment.
 
For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


> AvroParquetWriter needs OutputFile based Builder
> 
>
> Key: PARQUET-1183
> URL: https://issues.apache.org/jira/browse/PARQUET-1183
> Project: Parquet
>  Issue Type: Improvement
>  Components: parquet-avro
>Affects Versions: 1.9.1
>Reporter: Werner Daehn
>Priority: Major
> Fix For: 1.10.0
>
>
> The ParquetWriter got a new Builder(OutputFile). 
> But it cannot be used by the AvroParquetWriter as there is no matching 
> Builder/Constructor.
> Changes are quite simple:
> public static  Builder builder(OutputFile file) {
>   return new Builder(file)
> }
> and in the static Builder class below
> private Builder(OutputFile file) {
>   super(file);
> }
> Note: I am not good enough with builds, maven and git to create a pull 
> request yet. Sorry. Will try to get better here.
> See: https://issues.apache.org/jira/browse/PARQUET-1142



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)


[jira] [Commented] (PARQUET-1183) AvroParquetWriter needs OutputFile based Builder

2018-03-30 Thread ASF GitHub Bot (JIRA)

[ 
https://issues.apache.org/jira/browse/PARQUET-1183?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16421018#comment-16421018
 ] 

ASF GitHub Bot commented on PARQUET-1183:
-

rdblue closed pull request #446: PARQUET-1183 AvroParquetWriter needs 
OutputFile based Builder
URL: https://github.com/apache/parquet-mr/pull/446
 
 
   

This is a PR merged from a forked repository.
As GitHub hides the original diff on merge, it is displayed below for
the sake of provenance:

As this is a foreign pull request (from a fork), the diff is supplied
below (as it won't show otherwise due to GitHub magic):

diff --git 
a/parquet-avro/src/main/java/org/apache/parquet/avro/AvroParquetWriter.java 
b/parquet-avro/src/main/java/org/apache/parquet/avro/AvroParquetWriter.java
index d0c063325..7b937b99a 100644
--- a/parquet-avro/src/main/java/org/apache/parquet/avro/AvroParquetWriter.java
+++ b/parquet-avro/src/main/java/org/apache/parquet/avro/AvroParquetWriter.java
@@ -28,6 +28,7 @@
 import org.apache.parquet.hadoop.ParquetWriter;
 import org.apache.parquet.hadoop.api.WriteSupport;
 import org.apache.parquet.hadoop.metadata.CompressionCodecName;
+import org.apache.parquet.io.OutputFile;
 
 /**
  * Write Avro records to a Parquet file.
@@ -38,6 +39,11 @@
 return new Builder(file);
   }
 
+  public static  Builder builder(OutputFile file) {
+   return new Builder(file);
+  }
+
+
   /** Create a new {@link AvroParquetWriter}.
*
* @param file
@@ -153,6 +159,10 @@ private Builder(Path file) {
   super(file);
 }
 
+private Builder(OutputFile file) {
+  super(file);
+}
+
 public Builder withSchema(Schema schema) {
   this.schema = schema;
   return this;
diff --git 
a/parquet-avro/src/main/java/org/apache/parquet/avro/AvroSchemaConverter.java 
b/parquet-avro/src/main/java/org/apache/parquet/avro/AvroSchemaConverter.java
index 70b6525f6..84a4bb728 100644
--- 
a/parquet-avro/src/main/java/org/apache/parquet/avro/AvroSchemaConverter.java
+++ 
b/parquet-avro/src/main/java/org/apache/parquet/avro/AvroSchemaConverter.java
@@ -58,6 +58,9 @@
 
   private final boolean assumeRepeatedIsListElement;
   private final boolean writeOldListStructure;
+  
+  private ArrayList schemapath;
+  private ArrayList grouppath;
 
   public AvroSchemaConverter() {
 this.assumeRepeatedIsListElement = ADD_LIST_ELEMENT_RECORDS_DEFAULT;
@@ -112,7 +115,13 @@ public MessageType convert(Schema avroSchema) {
 if (!avroSchema.getType().equals(Schema.Type.RECORD)) {
   throw new IllegalArgumentException("Avro schema must be a record.");
 }
-return new MessageType(avroSchema.getFullName(), 
convertFields(avroSchema.getFields()));
+schemapath = new ArrayList();
+schemapath.add(avroSchema);
+grouppath = new ArrayList();
+MessageType m = new MessageType(avroSchema.getFullName());
+grouppath.add(m);
+m.addFields(convertFields(avroSchema.getFields()));
+return m;
   }
 
   private List convertFields(List fields) {
@@ -149,7 +158,50 @@ private Type convertField(String fieldName, Schema schema, 
Type.Repetition repet
 } else if (type.equals(Schema.Type.STRING)) {
   builder = Types.primitive(BINARY, repetition).as(UTF8);
 } else if (type.equals(Schema.Type.RECORD)) {
-  return new GroupType(repetition, fieldName, 
convertFields(schema.getFields()));
+   /*
+* A Schema might contain directly or indirectly a parent schema.
+* Example1: "Person"-Schema has a field of type array-of-"Person" 
named "children" --> A "Person" can have multiple Person records in the field 
"children"
+* Example2: "Person"-Schema has a field "contacts" which lists various 
contact options. These contact options have an optional field naturalperson 
which is of type "Person"
+* 
+* To solve that, whenever a new record schema is found, we check if 
this schema had been used somewhere along the path.
+* If No, then it is just a regular structure tree, no circular 
references where one schema has itself as child.
+* If Yes, then this field is redefined as a INT64 containing a 
generated ID and records of that element can be found in the parent structure 
via the __ID field. 
+*/
+   int index = schemapath.lastIndexOf(schema); // Has the current schema 
been used in the schema tree already? 
+   if (index == -1) {
+   /*
+* No, it has not been used, it is the first time this schema 
appears in this section of the tree, hence simply add it.
+* But we need to build the schema tree so the recursive calls 
know the tree structure.
+* And we need to build the same tree with the generated 
GroupTypes so we can add the __ID column in case it is needed.
+*/
+   schemapath.add(schema);
+   GroupType group = new GroupType(repetition, fieldName);
+   

[jira] [Resolved] (PARQUET-1183) AvroParquetWriter needs OutputFile based Builder

2018-03-30 Thread Ryan Blue (JIRA)

 [ 
https://issues.apache.org/jira/browse/PARQUET-1183?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Ryan Blue resolved PARQUET-1183.

Resolution: Fixed
  Assignee: Ryan Blue

Merged #460. Thanks [~zi] for reviewing!

> AvroParquetWriter needs OutputFile based Builder
> 
>
> Key: PARQUET-1183
> URL: https://issues.apache.org/jira/browse/PARQUET-1183
> Project: Parquet
>  Issue Type: Improvement
>  Components: parquet-avro
>Affects Versions: 1.9.1
>Reporter: Werner Daehn
>Assignee: Ryan Blue
>Priority: Major
> Fix For: 1.10.0
>
>
> The ParquetWriter got a new Builder(OutputFile). 
> But it cannot be used by the AvroParquetWriter as there is no matching 
> Builder/Constructor.
> Changes are quite simple:
> public static  Builder builder(OutputFile file) {
>   return new Builder(file)
> }
> and in the static Builder class below
> private Builder(OutputFile file) {
>   super(file);
> }
> Note: I am not good enough with builds, maven and git to create a pull 
> request yet. Sorry. Will try to get better here.
> See: https://issues.apache.org/jira/browse/PARQUET-1142



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)


[jira] [Commented] (PARQUET-1263) ParquetReader's builder should use Configuration from the InputFile

2018-03-30 Thread ASF GitHub Bot (JIRA)

[ 
https://issues.apache.org/jira/browse/PARQUET-1263?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16421004#comment-16421004
 ] 

ASF GitHub Bot commented on PARQUET-1263:
-

rdblue opened a new pull request #464: PARQUET-1263: If file has a config, use 
it for ParquetReadOptions.
URL: https://github.com/apache/parquet-mr/pull/464
 
 
   


This is an automated message from the Apache Git Service.
To respond to the message, please log on GitHub and use the
URL above to go to the specific comment.
 
For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


> ParquetReader's builder should use Configuration from the InputFile
> ---
>
> Key: PARQUET-1263
> URL: https://issues.apache.org/jira/browse/PARQUET-1263
> Project: Parquet
>  Issue Type: Improvement
>Reporter: Ryan Blue
>Priority: Major
> Fix For: 1.10.0
>
>
> ParquetReader can be built using an InputFile, which may be a HadoopInputFile 
> and have a Configuration. If it is, ParquetHadoopOptions should be be based 
> on that configuration instance.



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)


[jira] [Created] (PARQUET-1263) ParquetReader's builder should use Configuration from the InputFile

2018-03-30 Thread Ryan Blue (JIRA)
Ryan Blue created PARQUET-1263:
--

 Summary: ParquetReader's builder should use Configuration from the 
InputFile
 Key: PARQUET-1263
 URL: https://issues.apache.org/jira/browse/PARQUET-1263
 Project: Parquet
  Issue Type: Improvement
Reporter: Ryan Blue


ParquetReader can be built using an InputFile, which may be a HadoopInputFile 
and have a Configuration. If it is, ParquetHadoopOptions should be be based on 
that configuration instance.



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)


[jira] [Updated] (PARQUET-1263) ParquetReader's builder should use Configuration from the InputFile

2018-03-30 Thread Ryan Blue (JIRA)

 [ 
https://issues.apache.org/jira/browse/PARQUET-1263?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Ryan Blue updated PARQUET-1263:
---
Fix Version/s: 1.10.0

> ParquetReader's builder should use Configuration from the InputFile
> ---
>
> Key: PARQUET-1263
> URL: https://issues.apache.org/jira/browse/PARQUET-1263
> Project: Parquet
>  Issue Type: Improvement
>Reporter: Ryan Blue
>Priority: Major
> Fix For: 1.10.0
>
>
> ParquetReader can be built using an InputFile, which may be a HadoopInputFile 
> and have a Configuration. If it is, ParquetHadoopOptions should be be based 
> on that configuration instance.



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)


[jira] [Resolved] (PARQUET-1184) Make DelegatingPositionOutputStream a concrete class

2018-03-30 Thread Ryan Blue (JIRA)

 [ 
https://issues.apache.org/jira/browse/PARQUET-1184?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Ryan Blue resolved PARQUET-1184.

   Resolution: Won't Fix
Fix Version/s: (was: 1.10.0)

> Make DelegatingPositionOutputStream a concrete class
> 
>
> Key: PARQUET-1184
> URL: https://issues.apache.org/jira/browse/PARQUET-1184
> Project: Parquet
>  Issue Type: Improvement
>  Components: parquet-avro
>Affects Versions: 1.9.1
>Reporter: Werner Daehn
>Priority: Major
>
> I fail to understand why this is an abstract class. In my example I want to 
> write the Parquet file to a java.io.FileOutputStream, hence have to extend 
> the DelegatingPositionOutputStream and store the pos information, increase it 
> in all write(..) methods and return its value in getPos().
> Doable of course, but useful? Previously yes but now with the OutputFile 
> changes to decouple it from Hadoop more, I believe no.
> related to: https://issues.apache.org/jira/browse/PARQUET-1142



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)


[jira] [Commented] (PARQUET-1184) Make DelegatingPositionOutputStream a concrete class

2018-03-30 Thread Ryan Blue (JIRA)

[ 
https://issues.apache.org/jira/browse/PARQUET-1184?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16420982#comment-16420982
 ] 

Ryan Blue commented on PARQUET-1184:


The reason why this is an abstract class is so that you can use it to wrap 
implementations that provide a position, like Hadoop's FsOutputStream. It would 
not be correct to assume that the position is at the current number of bytes 
written to the underlying stream. An implementation could wrap RandomAccessFile 
and expose its seek method, which would invalidate the delegating stream's 
position.

The delegating class is present for convenience only. You don't have to use it 
and can implement your own logic as long as you implement PositionOutputStream.

> Make DelegatingPositionOutputStream a concrete class
> 
>
> Key: PARQUET-1184
> URL: https://issues.apache.org/jira/browse/PARQUET-1184
> Project: Parquet
>  Issue Type: Improvement
>  Components: parquet-avro
>Affects Versions: 1.9.1
>Reporter: Werner Daehn
>Priority: Major
> Fix For: 1.10.0
>
>
> I fail to understand why this is an abstract class. In my example I want to 
> write the Parquet file to a java.io.FileOutputStream, hence have to extend 
> the DelegatingPositionOutputStream and store the pos information, increase it 
> in all write(..) methods and return its value in getPos().
> Doable of course, but useful? Previously yes but now with the OutputFile 
> changes to decouple it from Hadoop more, I believe no.
> related to: https://issues.apache.org/jira/browse/PARQUET-1142



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)


[jira] [Updated] (PARQUET-1028) [JAVA] When reading old Spark-generated files with INT96, stats are reported as valid when they aren't

2018-03-30 Thread Ryan Blue (JIRA)

 [ 
https://issues.apache.org/jira/browse/PARQUET-1028?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Ryan Blue updated PARQUET-1028:
---
Fix Version/s: 1.10.0

> [JAVA] When reading old Spark-generated files with INT96, stats are reported 
> as valid when they aren't 
> ---
>
> Key: PARQUET-1028
> URL: https://issues.apache.org/jira/browse/PARQUET-1028
> Project: Parquet
>  Issue Type: Bug
>  Components: parquet-mr
>Affects Versions: 1.9.0
>Reporter: Jacques Nadeau
>Priority: Major
> Fix For: 1.10.0
>
>
> Found that the condition 
> [here|https://github.com/apache/parquet-mr/blob/9d58b6a83aa79dcad01c3bcc2ec0a7db74ba83b1/parquet-column/src/main/java/org/apache/parquet/CorruptStatistics.java#L55]
>  is missing a check for INT96. Since INT96 statis are also corrupt with old 
> versions of Parquet, the code here shouldn't short-circuit return.



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)


[jira] [Resolved] (PARQUET-1028) [JAVA] When reading old Spark-generated files with INT96, stats are reported as valid when they aren't

2018-03-30 Thread Ryan Blue (JIRA)

 [ 
https://issues.apache.org/jira/browse/PARQUET-1028?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Ryan Blue resolved PARQUET-1028.

Resolution: Fixed
  Assignee: Zoltan Ivanfi

> [JAVA] When reading old Spark-generated files with INT96, stats are reported 
> as valid when they aren't 
> ---
>
> Key: PARQUET-1028
> URL: https://issues.apache.org/jira/browse/PARQUET-1028
> Project: Parquet
>  Issue Type: Bug
>  Components: parquet-mr
>Affects Versions: 1.9.0
>Reporter: Jacques Nadeau
>Assignee: Zoltan Ivanfi
>Priority: Major
> Fix For: 1.10.0
>
>
> Found that the condition 
> [here|https://github.com/apache/parquet-mr/blob/9d58b6a83aa79dcad01c3bcc2ec0a7db74ba83b1/parquet-column/src/main/java/org/apache/parquet/CorruptStatistics.java#L55]
>  is missing a check for INT96. Since INT96 statis are also corrupt with old 
> versions of Parquet, the code here shouldn't short-circuit return.



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)


[jira] [Commented] (PARQUET-1028) [JAVA] When reading old Spark-generated files with INT96, stats are reported as valid when they aren't

2018-03-30 Thread Ryan Blue (JIRA)

[ 
https://issues.apache.org/jira/browse/PARQUET-1028?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16420962#comment-16420962
 ] 

Ryan Blue commented on PARQUET-1028:


This was fixed by PARQUET-1065. The expected sort order for INT96 is now 
UNKNOWN, so stats are discarded.

> [JAVA] When reading old Spark-generated files with INT96, stats are reported 
> as valid when they aren't 
> ---
>
> Key: PARQUET-1028
> URL: https://issues.apache.org/jira/browse/PARQUET-1028
> Project: Parquet
>  Issue Type: Bug
>  Components: parquet-mr
>Affects Versions: 1.9.0
>Reporter: Jacques Nadeau
>Priority: Major
> Fix For: 1.10.0
>
>
> Found that the condition 
> [here|https://github.com/apache/parquet-mr/blob/9d58b6a83aa79dcad01c3bcc2ec0a7db74ba83b1/parquet-column/src/main/java/org/apache/parquet/CorruptStatistics.java#L55]
>  is missing a check for INT96. Since INT96 statis are also corrupt with old 
> versions of Parquet, the code here shouldn't short-circuit return.



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)


[jira] [Updated] (PARQUET-1055) Improve the creation of ExecutorService when reading footers

2018-03-30 Thread Ryan Blue (JIRA)

 [ 
https://issues.apache.org/jira/browse/PARQUET-1055?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Ryan Blue updated PARQUET-1055:
---
Fix Version/s: (was: 1.9.1)

> Improve the creation of ExecutorService when reading footers
> 
>
> Key: PARQUET-1055
> URL: https://issues.apache.org/jira/browse/PARQUET-1055
> Project: Parquet
>  Issue Type: Improvement
>  Components: parquet-mr
>Affects Versions: 1.9.0
>Reporter: Benoit Lacelle
>Priority: Minor
>
> Doing some benchmarks loading a large set of parquet files (3000+) from the 
> local FS, we observed some inefficiencies in the number of created threads 
> when reading footers.
> By reading, the read the configuration parallelism in Hadoop configuration 
> (defaulted to 5) and allocate 2 ExecuteService with each 5 threads to read 
> footers. This is especially inefficient if there is less Callable to handle 
> than the configured parallelism.



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)


[jira] [Updated] (PARQUET-1028) [JAVA] When reading old Spark-generated files with INT96, stats are reported as valid when they aren't

2018-03-30 Thread Ryan Blue (JIRA)

 [ 
https://issues.apache.org/jira/browse/PARQUET-1028?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Ryan Blue updated PARQUET-1028:
---
Fix Version/s: (was: 1.9.1)

> [JAVA] When reading old Spark-generated files with INT96, stats are reported 
> as valid when they aren't 
> ---
>
> Key: PARQUET-1028
> URL: https://issues.apache.org/jira/browse/PARQUET-1028
> Project: Parquet
>  Issue Type: Bug
>  Components: parquet-mr
>Affects Versions: 1.9.0
>Reporter: Jacques Nadeau
>Priority: Major
>
> Found that the condition 
> [here|https://github.com/apache/parquet-mr/blob/9d58b6a83aa79dcad01c3bcc2ec0a7db74ba83b1/parquet-column/src/main/java/org/apache/parquet/CorruptStatistics.java#L55]
>  is missing a check for INT96. Since INT96 statis are also corrupt with old 
> versions of Parquet, the code here shouldn't short-circuit return.



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)


[jira] [Updated] (PARQUET-1174) Concurrent read micro benchmarks

2018-03-30 Thread Ryan Blue (JIRA)

 [ 
https://issues.apache.org/jira/browse/PARQUET-1174?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Ryan Blue updated PARQUET-1174:
---
Fix Version/s: (was: 1.9.1)

> Concurrent read micro benchmarks
> 
>
> Key: PARQUET-1174
> URL: https://issues.apache.org/jira/browse/PARQUET-1174
> Project: Parquet
>  Issue Type: Improvement
>  Components: parquet-mr
>Reporter: Takeshi Yoshimura
>Priority: Minor
>
> parquet-benchmarks only contain read and write benchmarks with a single 
> thread.
> I add concurrent Parquet file scans like typical data-parallel computing.



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)


[jira] [Updated] (PARQUET-796) Delta Encoding is not used when dictionary enabled

2018-03-30 Thread Ryan Blue (JIRA)

 [ 
https://issues.apache.org/jira/browse/PARQUET-796?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Ryan Blue updated PARQUET-796:
--
Fix Version/s: (was: 1.9.1)

> Delta Encoding is not used when dictionary enabled
> --
>
> Key: PARQUET-796
> URL: https://issues.apache.org/jira/browse/PARQUET-796
> Project: Parquet
>  Issue Type: Bug
>  Components: parquet-mr
>Affects Versions: 1.9.0
>Reporter: Jakub Liska
>Priority: Critical
>
> Current code doesn't enable using both Delta Encoding and Dictionary 
> Encoding. If I instantiate ParquetWriter like this : 
> {code}
> val writer = new ParquetWriter[Group](outFile, new GroupWriteSupport, codec, 
> blockSize, pageSize, dictPageSize, enableDictionary = true, true, 
> ParquetProperties.WriterVersion.PARQUET_2_0, configuration)
> {code}
> Then this piece of code : 
> https://github.com/apache/parquet-mr/blob/master/parquet-column/src/main/java/org/apache/parquet/column/values/factory/DefaultValuesWriterFactory.java#L78-L86
> Causes that DictionaryValuesWriter is used instead of the inferred 
> DeltaLongEncodingWriter. 
> The original issue is here : 
> https://github.com/apache/parquet-mr/pull/154#issuecomment-266489768



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)


[jira] [Updated] (PARQUET-1153) Parquet-thrift doesn't compile with Thrift 0.10.0

2018-03-30 Thread Ryan Blue (JIRA)

 [ 
https://issues.apache.org/jira/browse/PARQUET-1153?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Ryan Blue updated PARQUET-1153:
---
Fix Version/s: (was: 1.9.1)
   1.10.0

> Parquet-thrift doesn't compile with Thrift 0.10.0
> -
>
> Key: PARQUET-1153
> URL: https://issues.apache.org/jira/browse/PARQUET-1153
> Project: Parquet
>  Issue Type: Bug
>Reporter: Nandor Kollar
>Assignee: Nandor Kollar
>Priority: Major
> Fix For: 1.10.0
>
>
> Parquet-thrift doesn't compile with Thrift 0.10.0 due to THRIFT-2263. The 
> default generator parameter used for {{--gen}} argument by Thrift Maven 
> plugin is no longer supported, this can be fixed with an additional 
> {{java}} parameter to Thrift Maven plugin.



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)


[jira] [Updated] (PARQUET-1135) upgrade thrift and protobuf dependencies

2018-03-30 Thread Ryan Blue (JIRA)

 [ 
https://issues.apache.org/jira/browse/PARQUET-1135?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Ryan Blue updated PARQUET-1135:
---
Fix Version/s: (was: 1.9.1)
   1.10.0

> upgrade thrift and protobuf dependencies
> 
>
> Key: PARQUET-1135
> URL: https://issues.apache.org/jira/browse/PARQUET-1135
> Project: Parquet
>  Issue Type: Improvement
>  Components: parquet-mr
>Reporter: Julien Le Dem
>Assignee: Julien Le Dem
>Priority: Major
> Fix For: 1.10.0
>
>
> thrift 0.7.0 -> 0.9.3
>  protobuf 3.2 -> 3.5.1



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)


[jira] [Resolved] (PARQUET-777) Add new Parquet CLI tools

2018-03-30 Thread Ryan Blue (JIRA)

 [ 
https://issues.apache.org/jira/browse/PARQUET-777?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Ryan Blue resolved PARQUET-777.
---
Resolution: Fixed

> Add new Parquet CLI tools
> -
>
> Key: PARQUET-777
> URL: https://issues.apache.org/jira/browse/PARQUET-777
> Project: Parquet
>  Issue Type: Improvement
>  Components: parquet-cli
>Reporter: Ryan Blue
>Assignee: Ryan Blue
>Priority: Major
> Fix For: 1.9.1
>
>
> This issue tracks adding parquet-cli from 
> [rdblue/parquet-cli|https://github.com/rdblue/parquet-cli].



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)


[jira] [Updated] (PARQUET-1152) Parquet-thrift doesn't compile with Thrift 0.9.3

2018-03-30 Thread Ryan Blue (JIRA)

 [ 
https://issues.apache.org/jira/browse/PARQUET-1152?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Ryan Blue updated PARQUET-1152:
---
Fix Version/s: (was: 1.9.1)
   1.10.0

> Parquet-thrift doesn't compile with Thrift 0.9.3
> 
>
> Key: PARQUET-1152
> URL: https://issues.apache.org/jira/browse/PARQUET-1152
> Project: Parquet
>  Issue Type: Bug
>Reporter: Nandor Kollar
>Assignee: Nandor Kollar
>Priority: Major
> Fix For: 1.10.0
>
>
> Parquet-thrift doesn't compile with Thrift 0.9.3, because 
> TBinaryProtocol#setReadLength method was removed.
> PARQUET-180 already addressed the problem, but only in runtime.



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)


[jira] [Updated] (PARQUET-777) Add new Parquet CLI tools

2018-03-30 Thread Ryan Blue (JIRA)

 [ 
https://issues.apache.org/jira/browse/PARQUET-777?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Ryan Blue updated PARQUET-777:
--
Fix Version/s: (was: 1.9.1)
   1.10.0

> Add new Parquet CLI tools
> -
>
> Key: PARQUET-777
> URL: https://issues.apache.org/jira/browse/PARQUET-777
> Project: Parquet
>  Issue Type: Improvement
>  Components: parquet-cli
>Reporter: Ryan Blue
>Assignee: Ryan Blue
>Priority: Major
> Fix For: 1.10.0
>
>
> This issue tracks adding parquet-cli from 
> [rdblue/parquet-cli|https://github.com/rdblue/parquet-cli].



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)


[jira] [Updated] (PARQUET-1115) Warn users when misusing parquet-tools merge

2018-03-30 Thread Ryan Blue (JIRA)

 [ 
https://issues.apache.org/jira/browse/PARQUET-1115?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Ryan Blue updated PARQUET-1115:
---
Fix Version/s: (was: 1.9.1)
   1.10.0

> Warn users when misusing parquet-tools merge
> 
>
> Key: PARQUET-1115
> URL: https://issues.apache.org/jira/browse/PARQUET-1115
> Project: Parquet
>  Issue Type: Improvement
>Reporter: Zoltan Ivanfi
>Assignee: Nandor Kollar
>Priority: Major
> Fix For: 1.10.0
>
>
> To prevent users from using {{parquet-tools merge}} in scenarios where its 
> use is not practical, we should describe its limitations in the help text of 
> this command. Additionally, we should add a warning to the output of the 
> merge command if the size of the original row groups are below a threshold.
> Reasoning:
> Many users are tempted to use the new {{parquet-tools merge}} functionality, 
> because they want to achieve good performance and historically that has been 
> associated with large Parquet files. However, in practice Hive performance 
> won't change significantly after using {{parquet-tools merge}}, but Impala 
> performance will be much worse. The reason for that is that good performance 
> is not a result of large files but large rowgroups instead (up to the HDFS 
> block size).
> However, {{parquet-tools merge}} does not merge rowgroups, it just places 
> them one after the other. It was intended to be used for Parquet files that 
> are already arranged in row groups of the desired size. When used to merge 
> many small files, the resulting file will still contain small row groups and 
> one loses most of the advantages of larger files (the only one that remains 
> is that it takes a single HDFS operation to read them).



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)


[jira] [Updated] (PARQUET-1149) Upgrade Avro dependancy to 1.8.2

2018-03-30 Thread Ryan Blue (JIRA)

 [ 
https://issues.apache.org/jira/browse/PARQUET-1149?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Ryan Blue updated PARQUET-1149:
---
Fix Version/s: (was: 1.9.1)
   1.10.0

> Upgrade Avro dependancy to 1.8.2
> 
>
> Key: PARQUET-1149
> URL: https://issues.apache.org/jira/browse/PARQUET-1149
> Project: Parquet
>  Issue Type: Improvement
>Reporter: Fokko Driesprong
>Priority: Major
> Fix For: 1.10.0
>
>
> I would like to update the Avro dependancy to 1.8.2.



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)


[jira] [Updated] (PARQUET-1141) IDs are dropped in metadata conversion

2018-03-30 Thread Ryan Blue (JIRA)

 [ 
https://issues.apache.org/jira/browse/PARQUET-1141?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Ryan Blue updated PARQUET-1141:
---
Fix Version/s: (was: 1.9.1)
   1.10.0

> IDs are dropped in metadata conversion
> --
>
> Key: PARQUET-1141
> URL: https://issues.apache.org/jira/browse/PARQUET-1141
> Project: Parquet
>  Issue Type: Bug
>  Components: parquet-mr
>Affects Versions: 1.9.0, 1.8.2
>Reporter: Ryan Blue
>Assignee: Ryan Blue
>Priority: Major
> Fix For: 1.10.0
>
>




--
This message was sent by Atlassian JIRA
(v7.6.3#76005)


[jira] [Updated] (PARQUET-1025) Support new min-max statistics in parquet-mr

2018-03-30 Thread Ryan Blue (JIRA)

 [ 
https://issues.apache.org/jira/browse/PARQUET-1025?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Ryan Blue updated PARQUET-1025:
---
Fix Version/s: (was: 1.9.1)
   1.10.0

> Support new min-max statistics in parquet-mr
> 
>
> Key: PARQUET-1025
> URL: https://issues.apache.org/jira/browse/PARQUET-1025
> Project: Parquet
>  Issue Type: New Feature
>  Components: parquet-mr
>Affects Versions: 1.9.1
>Reporter: Zoltan Ivanfi
>Assignee: Gabor Szadovszky
>Priority: Major
> Fix For: 1.10.0
>
>
> Impala started using new min-max statistics that got specified as part of 
> PARQUET-686. Support for these should be added to parquet-mr as well.



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)


[jira] [Updated] (PARQUET-1077) [MR] Switch to long key ids in KEYs file

2018-03-30 Thread Ryan Blue (JIRA)

 [ 
https://issues.apache.org/jira/browse/PARQUET-1077?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Ryan Blue updated PARQUET-1077:
---
Fix Version/s: (was: 1.9.1)

> [MR] Switch to long key ids in KEYs file
> 
>
> Key: PARQUET-1077
> URL: https://issues.apache.org/jira/browse/PARQUET-1077
> Project: Parquet
>  Issue Type: Bug
>  Components: parquet-mr
>Reporter: Lars Volker
>Assignee: Lars Volker
>Priority: Major
> Fix For: 2.0.0, 1.10.0
>
>
> PGP keys should be longer than 32bit, as outlined on https://evil32.com/. We 
> should fix the KEYS file in parquet-mr. I will push a PR shortly.



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)


[jira] [Updated] (PARQUET-791) Predicate pushing down on missing columns should work on UserDefinedPredicate too

2018-03-30 Thread Ryan Blue (JIRA)

 [ 
https://issues.apache.org/jira/browse/PARQUET-791?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Ryan Blue updated PARQUET-791:
--
Fix Version/s: (was: 1.9.1)
   1.10.0

> Predicate pushing down on missing columns should work on UserDefinedPredicate 
> too
> -
>
> Key: PARQUET-791
> URL: https://issues.apache.org/jira/browse/PARQUET-791
> Project: Parquet
>  Issue Type: Bug
>  Components: parquet-mr
>Reporter: Liang-Chi Hsieh
>Assignee: Liang-Chi Hsieh
>Priority: Major
> Fix For: 1.10.0
>
>
> This is related to PARQUET-389. PARQUET-389 fixes the predicate pushing down 
> on missing columns. But it doesn't fix it for UserDefinedPredicate.



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)


[jira] [Updated] (PARQUET-1024) allow for case insensitive parquet-xxx prefix in PR title

2018-03-30 Thread Ryan Blue (JIRA)

 [ 
https://issues.apache.org/jira/browse/PARQUET-1024?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Ryan Blue updated PARQUET-1024:
---
Fix Version/s: (was: 1.9.1)
   1.10.0

> allow for case insensitive parquet-xxx prefix in PR title
> -
>
> Key: PARQUET-1024
> URL: https://issues.apache.org/jira/browse/PARQUET-1024
> Project: Parquet
>  Issue Type: Improvement
>Reporter: Julien Le Dem
>Assignee: Julien Le Dem
>Priority: Major
> Fix For: 1.10.0
>
>




--
This message was sent by Atlassian JIRA
(v7.6.3#76005)


[jira] [Updated] (PARQUET-1005) Fix DumpCommand parsing to allow column projection

2018-03-30 Thread Ryan Blue (JIRA)

 [ 
https://issues.apache.org/jira/browse/PARQUET-1005?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Ryan Blue updated PARQUET-1005:
---
Fix Version/s: (was: 1.9.1)
   1.10.0

> Fix DumpCommand parsing to allow column projection
> --
>
> Key: PARQUET-1005
> URL: https://issues.apache.org/jira/browse/PARQUET-1005
> Project: Parquet
>  Issue Type: Bug
>  Components: parquet-cli
>Affects Versions: 1.8.0, 1.8.1, 1.9.0, 2.0.0
>Reporter: Gera Shegalov
>Assignee: Gera Shegalov
>Priority: Major
> Fix For: 1.10.0
>
>
> DumpCommand option for -c is specified as hasArgs() for unlimited
> number of arguments following -c. The very description of the option
> shows the real intent of using hasArg() such that multiple columns
> can be specified as '-c c1 -c c2 ...'. Otherwise, the input path
> is parsed as an argument for -c instead of the command itself.



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)


[jira] [Updated] (PARQUET-801) Allow UserDefinedPredicates in DictionaryFilter

2018-03-30 Thread Ryan Blue (JIRA)

 [ 
https://issues.apache.org/jira/browse/PARQUET-801?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Ryan Blue updated PARQUET-801:
--
Fix Version/s: (was: 1.9.1)
   1.10.0

> Allow UserDefinedPredicates in DictionaryFilter
> ---
>
> Key: PARQUET-801
> URL: https://issues.apache.org/jira/browse/PARQUET-801
> Project: Parquet
>  Issue Type: Improvement
>  Components: parquet-mr
>Affects Versions: 1.9.0
>Reporter: Patrick Woody
>Assignee: Patrick Woody
>Priority: Major
> Fix For: 1.10.0
>
>
> UserDefinedPredicate is not implemented for dictionary filtering.



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)


[jira] [Updated] (PARQUET-321) Set the HDFS padding default to 8MB

2018-03-30 Thread Ryan Blue (JIRA)

 [ 
https://issues.apache.org/jira/browse/PARQUET-321?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Ryan Blue updated PARQUET-321:
--
Fix Version/s: (was: 1.9.1)
   1.10.0

> Set the HDFS padding default to 8MB
> ---
>
> Key: PARQUET-321
> URL: https://issues.apache.org/jira/browse/PARQUET-321
> Project: Parquet
>  Issue Type: Improvement
>  Components: parquet-mr
>Reporter: Ryan Blue
>Assignee: Ryan Blue
>Priority: Major
> Fix For: 1.10.0
>
>
> PARQUET-306 added the ability to pad row groups so that they align with HDFS 
> blocks to avoid remote reads. The ParquetFileWriter will now either pad the 
> remaining space in the block or target a row group for the remaining size.
> The padding maximum controls the threshold of the amount of padding that will 
> be used. If the space left is under this threshold, it is padded. If it is 
> greater than this threshold, then the next row group is fit into the 
> remaining space. The current padding maximum is 0.
> I think we should change the padding maximum to 8MB. My reasoning is this: we 
> want this number to be small enough that it won't prevent the library from 
> writing reasonable row groups, but larger than the minimum size row group we 
> would want to write. 8MB is 1/16th of the row group default, so I think it is 
> reasonable: we don't want a row group to be smaller than 8 MB.
> We also want this to be large enough that a few row groups in a  block don't 
> cause a tiny row group to be written in the excess space. 8MB accounts for 4 
> row groups that are 2MB under-size. In addition, it is reasonable to not 
> allow row groups under 8MB.



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)


[jira] [Commented] (PARQUET-1251) Clarify ambiguous min/max stats for FLOAT/DOUBLE

2018-03-30 Thread ASF GitHub Bot (JIRA)

[ 
https://issues.apache.org/jira/browse/PARQUET-1251?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16420942#comment-16420942
 ] 

ASF GitHub Bot commented on PARQUET-1251:
-

rdblue commented on issue #88: PARQUET-1251: Clarify ambiguous min/max stats 
for FLOAT/DOUBLE
URL: https://github.com/apache/parquet-format/pull/88#issuecomment-377624823
 
 
   +1
   
   Thanks for working on this @gszadovszky and @zivanfi!


This is an automated message from the Apache Git Service.
To respond to the message, please log on GitHub and use the
URL above to go to the specific comment.
 
For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


> Clarify ambiguous min/max stats for FLOAT/DOUBLE
> 
>
> Key: PARQUET-1251
> URL: https://issues.apache.org/jira/browse/PARQUET-1251
> Project: Parquet
>  Issue Type: Bug
>  Components: parquet-format
>Affects Versions: format-2.4.0
>Reporter: Gabor Szadovszky
>Assignee: Gabor Szadovszky
>Priority: Major
> Fix For: format-2.5.0
>
>
> Describe the handling of the ambigous min/max statistics for FLOAT/DOUBLE 
> types in case of TypeDefinedOrder. (See PARQUET-1222 for details.)
> * When looking for NaN values, min and max should be ignored.
> * If the min is a NaN, it should be ignored.
> * If the max is a NaN, it should be ignored.
> * If the min is +0, the row group may contain -0 values as well.
> * If the max is -0, the row group may contain +0 values as well.



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)


Re: parquet-mr next release with PARQUET-1217?

2018-03-30 Thread Ryan Blue
I have no plan for 1.9.1.

On Fri, Mar 30, 2018 at 10:42 AM, Henry Robinson  wrote:

> Great! Do you know of any plans to do a 1.9.1?
>
> On 30 March 2018 at 09:35, Ryan Blue  wrote:
>
>> I'm planning on getting a 1.10.0 rc out today, if I don't find problems
>> with the stats changes.
>>
>> On Thu, Mar 29, 2018 at 4:18 PM, Henry Robinson  wrote:
>>
>> > Hi all -
>> >
>> > While using Spark, I got hit by PARQUET-1217 today on some data written
>> by
>> > Impala. This is a pretty nasty bug, and one that affects Apache Spark
>> right
>> > now because, AFAICT, there's no release to move to that contains the
>> fix,
>> > and parquet-mr 1.9.0 is affected. There is a workaround, but it's
>> expensive
>> > in terms of lost performance.
>> >
>> > I'm new to the community, so wanted to see if there was a plan to make a
>> > release (1.9.1?) in the near future. I'd rather that than have to build
>> > short-term workarounds into Spark.
>> >
>> > Best,
>> > Henry
>> >
>>
>>
>>
>> --
>> Ryan Blue
>> Software Engineer
>> Netflix
>>
>
>
>
> --
> Henry Robinson
> Software Engineer
> Cloudera
> 415-994-6679 <(415)%20994-6679>
>



-- 
Ryan Blue
Software Engineer
Netflix


Re: parquet-mr next release with PARQUET-1217?

2018-03-30 Thread Henry Robinson
Great! Do you know of any plans to do a 1.9.1?

On 30 March 2018 at 09:35, Ryan Blue  wrote:

> I'm planning on getting a 1.10.0 rc out today, if I don't find problems
> with the stats changes.
>
> On Thu, Mar 29, 2018 at 4:18 PM, Henry Robinson  wrote:
>
> > Hi all -
> >
> > While using Spark, I got hit by PARQUET-1217 today on some data written
> by
> > Impala. This is a pretty nasty bug, and one that affects Apache Spark
> right
> > now because, AFAICT, there's no release to move to that contains the fix,
> > and parquet-mr 1.9.0 is affected. There is a workaround, but it's
> expensive
> > in terms of lost performance.
> >
> > I'm new to the community, so wanted to see if there was a plan to make a
> > release (1.9.1?) in the near future. I'd rather that than have to build
> > short-term workarounds into Spark.
> >
> > Best,
> > Henry
> >
>
>
>
> --
> Ryan Blue
> Software Engineer
> Netflix
>



-- 
Henry Robinson
Software Engineer
Cloudera
415-994-6679


Re: parquet-mr next release with PARQUET-1217?

2018-03-30 Thread Ryan Blue
I'm planning on getting a 1.10.0 rc out today, if I don't find problems
with the stats changes.

On Thu, Mar 29, 2018 at 4:18 PM, Henry Robinson  wrote:

> Hi all -
>
> While using Spark, I got hit by PARQUET-1217 today on some data written by
> Impala. This is a pretty nasty bug, and one that affects Apache Spark right
> now because, AFAICT, there's no release to move to that contains the fix,
> and parquet-mr 1.9.0 is affected. There is a workaround, but it's expensive
> in terms of lost performance.
>
> I'm new to the community, so wanted to see if there was a plan to make a
> release (1.9.1?) in the near future. I'd rather that than have to build
> short-term workarounds into Spark.
>
> Best,
> Henry
>



-- 
Ryan Blue
Software Engineer
Netflix


[jira] [Commented] (PARQUET-1143) Update Java for format 2.4.0 changes

2018-03-30 Thread ASF GitHub Bot (JIRA)

[ 
https://issues.apache.org/jira/browse/PARQUET-1143?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16420676#comment-16420676
 ] 

ASF GitHub Bot commented on PARQUET-1143:
-

rdblue commented on issue #430: PARQUET-1143: Update to Parquet format 2.4.0.
URL: https://github.com/apache/parquet-mr/pull/430#issuecomment-377564457
 
 
   @scottcarey, you don't need to update Spark, I have a branch with it updated 
that we're already running in production.


This is an automated message from the Apache Git Service.
To respond to the message, please log on GitHub and use the
URL above to go to the specific comment.
 
For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


> Update Java for format 2.4.0 changes
> 
>
> Key: PARQUET-1143
> URL: https://issues.apache.org/jira/browse/PARQUET-1143
> Project: Parquet
>  Issue Type: Task
>  Components: parquet-mr
>Affects Versions: 1.9.0, 1.8.2
>Reporter: Ryan Blue
>Assignee: Ryan Blue
>Priority: Major
>




--
This message was sent by Atlassian JIRA
(v7.6.3#76005)


[jira] [Commented] (PARQUET-1143) Update Java for format 2.4.0 changes

2018-03-30 Thread ASF GitHub Bot (JIRA)

[ 
https://issues.apache.org/jira/browse/PARQUET-1143?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16420243#comment-16420243
 ] 

ASF GitHub Bot commented on PARQUET-1143:
-

scottcarey commented on issue #430: PARQUET-1143: Update to Parquet format 
2.4.0.
URL: https://github.com/apache/parquet-mr/pull/430#issuecomment-377463522
 
 
   Yeah, I looked a little further into what is needed on the Spark side too.   
Part way in modifying the vectorized readers to use method signatures that use 
ByteBufferInputStream rather than (byte[], offset), I hit a spot where they 
called back into code here that did not take a ByteBufferInputStream.
   
   It looks like changes on both sides are needed.
   
   I think that whole area of code would work better if coded with a DataInput 
interface instead.  You can wrap a ByteBufferInputStream in an DataInputStream, 
and get free (and decently efficient but not amazing) tools for reading 
littleEndian ints, etc.  DataInputStream will be quite a bit faster than 
calling read() 4 times in a row and constructing the int by hand, though its 
technique of maintaining a small buffer for reading primitives can be emulated.
   
   
   


This is an automated message from the Apache Git Service.
To respond to the message, please log on GitHub and use the
URL above to go to the specific comment.
 
For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


> Update Java for format 2.4.0 changes
> 
>
> Key: PARQUET-1143
> URL: https://issues.apache.org/jira/browse/PARQUET-1143
> Project: Parquet
>  Issue Type: Task
>  Components: parquet-mr
>Affects Versions: 1.9.0, 1.8.2
>Reporter: Ryan Blue
>Assignee: Ryan Blue
>Priority: Major
>




--
This message was sent by Atlassian JIRA
(v7.6.3#76005)