[jira] [Created] (PARQUET-1264) Update Javadoc for Java 1.8
Ryan Blue created PARQUET-1264: -- Summary: Update Javadoc for Java 1.8 Key: PARQUET-1264 URL: https://issues.apache.org/jira/browse/PARQUET-1264 Project: Parquet Issue Type: Improvement Components: parquet-mr Affects Versions: 1.9.0 Reporter: Ryan Blue Assignee: Ryan Blue Fix For: 1.10.0 After moving the build to Java 1.8, the release procedure no longer works because Javadoc generation fails. -- This message was sent by Atlassian JIRA (v7.6.3#76005)
[jira] [Resolved] (PARQUET-1263) ParquetReader's builder should use Configuration from the InputFile
[ https://issues.apache.org/jira/browse/PARQUET-1263?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Ryan Blue resolved PARQUET-1263. Resolution: Fixed Assignee: Ryan Blue Merged #464. > ParquetReader's builder should use Configuration from the InputFile > --- > > Key: PARQUET-1263 > URL: https://issues.apache.org/jira/browse/PARQUET-1263 > Project: Parquet > Issue Type: Improvement >Reporter: Ryan Blue >Assignee: Ryan Blue >Priority: Major > Fix For: 1.10.0 > > > ParquetReader can be built using an InputFile, which may be a HadoopInputFile > and have a Configuration. If it is, ParquetHadoopOptions should be be based > on that configuration instance. -- This message was sent by Atlassian JIRA (v7.6.3#76005)
[jira] [Commented] (PARQUET-1263) ParquetReader's builder should use Configuration from the InputFile
[ https://issues.apache.org/jira/browse/PARQUET-1263?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16421024#comment-16421024 ] ASF GitHub Bot commented on PARQUET-1263: - rdblue closed pull request #464: PARQUET-1263: If file has a config, use it for ParquetReadOptions. URL: https://github.com/apache/parquet-mr/pull/464 This is a PR merged from a forked repository. As GitHub hides the original diff on merge, it is displayed below for the sake of provenance: As this is a foreign pull request (from a fork), the diff is supplied below (as it won't show otherwise due to GitHub magic): diff --git a/parquet-hadoop/src/main/java/org/apache/parquet/hadoop/ParquetReader.java b/parquet-hadoop/src/main/java/org/apache/parquet/hadoop/ParquetReader.java index 1ba5380c8..22c219885 100644 --- a/parquet-hadoop/src/main/java/org/apache/parquet/hadoop/ParquetReader.java +++ b/parquet-hadoop/src/main/java/org/apache/parquet/hadoop/ParquetReader.java @@ -177,14 +177,16 @@ public void close() throws IOException { private final InputFile file; private final Path path; private Filter filter = null; -protected Configuration conf = new Configuration(); -private ParquetReadOptions.Builder optionsBuilder = HadoopReadOptions.builder(conf); +protected Configuration conf; +private ParquetReadOptions.Builder optionsBuilder; @Deprecated private Builder(ReadSupport readSupport, Path path) { this.readSupport = checkNotNull(readSupport, "readSupport"); this.file = null; this.path = checkNotNull(path, "path"); + this.conf = new Configuration(); + this.optionsBuilder = HadoopReadOptions.builder(conf); } @Deprecated @@ -192,12 +194,20 @@ protected Builder(Path path) { this.readSupport = null; this.file = null; this.path = checkNotNull(path, "path"); + this.conf = new Configuration(); + this.optionsBuilder = HadoopReadOptions.builder(conf); } protected Builder(InputFile file) { this.readSupport = null; this.file = checkNotNull(file, "file"); this.path = null; + if (file instanceof HadoopInputFile) { +this.conf = ((HadoopInputFile) file).getConfiguration(); + } else { +this.conf = new Configuration(); + } + optionsBuilder = HadoopReadOptions.builder(conf); } // when called, resets options to the defaults from conf This is an automated message from the Apache Git Service. To respond to the message, please log on GitHub and use the URL above to go to the specific comment. For queries about this service, please contact Infrastructure at: us...@infra.apache.org > ParquetReader's builder should use Configuration from the InputFile > --- > > Key: PARQUET-1263 > URL: https://issues.apache.org/jira/browse/PARQUET-1263 > Project: Parquet > Issue Type: Improvement >Reporter: Ryan Blue >Priority: Major > Fix For: 1.10.0 > > > ParquetReader can be built using an InputFile, which may be a HadoopInputFile > and have a Configuration. If it is, ParquetHadoopOptions should be be based > on that configuration instance. -- This message was sent by Atlassian JIRA (v7.6.3#76005)
[jira] [Commented] (PARQUET-1183) AvroParquetWriter needs OutputFile based Builder
[ https://issues.apache.org/jira/browse/PARQUET-1183?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16421017#comment-16421017 ] ASF GitHub Bot commented on PARQUET-1183: - rdblue closed pull request #460: PARQUET-1183: Add Avro builders using InputFile and OutputFile. URL: https://github.com/apache/parquet-mr/pull/460 This is a PR merged from a forked repository. As GitHub hides the original diff on merge, it is displayed below for the sake of provenance: As this is a foreign pull request (from a fork), the diff is supplied below (as it won't show otherwise due to GitHub magic): diff --git a/parquet-avro/src/main/java/org/apache/parquet/avro/AvroParquetReader.java b/parquet-avro/src/main/java/org/apache/parquet/avro/AvroParquetReader.java index a361c62fd..442c5b78f 100644 --- a/parquet-avro/src/main/java/org/apache/parquet/avro/AvroParquetReader.java +++ b/parquet-avro/src/main/java/org/apache/parquet/avro/AvroParquetReader.java @@ -28,16 +28,25 @@ import org.apache.parquet.filter.UnboundRecordFilter; import org.apache.parquet.hadoop.ParquetReader; import org.apache.parquet.hadoop.api.ReadSupport; +import org.apache.parquet.io.InputFile; /** * Read Avro records from a Parquet file. */ public class AvroParquetReader extends ParquetReader { + /** + * @deprecated will be removed in 2.0.0; use {@link #builder(InputFile)} instead. + */ + @Deprecated public static Builder builder(Path file) { return new Builder(file); } + public static Builder builder(InputFile file) { +return new Builder(file); + } + /** * @deprecated use {@link #builder(Path)} */ @@ -76,10 +85,15 @@ public AvroParquetReader(Configuration conf, Path file, UnboundRecordFilter unbo private boolean enableCompatibility = true; private boolean isReflect = true; +@Deprecated private Builder(Path path) { super(path); } +private Builder(InputFile file) { + super(file); +} + public Builder withDataModel(GenericData model) { this.model = model; diff --git a/parquet-avro/src/main/java/org/apache/parquet/avro/AvroParquetWriter.java b/parquet-avro/src/main/java/org/apache/parquet/avro/AvroParquetWriter.java index d0c063325..3e802a84f 100644 --- a/parquet-avro/src/main/java/org/apache/parquet/avro/AvroParquetWriter.java +++ b/parquet-avro/src/main/java/org/apache/parquet/avro/AvroParquetWriter.java @@ -28,6 +28,7 @@ import org.apache.parquet.hadoop.ParquetWriter; import org.apache.parquet.hadoop.api.WriteSupport; import org.apache.parquet.hadoop.metadata.CompressionCodecName; +import org.apache.parquet.io.OutputFile; /** * Write Avro records to a Parquet file. @@ -38,6 +39,10 @@ return new Builder(file); } + public static Builder builder(OutputFile file) { +return new Builder(file); + } + /** Create a new {@link AvroParquetWriter}. * * @param file @@ -153,6 +158,10 @@ private Builder(Path file) { super(file); } +private Builder(OutputFile file) { + super(file); +} + public Builder withSchema(Schema schema) { this.schema = schema; return this; This is an automated message from the Apache Git Service. To respond to the message, please log on GitHub and use the URL above to go to the specific comment. For queries about this service, please contact Infrastructure at: us...@infra.apache.org > AvroParquetWriter needs OutputFile based Builder > > > Key: PARQUET-1183 > URL: https://issues.apache.org/jira/browse/PARQUET-1183 > Project: Parquet > Issue Type: Improvement > Components: parquet-avro >Affects Versions: 1.9.1 >Reporter: Werner Daehn >Priority: Major > Fix For: 1.10.0 > > > The ParquetWriter got a new Builder(OutputFile). > But it cannot be used by the AvroParquetWriter as there is no matching > Builder/Constructor. > Changes are quite simple: > public static Builder builder(OutputFile file) { > return new Builder(file) > } > and in the static Builder class below > private Builder(OutputFile file) { > super(file); > } > Note: I am not good enough with builds, maven and git to create a pull > request yet. Sorry. Will try to get better here. > See: https://issues.apache.org/jira/browse/PARQUET-1142 -- This message was sent by Atlassian JIRA (v7.6.3#76005)
[jira] [Commented] (PARQUET-1183) AvroParquetWriter needs OutputFile based Builder
[ https://issues.apache.org/jira/browse/PARQUET-1183?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16421018#comment-16421018 ] ASF GitHub Bot commented on PARQUET-1183: - rdblue closed pull request #446: PARQUET-1183 AvroParquetWriter needs OutputFile based Builder URL: https://github.com/apache/parquet-mr/pull/446 This is a PR merged from a forked repository. As GitHub hides the original diff on merge, it is displayed below for the sake of provenance: As this is a foreign pull request (from a fork), the diff is supplied below (as it won't show otherwise due to GitHub magic): diff --git a/parquet-avro/src/main/java/org/apache/parquet/avro/AvroParquetWriter.java b/parquet-avro/src/main/java/org/apache/parquet/avro/AvroParquetWriter.java index d0c063325..7b937b99a 100644 --- a/parquet-avro/src/main/java/org/apache/parquet/avro/AvroParquetWriter.java +++ b/parquet-avro/src/main/java/org/apache/parquet/avro/AvroParquetWriter.java @@ -28,6 +28,7 @@ import org.apache.parquet.hadoop.ParquetWriter; import org.apache.parquet.hadoop.api.WriteSupport; import org.apache.parquet.hadoop.metadata.CompressionCodecName; +import org.apache.parquet.io.OutputFile; /** * Write Avro records to a Parquet file. @@ -38,6 +39,11 @@ return new Builder(file); } + public static Builder builder(OutputFile file) { + return new Builder(file); + } + + /** Create a new {@link AvroParquetWriter}. * * @param file @@ -153,6 +159,10 @@ private Builder(Path file) { super(file); } +private Builder(OutputFile file) { + super(file); +} + public Builder withSchema(Schema schema) { this.schema = schema; return this; diff --git a/parquet-avro/src/main/java/org/apache/parquet/avro/AvroSchemaConverter.java b/parquet-avro/src/main/java/org/apache/parquet/avro/AvroSchemaConverter.java index 70b6525f6..84a4bb728 100644 --- a/parquet-avro/src/main/java/org/apache/parquet/avro/AvroSchemaConverter.java +++ b/parquet-avro/src/main/java/org/apache/parquet/avro/AvroSchemaConverter.java @@ -58,6 +58,9 @@ private final boolean assumeRepeatedIsListElement; private final boolean writeOldListStructure; + + private ArrayList schemapath; + private ArrayList grouppath; public AvroSchemaConverter() { this.assumeRepeatedIsListElement = ADD_LIST_ELEMENT_RECORDS_DEFAULT; @@ -112,7 +115,13 @@ public MessageType convert(Schema avroSchema) { if (!avroSchema.getType().equals(Schema.Type.RECORD)) { throw new IllegalArgumentException("Avro schema must be a record."); } -return new MessageType(avroSchema.getFullName(), convertFields(avroSchema.getFields())); +schemapath = new ArrayList(); +schemapath.add(avroSchema); +grouppath = new ArrayList(); +MessageType m = new MessageType(avroSchema.getFullName()); +grouppath.add(m); +m.addFields(convertFields(avroSchema.getFields())); +return m; } private List convertFields(List fields) { @@ -149,7 +158,50 @@ private Type convertField(String fieldName, Schema schema, Type.Repetition repet } else if (type.equals(Schema.Type.STRING)) { builder = Types.primitive(BINARY, repetition).as(UTF8); } else if (type.equals(Schema.Type.RECORD)) { - return new GroupType(repetition, fieldName, convertFields(schema.getFields())); + /* +* A Schema might contain directly or indirectly a parent schema. +* Example1: "Person"-Schema has a field of type array-of-"Person" named "children" --> A "Person" can have multiple Person records in the field "children" +* Example2: "Person"-Schema has a field "contacts" which lists various contact options. These contact options have an optional field naturalperson which is of type "Person" +* +* To solve that, whenever a new record schema is found, we check if this schema had been used somewhere along the path. +* If No, then it is just a regular structure tree, no circular references where one schema has itself as child. +* If Yes, then this field is redefined as a INT64 containing a generated ID and records of that element can be found in the parent structure via the __ID field. +*/ + int index = schemapath.lastIndexOf(schema); // Has the current schema been used in the schema tree already? + if (index == -1) { + /* +* No, it has not been used, it is the first time this schema appears in this section of the tree, hence simply add it. +* But we need to build the schema tree so the recursive calls know the tree structure. +* And we need to build the same tree with the generated GroupTypes so we can add the __ID column in case it is needed. +*/ + schemapath.add(schema); + GroupType group = new GroupType(repetition, fieldName); +
[jira] [Resolved] (PARQUET-1183) AvroParquetWriter needs OutputFile based Builder
[ https://issues.apache.org/jira/browse/PARQUET-1183?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Ryan Blue resolved PARQUET-1183. Resolution: Fixed Assignee: Ryan Blue Merged #460. Thanks [~zi] for reviewing! > AvroParquetWriter needs OutputFile based Builder > > > Key: PARQUET-1183 > URL: https://issues.apache.org/jira/browse/PARQUET-1183 > Project: Parquet > Issue Type: Improvement > Components: parquet-avro >Affects Versions: 1.9.1 >Reporter: Werner Daehn >Assignee: Ryan Blue >Priority: Major > Fix For: 1.10.0 > > > The ParquetWriter got a new Builder(OutputFile). > But it cannot be used by the AvroParquetWriter as there is no matching > Builder/Constructor. > Changes are quite simple: > public static Builder builder(OutputFile file) { > return new Builder(file) > } > and in the static Builder class below > private Builder(OutputFile file) { > super(file); > } > Note: I am not good enough with builds, maven and git to create a pull > request yet. Sorry. Will try to get better here. > See: https://issues.apache.org/jira/browse/PARQUET-1142 -- This message was sent by Atlassian JIRA (v7.6.3#76005)
[jira] [Commented] (PARQUET-1263) ParquetReader's builder should use Configuration from the InputFile
[ https://issues.apache.org/jira/browse/PARQUET-1263?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16421004#comment-16421004 ] ASF GitHub Bot commented on PARQUET-1263: - rdblue opened a new pull request #464: PARQUET-1263: If file has a config, use it for ParquetReadOptions. URL: https://github.com/apache/parquet-mr/pull/464 This is an automated message from the Apache Git Service. To respond to the message, please log on GitHub and use the URL above to go to the specific comment. For queries about this service, please contact Infrastructure at: us...@infra.apache.org > ParquetReader's builder should use Configuration from the InputFile > --- > > Key: PARQUET-1263 > URL: https://issues.apache.org/jira/browse/PARQUET-1263 > Project: Parquet > Issue Type: Improvement >Reporter: Ryan Blue >Priority: Major > Fix For: 1.10.0 > > > ParquetReader can be built using an InputFile, which may be a HadoopInputFile > and have a Configuration. If it is, ParquetHadoopOptions should be be based > on that configuration instance. -- This message was sent by Atlassian JIRA (v7.6.3#76005)
[jira] [Created] (PARQUET-1263) ParquetReader's builder should use Configuration from the InputFile
Ryan Blue created PARQUET-1263: -- Summary: ParquetReader's builder should use Configuration from the InputFile Key: PARQUET-1263 URL: https://issues.apache.org/jira/browse/PARQUET-1263 Project: Parquet Issue Type: Improvement Reporter: Ryan Blue ParquetReader can be built using an InputFile, which may be a HadoopInputFile and have a Configuration. If it is, ParquetHadoopOptions should be be based on that configuration instance. -- This message was sent by Atlassian JIRA (v7.6.3#76005)
[jira] [Updated] (PARQUET-1263) ParquetReader's builder should use Configuration from the InputFile
[ https://issues.apache.org/jira/browse/PARQUET-1263?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Ryan Blue updated PARQUET-1263: --- Fix Version/s: 1.10.0 > ParquetReader's builder should use Configuration from the InputFile > --- > > Key: PARQUET-1263 > URL: https://issues.apache.org/jira/browse/PARQUET-1263 > Project: Parquet > Issue Type: Improvement >Reporter: Ryan Blue >Priority: Major > Fix For: 1.10.0 > > > ParquetReader can be built using an InputFile, which may be a HadoopInputFile > and have a Configuration. If it is, ParquetHadoopOptions should be be based > on that configuration instance. -- This message was sent by Atlassian JIRA (v7.6.3#76005)
[jira] [Resolved] (PARQUET-1184) Make DelegatingPositionOutputStream a concrete class
[ https://issues.apache.org/jira/browse/PARQUET-1184?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Ryan Blue resolved PARQUET-1184. Resolution: Won't Fix Fix Version/s: (was: 1.10.0) > Make DelegatingPositionOutputStream a concrete class > > > Key: PARQUET-1184 > URL: https://issues.apache.org/jira/browse/PARQUET-1184 > Project: Parquet > Issue Type: Improvement > Components: parquet-avro >Affects Versions: 1.9.1 >Reporter: Werner Daehn >Priority: Major > > I fail to understand why this is an abstract class. In my example I want to > write the Parquet file to a java.io.FileOutputStream, hence have to extend > the DelegatingPositionOutputStream and store the pos information, increase it > in all write(..) methods and return its value in getPos(). > Doable of course, but useful? Previously yes but now with the OutputFile > changes to decouple it from Hadoop more, I believe no. > related to: https://issues.apache.org/jira/browse/PARQUET-1142 -- This message was sent by Atlassian JIRA (v7.6.3#76005)
[jira] [Commented] (PARQUET-1184) Make DelegatingPositionOutputStream a concrete class
[ https://issues.apache.org/jira/browse/PARQUET-1184?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16420982#comment-16420982 ] Ryan Blue commented on PARQUET-1184: The reason why this is an abstract class is so that you can use it to wrap implementations that provide a position, like Hadoop's FsOutputStream. It would not be correct to assume that the position is at the current number of bytes written to the underlying stream. An implementation could wrap RandomAccessFile and expose its seek method, which would invalidate the delegating stream's position. The delegating class is present for convenience only. You don't have to use it and can implement your own logic as long as you implement PositionOutputStream. > Make DelegatingPositionOutputStream a concrete class > > > Key: PARQUET-1184 > URL: https://issues.apache.org/jira/browse/PARQUET-1184 > Project: Parquet > Issue Type: Improvement > Components: parquet-avro >Affects Versions: 1.9.1 >Reporter: Werner Daehn >Priority: Major > Fix For: 1.10.0 > > > I fail to understand why this is an abstract class. In my example I want to > write the Parquet file to a java.io.FileOutputStream, hence have to extend > the DelegatingPositionOutputStream and store the pos information, increase it > in all write(..) methods and return its value in getPos(). > Doable of course, but useful? Previously yes but now with the OutputFile > changes to decouple it from Hadoop more, I believe no. > related to: https://issues.apache.org/jira/browse/PARQUET-1142 -- This message was sent by Atlassian JIRA (v7.6.3#76005)
[jira] [Updated] (PARQUET-1028) [JAVA] When reading old Spark-generated files with INT96, stats are reported as valid when they aren't
[ https://issues.apache.org/jira/browse/PARQUET-1028?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Ryan Blue updated PARQUET-1028: --- Fix Version/s: 1.10.0 > [JAVA] When reading old Spark-generated files with INT96, stats are reported > as valid when they aren't > --- > > Key: PARQUET-1028 > URL: https://issues.apache.org/jira/browse/PARQUET-1028 > Project: Parquet > Issue Type: Bug > Components: parquet-mr >Affects Versions: 1.9.0 >Reporter: Jacques Nadeau >Priority: Major > Fix For: 1.10.0 > > > Found that the condition > [here|https://github.com/apache/parquet-mr/blob/9d58b6a83aa79dcad01c3bcc2ec0a7db74ba83b1/parquet-column/src/main/java/org/apache/parquet/CorruptStatistics.java#L55] > is missing a check for INT96. Since INT96 statis are also corrupt with old > versions of Parquet, the code here shouldn't short-circuit return. -- This message was sent by Atlassian JIRA (v7.6.3#76005)
[jira] [Resolved] (PARQUET-1028) [JAVA] When reading old Spark-generated files with INT96, stats are reported as valid when they aren't
[ https://issues.apache.org/jira/browse/PARQUET-1028?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Ryan Blue resolved PARQUET-1028. Resolution: Fixed Assignee: Zoltan Ivanfi > [JAVA] When reading old Spark-generated files with INT96, stats are reported > as valid when they aren't > --- > > Key: PARQUET-1028 > URL: https://issues.apache.org/jira/browse/PARQUET-1028 > Project: Parquet > Issue Type: Bug > Components: parquet-mr >Affects Versions: 1.9.0 >Reporter: Jacques Nadeau >Assignee: Zoltan Ivanfi >Priority: Major > Fix For: 1.10.0 > > > Found that the condition > [here|https://github.com/apache/parquet-mr/blob/9d58b6a83aa79dcad01c3bcc2ec0a7db74ba83b1/parquet-column/src/main/java/org/apache/parquet/CorruptStatistics.java#L55] > is missing a check for INT96. Since INT96 statis are also corrupt with old > versions of Parquet, the code here shouldn't short-circuit return. -- This message was sent by Atlassian JIRA (v7.6.3#76005)
[jira] [Commented] (PARQUET-1028) [JAVA] When reading old Spark-generated files with INT96, stats are reported as valid when they aren't
[ https://issues.apache.org/jira/browse/PARQUET-1028?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16420962#comment-16420962 ] Ryan Blue commented on PARQUET-1028: This was fixed by PARQUET-1065. The expected sort order for INT96 is now UNKNOWN, so stats are discarded. > [JAVA] When reading old Spark-generated files with INT96, stats are reported > as valid when they aren't > --- > > Key: PARQUET-1028 > URL: https://issues.apache.org/jira/browse/PARQUET-1028 > Project: Parquet > Issue Type: Bug > Components: parquet-mr >Affects Versions: 1.9.0 >Reporter: Jacques Nadeau >Priority: Major > Fix For: 1.10.0 > > > Found that the condition > [here|https://github.com/apache/parquet-mr/blob/9d58b6a83aa79dcad01c3bcc2ec0a7db74ba83b1/parquet-column/src/main/java/org/apache/parquet/CorruptStatistics.java#L55] > is missing a check for INT96. Since INT96 statis are also corrupt with old > versions of Parquet, the code here shouldn't short-circuit return. -- This message was sent by Atlassian JIRA (v7.6.3#76005)
[jira] [Updated] (PARQUET-1055) Improve the creation of ExecutorService when reading footers
[ https://issues.apache.org/jira/browse/PARQUET-1055?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Ryan Blue updated PARQUET-1055: --- Fix Version/s: (was: 1.9.1) > Improve the creation of ExecutorService when reading footers > > > Key: PARQUET-1055 > URL: https://issues.apache.org/jira/browse/PARQUET-1055 > Project: Parquet > Issue Type: Improvement > Components: parquet-mr >Affects Versions: 1.9.0 >Reporter: Benoit Lacelle >Priority: Minor > > Doing some benchmarks loading a large set of parquet files (3000+) from the > local FS, we observed some inefficiencies in the number of created threads > when reading footers. > By reading, the read the configuration parallelism in Hadoop configuration > (defaulted to 5) and allocate 2 ExecuteService with each 5 threads to read > footers. This is especially inefficient if there is less Callable to handle > than the configured parallelism. -- This message was sent by Atlassian JIRA (v7.6.3#76005)
[jira] [Updated] (PARQUET-1028) [JAVA] When reading old Spark-generated files with INT96, stats are reported as valid when they aren't
[ https://issues.apache.org/jira/browse/PARQUET-1028?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Ryan Blue updated PARQUET-1028: --- Fix Version/s: (was: 1.9.1) > [JAVA] When reading old Spark-generated files with INT96, stats are reported > as valid when they aren't > --- > > Key: PARQUET-1028 > URL: https://issues.apache.org/jira/browse/PARQUET-1028 > Project: Parquet > Issue Type: Bug > Components: parquet-mr >Affects Versions: 1.9.0 >Reporter: Jacques Nadeau >Priority: Major > > Found that the condition > [here|https://github.com/apache/parquet-mr/blob/9d58b6a83aa79dcad01c3bcc2ec0a7db74ba83b1/parquet-column/src/main/java/org/apache/parquet/CorruptStatistics.java#L55] > is missing a check for INT96. Since INT96 statis are also corrupt with old > versions of Parquet, the code here shouldn't short-circuit return. -- This message was sent by Atlassian JIRA (v7.6.3#76005)
[jira] [Updated] (PARQUET-1174) Concurrent read micro benchmarks
[ https://issues.apache.org/jira/browse/PARQUET-1174?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Ryan Blue updated PARQUET-1174: --- Fix Version/s: (was: 1.9.1) > Concurrent read micro benchmarks > > > Key: PARQUET-1174 > URL: https://issues.apache.org/jira/browse/PARQUET-1174 > Project: Parquet > Issue Type: Improvement > Components: parquet-mr >Reporter: Takeshi Yoshimura >Priority: Minor > > parquet-benchmarks only contain read and write benchmarks with a single > thread. > I add concurrent Parquet file scans like typical data-parallel computing. -- This message was sent by Atlassian JIRA (v7.6.3#76005)
[jira] [Updated] (PARQUET-796) Delta Encoding is not used when dictionary enabled
[ https://issues.apache.org/jira/browse/PARQUET-796?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Ryan Blue updated PARQUET-796: -- Fix Version/s: (was: 1.9.1) > Delta Encoding is not used when dictionary enabled > -- > > Key: PARQUET-796 > URL: https://issues.apache.org/jira/browse/PARQUET-796 > Project: Parquet > Issue Type: Bug > Components: parquet-mr >Affects Versions: 1.9.0 >Reporter: Jakub Liska >Priority: Critical > > Current code doesn't enable using both Delta Encoding and Dictionary > Encoding. If I instantiate ParquetWriter like this : > {code} > val writer = new ParquetWriter[Group](outFile, new GroupWriteSupport, codec, > blockSize, pageSize, dictPageSize, enableDictionary = true, true, > ParquetProperties.WriterVersion.PARQUET_2_0, configuration) > {code} > Then this piece of code : > https://github.com/apache/parquet-mr/blob/master/parquet-column/src/main/java/org/apache/parquet/column/values/factory/DefaultValuesWriterFactory.java#L78-L86 > Causes that DictionaryValuesWriter is used instead of the inferred > DeltaLongEncodingWriter. > The original issue is here : > https://github.com/apache/parquet-mr/pull/154#issuecomment-266489768 -- This message was sent by Atlassian JIRA (v7.6.3#76005)
[jira] [Updated] (PARQUET-1153) Parquet-thrift doesn't compile with Thrift 0.10.0
[ https://issues.apache.org/jira/browse/PARQUET-1153?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Ryan Blue updated PARQUET-1153: --- Fix Version/s: (was: 1.9.1) 1.10.0 > Parquet-thrift doesn't compile with Thrift 0.10.0 > - > > Key: PARQUET-1153 > URL: https://issues.apache.org/jira/browse/PARQUET-1153 > Project: Parquet > Issue Type: Bug >Reporter: Nandor Kollar >Assignee: Nandor Kollar >Priority: Major > Fix For: 1.10.0 > > > Parquet-thrift doesn't compile with Thrift 0.10.0 due to THRIFT-2263. The > default generator parameter used for {{--gen}} argument by Thrift Maven > plugin is no longer supported, this can be fixed with an additional > {{java}} parameter to Thrift Maven plugin. -- This message was sent by Atlassian JIRA (v7.6.3#76005)
[jira] [Updated] (PARQUET-1135) upgrade thrift and protobuf dependencies
[ https://issues.apache.org/jira/browse/PARQUET-1135?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Ryan Blue updated PARQUET-1135: --- Fix Version/s: (was: 1.9.1) 1.10.0 > upgrade thrift and protobuf dependencies > > > Key: PARQUET-1135 > URL: https://issues.apache.org/jira/browse/PARQUET-1135 > Project: Parquet > Issue Type: Improvement > Components: parquet-mr >Reporter: Julien Le Dem >Assignee: Julien Le Dem >Priority: Major > Fix For: 1.10.0 > > > thrift 0.7.0 -> 0.9.3 > protobuf 3.2 -> 3.5.1 -- This message was sent by Atlassian JIRA (v7.6.3#76005)
[jira] [Resolved] (PARQUET-777) Add new Parquet CLI tools
[ https://issues.apache.org/jira/browse/PARQUET-777?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Ryan Blue resolved PARQUET-777. --- Resolution: Fixed > Add new Parquet CLI tools > - > > Key: PARQUET-777 > URL: https://issues.apache.org/jira/browse/PARQUET-777 > Project: Parquet > Issue Type: Improvement > Components: parquet-cli >Reporter: Ryan Blue >Assignee: Ryan Blue >Priority: Major > Fix For: 1.9.1 > > > This issue tracks adding parquet-cli from > [rdblue/parquet-cli|https://github.com/rdblue/parquet-cli]. -- This message was sent by Atlassian JIRA (v7.6.3#76005)
[jira] [Updated] (PARQUET-1152) Parquet-thrift doesn't compile with Thrift 0.9.3
[ https://issues.apache.org/jira/browse/PARQUET-1152?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Ryan Blue updated PARQUET-1152: --- Fix Version/s: (was: 1.9.1) 1.10.0 > Parquet-thrift doesn't compile with Thrift 0.9.3 > > > Key: PARQUET-1152 > URL: https://issues.apache.org/jira/browse/PARQUET-1152 > Project: Parquet > Issue Type: Bug >Reporter: Nandor Kollar >Assignee: Nandor Kollar >Priority: Major > Fix For: 1.10.0 > > > Parquet-thrift doesn't compile with Thrift 0.9.3, because > TBinaryProtocol#setReadLength method was removed. > PARQUET-180 already addressed the problem, but only in runtime. -- This message was sent by Atlassian JIRA (v7.6.3#76005)
[jira] [Updated] (PARQUET-777) Add new Parquet CLI tools
[ https://issues.apache.org/jira/browse/PARQUET-777?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Ryan Blue updated PARQUET-777: -- Fix Version/s: (was: 1.9.1) 1.10.0 > Add new Parquet CLI tools > - > > Key: PARQUET-777 > URL: https://issues.apache.org/jira/browse/PARQUET-777 > Project: Parquet > Issue Type: Improvement > Components: parquet-cli >Reporter: Ryan Blue >Assignee: Ryan Blue >Priority: Major > Fix For: 1.10.0 > > > This issue tracks adding parquet-cli from > [rdblue/parquet-cli|https://github.com/rdblue/parquet-cli]. -- This message was sent by Atlassian JIRA (v7.6.3#76005)
[jira] [Updated] (PARQUET-1115) Warn users when misusing parquet-tools merge
[ https://issues.apache.org/jira/browse/PARQUET-1115?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Ryan Blue updated PARQUET-1115: --- Fix Version/s: (was: 1.9.1) 1.10.0 > Warn users when misusing parquet-tools merge > > > Key: PARQUET-1115 > URL: https://issues.apache.org/jira/browse/PARQUET-1115 > Project: Parquet > Issue Type: Improvement >Reporter: Zoltan Ivanfi >Assignee: Nandor Kollar >Priority: Major > Fix For: 1.10.0 > > > To prevent users from using {{parquet-tools merge}} in scenarios where its > use is not practical, we should describe its limitations in the help text of > this command. Additionally, we should add a warning to the output of the > merge command if the size of the original row groups are below a threshold. > Reasoning: > Many users are tempted to use the new {{parquet-tools merge}} functionality, > because they want to achieve good performance and historically that has been > associated with large Parquet files. However, in practice Hive performance > won't change significantly after using {{parquet-tools merge}}, but Impala > performance will be much worse. The reason for that is that good performance > is not a result of large files but large rowgroups instead (up to the HDFS > block size). > However, {{parquet-tools merge}} does not merge rowgroups, it just places > them one after the other. It was intended to be used for Parquet files that > are already arranged in row groups of the desired size. When used to merge > many small files, the resulting file will still contain small row groups and > one loses most of the advantages of larger files (the only one that remains > is that it takes a single HDFS operation to read them). -- This message was sent by Atlassian JIRA (v7.6.3#76005)
[jira] [Updated] (PARQUET-1149) Upgrade Avro dependancy to 1.8.2
[ https://issues.apache.org/jira/browse/PARQUET-1149?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Ryan Blue updated PARQUET-1149: --- Fix Version/s: (was: 1.9.1) 1.10.0 > Upgrade Avro dependancy to 1.8.2 > > > Key: PARQUET-1149 > URL: https://issues.apache.org/jira/browse/PARQUET-1149 > Project: Parquet > Issue Type: Improvement >Reporter: Fokko Driesprong >Priority: Major > Fix For: 1.10.0 > > > I would like to update the Avro dependancy to 1.8.2. -- This message was sent by Atlassian JIRA (v7.6.3#76005)
[jira] [Updated] (PARQUET-1141) IDs are dropped in metadata conversion
[ https://issues.apache.org/jira/browse/PARQUET-1141?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Ryan Blue updated PARQUET-1141: --- Fix Version/s: (was: 1.9.1) 1.10.0 > IDs are dropped in metadata conversion > -- > > Key: PARQUET-1141 > URL: https://issues.apache.org/jira/browse/PARQUET-1141 > Project: Parquet > Issue Type: Bug > Components: parquet-mr >Affects Versions: 1.9.0, 1.8.2 >Reporter: Ryan Blue >Assignee: Ryan Blue >Priority: Major > Fix For: 1.10.0 > > -- This message was sent by Atlassian JIRA (v7.6.3#76005)
[jira] [Updated] (PARQUET-1025) Support new min-max statistics in parquet-mr
[ https://issues.apache.org/jira/browse/PARQUET-1025?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Ryan Blue updated PARQUET-1025: --- Fix Version/s: (was: 1.9.1) 1.10.0 > Support new min-max statistics in parquet-mr > > > Key: PARQUET-1025 > URL: https://issues.apache.org/jira/browse/PARQUET-1025 > Project: Parquet > Issue Type: New Feature > Components: parquet-mr >Affects Versions: 1.9.1 >Reporter: Zoltan Ivanfi >Assignee: Gabor Szadovszky >Priority: Major > Fix For: 1.10.0 > > > Impala started using new min-max statistics that got specified as part of > PARQUET-686. Support for these should be added to parquet-mr as well. -- This message was sent by Atlassian JIRA (v7.6.3#76005)
[jira] [Updated] (PARQUET-1077) [MR] Switch to long key ids in KEYs file
[ https://issues.apache.org/jira/browse/PARQUET-1077?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Ryan Blue updated PARQUET-1077: --- Fix Version/s: (was: 1.9.1) > [MR] Switch to long key ids in KEYs file > > > Key: PARQUET-1077 > URL: https://issues.apache.org/jira/browse/PARQUET-1077 > Project: Parquet > Issue Type: Bug > Components: parquet-mr >Reporter: Lars Volker >Assignee: Lars Volker >Priority: Major > Fix For: 2.0.0, 1.10.0 > > > PGP keys should be longer than 32bit, as outlined on https://evil32.com/. We > should fix the KEYS file in parquet-mr. I will push a PR shortly. -- This message was sent by Atlassian JIRA (v7.6.3#76005)
[jira] [Updated] (PARQUET-791) Predicate pushing down on missing columns should work on UserDefinedPredicate too
[ https://issues.apache.org/jira/browse/PARQUET-791?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Ryan Blue updated PARQUET-791: -- Fix Version/s: (was: 1.9.1) 1.10.0 > Predicate pushing down on missing columns should work on UserDefinedPredicate > too > - > > Key: PARQUET-791 > URL: https://issues.apache.org/jira/browse/PARQUET-791 > Project: Parquet > Issue Type: Bug > Components: parquet-mr >Reporter: Liang-Chi Hsieh >Assignee: Liang-Chi Hsieh >Priority: Major > Fix For: 1.10.0 > > > This is related to PARQUET-389. PARQUET-389 fixes the predicate pushing down > on missing columns. But it doesn't fix it for UserDefinedPredicate. -- This message was sent by Atlassian JIRA (v7.6.3#76005)
[jira] [Updated] (PARQUET-1024) allow for case insensitive parquet-xxx prefix in PR title
[ https://issues.apache.org/jira/browse/PARQUET-1024?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Ryan Blue updated PARQUET-1024: --- Fix Version/s: (was: 1.9.1) 1.10.0 > allow for case insensitive parquet-xxx prefix in PR title > - > > Key: PARQUET-1024 > URL: https://issues.apache.org/jira/browse/PARQUET-1024 > Project: Parquet > Issue Type: Improvement >Reporter: Julien Le Dem >Assignee: Julien Le Dem >Priority: Major > Fix For: 1.10.0 > > -- This message was sent by Atlassian JIRA (v7.6.3#76005)
[jira] [Updated] (PARQUET-1005) Fix DumpCommand parsing to allow column projection
[ https://issues.apache.org/jira/browse/PARQUET-1005?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Ryan Blue updated PARQUET-1005: --- Fix Version/s: (was: 1.9.1) 1.10.0 > Fix DumpCommand parsing to allow column projection > -- > > Key: PARQUET-1005 > URL: https://issues.apache.org/jira/browse/PARQUET-1005 > Project: Parquet > Issue Type: Bug > Components: parquet-cli >Affects Versions: 1.8.0, 1.8.1, 1.9.0, 2.0.0 >Reporter: Gera Shegalov >Assignee: Gera Shegalov >Priority: Major > Fix For: 1.10.0 > > > DumpCommand option for -c is specified as hasArgs() for unlimited > number of arguments following -c. The very description of the option > shows the real intent of using hasArg() such that multiple columns > can be specified as '-c c1 -c c2 ...'. Otherwise, the input path > is parsed as an argument for -c instead of the command itself. -- This message was sent by Atlassian JIRA (v7.6.3#76005)
[jira] [Updated] (PARQUET-801) Allow UserDefinedPredicates in DictionaryFilter
[ https://issues.apache.org/jira/browse/PARQUET-801?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Ryan Blue updated PARQUET-801: -- Fix Version/s: (was: 1.9.1) 1.10.0 > Allow UserDefinedPredicates in DictionaryFilter > --- > > Key: PARQUET-801 > URL: https://issues.apache.org/jira/browse/PARQUET-801 > Project: Parquet > Issue Type: Improvement > Components: parquet-mr >Affects Versions: 1.9.0 >Reporter: Patrick Woody >Assignee: Patrick Woody >Priority: Major > Fix For: 1.10.0 > > > UserDefinedPredicate is not implemented for dictionary filtering. -- This message was sent by Atlassian JIRA (v7.6.3#76005)
[jira] [Updated] (PARQUET-321) Set the HDFS padding default to 8MB
[ https://issues.apache.org/jira/browse/PARQUET-321?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Ryan Blue updated PARQUET-321: -- Fix Version/s: (was: 1.9.1) 1.10.0 > Set the HDFS padding default to 8MB > --- > > Key: PARQUET-321 > URL: https://issues.apache.org/jira/browse/PARQUET-321 > Project: Parquet > Issue Type: Improvement > Components: parquet-mr >Reporter: Ryan Blue >Assignee: Ryan Blue >Priority: Major > Fix For: 1.10.0 > > > PARQUET-306 added the ability to pad row groups so that they align with HDFS > blocks to avoid remote reads. The ParquetFileWriter will now either pad the > remaining space in the block or target a row group for the remaining size. > The padding maximum controls the threshold of the amount of padding that will > be used. If the space left is under this threshold, it is padded. If it is > greater than this threshold, then the next row group is fit into the > remaining space. The current padding maximum is 0. > I think we should change the padding maximum to 8MB. My reasoning is this: we > want this number to be small enough that it won't prevent the library from > writing reasonable row groups, but larger than the minimum size row group we > would want to write. 8MB is 1/16th of the row group default, so I think it is > reasonable: we don't want a row group to be smaller than 8 MB. > We also want this to be large enough that a few row groups in a block don't > cause a tiny row group to be written in the excess space. 8MB accounts for 4 > row groups that are 2MB under-size. In addition, it is reasonable to not > allow row groups under 8MB. -- This message was sent by Atlassian JIRA (v7.6.3#76005)
[jira] [Commented] (PARQUET-1251) Clarify ambiguous min/max stats for FLOAT/DOUBLE
[ https://issues.apache.org/jira/browse/PARQUET-1251?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16420942#comment-16420942 ] ASF GitHub Bot commented on PARQUET-1251: - rdblue commented on issue #88: PARQUET-1251: Clarify ambiguous min/max stats for FLOAT/DOUBLE URL: https://github.com/apache/parquet-format/pull/88#issuecomment-377624823 +1 Thanks for working on this @gszadovszky and @zivanfi! This is an automated message from the Apache Git Service. To respond to the message, please log on GitHub and use the URL above to go to the specific comment. For queries about this service, please contact Infrastructure at: us...@infra.apache.org > Clarify ambiguous min/max stats for FLOAT/DOUBLE > > > Key: PARQUET-1251 > URL: https://issues.apache.org/jira/browse/PARQUET-1251 > Project: Parquet > Issue Type: Bug > Components: parquet-format >Affects Versions: format-2.4.0 >Reporter: Gabor Szadovszky >Assignee: Gabor Szadovszky >Priority: Major > Fix For: format-2.5.0 > > > Describe the handling of the ambigous min/max statistics for FLOAT/DOUBLE > types in case of TypeDefinedOrder. (See PARQUET-1222 for details.) > * When looking for NaN values, min and max should be ignored. > * If the min is a NaN, it should be ignored. > * If the max is a NaN, it should be ignored. > * If the min is +0, the row group may contain -0 values as well. > * If the max is -0, the row group may contain +0 values as well. -- This message was sent by Atlassian JIRA (v7.6.3#76005)
Re: parquet-mr next release with PARQUET-1217?
I have no plan for 1.9.1. On Fri, Mar 30, 2018 at 10:42 AM, Henry Robinsonwrote: > Great! Do you know of any plans to do a 1.9.1? > > On 30 March 2018 at 09:35, Ryan Blue wrote: > >> I'm planning on getting a 1.10.0 rc out today, if I don't find problems >> with the stats changes. >> >> On Thu, Mar 29, 2018 at 4:18 PM, Henry Robinson wrote: >> >> > Hi all - >> > >> > While using Spark, I got hit by PARQUET-1217 today on some data written >> by >> > Impala. This is a pretty nasty bug, and one that affects Apache Spark >> right >> > now because, AFAICT, there's no release to move to that contains the >> fix, >> > and parquet-mr 1.9.0 is affected. There is a workaround, but it's >> expensive >> > in terms of lost performance. >> > >> > I'm new to the community, so wanted to see if there was a plan to make a >> > release (1.9.1?) in the near future. I'd rather that than have to build >> > short-term workarounds into Spark. >> > >> > Best, >> > Henry >> > >> >> >> >> -- >> Ryan Blue >> Software Engineer >> Netflix >> > > > > -- > Henry Robinson > Software Engineer > Cloudera > 415-994-6679 <(415)%20994-6679> > -- Ryan Blue Software Engineer Netflix
Re: parquet-mr next release with PARQUET-1217?
Great! Do you know of any plans to do a 1.9.1? On 30 March 2018 at 09:35, Ryan Bluewrote: > I'm planning on getting a 1.10.0 rc out today, if I don't find problems > with the stats changes. > > On Thu, Mar 29, 2018 at 4:18 PM, Henry Robinson wrote: > > > Hi all - > > > > While using Spark, I got hit by PARQUET-1217 today on some data written > by > > Impala. This is a pretty nasty bug, and one that affects Apache Spark > right > > now because, AFAICT, there's no release to move to that contains the fix, > > and parquet-mr 1.9.0 is affected. There is a workaround, but it's > expensive > > in terms of lost performance. > > > > I'm new to the community, so wanted to see if there was a plan to make a > > release (1.9.1?) in the near future. I'd rather that than have to build > > short-term workarounds into Spark. > > > > Best, > > Henry > > > > > > -- > Ryan Blue > Software Engineer > Netflix > -- Henry Robinson Software Engineer Cloudera 415-994-6679
Re: parquet-mr next release with PARQUET-1217?
I'm planning on getting a 1.10.0 rc out today, if I don't find problems with the stats changes. On Thu, Mar 29, 2018 at 4:18 PM, Henry Robinsonwrote: > Hi all - > > While using Spark, I got hit by PARQUET-1217 today on some data written by > Impala. This is a pretty nasty bug, and one that affects Apache Spark right > now because, AFAICT, there's no release to move to that contains the fix, > and parquet-mr 1.9.0 is affected. There is a workaround, but it's expensive > in terms of lost performance. > > I'm new to the community, so wanted to see if there was a plan to make a > release (1.9.1?) in the near future. I'd rather that than have to build > short-term workarounds into Spark. > > Best, > Henry > -- Ryan Blue Software Engineer Netflix
[jira] [Commented] (PARQUET-1143) Update Java for format 2.4.0 changes
[ https://issues.apache.org/jira/browse/PARQUET-1143?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16420676#comment-16420676 ] ASF GitHub Bot commented on PARQUET-1143: - rdblue commented on issue #430: PARQUET-1143: Update to Parquet format 2.4.0. URL: https://github.com/apache/parquet-mr/pull/430#issuecomment-377564457 @scottcarey, you don't need to update Spark, I have a branch with it updated that we're already running in production. This is an automated message from the Apache Git Service. To respond to the message, please log on GitHub and use the URL above to go to the specific comment. For queries about this service, please contact Infrastructure at: us...@infra.apache.org > Update Java for format 2.4.0 changes > > > Key: PARQUET-1143 > URL: https://issues.apache.org/jira/browse/PARQUET-1143 > Project: Parquet > Issue Type: Task > Components: parquet-mr >Affects Versions: 1.9.0, 1.8.2 >Reporter: Ryan Blue >Assignee: Ryan Blue >Priority: Major > -- This message was sent by Atlassian JIRA (v7.6.3#76005)
[jira] [Commented] (PARQUET-1143) Update Java for format 2.4.0 changes
[ https://issues.apache.org/jira/browse/PARQUET-1143?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16420243#comment-16420243 ] ASF GitHub Bot commented on PARQUET-1143: - scottcarey commented on issue #430: PARQUET-1143: Update to Parquet format 2.4.0. URL: https://github.com/apache/parquet-mr/pull/430#issuecomment-377463522 Yeah, I looked a little further into what is needed on the Spark side too. Part way in modifying the vectorized readers to use method signatures that use ByteBufferInputStream rather than (byte[], offset), I hit a spot where they called back into code here that did not take a ByteBufferInputStream. It looks like changes on both sides are needed. I think that whole area of code would work better if coded with a DataInput interface instead. You can wrap a ByteBufferInputStream in an DataInputStream, and get free (and decently efficient but not amazing) tools for reading littleEndian ints, etc. DataInputStream will be quite a bit faster than calling read() 4 times in a row and constructing the int by hand, though its technique of maintaining a small buffer for reading primitives can be emulated. This is an automated message from the Apache Git Service. To respond to the message, please log on GitHub and use the URL above to go to the specific comment. For queries about this service, please contact Infrastructure at: us...@infra.apache.org > Update Java for format 2.4.0 changes > > > Key: PARQUET-1143 > URL: https://issues.apache.org/jira/browse/PARQUET-1143 > Project: Parquet > Issue Type: Task > Components: parquet-mr >Affects Versions: 1.9.0, 1.8.2 >Reporter: Ryan Blue >Assignee: Ryan Blue >Priority: Major > -- This message was sent by Atlassian JIRA (v7.6.3#76005)