[GitHub] [iceberg] rdblue commented on a change in pull request #1388: [Parquet Vectorized Reads] Fix reading of files with mix of dictionary and non-dictionary encoded row groups

GitBox Fri, 28 Aug 2020 10:01:40 -0700


rdblue commented on a change in pull request #1388:
URL: https://github.com/apache/iceberg/pull/1388#discussion_r479427662




##########
File path: parquet/src/main/java/org/apache/iceberg/parquet/Parquet.java
##########
@@ -690,4 +692,25 @@ private ParquetReadBuilder(org.apache.parquet.io.InputFile 
file) {
       return new ParquetReadSupport<>(schema, readSupport, callInit, 
nameMapping);
     }
   }
+
+  /**
+   * @param inputFiles   an {@link Iterable} of parquet files. The order of 
iteration determines the order in which
+   *                     content of files are read and written to the @param 
outputFile
+   * @param outputFile   the output parquet file containing all the data from 
@param inputFiles
+   * @param rowGroupSize the row group size to use when writing the @param 
outputFile
+   * @param schema       the schema of the data
+   * @param metadata     extraMetadata to write at the footer of the @param 
outputFile
+   */
+  public static void concat(Iterable<File> inputFiles, File outputFile, int 
rowGroupSize, Schema schema,
+                            Map<String, String> metadata) throws IOException {
+    OutputFile file = Files.localOutput(outputFile);
+    ParquetFileWriter writer = new ParquetFileWriter(
+            ParquetIO.file(file), ParquetSchemaUtil.convert(schema, "table"),
+            ParquetFileWriter.Mode.CREATE, rowGroupSize, 0);

Review comment:
       We can use the default row group size from table properties here. It 
will be ignored when appending files because row groups are appended directly 
and not rewritten.




----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
[email protected]



---------------------------------------------------------------------
To unsubscribe, e-mail: [email protected]
For additional commands, e-mail: [email protected]

[GitHub] [iceberg] rdblue commented on a change in pull request #1388: [Parquet Vectorized Reads] Fix reading of files with mix of dictionary and non-dictionary encoded row groups

Reply via email to