MaxNevermind commented on code in PR #1273:
URL: https://github.com/apache/parquet-mr/pull/1273#discussion_r1526967519


##########
parquet-hadoop/src/main/java/org/apache/parquet/hadoop/rewrite/ParquetRewriter.java:
##########
@@ -115,19 +108,49 @@ public class ParquetRewriter implements Closeable {
   private ParquetMetadata meta = null;
   // created_by information of current reader being processed
   private String originalCreatedBy = "";
-  // Unique created_by information from all input files
-  private final Set<String> allOriginalCreatedBys = new HashSet<>();
   // The index cache strategy
   private final IndexCache.CacheStrategy indexCacheStrategy;
 
   public ParquetRewriter(RewriteOptions options) throws IOException {
     ParquetConfiguration conf = options.getParquetConfiguration();
     OutputFile out = options.getParquetOutputFile();
-    openInputFiles(options.getParquetInputFiles(), conf);
+    inputFiles.addAll(getFileReaders(options.getParquetInputFiles(), conf));
+    List<Queue<TransParquetFileReader>> inputFilesR = 
options.getParquetInputFilesR()
+        .stream()
+        .map(x -> getFileReaders(x, conf))
+        .collect(Collectors.toList());
+    ensureSameSchema(inputFiles);
+    inputFilesR.forEach(this::ensureSameSchema);

Review Comment:
   I thought a little more about your suggestion. My notes:
   - We can do that but we should not require the same number of rows for 
VerticalInputFiles in each HorizontalInputFile. In my opinion tools that 
produce Parquet files often use some threshold size in bytes(e.g.:HDFS block 
size) as a measure of next file spit. So it will be changeling for user to 
produce such files with matching row count.
   - If we do that the state that tracks current file for each column on the 
right side will become more complex. Right now the state is easily maintained 
by a input file queue for each file group.
   - I can see that we definitely will provide a more use-case coverage that 
way but I'm not sure if it will be useful. In my experience parquet files are 
usually produced using some widespread libraries/framework and those tools 
usually produce parquet dataset/tables which implies all output files have the 
same schema. I can see that somebody can produce such parquet files as you 
described programmatically, but in such cases those files unlikely to be large 
and the main use-case of my feature is in the context of optimization for very 
large datasets stitching, if there are no large datasets involved it might be 
easier to just completely rewrite files using full read & write.



-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: [email protected]

For queries about this service, please contact Infrastructure at:
[email protected]


---------------------------------------------------------------------
To unsubscribe, e-mail: [email protected]
For additional commands, e-mail: [email protected]

Reply via email to