MaxNevermind commented on code in PR #1273:
URL: https://github.com/apache/parquet-mr/pull/1273#discussion_r1526967519
##########
parquet-hadoop/src/main/java/org/apache/parquet/hadoop/rewrite/ParquetRewriter.java:
##########
@@ -115,19 +108,49 @@ public class ParquetRewriter implements Closeable {
private ParquetMetadata meta = null;
// created_by information of current reader being processed
private String originalCreatedBy = "";
- // Unique created_by information from all input files
- private final Set<String> allOriginalCreatedBys = new HashSet<>();
// The index cache strategy
private final IndexCache.CacheStrategy indexCacheStrategy;
public ParquetRewriter(RewriteOptions options) throws IOException {
ParquetConfiguration conf = options.getParquetConfiguration();
OutputFile out = options.getParquetOutputFile();
- openInputFiles(options.getParquetInputFiles(), conf);
+ inputFiles.addAll(getFileReaders(options.getParquetInputFiles(), conf));
+ List<Queue<TransParquetFileReader>> inputFilesR =
options.getParquetInputFilesR()
+ .stream()
+ .map(x -> getFileReaders(x, conf))
+ .collect(Collectors.toList());
+ ensureSameSchema(inputFiles);
+ inputFilesR.forEach(this::ensureSameSchema);
Review Comment:
I thought a little more about your suggestion. My notes:
- We can do that but we should not require the same number of rows for
VerticalInputFiles in each HorizontalInputFile. In my opinion tools that
produce Parquet files often use some threshold size in bytes(e.g.:HDFS block
size) as a measure of next file spit. So it will be changeling for user to
produce such files with matching row count.
- If we do that the state that tracks current file for each column on the
right side will become more complex. Right now the state is easily maintained
by a input file queue for each file group.
- I can see that we definitely will provide a more use-case coverage that
way but I'm not sure if it will be useful. In my experience parquet files are
usually produced using some widespread libraries/framework and those tools
usually produce parquet dataset/tables which implies all output files have the
same schema. I can see that somebody can produce such parquet files as you
described programmatically, but in such cases those files unlikely to be large
and the main use-case of my feature is in the context of optimization for very
large datasets stitching, if there are no large datasets involved it might be
easier to just completely rewrite files using full read & write.
--
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.
To unsubscribe, e-mail: [email protected]
For queries about this service, please contact Infrastructure at:
[email protected]
---------------------------------------------------------------------
To unsubscribe, e-mail: [email protected]
For additional commands, e-mail: [email protected]