MaxNevermind commented on code in PR #1273:
URL: https://github.com/apache/parquet-mr/pull/1273#discussion_r1524250339
##########
parquet-hadoop/src/main/java/org/apache/parquet/hadoop/rewrite/ParquetRewriter.java:
##########
@@ -115,19 +108,49 @@ public class ParquetRewriter implements Closeable {
private ParquetMetadata meta = null;
// created_by information of current reader being processed
private String originalCreatedBy = "";
- // Unique created_by information from all input files
- private final Set<String> allOriginalCreatedBys = new HashSet<>();
// The index cache strategy
private final IndexCache.CacheStrategy indexCacheStrategy;
public ParquetRewriter(RewriteOptions options) throws IOException {
ParquetConfiguration conf = options.getParquetConfiguration();
OutputFile out = options.getParquetOutputFile();
- openInputFiles(options.getParquetInputFiles(), conf);
+ inputFiles.addAll(getFileReaders(options.getParquetInputFiles(), conf));
+ List<Queue<TransParquetFileReader>> inputFilesR =
options.getParquetInputFilesR()
+ .stream()
+ .map(x -> getFileReaders(x, conf))
+ .collect(Collectors.toList());
+ ensureSameSchema(inputFiles);
+ inputFilesR.forEach(this::ensureSameSchema);
Review Comment:
On the second thought actually the right side might remain
List<List<InputFile>>. The requirements are: the order of files for each column
need to be expressed somehow to us and number of rows in for each column in
total in all files must be the same as on the left side. Having those satisfied
we can deduct which file has which column etc. Let me think a little more about
it.
--
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.
To unsubscribe, e-mail: [email protected]
For queries about this service, please contact Infrastructure at:
[email protected]
---------------------------------------------------------------------
To unsubscribe, e-mail: [email protected]
For additional commands, e-mail: [email protected]