MaxNevermind commented on code in PR #3036:
URL: https://github.com/apache/parquet-java/pull/3036#discussion_r1818179742


##########
parquet-hadoop/src/main/java/org/apache/parquet/hadoop/rewrite/ParquetRewriter.java:
##########
@@ -145,15 +145,18 @@ public class ParquetRewriter implements Closeable {
   private final Queue<TransParquetFileReader> inputFiles = new LinkedList<>();
   private final Queue<TransParquetFileReader> inputFilesToJoin = new 
LinkedList<>();
   private final MessageType outSchema;
+  private final MessageType outSchemaWithRenamedColumns;

Review Comment:
   As I understand that behavior is not supported by current master branch 
version of ParquetRewriter and I didn't plan to change it in this PR, it 
expects the schema for all input files to be the same, both main and join 
input(if provided), so there are two schemas at max: one schema(if no join 
files provided), two schemas(if join files are provided). If some of the files 
are missing a column then I think `ensureSameSchema()` will fail.
   
   Renaming feature doesn't change that, a renamed column can be in the main 
input or it can be in a join input. In case of join column name overlap with 
main input, the result of a join depends on a `overwriteInputWithJoinColumns` 
option, but it doesn't change the output of renaming, if that column is to be 
renamed it should not change the source of input data being main or join files.



-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: [email protected]

For queries about this service, please contact Infrastructure at:
[email protected]


---------------------------------------------------------------------
To unsubscribe, e-mail: [email protected]
For additional commands, e-mail: [email protected]

Reply via email to