MaxNevermind commented on code in PR #3036:
URL: https://github.com/apache/parquet-java/pull/3036#discussion_r1818179742
##########
parquet-hadoop/src/main/java/org/apache/parquet/hadoop/rewrite/ParquetRewriter.java:
##########
@@ -145,15 +145,18 @@ public class ParquetRewriter implements Closeable {
private final Queue<TransParquetFileReader> inputFiles = new LinkedList<>();
private final Queue<TransParquetFileReader> inputFilesToJoin = new
LinkedList<>();
private final MessageType outSchema;
+ private final MessageType outSchemaWithRenamedColumns;
Review Comment:
As I understand that behavior is not supported by current master branch
version of ParquetRewriter and I didn't plan to change it in this PR, it
expects the schema for all input files to be the same, both main and join
input(if provided), so there are two schemas at max: one schema(if no join
files provided), two schemas(if join files are provided). If some of the files
are missing a column then I think `ensureSameSchema()` will fail.
Renaming feature doesn't change that, a renamed column can be in the main
input or it can be in a join input. In case of join column name overlap with
main input, the result of a join depends on a `overwriteInputWithJoinColumns`
option, but it doesn't change the output of renaming, if that column is to be
renamed it should not change the source of input data being main or join files.
--
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.
To unsubscribe, e-mail: [email protected]
For queries about this service, please contact Infrastructure at:
[email protected]
---------------------------------------------------------------------
To unsubscribe, e-mail: [email protected]
For additional commands, e-mail: [email protected]