wgtmac commented on code in PR #3036:
URL: https://github.com/apache/parquet-java/pull/3036#discussion_r1814254827
##########
parquet-hadoop/src/main/java/org/apache/parquet/hadoop/rewrite/ParquetRewriter.java:
##########
@@ -431,7 +480,27 @@ private void processBlock(
if (chunk.isEncrypted()) {
throw new IOException("Column " + chunk.getPath().toDotString() + " is
already encrypted");
}
- ColumnDescriptor descriptor = outSchema.getColumns().get(outColumnIdx);
+
+ ColumnChunkMetaData chunkColumnsRenamed = chunk;
+ if (renamedColumns != null && !renamedColumns.isEmpty()) {
+ chunkColumnsRenamed = ColumnChunkMetaData.get(
Review Comment:
If we add a field to ColumnChunkMetaData in the future, the current
`ColumnChunkMetaData.get` method gets deprecated and we can easily get out of
sync here. I'm not sure if it is a good idea to introduce a copy|clone method
to it with only the name change.
##########
parquet-hadoop/src/main/java/org/apache/parquet/hadoop/rewrite/ParquetRewriter.java:
##########
@@ -431,7 +480,27 @@ private void processBlock(
if (chunk.isEncrypted()) {
throw new IOException("Column " + chunk.getPath().toDotString() + " is
already encrypted");
}
- ColumnDescriptor descriptor = outSchema.getColumns().get(outColumnIdx);
+
+ ColumnChunkMetaData chunkColumnsRenamed = chunk;
Review Comment:
nit: perhaps use `normalize` to replace `rename`?
- chunkColumnsRenamed -> normalizedColumnChunk
- renameFieldsInPath -> normalizePath
- renameNameInType -> normalizeType
##########
parquet-hadoop/src/main/java/org/apache/parquet/hadoop/rewrite/ParquetRewriter.java:
##########
@@ -145,15 +145,18 @@ public class ParquetRewriter implements Closeable {
private final Queue<TransParquetFileReader> inputFiles = new LinkedList<>();
private final Queue<TransParquetFileReader> inputFilesToJoin = new
LinkedList<>();
private final MessageType outSchema;
+ private final MessageType outSchemaWithRenamedColumns;
Review Comment:
Is it possible not to add a new `outSchemaWithRenamedColumns` and stick to
`outSchema` with field name replaced?
##########
parquet-hadoop/src/main/java/org/apache/parquet/hadoop/rewrite/RewriteOptions.java:
##########
@@ -561,13 +582,29 @@ public RewriteOptions build() {
!maskColumns.containsKey(pruneColumn), "Cannot prune and mask
same column");
}
}
-
if (encryptColumns != null) {
for (String pruneColumn : pruneColumns) {
Preconditions.checkArgument(
!encryptColumns.contains(pruneColumn), "Cannot prune and
encrypt same column");
}
}
+ if (renameColumns != null) {
+ for (Map.Entry<String, String> entry : renameColumns.entrySet()) {
+ Preconditions.checkArgument(
+ !encryptColumns.contains(entry.getKey()), "Cannot prune and
rename same column");
+ }
+ }
+ }
+
+ if (renameColumns != null && !renameColumns.isEmpty()) {
+ for (Map.Entry<String, String> entry : renameColumns.entrySet()) {
Review Comment:
We need to check if the renamed field name conflicts with any other output
field name.
--
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.
To unsubscribe, e-mail: [email protected]
For queries about this service, please contact Infrastructure at:
[email protected]
---------------------------------------------------------------------
To unsubscribe, e-mail: [email protected]
For additional commands, e-mail: [email protected]