[GitHub] [iceberg] pvary commented on a change in pull request #3980: Hive: Parquet vectorization support for Hive 3

GitBox Fri, 28 Jan 2022 05:42:00 -0800


pvary commented on a change in pull request #3980:
URL: https://github.com/apache/iceberg/pull/3980#discussion_r794511997




##########
File path: 
hive3/src/main/java/org/apache/iceberg/mr/hive/vector/ParquetSchemaFieldNameVisitor.java
##########
@@ -55,17 +58,25 @@ public Type struct(Types.StructType expected, GroupType 
struct, List<Type> field
 
     for (Types.NestedField field : expectedFields) {
       int id = field.fieldId();
-      if (id != MetadataColumns.ROW_POSITION.fieldId() && id != 
MetadataColumns.IS_DELETED.fieldId()) {
-        Type fieldInFileSchema = typesById.get(id);
-        if (fieldInFileSchema == null) {
-          // New field - not in this parquet file yet, add the new field name 
instead of null
+      if (id == MetadataColumns.ROW_POSITION.fieldId() || id == 
MetadataColumns.IS_DELETED.fieldId()) {
+        continue;
+      }
+      Type fieldInPrunedFileSchema = typesById.get(id);
+      if (fieldInPrunedFileSchema == null) {
+        if (!originalFileSchema.containsField(field.name())) {
+          // Must be a new field - it isn't in this parquet file yet, so add 
the new field name instead of null
           appendToColNamesList(isMessageType, field.name());
         } else {
-          // Already present column in this parquet file, add the original name
-          types.add(fieldInFileSchema);
-          appendToColNamesList(isMessageType, fieldInFileSchema.getName());
+          // This field is found in the parquet file with a different ID, so 
it must have been recreated since.
+          // Inserting a dummy col name to force Hive Parquet reader returning 
null for this column.
+          appendToColNamesList(isMessageType, DUMMY_COL_NAME);

Review comment:
       to -> two 😄 




-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: [email protected]

For queries about this service, please contact Infrastructure at:
[email protected]



---------------------------------------------------------------------
To unsubscribe, e-mail: [email protected]
For additional commands, e-mail: [email protected]

[GitHub] [iceberg] pvary commented on a change in pull request #3980: Hive: Parquet vectorization support for Hive 3

Reply via email to