[GitHub] [iceberg] zhongyujiang commented on a diff in pull request #3249: Optimized spark vectorized read parquet decimal

via GitHub Thu, 16 Mar 2023 22:51:32 -0700


zhongyujiang commented on code in PR #3249:
URL: https://github.com/apache/iceberg/pull/3249#discussion_r1139776700



##########
arrow/src/main/java/org/apache/iceberg/arrow/vectorized/VectorizedArrowReader.java:
##########
@@ -235,7 +237,7 @@ private void allocateDictEncodedVector() {
     this.readType = ReadType.DICTIONARY;
   }
 
-  private void allocateVectorBasedOnOriginalType(PrimitiveType primitive, 
Field arrowField) {
+  private void allocateVectorBasedOnOriginalParquetType(PrimitiveType 
primitive, Field arrowField) {

Review Comment:
   I think this change is unnecessary.



##########
arrow/src/main/java/org/apache/iceberg/arrow/vectorized/VectorizedReaderBuilder.java:
##########
@@ -129,11 +130,31 @@ public VectorizedReader<?> primitive(
     if (desc.getMaxRepetitionLevel() > 0) {
       return null;
     }
-    Types.NestedField icebergField = icebergSchema.findField(parquetFieldId);
-    if (icebergField == null) {
+    Types.NestedField logicalType = icebergSchema.findField(parquetFieldId);
+    if (logicalType == null) {
       return null;
     }
+
+    Types.NestedField physicalType = logicalType;
+    PrimitiveType.PrimitiveTypeName typeName = 
primitive.getPrimitiveTypeName();
+    if (OriginalType.DECIMAL.equals(primitive.getOriginalType())) {
+      org.apache.iceberg.types.Type type;
+      if (PrimitiveType.PrimitiveTypeName.INT64.equals(typeName)) {
+        // Use BigIntVector for long backed decimal
+        type = Types.LongType.get();
+      } else if (PrimitiveType.PrimitiveTypeName.INT32.equals(typeName)) {
+        // Use IntVector for int backed decimal
+        type = Types.IntegerType.get();
+      } else {
+        // Use FixedSizeBinaryVector for binary backed decimal
+        type = Types.FixedType.ofLength(primitive.getTypeLength());
+      }
+      physicalType =
+          Types.NestedField.of(
+              logicalType.fieldId(), logicalType.isOptional(), 
logicalType.name(), type);

Review Comment:
   Do we really need this? Figure out the physical type and logical type here 
and pass them to the reader. 
   I understand we need to know the physical type of the decimal fields when 
constructing an Arrow `Field`, but can we move this step to 
`VectorizedArrowReader#allocateFieldVector` directly? As is currently done with 
schema evolution, build the correct Arrow `Field` based on the actual Parquet 
column type, see `VectorizedArrowReader#allocateVectorBasedOnTypeName`. 
   It seems that the change will be much simpler if it can be done.



-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: [email protected]

For queries about this service, please contact Infrastructure at:
[email protected]


---------------------------------------------------------------------
To unsubscribe, e-mail: [email protected]
For additional commands, e-mail: [email protected]

[GitHub] [iceberg] zhongyujiang commented on a diff in pull request #3249: Optimized spark vectorized read parquet decimal

Reply via email to