zhongyujiang commented on code in PR #3249:
URL: https://github.com/apache/iceberg/pull/3249#discussion_r1139776700
##########
arrow/src/main/java/org/apache/iceberg/arrow/vectorized/VectorizedArrowReader.java:
##########
@@ -235,7 +237,7 @@ private void allocateDictEncodedVector() {
this.readType = ReadType.DICTIONARY;
}
- private void allocateVectorBasedOnOriginalType(PrimitiveType primitive,
Field arrowField) {
+ private void allocateVectorBasedOnOriginalParquetType(PrimitiveType
primitive, Field arrowField) {
Review Comment:
I think this change is unnecessary.
##########
arrow/src/main/java/org/apache/iceberg/arrow/vectorized/VectorizedReaderBuilder.java:
##########
@@ -129,11 +130,31 @@ public VectorizedReader<?> primitive(
if (desc.getMaxRepetitionLevel() > 0) {
return null;
}
- Types.NestedField icebergField = icebergSchema.findField(parquetFieldId);
- if (icebergField == null) {
+ Types.NestedField logicalType = icebergSchema.findField(parquetFieldId);
+ if (logicalType == null) {
return null;
}
+
+ Types.NestedField physicalType = logicalType;
+ PrimitiveType.PrimitiveTypeName typeName =
primitive.getPrimitiveTypeName();
+ if (OriginalType.DECIMAL.equals(primitive.getOriginalType())) {
+ org.apache.iceberg.types.Type type;
+ if (PrimitiveType.PrimitiveTypeName.INT64.equals(typeName)) {
+ // Use BigIntVector for long backed decimal
+ type = Types.LongType.get();
+ } else if (PrimitiveType.PrimitiveTypeName.INT32.equals(typeName)) {
+ // Use IntVector for int backed decimal
+ type = Types.IntegerType.get();
+ } else {
+ // Use FixedSizeBinaryVector for binary backed decimal
+ type = Types.FixedType.ofLength(primitive.getTypeLength());
+ }
+ physicalType =
+ Types.NestedField.of(
+ logicalType.fieldId(), logicalType.isOptional(),
logicalType.name(), type);
Review Comment:
Do we really need this? Figure out the physical type and logical type here
and pass them to the reader.
I understand we need to know the physical type of the decimal fields when
constructing an Arrow `Field`, but can we move this step to
`VectorizedArrowReader#allocateFieldVector` directly? As is currently done with
schema evolution, build the correct Arrow `Field` based on the actual Parquet
column type, see `VectorizedArrowReader#allocateVectorBasedOnTypeName`.
It seems that the change will be much simpler if it can be done.
--
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.
To unsubscribe, e-mail: [email protected]
For queries about this service, please contact Infrastructure at:
[email protected]
---------------------------------------------------------------------
To unsubscribe, e-mail: [email protected]
For additional commands, e-mail: [email protected]