Copilot commented on code in PR #6527:
URL: https://github.com/apache/hive/pull/6527#discussion_r3365024680
##########
iceberg/iceberg-handler/src/test/results/positive/llap/vectorized_iceberg_read_parquet.q.out:
##########
@@ -429,7 +429,7 @@ POSTHOOK: query: select max(t_float), t_double, t_boolean,
t_int, t_bigint, t_bi
POSTHOOK: type: QUERY
POSTHOOK: Input: default@tbl_ice_parquet_all_types
#### A masked pattern was here ####
-1.1 1.2 false 4 567890123456789 6 col7 2012-10-03
19:58:08 1234-09-02 10.01
+1.1 1.2 false 4 567890123456789 6 col7 2012-10-03
19:58:08 1234-09-02 0.00
Review Comment:
This test output expects the Parquet-written decimal value to be 0.00, but
the query inserts cast('10.01' as decimal(4,2)) into tbl_ice_parquet_all_types
(see
iceberg-handler/src/test/queries/positive/vectorized_iceberg_read_parquet.q:27).
The 0.00 here indicates a regression in Parquet DECIMAL_64 reading rather than
a legitimate output change; the expected output should remain 10.01 once the
reader is fixed.
##########
ql/src/java/org/apache/hadoop/hive/ql/io/parquet/vector/VectorizedPrimitiveColumnReader.java:
##########
@@ -686,27 +655,80 @@ private void decodeDictionaryIds(
}
}
+ private void fillDecimalPrecisionScale(DecimalColumnVector c) {
+ short[] ps = getDecimalPrecisionScale();
+ c.precision = ps[0];
+ c.scale = ps[1];
+ }
+
+ private void fillDecimal64PrecisionScale(Decimal64ColumnVector c) {
+ short[] ps = getDecimalPrecisionScale();
+ c.precision = ps[0];
+ c.scale = ps[1];
+ }
+
/**
- * The decimal precision and scale is filled into decimalColumnVector. If
the data in
- * Parquet is in decimal, the precision and scale will come in from
decimalLogicalType. If parquet
- * is not in decimal, then this call is made because HMS shows the type as
decimal. So, the
- * precision and scale are picked from hiveType.
- *
- * @param decimalLogicalType
- * @param decimalColumnVector
+ * Precision/scale for this decimal column: from the Parquet decimal logical
type when present,
+ * otherwise from the Hive type (Parquet stores it as a non-decimal physical
type but HMS reports
+ * decimal).
*/
- private void fillDecimalPrecisionScale(DecimalLogicalTypeAnnotation
decimalLogicalType,
- DecimalColumnVector decimalColumnVector) {
- if (decimalLogicalType != null) {
- decimalColumnVector.precision = (short)
decimalLogicalType.getPrecision();
- decimalColumnVector.scale = (short) decimalLogicalType.getScale();
+ private short[] getDecimalPrecisionScale() {
+ if (type.getLogicalTypeAnnotation() instanceof
DecimalLogicalTypeAnnotation d) {
+ return new short[] { (short) d.getPrecision(), (short) d.getScale() };
} else if (TypeInfoUtils.getBaseName(hiveType.getTypeName())
.equalsIgnoreCase(serdeConstants.DECIMAL_TYPE_NAME)) {
- decimalColumnVector.precision = (short) ((DecimalTypeInfo)
hiveType).getPrecision();
- decimalColumnVector.scale = (short) ((DecimalTypeInfo)
hiveType).getScale();
- } else {
- throw new UnsupportedOperationException(
- "The underlying Parquet type cannot be converted to Hive Decimal
type: " + type);
+ DecimalTypeInfo dti = (DecimalTypeInfo) hiveType;
+ return new short[] { (short) dti.getPrecision(), (short) dti.getScale()
};
+ }
+ throw new UnsupportedOperationException(
+ "The underlying Parquet type cannot be converted to Hive Decimal type:
" + type);
+ }
+
+ /**
+ * Decimal64 fast path: read the unscaled value straight into the
long-backed vector instead of
+ * materializing a HiveDecimal per row. Only reached for decimal columns the
vectorizer tagged
+ * DECIMAL_64 (precision <= 18); higher precision still uses {@link
#readDecimal}.
+ */
+ private void readDecimal64(int total, Decimal64ColumnVector c, int rowId) {
+ fillDecimal64PrecisionScale(c);
+ PrimitiveTypeName physical = type.asPrimitiveType().getPrimitiveTypeName();
+ int left = total;
+ while (left > 0) {
+ readRepetitionAndDefinitionLevels();
+ if (definitionLevel >= maxDefLevel) {
+ c.vector[rowId] = readUnscaledLong(physical, c.scale);
+ c.isNull[rowId] = false;
+ c.isRepeating = c.isRepeating && (c.vector[0] == c.vector[rowId]);
+ } else {
+ setNullValue(c, rowId);
+ }
+ rowId++;
+ left--;
}
}
+
+ // INT32/INT64-backed decimals already give the unscaled value via the
Parquet reader; for the
+ // (rare) byte-array-backed case reuse HiveDecimalWritable.serialize64 --
the same encoding
+ // Decimal64ColumnVector.set uses -- rather than decoding the bytes by hand.
+ private long readUnscaledLong(PrimitiveTypeName physical, short scale) {
+ return switch (physical) {
+ case INT32 -> dataColumn.readInteger();
+ case INT64 -> dataColumn.readLong();
+ default -> {
+ scratchDecimal.set(dataColumn.readDecimal(), scale);
+ yield scratchDecimal.serialize64(scale);
+ }
+ };
+ }
+
+ private long readUnscaledLongFromDict(PrimitiveTypeName physical, int
dictId, short scale) {
+ return switch (physical) {
+ case INT32 -> dictionary.readInteger(dictId);
+ case INT64 -> dictionary.readLong(dictId);
+ default -> {
+ scratchDecimal.set(dictionary.readDecimal(dictId), scale);
+ yield scratchDecimal.serialize64(scale);
+ }
+ };
+ }
Review Comment:
Decimal64 Parquet read path is incorrect for INT32/INT64 physical decimals:
ParquetDataColumnReader implementations for decimal columns (e.g.,
TypesFromInt32DecimalPageReader/TypesFromInt64DecimalPageReader) implement
readInteger/readLong as *cast-to-int/long* (validatedDouble), which returns 0
for scaled decimals like 10.01. Using readInteger/readLong here causes wrong
Decimal64 values (seen as 0.00 in Iceberg Parquet test outputs). Use
readDecimal()/dictionary.readDecimal() for Decimal64, then serialize64.
##########
ql/src/java/org/apache/hadoop/hive/ql/io/parquet/vector/VectorizedPrimitiveColumnReader.java:
##########
@@ -651,14 +614,20 @@ private void decodeDictionaryIds(
}
break;
case DECIMAL:
- DecimalLogicalTypeAnnotation decimalLogicalType = null;
- if (type.getLogicalTypeAnnotation() instanceof
DecimalLogicalTypeAnnotation) {
- decimalLogicalType = (DecimalLogicalTypeAnnotation)
type.getLogicalTypeAnnotation();
+ if (column instanceof Decimal64ColumnVector dec64) {
+ fillDecimal64PrecisionScale(dec64);
+ PrimitiveTypeName dictPhysical =
type.asPrimitiveType().getPrimitiveTypeName();
+ for (int i = rowId; i < rowId + num; ++i) {
+ if (!column.isNull[i]) {
+ dec64.vector[i] = readUnscaledLongFromDict(dictPhysical, (int)
dictionaryIds.vector[i], dec64.scale);
+ }
Review Comment:
In dictionary decoding for Decimal64, invalid dictionary values are not
handled: if dictionary.isValid() becomes false, the vector entry is still
populated, which can surface as incorrect 0.00 values. This should mirror the
DecimalColumnVector path and null out invalid entries.
##########
ql/src/java/org/apache/hadoop/hive/ql/io/parquet/vector/VectorizedPrimitiveColumnReader.java:
##########
@@ -686,27 +655,80 @@ private void decodeDictionaryIds(
}
}
+ private void fillDecimalPrecisionScale(DecimalColumnVector c) {
+ short[] ps = getDecimalPrecisionScale();
+ c.precision = ps[0];
+ c.scale = ps[1];
+ }
+
+ private void fillDecimal64PrecisionScale(Decimal64ColumnVector c) {
+ short[] ps = getDecimalPrecisionScale();
+ c.precision = ps[0];
+ c.scale = ps[1];
+ }
+
/**
- * The decimal precision and scale is filled into decimalColumnVector. If
the data in
- * Parquet is in decimal, the precision and scale will come in from
decimalLogicalType. If parquet
- * is not in decimal, then this call is made because HMS shows the type as
decimal. So, the
- * precision and scale are picked from hiveType.
- *
- * @param decimalLogicalType
- * @param decimalColumnVector
+ * Precision/scale for this decimal column: from the Parquet decimal logical
type when present,
+ * otherwise from the Hive type (Parquet stores it as a non-decimal physical
type but HMS reports
+ * decimal).
*/
- private void fillDecimalPrecisionScale(DecimalLogicalTypeAnnotation
decimalLogicalType,
- DecimalColumnVector decimalColumnVector) {
- if (decimalLogicalType != null) {
- decimalColumnVector.precision = (short)
decimalLogicalType.getPrecision();
- decimalColumnVector.scale = (short) decimalLogicalType.getScale();
+ private short[] getDecimalPrecisionScale() {
+ if (type.getLogicalTypeAnnotation() instanceof
DecimalLogicalTypeAnnotation d) {
+ return new short[] { (short) d.getPrecision(), (short) d.getScale() };
} else if (TypeInfoUtils.getBaseName(hiveType.getTypeName())
.equalsIgnoreCase(serdeConstants.DECIMAL_TYPE_NAME)) {
- decimalColumnVector.precision = (short) ((DecimalTypeInfo)
hiveType).getPrecision();
- decimalColumnVector.scale = (short) ((DecimalTypeInfo)
hiveType).getScale();
- } else {
- throw new UnsupportedOperationException(
- "The underlying Parquet type cannot be converted to Hive Decimal
type: " + type);
+ DecimalTypeInfo dti = (DecimalTypeInfo) hiveType;
+ return new short[] { (short) dti.getPrecision(), (short) dti.getScale()
};
+ }
+ throw new UnsupportedOperationException(
+ "The underlying Parquet type cannot be converted to Hive Decimal type:
" + type);
+ }
+
+ /**
+ * Decimal64 fast path: read the unscaled value straight into the
long-backed vector instead of
+ * materializing a HiveDecimal per row. Only reached for decimal columns the
vectorizer tagged
+ * DECIMAL_64 (precision <= 18); higher precision still uses {@link
#readDecimal}.
+ */
+ private void readDecimal64(int total, Decimal64ColumnVector c, int rowId) {
+ fillDecimal64PrecisionScale(c);
+ PrimitiveTypeName physical = type.asPrimitiveType().getPrimitiveTypeName();
+ int left = total;
+ while (left > 0) {
+ readRepetitionAndDefinitionLevels();
+ if (definitionLevel >= maxDefLevel) {
+ c.vector[rowId] = readUnscaledLong(physical, c.scale);
+ c.isNull[rowId] = false;
+ c.isRepeating = c.isRepeating && (c.vector[0] == c.vector[rowId]);
+ } else {
+ setNullValue(c, rowId);
+ }
+ rowId++;
+ left--;
}
Review Comment:
readDecimal64 writes a value even when the underlying
ParquetDataColumnReader marks the read as invalid (e.g., scale/precision
enforcement failure). This can silently materialize invalid decimals as 0.00
instead of null, diverging from the existing readDecimal behavior that nulls
invalid values.
##########
iceberg/iceberg-handler/src/test/results/positive/llap/vectorized_iceberg_read_mixed.q.out:
##########
@@ -663,7 +663,7 @@ POSTHOOK: type: QUERY
POSTHOOK: Input: default@tbl_ice_mixed_all_types
#### A masked pattern was here ####
1.1 1.2 false 4 567890123456789 6 col7 2012-10-03
19:58:08 1234-09-02 10.01
-5.1 6.2 true 40 567890123456780 8 col07 2012-10-03
19:58:09 1234-09-03 10.02
+5.1 6.2 true 40 567890123456780 8 col07 2012-10-03
19:58:09 1234-09-03 0.00
Review Comment:
This output row corresponds to the Parquet-written insert cast('10.02' as
decimal(4,2)) in vectorized_iceberg_read_mixed.q
(iceberg-handler/src/test/queries/positive/vectorized_iceberg_read_mixed.q:39),
but the expected result here is 0.00. That suggests the Parquet DECIMAL_64
vectorized read path is corrupting scaled decimals; the expected value should
be 10.02 after fixing the reader.
##########
ql/src/java/org/apache/hadoop/hive/ql/io/parquet/vector/VectorizedPrimitiveColumnReader.java:
##########
@@ -686,27 +655,80 @@ private void decodeDictionaryIds(
}
}
+ private void fillDecimalPrecisionScale(DecimalColumnVector c) {
+ short[] ps = getDecimalPrecisionScale();
+ c.precision = ps[0];
+ c.scale = ps[1];
+ }
+
+ private void fillDecimal64PrecisionScale(Decimal64ColumnVector c) {
+ short[] ps = getDecimalPrecisionScale();
+ c.precision = ps[0];
+ c.scale = ps[1];
+ }
+
/**
- * The decimal precision and scale is filled into decimalColumnVector. If
the data in
- * Parquet is in decimal, the precision and scale will come in from
decimalLogicalType. If parquet
- * is not in decimal, then this call is made because HMS shows the type as
decimal. So, the
- * precision and scale are picked from hiveType.
- *
- * @param decimalLogicalType
- * @param decimalColumnVector
+ * Precision/scale for this decimal column: from the Parquet decimal logical
type when present,
+ * otherwise from the Hive type (Parquet stores it as a non-decimal physical
type but HMS reports
+ * decimal).
*/
- private void fillDecimalPrecisionScale(DecimalLogicalTypeAnnotation
decimalLogicalType,
- DecimalColumnVector decimalColumnVector) {
- if (decimalLogicalType != null) {
- decimalColumnVector.precision = (short)
decimalLogicalType.getPrecision();
- decimalColumnVector.scale = (short) decimalLogicalType.getScale();
+ private short[] getDecimalPrecisionScale() {
+ if (type.getLogicalTypeAnnotation() instanceof
DecimalLogicalTypeAnnotation d) {
+ return new short[] { (short) d.getPrecision(), (short) d.getScale() };
} else if (TypeInfoUtils.getBaseName(hiveType.getTypeName())
.equalsIgnoreCase(serdeConstants.DECIMAL_TYPE_NAME)) {
- decimalColumnVector.precision = (short) ((DecimalTypeInfo)
hiveType).getPrecision();
- decimalColumnVector.scale = (short) ((DecimalTypeInfo)
hiveType).getScale();
- } else {
- throw new UnsupportedOperationException(
- "The underlying Parquet type cannot be converted to Hive Decimal
type: " + type);
+ DecimalTypeInfo dti = (DecimalTypeInfo) hiveType;
+ return new short[] { (short) dti.getPrecision(), (short) dti.getScale()
};
+ }
+ throw new UnsupportedOperationException(
+ "The underlying Parquet type cannot be converted to Hive Decimal type:
" + type);
+ }
+
+ /**
+ * Decimal64 fast path: read the unscaled value straight into the
long-backed vector instead of
+ * materializing a HiveDecimal per row. Only reached for decimal columns the
vectorizer tagged
+ * DECIMAL_64 (precision <= 18); higher precision still uses {@link
#readDecimal}.
+ */
+ private void readDecimal64(int total, Decimal64ColumnVector c, int rowId) {
+ fillDecimal64PrecisionScale(c);
+ PrimitiveTypeName physical = type.asPrimitiveType().getPrimitiveTypeName();
Review Comment:
The new Decimal64 fast path is intended to handle DECIMAL logical types
backed by INT32/INT64 (common for Parquet decimals with precision <= 18), but
the added unit tests only exercise a BINARY-backed decimal schema ("required
binary ... (DECIMAL)"). Please add coverage that reads a DECIMAL logical type
stored in INT32 and/or INT64 while tagged DECIMAL_64 to prevent regressions
like the Iceberg Parquet 10.01 -> 0.00 issue.
--
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.
To unsubscribe, e-mail: [email protected]
For queries about this service, please contact Infrastructure at:
[email protected]
---------------------------------------------------------------------
To unsubscribe, e-mail: [email protected]
For additional commands, e-mail: [email protected]