Re: [PR] HIVE-29649: Iceberg: Support Parquet DECIMAL_64 vectorization [hive]

via GitHub Fri, 05 Jun 2026 12:55:26 -0700


Copilot commented on code in PR #6527:
URL: https://github.com/apache/hive/pull/6527#discussion_r3365024680



##########
iceberg/iceberg-handler/src/test/results/positive/llap/vectorized_iceberg_read_parquet.q.out:
##########
@@ -429,7 +429,7 @@ POSTHOOK: query: select max(t_float), t_double, t_boolean, 
t_int, t_bigint, t_bi
 POSTHOOK: type: QUERY
 POSTHOOK: Input: default@tbl_ice_parquet_all_types
 #### A masked pattern was here ####
-1.1    1.2     false   4       567890123456789 6       col7    2012-10-03 
19:58:08     1234-09-02      10.01
+1.1    1.2     false   4       567890123456789 6       col7    2012-10-03 
19:58:08     1234-09-02      0.00

Review Comment:
   This test output expects the Parquet-written decimal value to be 0.00, but 
the query inserts cast('10.01' as decimal(4,2)) into tbl_ice_parquet_all_types 
(see 
iceberg-handler/src/test/queries/positive/vectorized_iceberg_read_parquet.q:27).
 The 0.00 here indicates a regression in Parquet DECIMAL_64 reading rather than 
a legitimate output change; the expected output should remain 10.01 once the 
reader is fixed.



##########
ql/src/java/org/apache/hadoop/hive/ql/io/parquet/vector/VectorizedPrimitiveColumnReader.java:
##########
@@ -686,27 +655,80 @@ private void decodeDictionaryIds(
     }
   }
 
+  private void fillDecimalPrecisionScale(DecimalColumnVector c) {
+    short[] ps = getDecimalPrecisionScale();
+    c.precision = ps[0];
+    c.scale = ps[1];
+  }
+
+  private void fillDecimal64PrecisionScale(Decimal64ColumnVector c) {
+    short[] ps = getDecimalPrecisionScale();
+    c.precision = ps[0];
+    c.scale = ps[1];
+  }
+
   /**
-   * The decimal precision and scale is filled into decimalColumnVector.  If 
the data in
-   * Parquet is in decimal, the precision and scale will come in from 
decimalLogicalType.  If parquet
-   * is not in decimal, then this call is made because HMS shows the type as 
decimal.  So, the
-   * precision and scale are picked from hiveType.
-   *
-   * @param decimalLogicalType
-   * @param decimalColumnVector
+   * Precision/scale for this decimal column: from the Parquet decimal logical 
type when present,
+   * otherwise from the Hive type (Parquet stores it as a non-decimal physical 
type but HMS reports
+   * decimal).
    */
-  private void fillDecimalPrecisionScale(DecimalLogicalTypeAnnotation 
decimalLogicalType,
-      DecimalColumnVector decimalColumnVector) {
-    if (decimalLogicalType != null) {
-      decimalColumnVector.precision = (short) 
decimalLogicalType.getPrecision();
-      decimalColumnVector.scale = (short) decimalLogicalType.getScale();
+  private short[] getDecimalPrecisionScale() {
+    if (type.getLogicalTypeAnnotation() instanceof 
DecimalLogicalTypeAnnotation d) {
+      return new short[] { (short) d.getPrecision(), (short) d.getScale() };
     } else if (TypeInfoUtils.getBaseName(hiveType.getTypeName())
         .equalsIgnoreCase(serdeConstants.DECIMAL_TYPE_NAME)) {
-      decimalColumnVector.precision = (short) ((DecimalTypeInfo) 
hiveType).getPrecision();
-      decimalColumnVector.scale = (short) ((DecimalTypeInfo) 
hiveType).getScale();
-    } else {
-      throw new UnsupportedOperationException(
-          "The underlying Parquet type cannot be converted to Hive Decimal 
type: " + type);
+      DecimalTypeInfo dti = (DecimalTypeInfo) hiveType;
+      return new short[] { (short) dti.getPrecision(), (short) dti.getScale() 
};
+    }
+    throw new UnsupportedOperationException(
+        "The underlying Parquet type cannot be converted to Hive Decimal type: 
" + type);
+  }
+
+  /**
+   * Decimal64 fast path: read the unscaled value straight into the 
long-backed vector instead of
+   * materializing a HiveDecimal per row. Only reached for decimal columns the 
vectorizer tagged
+   * DECIMAL_64 (precision <= 18); higher precision still uses {@link 
#readDecimal}.
+   */
+  private void readDecimal64(int total, Decimal64ColumnVector c, int rowId) {
+    fillDecimal64PrecisionScale(c);
+    PrimitiveTypeName physical = type.asPrimitiveType().getPrimitiveTypeName();
+    int left = total;
+    while (left > 0) {
+      readRepetitionAndDefinitionLevels();
+      if (definitionLevel >= maxDefLevel) {
+        c.vector[rowId] = readUnscaledLong(physical, c.scale);
+        c.isNull[rowId] = false;
+        c.isRepeating = c.isRepeating && (c.vector[0] == c.vector[rowId]);
+      } else {
+        setNullValue(c, rowId);
+      }
+      rowId++;
+      left--;
     }
   }
+
+  // INT32/INT64-backed decimals already give the unscaled value via the 
Parquet reader; for the
+  // (rare) byte-array-backed case reuse HiveDecimalWritable.serialize64 -- 
the same encoding
+  // Decimal64ColumnVector.set uses -- rather than decoding the bytes by hand.
+  private long readUnscaledLong(PrimitiveTypeName physical, short scale) {
+    return switch (physical) {
+      case INT32 -> dataColumn.readInteger();
+      case INT64 -> dataColumn.readLong();
+      default -> {
+        scratchDecimal.set(dataColumn.readDecimal(), scale);
+        yield scratchDecimal.serialize64(scale);
+      }
+    };
+  }
+
+  private long readUnscaledLongFromDict(PrimitiveTypeName physical, int 
dictId, short scale) {
+    return switch (physical) {
+      case INT32 -> dictionary.readInteger(dictId);
+      case INT64 -> dictionary.readLong(dictId);
+      default -> {
+        scratchDecimal.set(dictionary.readDecimal(dictId), scale);
+        yield scratchDecimal.serialize64(scale);
+      }
+    };
+  }

Review Comment:
   Decimal64 Parquet read path is incorrect for INT32/INT64 physical decimals: 
ParquetDataColumnReader implementations for decimal columns (e.g., 
TypesFromInt32DecimalPageReader/TypesFromInt64DecimalPageReader) implement 
readInteger/readLong as *cast-to-int/long* (validatedDouble), which returns 0 
for scaled decimals like 10.01. Using readInteger/readLong here causes wrong 
Decimal64 values (seen as 0.00 in Iceberg Parquet test outputs). Use 
readDecimal()/dictionary.readDecimal() for Decimal64, then serialize64.



##########
ql/src/java/org/apache/hadoop/hive/ql/io/parquet/vector/VectorizedPrimitiveColumnReader.java:
##########
@@ -651,14 +614,20 @@ private void decodeDictionaryIds(
       }
       break;
     case DECIMAL:
-      DecimalLogicalTypeAnnotation decimalLogicalType = null;
-      if (type.getLogicalTypeAnnotation() instanceof 
DecimalLogicalTypeAnnotation) {
-        decimalLogicalType = (DecimalLogicalTypeAnnotation) 
type.getLogicalTypeAnnotation();
+      if (column instanceof Decimal64ColumnVector dec64) {
+        fillDecimal64PrecisionScale(dec64);
+        PrimitiveTypeName dictPhysical = 
type.asPrimitiveType().getPrimitiveTypeName();
+        for (int i = rowId; i < rowId + num; ++i) {
+          if (!column.isNull[i]) {
+            dec64.vector[i] = readUnscaledLongFromDict(dictPhysical, (int) 
dictionaryIds.vector[i], dec64.scale);
+          }

Review Comment:
   In dictionary decoding for Decimal64, invalid dictionary values are not 
handled: if dictionary.isValid() becomes false, the vector entry is still 
populated, which can surface as incorrect 0.00 values. This should mirror the 
DecimalColumnVector path and null out invalid entries.



##########
ql/src/java/org/apache/hadoop/hive/ql/io/parquet/vector/VectorizedPrimitiveColumnReader.java:
##########
@@ -686,27 +655,80 @@ private void decodeDictionaryIds(
     }
   }
 
+  private void fillDecimalPrecisionScale(DecimalColumnVector c) {
+    short[] ps = getDecimalPrecisionScale();
+    c.precision = ps[0];
+    c.scale = ps[1];
+  }
+
+  private void fillDecimal64PrecisionScale(Decimal64ColumnVector c) {
+    short[] ps = getDecimalPrecisionScale();
+    c.precision = ps[0];
+    c.scale = ps[1];
+  }
+
   /**
-   * The decimal precision and scale is filled into decimalColumnVector.  If 
the data in
-   * Parquet is in decimal, the precision and scale will come in from 
decimalLogicalType.  If parquet
-   * is not in decimal, then this call is made because HMS shows the type as 
decimal.  So, the
-   * precision and scale are picked from hiveType.
-   *
-   * @param decimalLogicalType
-   * @param decimalColumnVector
+   * Precision/scale for this decimal column: from the Parquet decimal logical 
type when present,
+   * otherwise from the Hive type (Parquet stores it as a non-decimal physical 
type but HMS reports
+   * decimal).
    */
-  private void fillDecimalPrecisionScale(DecimalLogicalTypeAnnotation 
decimalLogicalType,
-      DecimalColumnVector decimalColumnVector) {
-    if (decimalLogicalType != null) {
-      decimalColumnVector.precision = (short) 
decimalLogicalType.getPrecision();
-      decimalColumnVector.scale = (short) decimalLogicalType.getScale();
+  private short[] getDecimalPrecisionScale() {
+    if (type.getLogicalTypeAnnotation() instanceof 
DecimalLogicalTypeAnnotation d) {
+      return new short[] { (short) d.getPrecision(), (short) d.getScale() };
     } else if (TypeInfoUtils.getBaseName(hiveType.getTypeName())
         .equalsIgnoreCase(serdeConstants.DECIMAL_TYPE_NAME)) {
-      decimalColumnVector.precision = (short) ((DecimalTypeInfo) 
hiveType).getPrecision();
-      decimalColumnVector.scale = (short) ((DecimalTypeInfo) 
hiveType).getScale();
-    } else {
-      throw new UnsupportedOperationException(
-          "The underlying Parquet type cannot be converted to Hive Decimal 
type: " + type);
+      DecimalTypeInfo dti = (DecimalTypeInfo) hiveType;
+      return new short[] { (short) dti.getPrecision(), (short) dti.getScale() 
};
+    }
+    throw new UnsupportedOperationException(
+        "The underlying Parquet type cannot be converted to Hive Decimal type: 
" + type);
+  }
+
+  /**
+   * Decimal64 fast path: read the unscaled value straight into the 
long-backed vector instead of
+   * materializing a HiveDecimal per row. Only reached for decimal columns the 
vectorizer tagged
+   * DECIMAL_64 (precision <= 18); higher precision still uses {@link 
#readDecimal}.
+   */
+  private void readDecimal64(int total, Decimal64ColumnVector c, int rowId) {
+    fillDecimal64PrecisionScale(c);
+    PrimitiveTypeName physical = type.asPrimitiveType().getPrimitiveTypeName();
+    int left = total;
+    while (left > 0) {
+      readRepetitionAndDefinitionLevels();
+      if (definitionLevel >= maxDefLevel) {
+        c.vector[rowId] = readUnscaledLong(physical, c.scale);
+        c.isNull[rowId] = false;
+        c.isRepeating = c.isRepeating && (c.vector[0] == c.vector[rowId]);
+      } else {
+        setNullValue(c, rowId);
+      }
+      rowId++;
+      left--;
     }

Review Comment:
   readDecimal64 writes a value even when the underlying 
ParquetDataColumnReader marks the read as invalid (e.g., scale/precision 
enforcement failure). This can silently materialize invalid decimals as 0.00 
instead of null, diverging from the existing readDecimal behavior that nulls 
invalid values.



##########
iceberg/iceberg-handler/src/test/results/positive/llap/vectorized_iceberg_read_mixed.q.out:
##########
@@ -663,7 +663,7 @@ POSTHOOK: type: QUERY
 POSTHOOK: Input: default@tbl_ice_mixed_all_types
 #### A masked pattern was here ####
 1.1    1.2     false   4       567890123456789 6       col7    2012-10-03 
19:58:08     1234-09-02      10.01
-5.1    6.2     true    40      567890123456780 8       col07   2012-10-03 
19:58:09     1234-09-03      10.02
+5.1    6.2     true    40      567890123456780 8       col07   2012-10-03 
19:58:09     1234-09-03      0.00

Review Comment:
   This output row corresponds to the Parquet-written insert cast('10.02' as 
decimal(4,2)) in vectorized_iceberg_read_mixed.q 
(iceberg-handler/src/test/queries/positive/vectorized_iceberg_read_mixed.q:39), 
but the expected result here is 0.00. That suggests the Parquet DECIMAL_64 
vectorized read path is corrupting scaled decimals; the expected value should 
be 10.02 after fixing the reader.



##########
ql/src/java/org/apache/hadoop/hive/ql/io/parquet/vector/VectorizedPrimitiveColumnReader.java:
##########
@@ -686,27 +655,80 @@ private void decodeDictionaryIds(
     }
   }
 
+  private void fillDecimalPrecisionScale(DecimalColumnVector c) {
+    short[] ps = getDecimalPrecisionScale();
+    c.precision = ps[0];
+    c.scale = ps[1];
+  }
+
+  private void fillDecimal64PrecisionScale(Decimal64ColumnVector c) {
+    short[] ps = getDecimalPrecisionScale();
+    c.precision = ps[0];
+    c.scale = ps[1];
+  }
+
   /**
-   * The decimal precision and scale is filled into decimalColumnVector.  If 
the data in
-   * Parquet is in decimal, the precision and scale will come in from 
decimalLogicalType.  If parquet
-   * is not in decimal, then this call is made because HMS shows the type as 
decimal.  So, the
-   * precision and scale are picked from hiveType.
-   *
-   * @param decimalLogicalType
-   * @param decimalColumnVector
+   * Precision/scale for this decimal column: from the Parquet decimal logical 
type when present,
+   * otherwise from the Hive type (Parquet stores it as a non-decimal physical 
type but HMS reports
+   * decimal).
    */
-  private void fillDecimalPrecisionScale(DecimalLogicalTypeAnnotation 
decimalLogicalType,
-      DecimalColumnVector decimalColumnVector) {
-    if (decimalLogicalType != null) {
-      decimalColumnVector.precision = (short) 
decimalLogicalType.getPrecision();
-      decimalColumnVector.scale = (short) decimalLogicalType.getScale();
+  private short[] getDecimalPrecisionScale() {
+    if (type.getLogicalTypeAnnotation() instanceof 
DecimalLogicalTypeAnnotation d) {
+      return new short[] { (short) d.getPrecision(), (short) d.getScale() };
     } else if (TypeInfoUtils.getBaseName(hiveType.getTypeName())
         .equalsIgnoreCase(serdeConstants.DECIMAL_TYPE_NAME)) {
-      decimalColumnVector.precision = (short) ((DecimalTypeInfo) 
hiveType).getPrecision();
-      decimalColumnVector.scale = (short) ((DecimalTypeInfo) 
hiveType).getScale();
-    } else {
-      throw new UnsupportedOperationException(
-          "The underlying Parquet type cannot be converted to Hive Decimal 
type: " + type);
+      DecimalTypeInfo dti = (DecimalTypeInfo) hiveType;
+      return new short[] { (short) dti.getPrecision(), (short) dti.getScale() 
};
+    }
+    throw new UnsupportedOperationException(
+        "The underlying Parquet type cannot be converted to Hive Decimal type: 
" + type);
+  }
+
+  /**
+   * Decimal64 fast path: read the unscaled value straight into the 
long-backed vector instead of
+   * materializing a HiveDecimal per row. Only reached for decimal columns the 
vectorizer tagged
+   * DECIMAL_64 (precision <= 18); higher precision still uses {@link 
#readDecimal}.
+   */
+  private void readDecimal64(int total, Decimal64ColumnVector c, int rowId) {
+    fillDecimal64PrecisionScale(c);
+    PrimitiveTypeName physical = type.asPrimitiveType().getPrimitiveTypeName();

Review Comment:
   The new Decimal64 fast path is intended to handle DECIMAL logical types 
backed by INT32/INT64 (common for Parquet decimals with precision <= 18), but 
the added unit tests only exercise a BINARY-backed decimal schema ("required 
binary ... (DECIMAL)"). Please add coverage that reads a DECIMAL logical type 
stored in INT32 and/or INT64 while tagged DECIMAL_64 to prevent regressions 
like the Iceberg Parquet 10.01 -> 0.00 issue.



-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: [email protected]

For queries about this service, please contact Infrastructure at:
[email protected]


---------------------------------------------------------------------
To unsubscribe, e-mail: [email protected]
For additional commands, e-mail: [email protected]

Re: [PR] HIVE-29649: Iceberg: Support Parquet DECIMAL_64 vectorization [hive]

Reply via email to