Re: [PR] HIVE-26877: Parquet CTAS with JOIN on decimals with different precision/scale fail. [hive]

via GitHub Thu, 11 Dec 2025 03:39:28 -0800


zabetak commented on code in PR #6229:
URL: https://github.com/apache/hive/pull/6229#discussion_r2610245330



##########
ql/src/test/results/clientpositive/llap/ctas_dec_pre_scale_issue.q.out:
##########
@@ -0,0 +1,224 @@
+PREHOOK: query: CREATE TABLE table_a (col_dec_a decimal(12,7))
+PREHOOK: type: CREATETABLE
+PREHOOK: Output: database:default
+PREHOOK: Output: default@table_a
+POSTHOOK: query: CREATE TABLE table_a (col_dec_a decimal(12,7))
+POSTHOOK: type: CREATETABLE
+POSTHOOK: Output: database:default
+POSTHOOK: Output: default@table_a
+PREHOOK: query: CREATE TABLE table_b(col_dec_b decimal(15,5))
+PREHOOK: type: CREATETABLE
+PREHOOK: Output: database:default
+PREHOOK: Output: default@table_b
+POSTHOOK: query: CREATE TABLE table_b(col_dec_b decimal(15,5))
+POSTHOOK: type: CREATETABLE
+POSTHOOK: Output: database:default
+POSTHOOK: Output: default@table_b
+PREHOOK: query: INSERT INTO table_a VALUES (12345.6789101)
+PREHOOK: type: QUERY
+PREHOOK: Input: _dummy_database@_dummy_table
+PREHOOK: Output: default@table_a
+POSTHOOK: query: INSERT INTO table_a VALUES (12345.6789101)
+POSTHOOK: type: QUERY
+POSTHOOK: Input: _dummy_database@_dummy_table
+POSTHOOK: Output: default@table_a
+POSTHOOK: Lineage: table_a.col_dec_a SCRIPT []
+PREHOOK: query: INSERT INTO table_b VALUES (1234567891.01112)
+PREHOOK: type: QUERY
+PREHOOK: Input: _dummy_database@_dummy_table
+PREHOOK: Output: default@table_b
+POSTHOOK: query: INSERT INTO table_b VALUES (1234567891.01112)
+POSTHOOK: type: QUERY
+POSTHOOK: Input: _dummy_database@_dummy_table
+POSTHOOK: Output: default@table_b
+POSTHOOK: Lineage: table_b.col_dec_b SCRIPT []
+PREHOOK: query: explain create table target as
+select table_a.col_dec_a target_col
+from table_a
+left outer join table_b on
+table_a.col_dec_a = table_b.col_dec_b
+PREHOOK: type: CREATETABLE_AS_SELECT
+PREHOOK: Input: default@table_a
+PREHOOK: Input: default@table_b
+PREHOOK: Output: database:default
+PREHOOK: Output: default@target
+POSTHOOK: query: explain create table target as
+select table_a.col_dec_a target_col
+from table_a
+left outer join table_b on
+table_a.col_dec_a = table_b.col_dec_b
+POSTHOOK: type: CREATETABLE_AS_SELECT
+POSTHOOK: Input: default@table_a
+POSTHOOK: Input: default@table_b
+POSTHOOK: Output: database:default
+POSTHOOK: Output: default@target
+STAGE DEPENDENCIES:
+  Stage-1 is a root stage
+  Stage-2 depends on stages: Stage-1
+  Stage-4 depends on stages: Stage-0, Stage-2
+  Stage-3 depends on stages: Stage-4
+  Stage-0 depends on stages: Stage-1
+
+STAGE PLANS:
+  Stage: Stage-1
+    Tez
+#### A masked pattern was here ####
+      Edges:
+        Reducer 2 <- Map 1 (SIMPLE_EDGE), Map 4 (SIMPLE_EDGE)
+        Reducer 3 <- Reducer 2 (CUSTOM_SIMPLE_EDGE)
+#### A masked pattern was here ####
+      Vertices:
+        Map 1 
+            Map Operator Tree:
+                TableScan
+                  alias: table_a
+                  Statistics: Num rows: 1 Data size: 112 Basic stats: COMPLETE 
Column stats: COMPLETE
+                  Select Operator
+                    expressions: col_dec_a (type: decimal(12,7))
+                    outputColumnNames: _col0
+                    Statistics: Num rows: 1 Data size: 112 Basic stats: 
COMPLETE Column stats: COMPLETE
+                    Reduce Output Operator
+                      key expressions: _col0 (type: decimal(17,7))
+                      null sort order: z
+                      sort order: +
+                      Map-reduce partition columns: _col0 (type: decimal(17,7))
+                      Statistics: Num rows: 1 Data size: 112 Basic stats: 
COMPLETE Column stats: COMPLETE
+            Execution mode: vectorized, llap
+            LLAP IO: all inputs
+        Map 4 
+            Map Operator Tree:
+                TableScan
+                  alias: table_b
+                  filterExpr: col_dec_b is not null (type: boolean)
+                  Statistics: Num rows: 1 Data size: 112 Basic stats: COMPLETE 
Column stats: COMPLETE
+                  Filter Operator
+                    predicate: col_dec_b is not null (type: boolean)
+                    Statistics: Num rows: 1 Data size: 112 Basic stats: 
COMPLETE Column stats: COMPLETE
+                    Select Operator
+                      expressions: col_dec_b (type: decimal(15,5))
+                      outputColumnNames: _col0
+                      Statistics: Num rows: 1 Data size: 112 Basic stats: 
COMPLETE Column stats: COMPLETE
+                      Reduce Output Operator
+                        key expressions: _col0 (type: decimal(17,7))
+                        null sort order: z
+                        sort order: +
+                        Map-reduce partition columns: _col0 (type: 
decimal(17,7))
+                        Statistics: Num rows: 1 Data size: 112 Basic stats: 
COMPLETE Column stats: COMPLETE
+            Execution mode: vectorized, llap
+            LLAP IO: all inputs
+        Reducer 2 
+            Execution mode: llap
+            Reduce Operator Tree:
+              Merge Join Operator
+                condition map:
+                     Left Outer Join 0 to 1
+                keys:
+                  0 _col0 (type: decimal(17,7))
+                  1 _col0 (type: decimal(17,7))
+                outputColumnNames: _col0
+                Statistics: Num rows: 1 Data size: 112 Basic stats: COMPLETE 
Column stats: COMPLETE
+                File Output Operator
+                  compressed: false
+                  Statistics: Num rows: 1 Data size: 112 Basic stats: COMPLETE 
Column stats: COMPLETE
+                  table:
+                      input format: 
org.apache.hadoop.hive.ql.io.parquet.MapredParquetInputFormat
+                      output format: 
org.apache.hadoop.hive.ql.io.parquet.MapredParquetOutputFormat
+                      serde: 
org.apache.hadoop.hive.ql.io.parquet.serde.ParquetHiveSerDe
+                      name: default.target
+                Select Operator
+                  expressions: _col0 (type: decimal(12,7))
+                  outputColumnNames: col1
+                  Statistics: Num rows: 1 Data size: 112 Basic stats: COMPLETE 
Column stats: COMPLETE

Review Comment:
   This `Select Operator` is interesting. Although it is not involved when 
writing to the table it seems to have the correct data type for the column that 
results from the join so if we had something similar before calling the `File 
Output Operator` we wouldn't hit the problem in the first place.



##########
ql/src/java/org/apache/hadoop/hive/ql/io/parquet/serde/ParquetHiveSerDe.java:
##########
@@ -143,7 +147,14 @@ public Writable serialize(final Object obj, final 
ObjectInspector objInspector)
     }
 
     parquetRow.value = obj;
-    parquetRow.inspector= (StructObjectInspector)objInspector;
+    // The 'objInspector' coming from Operator may have different type infos 
than table column type infos which will lead to the issues like HIVE-26877
+    // so comparing the object inspector created during initialize phase of 
this SerDe class and the object inspector coming from Operator
+    // if they are different then using the object inspector created during 
initialize phase which is proper
+    if (!ObjectInspectorUtils.compareTypes(writableObjectInspector, 
objInspector)) {
+      parquetRow.inspector = (StructObjectInspector) writableObjectInspector;
+    } else {
+      parquetRow.inspector = (StructObjectInspector) objInspector;
+    }

Review Comment:
   The `if` statement here is bit strange cause it says that whenever there is 
type disagreement I will use the original inspector and when types are equal I 
will trust what is given to me.
   
   The fact that all tests pass implies that most of the time (if not always) 
in existing tests the types are equal so we are essentially hitting the else 
branch. 
   
   Type inequality seems to be an outlier and maybe 
`ctas_dec_pre_scale_issue.q` is the only test that covers it. Does the proposed 
solution work if you add more column types in the table/queries?
   
   ```sql
   CREATE TABLE table_a (cint int, cstring string, cdec decimal(12,7));
   INSERT INTO table_a(100, 'Bob', 12345.6789101);
   ...
   CREATE TABLE target AS
   SELECT ta.cint, ta.cstring, ta.cdec
   FROM table_a ta
   ...
   ```



-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: [email protected]

For queries about this service, please contact Infrastructure at:
[email protected]


---------------------------------------------------------------------
To unsubscribe, e-mail: [email protected]
For additional commands, e-mail: [email protected]

Re: [PR] HIVE-26877: Parquet CTAS with JOIN on decimals with different precision/scale fail. [hive]

Reply via email to