[ https://issues.apache.org/jira/browse/HIVE-20954?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16698459#comment-16698459 ]
Gopal V commented on HIVE-20954: -------------------------------- To recap the changes. Here's compat matrices to compare ||RS_2||RS_2||Result|| |UNSET|UNSET| Dedup with UNSET| |FIXED | FIXED | Dedup only if num-reducers are same | |UNIFORM+AUTOPARALLEL | UNIFORM+AUTOPARALLEL| Dedup always (use higher number of reducers)| That's the easy case, now for the combo (and vice versa) ||RS_2||RS_2||Result|| |UNSET|FIXED| Dedup with FIXED| |UNSET|UNIFORM| Dedup with UNIFORM| |UNSET|UNIFORM+AUTOPARALLEL| Dedup with UNIFORM| |UNIFORM|UNIFORM+AUTOPARALLEL| Dedup with UNIFORM| |UNIFORM|AUTOPARALLEL| No dedup| |UNIFORM|FIXED| No Dedup | [~teddy.choi]: the patch LGTM +1 - several queries the shared work is kicking in properly (i.e reducers are getting removed). The cbo_limit.q seems to be a test diff flakiness. The others are failing with an odd NPE {code} java.lang.NullPointerException at org.apache.hive.jdbc.BaseJdbcWithMiniLlap.tearDown(BaseJdbcWithMiniLlap.java:153) {code} Both failures look unrelated, but deserve their own follow-up bugs. > Vector RS operator is not using uniform hash function for TPC-DS query 95 > ------------------------------------------------------------------------- > > Key: HIVE-20954 > URL: https://issues.apache.org/jira/browse/HIVE-20954 > Project: Hive > Issue Type: Improvement > Reporter: Teddy Choi > Assignee: Teddy Choi > Priority: Major > Labels: pull-request-available > Attachments: HIVE-20954.1.patch, HIVE-20954.2.patch > > > Distribution of rows is skewed in DHJ causing slowdown. > Same RS outputs, but the two branches use VectorReduceSinkObjectHashOperator > and VectorReduceSinkLongOperator. > {code} > | Select Operator | > | expressions: ws_warehouse_sk (type: bigint), > ws_order_number (type: bigint) | > | outputColumnNames: _col0, _col1 | > | Select Vectorization: | > | className: VectorSelectOperator | > | native: true | > | projectedOutputColumnNums: [14, 16] | > | Statistics: Num rows: 7199963324 Data size: > 115185006696 Basic stats: COMPLETE Column stats: COMPLETE | > | Reduce Output Operator | > | key expressions: _col1 (type: bigint) | > | sort order: + | > | Map-reduce partition columns: _col1 (type: bigint) | > | Reduce Sink Vectorization: | > | className: VectorReduceSinkObjectHashOperator | > | keyColumnNums: [16] | > | native: true | > | nativeConditionsMet: > hive.vectorized.execution.reducesink.new.enabled IS true, > hive.execution.engine tez IN [tez, spark] IS true, No PTF TopN IS true, No > DISTINCT columns IS true, BinarySortableSerDe for keys IS true, > LazyBinarySerDe for values IS true | > | partitionColumnNums: [16] | > | valueColumnNums: [14] | > +----------------------------------------------------+ > | Explain | > +----------------------------------------------------+ > | Statistics: Num rows: 7199963324 Data size: > 115185006696 Basic stats: COMPLETE Column stats: COMPLETE | > | value expressions: _col0 (type: bigint) | > | Reduce Output Operator | > | key expressions: _col1 (type: bigint) | > | sort order: + | > | Map-reduce partition columns: _col1 (type: bigint) | > | Reduce Sink Vectorization: | > | className: VectorReduceSinkLongOperator | > | keyColumnNums: [16] | > | native: true | > | nativeConditionsMet: > hive.vectorized.execution.reducesink.new.enabled IS true, > hive.execution.engine tez IN [tez, spark] IS true, No PTF TopN IS true, No > DISTINCT columns IS true, BinarySortableSerDe for keys IS true, > LazyBinarySerDe for values IS true | > | valueColumnNums: [14] | > | Statistics: Num rows: 7199963324 Data size: > 115185006696 Basic stats: COMPLETE Column stats: COMPLETE | > | value expressions: _col0 (type: bigint) | > | Execution mode: vectorized, llap | > {code} -- This message was sent by Atlassian JIRA (v7.6.3#76005)