[jira] [Updated] (HIVE-7544) Changes related to TEZ-1288 (FastTezSerialization)
[ https://issues.apache.org/jira/browse/HIVE-7544?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Rajesh Balamohan updated HIVE-7544: --- Attachment: HIVE-7544.tez-branch.2.patch Uploading the rebased patch for tez branch. > Changes related to TEZ-1288 (FastTezSerialization) > -- > > Key: HIVE-7544 > URL: https://issues.apache.org/jira/browse/HIVE-7544 > Project: Hive > Issue Type: Sub-task > Components: Tez >Affects Versions: 0.14.0 >Reporter: Rajesh Balamohan >Assignee: Rajesh Balamohan > Attachments: HIVE-7544.1.patch, HIVE-7544.tez-branch.2.patch > > > Add ability to make use of TezBytesWritableSerialization. > NO PRECOMMIT TESTS -- This message was sent by Atlassian JIRA (v6.2#6252)
[jira] [Resolved] (HIVE-7910) Enhance natural order scheduler to prevent downstream vertex from monopolizing the cluster resources
[ https://issues.apache.org/jira/browse/HIVE-7910?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Rajesh Balamohan resolved HIVE-7910. Resolution: Won't Fix Apologizes..Meant for tez project. Closing this bug. > Enhance natural order scheduler to prevent downstream vertex from > monopolizing the cluster resources > > > Key: HIVE-7910 > URL: https://issues.apache.org/jira/browse/HIVE-7910 > Project: Hive > Issue Type: Bug >Reporter: Rajesh Balamohan > Labels: performance > > M2 M7 > \ / > (sg) \/ >R3/ (b) > \ / > (b) \ / > \ / > M5 > | > R6 > Plz refer to the attachment (task runtime SVG). In this case, M5 got > scheduled much earlier than R3 (R3 is mentioned as green color in the > diagram) and retained lots of containers. R3 got less containers to work > with. > Attaching the output from the status monitor when the job ran; Map_5 has > taken up almost all containers, whereas Reducer_3 got fraction of the > capacity. > Map_2: 1/1 Map_5: 0(+373)/1000 Map_7: 1/1 Reducer_3: 0/8000 > Reducer_6: 0/1 > Map_2: 1/1 Map_5: 0(+374)/1000 Map_7: 1/1 Reducer_3: 0/8000 > Reducer_6: 0/1 > Map_2: 1/1 Map_5: 0(+374)/1000 Map_7: 1/1 Reducer_3: 0(+1)/8000 > Reducer_6: 0/1 > > Map_2: 1/1 Map_5: 0(+374)/1000 Map_7: 1/1 Reducer_3: > 14(+7)/8000 Reducer_6: 0/1 > Map_2: 1/1 Map_5: 0(+374)/1000 Map_7: 1/1 Reducer_3: > 63(+14)/8000 Reducer_6: 0/1 > Map_2: 1/1 Map_5: 0(+374)/1000 Map_7: 1/1 Reducer_3: > 159(+22)/8000Reducer_6: 0/1 > Map_2: 1/1 Map_5: 0(+374)/1000 Map_7: 1/1 Reducer_3: > 308(+29)/8000Reducer_6: 0/1 > ... > Creating this JIRA as a placeholder for scheduler enhancement. One > possibililty could be to > schedule lesser number of tasks in downstream vertices, based on the > information available for the upstream vertex. -- This message was sent by Atlassian JIRA (v6.2#6252)
[jira] [Created] (HIVE-7910) Enhance natural order scheduler to prevent downstream vertex from monopolizing the cluster resources
Rajesh Balamohan created HIVE-7910: -- Summary: Enhance natural order scheduler to prevent downstream vertex from monopolizing the cluster resources Key: HIVE-7910 URL: https://issues.apache.org/jira/browse/HIVE-7910 Project: Hive Issue Type: Bug Reporter: Rajesh Balamohan M2 M7 \ / (sg) \/ R3/ (b) \ / (b) \ / \ / M5 | R6 Plz refer to the attachment (task runtime SVG). In this case, M5 got scheduled much earlier than R3 (R3 is mentioned as green color in the diagram) and retained lots of containers. R3 got less containers to work with. Attaching the output from the status monitor when the job ran; Map_5 has taken up almost all containers, whereas Reducer_3 got fraction of the capacity. Map_2: 1/1 Map_5: 0(+373)/1000 Map_7: 1/1 Reducer_3: 0/8000 Reducer_6: 0/1 Map_2: 1/1 Map_5: 0(+374)/1000 Map_7: 1/1 Reducer_3: 0/8000 Reducer_6: 0/1 Map_2: 1/1 Map_5: 0(+374)/1000 Map_7: 1/1 Reducer_3: 0(+1)/8000 Reducer_6: 0/1 Map_2: 1/1 Map_5: 0(+374)/1000 Map_7: 1/1 Reducer_3: 14(+7)/8000 Reducer_6: 0/1 Map_2: 1/1 Map_5: 0(+374)/1000 Map_7: 1/1 Reducer_3: 63(+14)/8000 Reducer_6: 0/1 Map_2: 1/1 Map_5: 0(+374)/1000 Map_7: 1/1 Reducer_3: 159(+22)/8000Reducer_6: 0/1 Map_2: 1/1 Map_5: 0(+374)/1000 Map_7: 1/1 Reducer_3: 308(+29)/8000Reducer_6: 0/1 ... Creating this JIRA as a placeholder for scheduler enhancement. One possibililty could be to schedule lesser number of tasks in downstream vertices, based on the information available for the upstream vertex. -- This message was sent by Atlassian JIRA (v6.2#6252)
[jira] [Created] (HIVE-8071) hive shell tries to write hive-exec.jar for each run
Rajesh Balamohan created HIVE-8071: -- Summary: hive shell tries to write hive-exec.jar for each run Key: HIVE-8071 URL: https://issues.apache.org/jira/browse/HIVE-8071 Project: Hive Issue Type: Bug Reporter: Rajesh Balamohan Assignee: Rajesh Balamohan For every run of the hive CLI there is a delay for the shell startup 14/07/31 23:07:19 INFO Configuration.deprecation: fs.default.name is deprecated. Instead, use fs.defaultFS 14/07/31 23:07:19 INFO tez.DagUtils: Hive jar directory is hdfs://mac-10:8020/user/gopal/apps/2014-Jul-31/hive/ 14/07/31 23:07:19 INFO tez.DagUtils: Localizing resource because it does not exist: file:/home/gopal/tez-autobuild/dist/hive/lib/hive-exec-0.14.0-SNAPSHOT.jar to dest: hdfs://mac-10:8020/user/gopal/apps/2014-Jul-31/hive/hive-exec-0.14.0-SNAPSHOTde1f82f0b5561d3db9e3080dfb2897210a3bda4ca5e7b14e881e381115837fd8. jar 14/07/31 23:07:19 INFO tez.DagUtils: Looks like another thread is writing the same file will wait. 14/07/31 23:07:19 INFO tez.DagUtils: Number of wait attempts: 5. Wait interval: 5000 14/07/31 23:07:19 INFO tez.DagUtils: Resource modification time: 1406870512963 14/07/31 23:07:20 INFO tez.TezSessionState: Opening new Tez Session (id: 02d6b558-44cc-4182-b2f2-6a37ffdd25d2, scratch dir: hdfs://mac-10:8020/tmp/hive-gopal/_tez_session_dir/02d6b558-44cc-4182-b2f2-6a37ffdd25d2) Traced this to a method which does PRIVATE LRs - this is marked as PRIVATE even if it is from a common install dir. {code} public LocalResource localizeResource(Path src, Path dest, Configuration conf) throws IOException { return createLocalResource(destFS, dest, LocalResourceType.FILE, LocalResourceVisibility.PRIVATE); } {code} -- This message was sent by Atlassian JIRA (v6.3.4#6332)
[jira] [Updated] (HIVE-8071) hive shell tries to write hive-exec.jar for each run
[ https://issues.apache.org/jira/browse/HIVE-8071?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Rajesh Balamohan updated HIVE-8071: --- Attachment: HIVE-8071.1.patch > hive shell tries to write hive-exec.jar for each run > > > Key: HIVE-8071 > URL: https://issues.apache.org/jira/browse/HIVE-8071 > Project: Hive > Issue Type: Bug >Reporter: Rajesh Balamohan >Assignee: Rajesh Balamohan > Attachments: HIVE-8071.1.patch > > > For every run of the hive CLI there is a delay for the shell startup > 14/07/31 23:07:19 INFO Configuration.deprecation: fs.default.name is > deprecated. Instead, use fs.defaultFS > 14/07/31 23:07:19 INFO tez.DagUtils: Hive jar directory is > hdfs://mac-10:8020/user/gopal/apps/2014-Jul-31/hive/ > 14/07/31 23:07:19 INFO tez.DagUtils: Localizing resource because it does not > exist: > file:/home/gopal/tez-autobuild/dist/hive/lib/hive-exec-0.14.0-SNAPSHOT.jar to > dest: > hdfs://mac-10:8020/user/gopal/apps/2014-Jul-31/hive/hive-exec-0.14.0-SNAPSHOTde1f82f0b5561d3db9e3080dfb2897210a3bda4ca5e7b14e881e381115837fd8. > jar > 14/07/31 23:07:19 INFO tez.DagUtils: Looks like another thread is writing the > same file will wait. > 14/07/31 23:07:19 INFO tez.DagUtils: Number of wait attempts: 5. Wait > interval: 5000 > 14/07/31 23:07:19 INFO tez.DagUtils: Resource modification time: 1406870512963 > 14/07/31 23:07:20 INFO tez.TezSessionState: Opening new Tez Session (id: > 02d6b558-44cc-4182-b2f2-6a37ffdd25d2, scratch dir: > hdfs://mac-10:8020/tmp/hive-gopal/_tez_session_dir/02d6b558-44cc-4182-b2f2-6a37ffdd25d2) > Traced this to a method which does PRIVATE LRs - this is marked as PRIVATE > even if it is from a common install dir. > {code} > public LocalResource localizeResource(Path src, Path dest, Configuration > conf) > throws IOException { > > return createLocalResource(destFS, dest, LocalResourceType.FILE, > LocalResourceVisibility.PRIVATE); > } > {code} -- This message was sent by Atlassian JIRA (v6.3.4#6332)
[jira] [Updated] (HIVE-8071) hive shell tries to write hive-exec.jar for each run
[ https://issues.apache.org/jira/browse/HIVE-8071?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Rajesh Balamohan updated HIVE-8071: --- Status: Patch Available (was: Open) > hive shell tries to write hive-exec.jar for each run > > > Key: HIVE-8071 > URL: https://issues.apache.org/jira/browse/HIVE-8071 > Project: Hive > Issue Type: Bug >Reporter: Rajesh Balamohan >Assignee: Rajesh Balamohan > Attachments: HIVE-8071.1.patch > > > For every run of the hive CLI there is a delay for the shell startup > 14/07/31 23:07:19 INFO Configuration.deprecation: fs.default.name is > deprecated. Instead, use fs.defaultFS > 14/07/31 23:07:19 INFO tez.DagUtils: Hive jar directory is > hdfs://mac-10:8020/user/gopal/apps/2014-Jul-31/hive/ > 14/07/31 23:07:19 INFO tez.DagUtils: Localizing resource because it does not > exist: > file:/home/gopal/tez-autobuild/dist/hive/lib/hive-exec-0.14.0-SNAPSHOT.jar to > dest: > hdfs://mac-10:8020/user/gopal/apps/2014-Jul-31/hive/hive-exec-0.14.0-SNAPSHOTde1f82f0b5561d3db9e3080dfb2897210a3bda4ca5e7b14e881e381115837fd8. > jar > 14/07/31 23:07:19 INFO tez.DagUtils: Looks like another thread is writing the > same file will wait. > 14/07/31 23:07:19 INFO tez.DagUtils: Number of wait attempts: 5. Wait > interval: 5000 > 14/07/31 23:07:19 INFO tez.DagUtils: Resource modification time: 1406870512963 > 14/07/31 23:07:20 INFO tez.TezSessionState: Opening new Tez Session (id: > 02d6b558-44cc-4182-b2f2-6a37ffdd25d2, scratch dir: > hdfs://mac-10:8020/tmp/hive-gopal/_tez_session_dir/02d6b558-44cc-4182-b2f2-6a37ffdd25d2) > Traced this to a method which does PRIVATE LRs - this is marked as PRIVATE > even if it is from a common install dir. > {code} > public LocalResource localizeResource(Path src, Path dest, Configuration > conf) > throws IOException { > > return createLocalResource(destFS, dest, LocalResourceType.FILE, > LocalResourceVisibility.PRIVATE); > } > {code} -- This message was sent by Atlassian JIRA (v6.3.4#6332)
[jira] [Created] (HIVE-8158) Optimize writeValue/setValue in VectorExpressionWriterFactory (in VectorReduceSinkOperator codepath)
Rajesh Balamohan created HIVE-8158: -- Summary: Optimize writeValue/setValue in VectorExpressionWriterFactory (in VectorReduceSinkOperator codepath) Key: HIVE-8158 URL: https://issues.apache.org/jira/browse/HIVE-8158 Project: Hive Issue Type: Bug Reporter: Rajesh Balamohan Assignee: Rajesh Balamohan VectorReduceSinkOperator --> ProcessOp --> makeValueWriatable --> VectorExpressionWriterFactory --> writeValue(byte[], int, int) /setValue. It appears that this goes through an additional layer of Text.encode/decode causing CPU pressure (profiler output attached). SettableStringObjectInspector / WritableStringObjectInspector has "set(Object o, Text value)" method. It would be beneficial to use set(Object, Text) directly to save CPU cycles. -- This message was sent by Atlassian JIRA (v6.3.4#6332)
[jira] [Updated] (HIVE-8158) Optimize writeValue/setValue in VectorExpressionWriterFactory (in VectorReduceSinkOperator codepath)
[ https://issues.apache.org/jira/browse/HIVE-8158?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Rajesh Balamohan updated HIVE-8158: --- Attachment: profiler_output.png > Optimize writeValue/setValue in VectorExpressionWriterFactory (in > VectorReduceSinkOperator codepath) > > > Key: HIVE-8158 > URL: https://issues.apache.org/jira/browse/HIVE-8158 > Project: Hive > Issue Type: Bug >Reporter: Rajesh Balamohan >Assignee: Rajesh Balamohan > Labels: performance > Attachments: profiler_output.png > > > VectorReduceSinkOperator --> ProcessOp --> makeValueWriatable --> > VectorExpressionWriterFactory --> writeValue(byte[], int, int) /setValue. > It appears that this goes through an additional layer of Text.encode/decode > causing CPU pressure (profiler output attached). > SettableStringObjectInspector / WritableStringObjectInspector has "set(Object > o, Text value)" method. It would be beneficial to use set(Object, Text) > directly to save CPU cycles. -- This message was sent by Atlassian JIRA (v6.3.4#6332)
[jira] [Updated] (HIVE-8158) Optimize writeValue/setValue in VectorExpressionWriterFactory (in VectorReduceSinkOperator codepath)
[ https://issues.apache.org/jira/browse/HIVE-8158?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Rajesh Balamohan updated HIVE-8158: --- Attachment: HIVE-8158.1.patch > Optimize writeValue/setValue in VectorExpressionWriterFactory (in > VectorReduceSinkOperator codepath) > > > Key: HIVE-8158 > URL: https://issues.apache.org/jira/browse/HIVE-8158 > Project: Hive > Issue Type: Bug >Reporter: Rajesh Balamohan >Assignee: Rajesh Balamohan > Labels: performance > Attachments: HIVE-8158.1.patch, profiler_output.png > > > VectorReduceSinkOperator --> ProcessOp --> makeValueWriatable --> > VectorExpressionWriterFactory --> writeValue(byte[], int, int) /setValue. > It appears that this goes through an additional layer of Text.encode/decode > causing CPU pressure (profiler output attached). > SettableStringObjectInspector / WritableStringObjectInspector has "set(Object > o, Text value)" method. It would be beneficial to use set(Object, Text) > directly to save CPU cycles. -- This message was sent by Atlassian JIRA (v6.3.4#6332)
[jira] [Commented] (HIVE-7389) Reduce number of metastore calls in MoveTask (when loading dynamic partitions)
[ https://issues.apache.org/jira/browse/HIVE-7389?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14148659#comment-14148659 ] Rajesh Balamohan commented on HIVE-7389: [~hagleitn] Looks like i need to rebase the patch. I will upload it soon. > Reduce number of metastore calls in MoveTask (when loading dynamic partitions) > -- > > Key: HIVE-7389 > URL: https://issues.apache.org/jira/browse/HIVE-7389 > Project: Hive > Issue Type: Bug >Affects Versions: 0.14.0 >Reporter: Rajesh Balamohan >Assignee: Rajesh Balamohan > Labels: performance > Attachments: HIVE-7389.1.patch, local_vm_testcase.txt > > > When the number of dynamic partitions to be loaded are high, the time taken > for 'MoveTask' is greater than the actual job in some scenarios. It would be > possible to reduce overall runtime by reducing the number of calls made to > metastore from MoveTask operation. -- This message was sent by Atlassian JIRA (v6.3.4#6332)
[jira] [Updated] (HIVE-7389) Reduce number of metastore calls in MoveTask (when loading dynamic partitions)
[ https://issues.apache.org/jira/browse/HIVE-7389?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Rajesh Balamohan updated HIVE-7389: --- Attachment: HIVE-7389.2.patch rebasing the patch to trunk. > Reduce number of metastore calls in MoveTask (when loading dynamic partitions) > -- > > Key: HIVE-7389 > URL: https://issues.apache.org/jira/browse/HIVE-7389 > Project: Hive > Issue Type: Bug >Affects Versions: 0.14.0 >Reporter: Rajesh Balamohan >Assignee: Rajesh Balamohan > Labels: performance > Attachments: HIVE-7389.1.patch, HIVE-7389.2.patch, > local_vm_testcase.txt > > > When the number of dynamic partitions to be loaded are high, the time taken > for 'MoveTask' is greater than the actual job in some scenarios. It would be > possible to reduce overall runtime by reducing the number of calls made to > metastore from MoveTask operation. -- This message was sent by Atlassian JIRA (v6.3.4#6332)
[jira] [Created] (HIVE-25827) Parquet file footer is read multiple times, when multiple splits are created in same file
Rajesh Balamohan created HIVE-25827: --- Summary: Parquet file footer is read multiple times, when multiple splits are created in same file Key: HIVE-25827 URL: https://issues.apache.org/jira/browse/HIVE-25827 Project: Hive Issue Type: Improvement Reporter: Rajesh Balamohan Attachments: image-2021-12-21-03-19-38-577.png With large files, it is possible that multiple splits are created in the same file. With current codebase, "ParquetRecordReaderBase" ends up reading file footer for each split. It can be optimized not to read footer information multiple times for the same file. [https://github.com/apache/hive/blob/master/ql/src/java/org/apache/hadoop/hive/ql/io/parquet/vector/VectorizedParquetRecordReader.java#L160] [https://github.com/apache/hive/blob/master/ql/src/java/org/apache/hadoop/hive/ql/io/parquet/ParquetRecordReaderBase.java#L91] !image-2021-12-21-03-19-38-577.png|width=1363,height=1256! -- This message was sent by Atlassian Jira (v8.20.1#820001)
[jira] [Created] (HIVE-25845) Support ColumnIndexes for Parq files
Rajesh Balamohan created HIVE-25845: --- Summary: Support ColumnIndexes for Parq files Key: HIVE-25845 URL: https://issues.apache.org/jira/browse/HIVE-25845 Project: Hive Issue Type: Improvement Reporter: Rajesh Balamohan https://issues.apache.org/jira/browse/PARQUET-1201 [https://github.com/apache/hive/blob/master/ql/src/java/org/apache/hadoop/hive/ql/io/parquet/vector/VectorizedParquetRecordReader.java#L271-L273] -- This message was sent by Atlassian Jira (v8.20.1#820001)
[jira] [Created] (HIVE-25913) Dynamic Partition Pruning Operator: Not working in iceberg tables
Rajesh Balamohan created HIVE-25913: --- Summary: Dynamic Partition Pruning Operator: Not working in iceberg tables Key: HIVE-25913 URL: https://issues.apache.org/jira/browse/HIVE-25913 Project: Hive Issue Type: Improvement Reporter: Rajesh Balamohan Notice "Dynamic Partitioning Event Operator" missing in Map 3 in iceberg tables. This causes heavy IO in iceberg tables leading to perf degradation. {noformat} ACID table == explain select count(*) from store_sales, date_dim where d_month_seq between 1212 and 1212+11 and ss_store_sk is not null and ss_sold_date_sk=d_date_sk; Explain STAGE DEPENDENCIES: Stage-1 is a root stage Stage-0 depends on stages: Stage-1STAGE PLANS: Stage: Stage-1 Tez DagId: hive_20220131032425_be2fab7f-7943-4aa1-bbdd-289139ea0f90:17 Edges: Map 1 <- Map 3 (BROADCAST_EDGE) Reducer 2 <- Map 1 (CUSTOM_SIMPLE_EDGE) DagName: hive_20220131032425_be2fab7f-7943-4aa1-bbdd-289139ea0f90:17 Vertices: Map 1 Map Operator Tree: TableScan alias: store_sales filterExpr: ss_store_sk is not null (type: boolean) Statistics: Num rows: 27503885621 Data size: 434880571744 Basic stats: COMPLETE Column stats: COMPLETE Filter Operator predicate: ss_store_sk is not null (type: boolean) Statistics: Num rows: 26856185846 Data size: 424639398832 Basic stats: COMPLETE Column stats: COMPLETE Select Operator expressions: ss_sold_date_sk (type: bigint) outputColumnNames: _col0 Statistics: Num rows: 26856185846 Data size: 214849486768 Basic stats: COMPLETE Column stats: COMPLETE Map Join Operator condition map: Inner Join 0 to 1 keys: 0 _col0 (type: bigint) 1 _col0 (type: bigint) input vertices: 1 Map 3 Statistics: Num rows: 5279977323 Data size: 42239818584 Basic stats: COMPLETE Column stats: COMPLETE Group By Operator aggregations: count() minReductionHashAggr: 0.99 mode: hash outputColumnNames: _col0 Statistics: Num rows: 1 Data size: 8 Basic stats: COMPLETE Column stats: COMPLETE Reduce Output Operator null sort order: sort order: Statistics: Num rows: 1 Data size: 8 Basic stats: COMPLETE Column stats: COMPLETE value expressions: _col0 (type: bigint) Execution mode: vectorized, llap LLAP IO: may be used (ACID table) Map 3 Map Operator Tree: TableScan alias: date_dim filterExpr: (d_month_seq BETWEEN 1212 AND 1223 and d_date_sk is not null) (type: boolean) Statistics: Num rows: 73049 Data size: 876588 Basic stats: COMPLETE Column stats: COMPLETE Filter Operator predicate: (d_month_seq BETWEEN 1212 AND 1223 and d_date_sk is not null) (type: boolean) Statistics: Num rows: 359 Data size: 4308 Basic stats: COMPLETE Column stats: COMPLETE Select Operator expressions: d_date_sk (type: bigint) outputColumnNames: _col0 Statistics: Num rows: 359 Data size: 2872 Basic stats: COMPLETE Column stats: COMPLETE Reduce Output Operator key expressions: _col0 (type: bigint) null sort order: a sort order: + Map-reduce partition columns: _col0 (type: bigint) Statistics: Num rows: 359 Data size: 2872 Basic stats: COMPLETE Column stats: COMPLETE Select Operator expressions: _col0 (type: bigint) outputColumnNames: _col0 Statistics: Num rows: 359 Data size: 2872 Basic stats: COMPLETE Column stats: COMPLETE Group By Operator keys: _col0 (type: bigint) minReductionHashAggr: 0.5013927 mode: hash outputColumnNames: _col0 Statistics: Num rows: 179 Data size: 1432 Basic stats: COMPLETE Column stats: COMPLETE Dynamic Partitioning Event Operator
[jira] [Created] (HIVE-25927) Fix DataWritableReadSupport
Rajesh Balamohan created HIVE-25927: --- Summary: Fix DataWritableReadSupport Key: HIVE-25927 URL: https://issues.apache.org/jira/browse/HIVE-25927 Project: Hive Issue Type: Improvement Reporter: Rajesh Balamohan Attachments: Screenshot 2022-02-04 at 4.57.22 AM.png !Screenshot 2022-02-04 at 4.57.22 AM.png|width=530,height=406! Takes n^2 ops to match columns. -- This message was sent by Atlassian Jira (v8.20.1#820001)
[jira] [Created] (HIVE-25958) Optimise BasicStatsNoJobTask
Rajesh Balamohan created HIVE-25958: --- Summary: Optimise BasicStatsNoJobTask Key: HIVE-25958 URL: https://issues.apache.org/jira/browse/HIVE-25958 Project: Hive Issue Type: Improvement Reporter: Rajesh Balamohan When there are large number of files are present, it takes lot of time for analyzing table (for stats) takes lot longer time especially on cloud platforms. Each file is read in sequential fashion for computing stats, which can be optimized. {code:java} at org.apache.hadoop.fs.s3a.Invoker.retry(Invoker.java:293) at org.apache.hadoop.fs.s3a.S3AInputStream.read(S3AInputStream.java:506) - locked <0x000642995b10> (a org.apache.hadoop.fs.s3a.S3AInputStream) at org.apache.hadoop.fs.s3a.S3AInputStream.readFully(S3AInputStream.java:775) - locked <0x000642995b10> (a org.apache.hadoop.fs.s3a.S3AInputStream) at org.apache.hadoop.fs.FSDataInputStream.readFully(FSDataInputStream.java:116) at org.apache.orc.impl.RecordReaderUtils.readDiskRanges(RecordReaderUtils.java:574) at org.apache.orc.impl.RecordReaderUtils$DefaultDataReader.readFileData(RecordReaderUtils.java:282) at org.apache.orc.impl.RecordReaderImpl.readAllDataStreams(RecordReaderImpl.java:1172) at org.apache.orc.impl.RecordReaderImpl.readStripe(RecordReaderImpl.java:1128) at org.apache.orc.impl.RecordReaderImpl.advanceStripe(RecordReaderImpl.java:1281) at org.apache.orc.impl.RecordReaderImpl.advanceToNextRow(RecordReaderImpl.java:1316) at org.apache.orc.impl.RecordReaderImpl.(RecordReaderImpl.java:302) at org.apache.hadoop.hive.ql.io.orc.RecordReaderImpl.(RecordReaderImpl.java:68) at org.apache.hadoop.hive.ql.io.orc.ReaderImpl.rowsOptions(ReaderImpl.java:83) at org.apache.hadoop.hive.ql.io.orc.OrcInputFormat.createReaderFromFile(OrcInputFormat.java:367) at org.apache.hadoop.hive.ql.io.orc.OrcInputFormat$OrcRecordReader.(OrcInputFormat.java:276) at org.apache.hadoop.hive.ql.io.orc.OrcInputFormat.getRecordReader(OrcInputFormat.java:2027) at org.apache.hadoop.hive.ql.stats.BasicStatsNoJobTask$FooterStatCollector.run(BasicStatsNoJobTask.java:235) at java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1149) at java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:624) at java.lang.Thread.run(Thread.java:748) "HiveServer2-Background-Pool: Thread-5161" #5161 prio=5 os_prio=0 tid=0x7f271217d800 nid=0x21b7 waiting on condition [0x7f26fce88000] java.lang.Thread.State: TIMED_WAITING (parking) at sun.misc.Unsafe.park(Native Method) - parking to wait for <0x0006bee1b3a0> (a java.util.concurrent.locks.AbstractQueuedSynchronizer$ConditionObject) at java.util.concurrent.locks.LockSupport.parkNanos(LockSupport.java:215) at java.util.concurrent.locks.AbstractQueuedSynchronizer$ConditionObject.awaitNanos(AbstractQueuedSynchronizer.java:2078) at java.util.concurrent.ThreadPoolExecutor.awaitTermination(ThreadPoolExecutor.java:1475) at org.apache.hadoop.hive.ql.stats.BasicStatsNoJobTask.shutdownAndAwaitTermination(BasicStatsNoJobTask.java:426) at org.apache.hadoop.hive.ql.stats.BasicStatsNoJobTask.aggregateStats(BasicStatsNoJobTask.java:338) at org.apache.hadoop.hive.ql.stats.BasicStatsNoJobTask.process(BasicStatsNoJobTask.java:121) at org.apache.hadoop.hive.ql.exec.StatsTask.execute(StatsTask.java:107) at org.apache.hadoop.hive.ql.exec.Task.executeTask(Task.java:213) at org.apache.hadoop.hive.ql.exec.TaskRunner.runSequential(TaskRunner.java:105) at org.apache.hadoop.hive.ql.Executor.launchTask(Executor.java:361) at org.apache.hadoop.hive.ql.Executor.launchTasks(Executor.java:334) at org.apache.hadoop.hive.ql.Executor.runTasks(Executor.java:250) {code} -- This message was sent by Atlassian Jira (v8.20.1#820001)
[jira] [Created] (HIVE-25981) Avoid checking for archived parts in analyze table
Rajesh Balamohan created HIVE-25981: --- Summary: Avoid checking for archived parts in analyze table Key: HIVE-25981 URL: https://issues.apache.org/jira/browse/HIVE-25981 Project: Hive Issue Type: Improvement Components: HiveServer2 Reporter: Rajesh Balamohan Analyze table on large partitioned table is expensive due to unwanted checks on archived data. {noformat} at org.apache.hadoop.hive.metastore.HiveMetaStoreClient$SynchronizedHandler.invoke(HiveMetaStoreClient.java:3908) - locked <0x0003d4c4c070> (a org.apache.hadoop.hive.metastore.HiveMetaStoreClient$SynchronizedHandler) at com.sun.proxy.$Proxy56.listPartitionsWithAuthInfo(Unknown Source) at org.apache.hadoop.hive.ql.metadata.Hive.getPartitions(Hive.java:3845) at org.apache.hadoop.hive.ql.exec.ArchiveUtils.conflictingArchiveNameOrNull(ArchiveUtils.java:299) at org.apache.hadoop.hive.ql.parse.SemanticAnalyzer.validate(SemanticAnalyzer.java:13579) at org.apache.hadoop.hive.ql.Compiler.analyze(Compiler.java:241) at org.apache.hadoop.hive.ql.Compiler.compile(Compiler.java:104) at org.apache.hadoop.hive.ql.Driver.compile(Driver.java:196) at org.apache.hadoop.hive.ql.Driver.compileInternal(Driver.java:615) at org.apache.hadoop.hive.ql.Driver.compileAndRespond(Driver.java:561) at org.apache.hadoop.hive.ql.Driver.compileAndRespond(Driver.java:555) at org.apache.hadoop.hive.ql.reexec.ReExecDriver.compileAndRespond(ReExecDriver.java:127) at org.apache.hive.service.cli.operation.SQLOperation.prepare(SQLOperation.java:204) at org.apache.hive.service.cli.operation.SQLOperation.runInternal(SQLOperation.java:265) at org.apache.hive.service.cli.operation.Operation.run(Operation.java:285) {noformat} -- This message was sent by Atlassian Jira (v8.20.1#820001)
[jira] [Created] (HIVE-26008) Dynamic partition pruning not sending right partitions with subqueries
Rajesh Balamohan created HIVE-26008: --- Summary: Dynamic partition pruning not sending right partitions with subqueries Key: HIVE-26008 URL: https://issues.apache.org/jira/browse/HIVE-26008 Project: Hive Issue Type: Improvement Components: HiveServer2 Reporter: Rajesh Balamohan DPP isn't working fine when there are subqueries involved. Here is an example query (q83). Note that "date_dim" has another query involved. Due to this, DPP operator ends up sending entire "date_dim" to the fact tables. Because of this, data scanned for fact tables are way higher and query runtime is increased. For context, on a very small cluster, this query ran for 265 seconds and with the rewritten query it finished in 11 seconds!. Fact table scan was 10MB vs 10 GB. {noformat} HiveJoin(condition=[=($2, $5)], joinType=[inner]) HiveJoin(condition=[=($0, $3)], joinType=[inner]) HiveProject(cr_item_sk=[$1], cr_return_quantity=[$16], cr_returned_date_sk=[$26]) HiveFilter(condition=[AND(IS NOT NULL($26), IS NOT NULL($1))]) HiveTableScan(table=[[tpcds_bin_partitioned_orc_1, catalog_returns]], table:alias=[catalog_returns]) HiveProject(i_item_sk=[$0], i_item_id=[$1]) HiveFilter(condition=[AND(IS NOT NULL($1), IS NOT NULL($0))]) HiveTableScan(table=[[tpcds_bin_partitioned_orc_1, item]], table:alias=[item]) HiveProject(d_date_sk=[$0], d_date=[$2]) HiveFilter(condition=[AND(IS NOT NULL($2), IS NOT NULL($0))]) HiveTableScan(table=[[tpcds_bin_partitioned_orc_1, date_dim]], table:alias=[date_dim]) HiveProject(d_date=[$0]) HiveSemiJoin(condition=[=($1, $2)], joinType=[semi]) HiveProject(d_date=[$2], d_week_seq=[$4]) HiveFilter(condition=[AND(IS NOT NULL($4), IS NOT NULL($2))]) HiveTableScan(table=[[tpcds_bin_partitioned_orc_1, date_dim]], table:alias=[date_dim]) HiveProject(d_week_seq=[$4]) HiveFilter(condition=[AND(IN($2, 1998-01-02:DATE, 1998-10-15:DATE, 1998-11-10:DATE), IS NOT NULL($4))]) HiveTableScan(table=[[tpcds_bin_partitioned_orc_1, date_dim]], table:alias=[date_dim]) {noformat} *Original Query & Plan: * {noformat} explain cbo with sr_items as (select i_item_id item_id, sum(sr_return_quantity) sr_item_qty from store_returns, item, date_dim where sr_item_sk = i_item_sk and d_datein (select d_date from date_dim where d_week_seq in (select d_week_seq from date_dim where d_date in ('1998-01-02','1998-10-15','1998-11-10'))) and sr_returned_date_sk = d_date_sk group by i_item_id), cr_items as (select i_item_id item_id, sum(cr_return_quantity) cr_item_qty from catalog_returns, item, date_dim where cr_item_sk = i_item_sk and d_datein (select d_date from date_dim where d_week_seq in (select d_week_seq from date_dim where d_date in ('1998-01-02','1998-10-15','1998-11-10'))) and cr_returned_date_sk = d_date_sk group by i_item_id), wr_items as (select i_item_id item_id, sum(wr_return_quantity) wr_item_qty from web_returns, item, date_dim where wr_item_sk = i_item_sk and d_datein (select d_date from date_dim where d_week_seq in (select d_week_seq from date_dim where d_date in ('1998-01-02','1998-10-15','1998-11-10'))) and wr_returned_date_sk = d_date_sk group by i_item_id) select sr_items.item_id ,sr_item_qty ,sr_item_qty/(sr_item_qty+cr_item_qty+wr_item_qty)/3.0 * 100 sr_dev ,cr_item_qty ,cr_item_qty/(sr_item_qty+cr_item_qty+wr_item_qty)/3.0 * 100 cr_dev ,wr_item_qty ,wr_item_qty/(sr_item_qty+cr_item_qty+wr_item_qty)/3.0 * 100 wr_dev ,(sr_item_qty+cr_item_qty+wr_item_qty)/3.0 average from sr_items ,cr_items ,wr_items where sr_items.item_id=cr_items.item_id and sr_items.item_id=wr_items.item_id order by sr_items.item_id ,sr_item_qty limit 100 INFO : Starting task [Stage-3:EXPLAIN] in serial mode INFO : Completed executing command(queryId=hive_20220307055109_88ad0cbd-bd40-45bc-92ae-ab15fa6b1da4); Time taken: 0.973 seconds INFO : OK Explain CBO PLAN: HiveSortLimit(sort0=[$0], sort1=[$1], dir0=[ASC], dir1=[ASC], fetch=[100]) HiveProject(item_id=[$0], sr_item_qty=[$4], sr_dev=[*(/(/($5, CAST(+(+($4, $1), $7)):DOUBLE), 3), 100)], cr_item_qty=[$1], cr_dev=[*(/(/($2, CAST(+(+($4, $1), $7)):DOUBLE), 3), 100)], wr_item_qty=[$7], wr_dev=[*(/(/($8, CAST(+(+($4, $1), $7)):DOUBLE), 3), 100)], average=[/(CAST(+(+($4, $1), $7)):DECIMAL(19, 0), 3:DECIMAL(1, 0))]) HiveJoin(condition=[=($0, $6)], joinType=[inner]) HiveJoin(condition=[=($3, $0)], joinType=[inner]) HiveProject($f0=[$0], $f1=[$1], EXPR$0=[CAST($1):DOUBLE]) HiveAggregate(group=[{4}], agg#0=[sum($1)]) HiveSemiJoin(co
[jira] [Created] (HIVE-26013) Parquet predicate filters are not properly propogated to task configs at runtime
Rajesh Balamohan created HIVE-26013: --- Summary: Parquet predicate filters are not properly propogated to task configs at runtime Key: HIVE-26013 URL: https://issues.apache.org/jira/browse/HIVE-26013 Project: Hive Issue Type: Bug Reporter: Rajesh Balamohan Hive ParquetRecordReader sets the predicate filter in the config for parquet libs to read. Ref: [https://github.com/apache/hive/blob/master/ql%2Fsrc%2Fjava%2Forg%2Fapache%2Fhadoop%2Fhive%2Fql%2Fio%2Fparquet%2FParquetRecordReaderBase.java#L188] {code:java} ParquetInputFormat.setFilterPredicate(conf, p); {code} This internally sets {color:#FF}"parquet.private.read.filter.predicate" {color}variable in config. Ref: [https://github.com/apache/parquet-mr/blob/master/parquet-hadoop%2Fsrc%2Fmain%2Fjava%2Forg%2Fapache%2Fparquet%2Fhadoop%2FParquetInputFormat.java#L231] Config set in compilation phase isn't visible at runtime for the tasks. This causes filters to be lost and tasks run with excessive IO. -- This message was sent by Atlassian Jira (v8.20.1#820001)
[jira] [Created] (HIVE-26035) Move to directsql for ObjectStore::addPartitions
Rajesh Balamohan created HIVE-26035: --- Summary: Move to directsql for ObjectStore::addPartitions Key: HIVE-26035 URL: https://issues.apache.org/jira/browse/HIVE-26035 Project: Hive Issue Type: Bug Reporter: Rajesh Balamohan Currently {{addPartitions}} uses datanuclues and is super slow for large number of partitions. It will be good to move to direct sql. Lots of repeated SQLs can be avoided as well (e.g SDS, SERDE, TABLE_PARAMS) -- This message was sent by Atlassian Jira (v8.20.1#820001)
[jira] [Created] (HIVE-26072) Enable vectorization for stats gathering (tablescan op)
Rajesh Balamohan created HIVE-26072: --- Summary: Enable vectorization for stats gathering (tablescan op) Key: HIVE-26072 URL: https://issues.apache.org/jira/browse/HIVE-26072 Project: Hive Issue Type: Bug Components: Hive Reporter: Rajesh Balamohan https://issues.apache.org/jira/browse/HIVE-24510 enabled vectorization for compute_bit_vector. But tablescan operator for stats gathering is disabled by default. [https://github.com/apache/hive/blob/master/ql/src/java/org/apache/hadoop/hive/ql/optimizer/physical/Vectorizer.java#L2577] Need to enable vectorization for this. This can significantly reduce runtimes for analyze statements for large tables. -- This message was sent by Atlassian Jira (v8.20.1#820001)
[jira] [Created] (HIVE-26091) Support DecimalFilterPredicateLeafBuilder for parquet
Rajesh Balamohan created HIVE-26091: --- Summary: Support DecimalFilterPredicateLeafBuilder for parquet Key: HIVE-26091 URL: https://issues.apache.org/jira/browse/HIVE-26091 Project: Hive Issue Type: Bug Reporter: Rajesh Balamohan https://github.com/apache/hive/blob/master/ql/src/java/org/apache/hadoop/hive/ql/io/parquet/LeafFilterFactory.java#L41 It will nice to have DecimalFilterPredicateLeafBuilder. This will help in supporting SARG pushdowns. {noformat} 2022-03-30 08:59:50,040 [ERROR] [TezChild] |read.ParquetFilterPredicateConverter|: fail to build predicate filter leaf with errorsorg.apache.hadoop.hive.ql.metadata.HiveException: Conversion to Parquet FilterPredicate not supported for DECIMAL org.apache.hadoop.hive.ql.metadata.HiveException: Conversion to Parquet FilterPredicate not supported for DECIMAL at org.apache.hadoop.hive.ql.io.parquet.LeafFilterFactory.getLeafFilterBuilderByType(LeafFilterFactory.java:223) at org.apache.hadoop.hive.ql.io.parquet.read.ParquetFilterPredicateConverter.buildFilterPredicateFromPredicateLeaf(ParquetFilterPredicateConverter.java:130) at org.apache.hadoop.hive.ql.io.parquet.read.ParquetFilterPredicateConverter.translate(ParquetFilterPredicateConverter.java:111) at org.apache.hadoop.hive.ql.io.parquet.read.ParquetFilterPredicateConverter.translate(ParquetFilterPredicateConverter.java:97) at org.apache.hadoop.hive.ql.io.parquet.read.ParquetFilterPredicateConverter.translate(ParquetFilterPredicateConverter.java:71) at org.apache.hadoop.hive.ql.io.parquet.read.ParquetFilterPredicateConverter.translate(ParquetFilterPredicateConverter.java:88) at org.apache.hadoop.hive.ql.io.parquet.read.ParquetFilterPredicateConverter.toFilterPredicate(ParquetFilterPredicateConverter.java:57) at org.apache.hadoop.hive.ql.io.parquet.ParquetRecordReaderBase.setFilter(ParquetRecordReaderBase.java:184) at org.apache.hadoop.hive.ql.io.parquet.ParquetRecordReaderBase.getSplit(ParquetRecordReaderBase.java:124) at org.apache.hadoop.hive.ql.io.parquet.vector.VectorizedParquetRecordReader.(VectorizedParquetRecordReader.java:158) at org.apache.hadoop.hive.ql.io.parquet.VectorizedParquetInputFormat.getRecordReader(VectorizedParquetInputFormat.java:50) at org.apache.hadoop.hive.ql.io.parquet.MapredParquetInputFormat.getRecordReader(MapredParquetInputFormat.java:87) at org.apache.hadoop.hive.ql.io.RecordReaderWrapper.create(RecordReaderWrapper.java:72) at org.apache.hadoop.hive.ql.io.HiveInputFormat.getRecordReader(HiveInputFormat.java:429) at org.apache.hadoop.mapred.split.TezGroupedSplitsInputFormat$TezGroupedSplitsRecordReader.initNextRecordReader(TezGroupedSplitsInputFormat.java:203) at org.apache.hadoop.mapred.split.TezGroupedSplitsInputFormat$TezGroupedSplitsRecordReader.next(TezGroupedSplitsInputFormat.java:152) at org.apache.tez.mapreduce.lib.MRReaderMapred.next(MRReaderMapred.java:116) at org.apache.hadoop.hive.ql.exec.tez.MapRecordSource.pushRecord(MapRecordSource.java:68) at org.apache.hadoop.hive.ql.exec.tez.MapRecordProcessor.run(MapRecordProcessor.java:437) at org.apache.hadoop.hive.ql.exec.tez.TezProcessor.initializeAndRunProcessor(TezProcessor.java:282) at org.apache.hadoop.hive.ql.exec.tez.TezProcessor.run(TezProcessor.java:265) at org.apache.tez.runtime.LogicalIOProcessorRuntimeTask.run(LogicalIOProcessorRuntimeTask.java:374) at org.apache.tez.runtime.task.TaskRunner2Callable$1.run(TaskRunner2Callable.java:75) at org.apache.tez.runtime.task.TaskRunner2Callable$1.run(TaskRunner2Callable.java:62) at java.security.AccessController.doPrivileged(Native Method) at javax.security.auth.Subject.doAs(Subject.java:422) at org.apache.hadoop.security.UserGroupInformation.doAs(UserGroupInformation.java:1898) at org.apache.tez.runtime.task.TaskRunner2Callable.callInternal(TaskRunner2Callable.java:62) at org.apache.tez.runtime.task.TaskRunner2Callable.callInternal(TaskRunner2Callable.java:38) at org.apache.tez.common.CallableWithNdc.call(CallableWithNdc.java:36) at com.google.common.util.concurrent.TrustedListenableFutureTask$TrustedFutureInterruptibleTask.runInterruptibly(TrustedListenableFutureTask.java:125) at com.google.common.util.concurrent.InterruptibleTask.run(InterruptibleTask.java:69) at com.google.common.util.concurrent.TrustedListenableFutureTask.run(TrustedListenableFutureTask.java:78) at java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1149) at java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:624) at java.lang.Thread.run(Thread.java:748) {noformat} -- This message was sent by Atlassian Jira (
[jira] [Created] (HIVE-26110) bulk insert into partitioned table creates lots of files in iceberg
Rajesh Balamohan created HIVE-26110: --- Summary: bulk insert into partitioned table creates lots of files in iceberg Key: HIVE-26110 URL: https://issues.apache.org/jira/browse/HIVE-26110 Project: Hive Issue Type: Bug Reporter: Rajesh Balamohan For e.g, create web_returns table in tpcds in iceberg format and try to copy over data from regular table. More like "insert into web_returns_iceberg as select * from web_returns". This inserts the data correctly, however there are lot of files present in each partition. IMO, dynamic sort optimisation isn't working fine and this causes records not to be grouped in the final phase. -- This message was sent by Atlassian Jira (v8.20.1#820001)
[jira] [Created] (HIVE-26115) Parquet footer is read 3 times when reading iceberg data
Rajesh Balamohan created HIVE-26115: --- Summary: Parquet footer is read 3 times when reading iceberg data Key: HIVE-26115 URL: https://issues.apache.org/jira/browse/HIVE-26115 Project: Hive Issue Type: Bug Reporter: Rajesh Balamohan Attachments: Screenshot 2022-04-05 at 10.08.27 AM.png, Screenshot 2022-04-05 at 10.08.35 AM.png, Screenshot 2022-04-05 at 10.08.50 AM.png, Screenshot 2022-04-05 at 10.09.03 AM.png !Screenshot 2022-04-05 at 10.08.27 AM.png|width=627,height=331! Here is the breakup of 3 footer reads per file. !Screenshot 2022-04-05 at 10.08.35 AM.png|width=1109,height=500! !Screenshot 2022-04-05 at 10.08.50 AM.png|width=1067,height=447! !Screenshot 2022-04-05 at 10.09.03 AM.png|width=827,height=303! HIVE-25827 already talks about the initial 2 footer reads per file. -- This message was sent by Atlassian Jira (v8.20.1#820001)
[jira] [Created] (HIVE-26128) Enabling dynamic runtime filtering in iceberg tables throws exception at runtime
Rajesh Balamohan created HIVE-26128: --- Summary: Enabling dynamic runtime filtering in iceberg tables throws exception at runtime Key: HIVE-26128 URL: https://issues.apache.org/jira/browse/HIVE-26128 Project: Hive Issue Type: Bug Reporter: Rajesh Balamohan E.g TPCDS Q2 at 10 TB scale throws the following error when run with "hive.disable.unsafe.external.table.operations=false". Iceberg tables were created as external tables and setting "hive.disable.unsafe.external.table.operations=false" will enable it to have dynamic runtime filtering; but throws the following error at runtime {noformat} ]Vertex failed, vertexName=Map 6, vertexId=vertex_1649658279052__1_03, diagnostics=[Vertex vertex_1649658279052__1_03 [Map 6] killed/failed due to:ROOT_INPUT_INIT_FAILURE, Vertex Input: date_dim initializer failed, vertex=vertex_1649658279052__1_03 [Map 6], java.lang.IndexOutOfBoundsException: Index: 0, Size: 0 at java.util.ArrayList.rangeCheck(ArrayList.java:659) at java.util.ArrayList.get(ArrayList.java:435) at org.apache.iceberg.mr.hive.HiveIcebergFilterFactory.translateLeaf(HiveIcebergFilterFactory.java:114) at org.apache.iceberg.mr.hive.HiveIcebergFilterFactory.translate(HiveIcebergFilterFactory.java:86) at org.apache.iceberg.mr.hive.HiveIcebergFilterFactory.translate(HiveIcebergFilterFactory.java:80) at org.apache.iceberg.mr.hive.HiveIcebergFilterFactory.generateFilterExpression(HiveIcebergFilterFactory.java:59) at org.apache.iceberg.mr.hive.HiveIcebergInputFormat.getSplits(HiveIcebergInputFormat.java:92) at org.apache.hadoop.hive.ql.io.HiveInputFormat.addSplitsForGroup(HiveInputFormat.java:592) at org.apache.hadoop.hive.ql.io.HiveInputFormat.getSplits(HiveInputFormat.java:900) at org.apache.hadoop.hive.ql.exec.tez.HiveSplitGenerator.initialize(HiveSplitGenerator.java:274) at org.apache.tez.dag.app.dag.RootInputInitializerManager.lambda$runInitializer$3(RootInputInitializerManager.java:199) at java.security.AccessController.doPrivileged(Native Method) at javax.security.auth.Subject.doAs(Subject.java:422) at org.apache.hadoop.security.UserGroupInformation.doAs(UserGroupInformation.java:1898) at org.apache.tez.dag.app.dag.RootInputInitializerManager.runInitializer(RootInputInitializerManager.java:192) at org.apache.tez.dag.app.dag.RootInputInitializerManager.runInitializerAndProcessResult(RootInputInitializerManager.java:173) at org.apache.tez.dag.app.dag.RootInputInitializerManager.lambda$createAndStartInitializing$2(RootInputInitializerManager.java:167) at java.util.concurrent.Executors$RunnableAdapter.call(Executors.java:511) at com.google.common.util.concurrent.TrustedListenableFutureTask$TrustedFutureInterruptibleTask.runInterruptibly(TrustedListenableFutureTask.java:125) at com.google.common.util.concurrent.InterruptibleTask.run(InterruptibleTask.java:69) at com.google.common.util.concurrent.TrustedListenableFutureTask.run(TrustedListenableFutureTask.java:78) at java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1149) at java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:624) at java.lang.Thread.run(Thread.java:750) ]Vertex killed, vertexName=Map 13, vertexId=vertex_1649658279052__1_07, diagnostics=[Vertex received Kill in INITED state., Vertex vertex_1649658279052__1_07 [Map 13] killed/failed due to:OTHER_VERTEX_FAILURE]Vertex killed, vertexName=Map 10, vertexId=vertex_1649658279052__1_06, diagnostics=[Vertex received Kill in INITED state., Vertex vertex_1649658279052__1_06 [Map 10] killed/failed due to:OTHER_VERTEX_FAILURE]Vertex killed, vertexName=Map 5, vertexId=vertex_1649658279052__1_04, diagnostics=[Vertex received Kill in INITED state., Vertex vertex_1649658279052__1_04 [Map 5] killed/failed due to:OTHER_VERTEX_FAILURE]Vertex killed, vertexName=Reducer 4, vertexId=vertex_1649658279052__1_11, diagnostics=[Vertex received Kill in NEW state., Vertex vertex_1649658279052__1_11 [Reducer 4] killed/failed due to:OTHER_VERTEX_FAILURE]Vertex killed, vertexName=Reducer 3, vertexId=vertex_1649658279052__1_10, diagnostics=[Vertex received Kill in INITED state., Vertex vertex_1649658279052__1_10 [Reducer 3] killed/failed due to:OTHER_VERTEX_FAILURE]Vertex killed, vertexName=Reducer 12, vertexId=vertex_1649658279052__1_09, diagnostics=[Vertex received Kill in INITED state., Vertex vertex_1649658279052__1_09 [Reducer 12] killed/failed due to:OTHER_VERTEX_FAILURE]Vertex killed, vertexName=Map 1, vertexId=vertex_1649658279052__1_08, diagnostics=[Vertex received Kill in INITED state., Vertex vertex_1649658279052__1_08 [Map 1] killed/failed due to:OT
[jira] [Created] (HIVE-26181) Add details on the number of partitions/entries in dynamic partition pruning
Rajesh Balamohan created HIVE-26181: --- Summary: Add details on the number of partitions/entries in dynamic partition pruning Key: HIVE-26181 URL: https://issues.apache.org/jira/browse/HIVE-26181 Project: Hive Issue Type: Bug Reporter: Rajesh Balamohan Related ticket: HIVE-26008 It will be good to print details on the number of partition pruning entries for debugging and for understanding the eff* of the query. -- This message was sent by Atlassian Jira (v8.20.7#820007)
[jira] [Created] (HIVE-26185) Need support for metadataonly operations with iceberg (e.g select distinct on partition column)
Rajesh Balamohan created HIVE-26185: --- Summary: Need support for metadataonly operations with iceberg (e.g select distinct on partition column) Key: HIVE-26185 URL: https://issues.apache.org/jira/browse/HIVE-26185 Project: Hive Issue Type: Bug Components: HiveServer2 Reporter: Rajesh Balamohan {noformat} select distinct ss_sold_date_sk from store_sales {noformat} This query scans 1800+ rows in hive acid. But takes ages to process with NullScanOptimiser during compilation phase (https://issues.apache.org/jira/browse/HIVE-24262) {noformat} Hive ACID INFO : Executing command(queryId=hive_20220427233926_282bc9d8-220c-4a09-928d-411601c2ef14): select distinct ss_sold_date_sk from store_sales INFO : Compute 'ndembla-test2' is active. INFO : Query ID = hive_20220427233926_282bc9d8-220c-4a09-928d-411601c2ef14 INFO : Total jobs = 1 INFO : Launching Job 1 out of 1 INFO : Starting task [Stage-1:MAPRED] in serial mode INFO : Subscribed to counters: [] for queryId: hive_20220427233926_282bc9d8-220c-4a09-928d-411601c2ef14 INFO : Tez session hasn't been created yet. Opening session INFO : Dag name: select distinct ss_sold_date_s...store_sales (Stage-1) INFO : Status: Running (Executing on YARN cluster with App id application_1651102345385_) INFO : Status: DAG finished successfully in 1.81 seconds INFO : DAG ID: dag_1651102345385__5 INFO : INFO : Query Execution Summary INFO : -- INFO : OPERATIONDURATION INFO : -- INFO : Compile Query 55.47s INFO : Prepare Plan2.32s INFO : Get Query Coordinator (AM) 0.13s INFO : Submit Plan 0.03s INFO : Start DAG 0.09s INFO : Run DAG 1.80s INFO : -- INFO : INFO : Task Execution Summary INFO : -- INFO : VERTICES DURATION(ms) CPU_TIME(ms)GC_TIME(ms) INPUT_RECORDS OUTPUT_RECORDS INFO : -- INFO : Map 1 1009.00 0 0 1,8241,824 INFO : Reducer 2 0.00 0 0 1,8240 INFO : -- INFO : {noformat} However, same query scans *2.8 Billion records.* in iceberg format. This can be fixed. {noformat} INFO : Executing command(queryId=hive_20220427233519_cddc6dd1-95a3-4f0e-afa5-e11e9dc5fa72): select distinct ss_sold_date_sk from store_sales INFO : Compute 'ndembla-test2' is active. INFO : Query ID = hive_20220427233519_cddc6dd1-95a3-4f0e-afa5-e11e9dc5fa72 INFO : Total jobs = 1 INFO : Launching Job 1 out of 1 INFO : Starting task [Stage-1:MAPRED] in serial mode INFO : Subscribed to counters: [] for queryId: hive_20220427233519_cddc6dd1-95a3-4f0e-afa5-e11e9dc5fa72 INFO : Tez session hasn't been created yet. Opening session INFO : Dag name: select distinct ss_sold_date_s...store_sales (Stage-1) INFO : Status: Running (Executing on YARN cluster with App id application_1651102345385_) -- VERTICES MODESTATUS TOTAL COMPLETED RUNNING PENDING FAILED KILLED -- Map 1 .. llap SUCCEEDED 7141 714100 0 0 Reducer 2 .. llap SUCCEEDED 2 200 0 0 -- VERTICES: 02/02 [==>>] 100% ELAPSED TIME: 18.48 s -- INFO : Status: DAG finished successfully in 17.97 seconds INFO : DAG ID: dag_1651102345385__4 INFO : INFO : Query Execution Summary INFO : -- INFO : OPERATIONDURATION INFO : -- INFO : Compile Query 1.81s INFO : Prepare Plan0.04s INFO : Get Query Coordinator
[jira] [Created] (HIVE-26194) Unable to interrupt query in the middle of long compilation
Rajesh Balamohan created HIVE-26194: --- Summary: Unable to interrupt query in the middle of long compilation Key: HIVE-26194 URL: https://issues.apache.org/jira/browse/HIVE-26194 Project: Hive Issue Type: Bug Components: HiveServer2 Reporter: Rajesh Balamohan *Issue:* * Certain queries can take lot longer time to compile, depending on the number of interactions with HMS. * When user tries to cancel such queries in the middle of compilation, it doesn't work. It interrupts the process only when the entire compilation phase is complete. * Example is given below (Q66 at 10 TB TPCDS) {noformat} . . . . . . . . . . . . . . . . . . . . . . .>,d_year . . . . . . . . . . . . . . . . . . . . . . .> ) . . . . . . . . . . . . . . . . . . . . . . .> ) x . . . . . . . . . . . . . . . . . . . . . . .> group by . . . . . . . . . . . . . . . . . . . . . . .> w_warehouse_name . . . . . . . . . . . . . . . . . . . . . . .>,w_warehouse_sq_ft . . . . . . . . . . . . . . . . . . . . . . .>,w_city . . . . . . . . . . . . . . . . . . . . . . .>,w_county . . . . . . . . . . . . . . . . . . . . . . .>,w_state . . . . . . . . . . . . . . . . . . . . . . .>,w_country . . . . . . . . . . . . . . . . . . . . . . .>,ship_carriers . . . . . . . . . . . . . . . . . . . . . . .>,year . . . . . . . . . . . . . . . . . . . . . . .> order by w_warehouse_name . . . . . . . . . . . . . . . . . . . . . . .> limit 100; Interrupting... Please be patient this may take some time. Interrupting... Please be patient this may take some time. Interrupting... Please be patient this may take some time. Interrupting... Please be patient this may take some time. Interrupting... Please be patient this may take some time. Interrupting... Please be patient this may take some time. ... ... ... ,w_city ,w_county ,w_state ,w_country ,ship_carriers ,year order by w_warehouse_name limit 100 INFO : Semantic Analysis Completed (retrial = false) ERROR : FAILED: command has been interrupted: after analyzing query. INFO : Compiling command(queryId=hive_20220502040541_14c76b6f-f6d2-4ab3-ad82-522f17ede63a) has been interrupted after 32.872 seconds <<< Notice that it interrupted only after entire compilation is done at 32 seconds. Error: Query was cancelled. Illegal Operation state transition from CANCELED to ERROR (state=01000,code=0) {noformat} This becomes an issue in busy cluster. Interrupt handling should be fixed in compilation phase. -- This message was sent by Atlassian Jira (v8.20.7#820007)
[jira] [Created] (HIVE-26490) Iceberg: Residual expression is constructed for the task from multiple places causing CPU burn
Rajesh Balamohan created HIVE-26490: --- Summary: Iceberg: Residual expression is constructed for the task from multiple places causing CPU burn Key: HIVE-26490 URL: https://issues.apache.org/jira/browse/HIVE-26490 Project: Hive Issue Type: Bug Reporter: Rajesh Balamohan Attachments: Screenshot 2022-08-22 at 12.58.47 PM.jpg "HiveIcebergInputFormat.residualForTask(task, job)" is invoked from multiple places causing CPU burn. !Screenshot 2022-08-22 at 12.58.47 PM.jpg|width=918,height=932! -- This message was sent by Atlassian Jira (v8.20.10#820010)
[jira] [Created] (HIVE-26491) Iceberg: Drop table should purge the data for V2 tables
Rajesh Balamohan created HIVE-26491: --- Summary: Iceberg: Drop table should purge the data for V2 tables Key: HIVE-26491 URL: https://issues.apache.org/jira/browse/HIVE-26491 Project: Hive Issue Type: Bug Reporter: Rajesh Balamohan # create external table stored by iceberg in orc format. Convert this to iceberg v2 table format via alter table statements. This should ideally have set "'external.table.purge'='true'" property by default which is missing for V2 tables. # insert data into it # Drop the table. This drops the metadata information, but retains actual data. Set "'external.table.purge'='true'" as default for iceberg (if it hasn't been set yet). -- This message was sent by Atlassian Jira (v8.20.10#820010)
[jira] [Created] (HIVE-26496) FetchOperator scans delete_delta folders multiple times causing slowness
Rajesh Balamohan created HIVE-26496: --- Summary: FetchOperator scans delete_delta folders multiple times causing slowness Key: HIVE-26496 URL: https://issues.apache.org/jira/browse/HIVE-26496 Project: Hive Issue Type: Bug Components: HiveServer2 Reporter: Rajesh Balamohan FetchOperator scans way too many number of files/directories than needed. For e.g here is a layout of a table which had set of updates and deletes. There are set of "delta" and "delete_delta" folders which are created. {noformat} s3a://bucket-name/warehouse/tablespace/managed/hive/test.db/date_dim/base_001 s3a://bucket-name/warehouse/tablespace/managed/hive/test.db/date_dim/delete_delta_002_002_ s3a://bucket-name/warehouse/tablespace/managed/hive/test.db/date_dim/delete_delta_003_003_ s3a://bucket-name/warehouse/tablespace/managed/hive/test.db/date_dim/delete_delta_004_004_ s3a://bucket-name/warehouse/tablespace/managed/hive/test.db/date_dim/delete_delta_005_005_ s3a://bucket-name/warehouse/tablespace/managed/hive/test.db/date_dim/delete_delta_006_006_ s3a://bucket-name/warehouse/tablespace/managed/hive/test.db/date_dim/delete_delta_007_007_ s3a://bucket-name/warehouse/tablespace/managed/hive/test.db/date_dim/delete_delta_008_008_ s3a://bucket-name/warehouse/tablespace/managed/hive/test.db/date_dim/delete_delta_009_009_ s3a://bucket-name/warehouse/tablespace/managed/hive/test.db/date_dim/delete_delta_010_010_ s3a://bucket-name/warehouse/tablespace/managed/hive/test.db/date_dim/delete_delta_011_011_ s3a://bucket-name/warehouse/tablespace/managed/hive/test.db/date_dim/delete_delta_012_012_ s3a://bucket-name/warehouse/tablespace/managed/hive/test.db/date_dim/delete_delta_013_013_ s3a://bucket-name/warehouse/tablespace/managed/hive/test.db/date_dim/delete_delta_014_014_ s3a://bucket-name/warehouse/tablespace/managed/hive/test.db/date_dim/delete_delta_015_015_ s3a://bucket-name/warehouse/tablespace/managed/hive/test.db/date_dim/delete_delta_016_016_ s3a://bucket-name/warehouse/tablespace/managed/hive/test.db/date_dim/delete_delta_017_017_ s3a://bucket-name/warehouse/tablespace/managed/hive/test.db/date_dim/delete_delta_018_018_ s3a://bucket-name/warehouse/tablespace/managed/hive/test.db/date_dim/delete_delta_019_019_ s3a://bucket-name/warehouse/tablespace/managed/hive/test.db/date_dim/delete_delta_020_020_ s3a://bucket-name/warehouse/tablespace/managed/hive/test.db/date_dim/delete_delta_021_021_ s3a://bucket-name/warehouse/tablespace/managed/hive/test.db/date_dim/delete_delta_022_022_ s3a://bucket-name/warehouse/tablespace/managed/hive/test.db/date_dim/delta_002_002_ s3a://bucket-name/warehouse/tablespace/managed/hive/test.db/date_dim/delta_003_003_ s3a://bucket-name/warehouse/tablespace/managed/hive/test.db/date_dim/delta_004_004_ s3a://bucket-name/warehouse/tablespace/managed/hive/test.db/date_dim/delta_005_005_ s3a://bucket-name/warehouse/tablespace/managed/hive/test.db/date_dim/delta_006_006_ s3a://bucket-name/warehouse/tablespace/managed/hive/test.db/date_dim/delta_007_007_ s3a://bucket-name/warehouse/tablespace/managed/hive/test.db/date_dim/delta_008_008_ s3a://bucket-name/warehouse/tablespace/managed/hive/test.db/date_dim/delta_009_009_ s3a://bucket-name/warehouse/tablespace/managed/hive/test.db/date_dim/delta_010_010_ s3a://bucket-name/warehouse/tablespace/managed/hive/test.db/date_dim/delta_011_011_ s3a://bucket-name/warehouse/tablespace/managed/hive/test.db/date_dim/delta_012_012_ s3a://bucket-name/warehouse/tablespace/managed/hive/test.db/date_dim/delta_013_013_ s3a://bucket-name/warehouse/tablespace/managed/hive/test.db/date_dim/delta_014_014_ s3a://bucket-name/warehouse/tablespace/managed/hive/test.db/date_dim/delta_015_015_ s3a://bucket-name/warehouse/tablespace/managed/hive/test.db/date_dim/delta_016_016_ s3a://bucket-name/warehouse/tablespace/managed/hive/test.db/date_dim/delta_017_017_ s3a://bucket-name/warehouse/tablespace/managed/hive/test.db/date_dim/delta_018_018_ s3a://bucket-name/warehouse/tablespace/managed/hive/test.db/date_dim/delta_019_019_ s3a://bucket-name/warehouse/tablespace/managed/hive/test.db/date_dim/delta_020_020_ s3a://bucket-name/warehouse/tablespace/managed/hive/test.db/date_dim/delta_021_021_ {noformat} When user runs *{color:#0747a6}{{select * from date_dim}}{color}* from beeline, FetchOperator tries to compute splits in "date_dim". This "base" and "delta" folders and computes 2
[jira] [Created] (HIVE-26507) Iceberg: In place metadata generation may not work for certain datatypes
Rajesh Balamohan created HIVE-26507: --- Summary: Iceberg: In place metadata generation may not work for certain datatypes Key: HIVE-26507 URL: https://issues.apache.org/jira/browse/HIVE-26507 Project: Hive Issue Type: Bug Reporter: Rajesh Balamohan "alter table" statements can be used for generating iceberg metadata information (i.e for converting external tables -> iceberg tables). As part of this process, it also converts certain datatypes to iceberg compatible types (e.g char -> string). "iceberg.mr.schema.auto.conversion" enables this conversion. This could cause certain issues at runtime. Here is an example {noformat} Before conversion: == -- external table select count(*) from customer_demographics where cd_gender = 'F' and cd_marital_status = 'U' and cd_education_status = '2 yr Degree'; 27440 after conversion: = -- iceberg table select count(*) from customer_demographics where cd_gender = 'F' and cd_marital_status = 'U' and cd_education_status = '2 yr Degree'; 0 select count(*) from customer_demographics where cd_gender = 'F' and cd_marital_status = 'U' and trim(cd_education_status) = '2 yr Degree'; 27440 {noformat} -- This message was sent by Atlassian Jira (v8.20.10#820010)
[jira] [Created] (HIVE-26520) Improve dynamic partition pruning operator when subqueries are involved
Rajesh Balamohan created HIVE-26520: --- Summary: Improve dynamic partition pruning operator when subqueries are involved Key: HIVE-26520 URL: https://issues.apache.org/jira/browse/HIVE-26520 Project: Hive Issue Type: Improvement Components: HiveServer2 Reporter: Rajesh Balamohan Attachments: q58_test.pdf Dynamic partition pruning operator sends entire date_dim table and due to this, entire catalog_sales data is scanned causing huge IO and decoding cost. If dynamic partition pruning operator was created after the "date_dim" subquery has been evaluated, it would have saved huge IO cost. E.g It would have just taken 6-7 partition scans instead of 1800+ partitions. Consider the following simplified query as example {noformat} select count(*) from (select i_item_id item_id ,sum(cs_ext_sales_price) cs_item_rev from catalog_sales ,item ,date_dim where cs_item_sk = i_item_sk and d_date in (select d_date from date_dim where d_week_seq = (select d_week_seq from date_dim where d_date = '1998-02-21')) and cs_sold_date_sk = d_date_sk group by i_item_id) a; CBO PLAN: HiveAggregate(group=[{}], agg#0=[count()]) HiveProject(i_item_id=[$0]) HiveAggregate(group=[{4}]) HiveSemiJoin(condition=[=($6, $7)], joinType=[semi]) HiveJoin(condition=[=($2, $5)], joinType=[inner]) HiveJoin(condition=[=($0, $3)], joinType=[inner]) HiveProject(cs_item_sk=[$14], cs_ext_sales_price=[$22], cs_sold_date_sk=[$33]) HiveFilter(condition=[AND(IS NOT NULL($33), IS NOT NULL($14))]) HiveTableScan(table=[[tpcds_bin_partitioned_orc_1_external, catalog_sales]], table:alias=[catalog_sales]) HiveProject(i_item_sk=[$0], i_item_id=[$1]) HiveFilter(condition=[IS NOT NULL($0)]) HiveTableScan(table=[[tpcds_bin_partitioned_orc_1_external, item]], table:alias=[item]) HiveProject(d_date_sk=[$0], d_date=[$2]) HiveFilter(condition=[AND(IS NOT NULL($2), IS NOT NULL($0))]) HiveTableScan(table=[[tpcds_bin_partitioned_orc_1_external, date_dim]], table:alias=[date_dim]) HiveProject(d_date=[$0]) HiveJoin(condition=[=($1, $3)], joinType=[inner]) HiveJoin(condition=[true], joinType=[inner]) HiveProject(d_date=[$2], d_week_seq=[$4]) HiveFilter(condition=[AND(IS NOT NULL($2), IS NOT NULL($4))]) HiveTableScan(table=[[tpcds_bin_partitioned_orc_1_external, date_dim]], table:alias=[date_dim]) HiveProject(cnt=[$0]) HiveFilter(condition=[<=(sq_count_check($0), 1)]) HiveProject(cnt=[$0]) HiveAggregate(group=[{}], cnt=[COUNT()]) HiveFilter(condition=[=($2, 1998-02-21)]) HiveTableScan(table=[[tpcds_bin_partitioned_orc_1_external, date_dim]], table:alias=[date_dim]) HiveProject(d_week_seq=[$4]) HiveFilter(condition=[AND(=($2, 1998-02-21), IS NOT NULL($4))]) HiveTableScan(table=[[tpcds_bin_partitioned_orc_1_external, date_dim]], table:alias=[date_dim]) {noformat} I will attach the formatted plan for reference as well. If planner generated the dynamic partition pruning event after "date_dim" got evaluated in "Map 7", it would be been very efficient. -- This message was sent by Atlassian Jira (v8.20.10#820010)
[jira] [Created] (HIVE-26529) Fix VectorizedSupport support for DECIMAL_64 in HiveIcebergInputFormat
Rajesh Balamohan created HIVE-26529: --- Summary: Fix VectorizedSupport support for DECIMAL_64 in HiveIcebergInputFormat Key: HIVE-26529 URL: https://issues.apache.org/jira/browse/HIVE-26529 Project: Hive Issue Type: Improvement Components: HiveServer2 Reporter: Rajesh Balamohan For supporting vectored reads in parquet, DECIMAL_64 support in ORC has been disabled in HiveIcebergInputFormat. This causes regressions in queries. [https://github.com/apache/hive/blob/master/iceberg/iceberg-handler/src/main/java/org/apache/iceberg/mr/hive/HiveIcebergInputFormat.java#L182] It will be good to restore DECIMAL_64 support in iceberg input format. -- This message was sent by Atlassian Jira (v8.20.10#820010)
[jira] [Created] (HIVE-26532) Remove logger from critical path in VectorMapJoinInnerLongOperator::processBatch
Rajesh Balamohan created HIVE-26532: --- Summary: Remove logger from critical path in VectorMapJoinInnerLongOperator::processBatch Key: HIVE-26532 URL: https://issues.apache.org/jira/browse/HIVE-26532 Project: Hive Issue Type: Improvement Reporter: Rajesh Balamohan Attachments: Screenshot 2022-09-12 at 10.03.43 AM.png !Screenshot 2022-09-12 at 10.03.43 AM.png|width=895,height=872! -- This message was sent by Atlassian Jira (v8.20.10#820010)
[jira] [Created] (HIVE-26540) Iceberg: Select queries after update/delete become expensive in reading contents
Rajesh Balamohan created HIVE-26540: --- Summary: Iceberg: Select queries after update/delete become expensive in reading contents Key: HIVE-26540 URL: https://issues.apache.org/jira/browse/HIVE-26540 Project: Hive Issue Type: Improvement Reporter: Rajesh Balamohan - Create basic date_dim table in tpcds. Store it in iceberg v2 format - Update few 1000 records couple of times - Run a simple select query {{select count ( * ) from date_dim_ice where d_qoy = 11 and d_dom=2 and d_fy_week_seq=3;}} This takes 8-18 seconds where ACID takes 1.5 seconds. Basic issue is that, it reads files multiple times (i.e both data and delete files). Lines of interest: IcebergInputFormat.java {noformat} InternalRecordWrapper wrapper = new InternalRecordWrapper(readSchema.asStruct()); Evaluator filter = new Evaluator(readSchema.asStruct(), residual, caseSensitive); return CloseableIterable.filter(iter, record -> filter.eval(wrapper.wrap((StructLike) record))); {noformat} {noformat} case GENERIC: DeleteFilter deletes = new GenericDeleteFilter(table.io(), currentTask, table.schema(), readSchema); Schema requiredSchema = deletes.requiredSchema(); return deletes.filter(openGeneric(currentTask, requiredSchema)); {noformat} These get evaluated for each row in the data file, causing delay. -- This message was sent by Atlassian Jira (v8.20.10#820010)
[jira] [Created] (HIVE-26686) Iceberg: Having lot of snapshots impacts runtime due to multiple loads of the table
Rajesh Balamohan created HIVE-26686: --- Summary: Iceberg: Having lot of snapshots impacts runtime due to multiple loads of the table Key: HIVE-26686 URL: https://issues.apache.org/jira/browse/HIVE-26686 Project: Hive Issue Type: Improvement Components: HiveServer2 Reporter: Rajesh Balamohan When large number of snpashots are present in manifest file, it adversely impacts the runtime of the queries. (e.g 15 mts trickle feed). Having more snapshots will slow down runtime in 2 additional places. 1. At the time of populating statistics, it tries to load the table details again. i.e refresh table invocation 2. At the time of hive metastore hook (HiveIcebergMetaHook::doPreAlterTable), during pre alter table. Need to check if entire table information along with snapshot details are needed for this. {noformat} at org.apache.hive.iceberg.com.fasterxml.jackson.databind.deser.std.BaseNodeDeserializer.deserializeArray(JsonNodeDeserializer.java:437) at org.apache.hive.iceberg.com.fasterxml.jackson.databind.deser.std.BaseNodeDeserializer.deserializeObject(JsonNodeDeserializer.java:261) at org.apache.hive.iceberg.com.fasterxml.jackson.databind.deser.std.JsonNodeDeserializer.deserialize(JsonNodeDeserializer.java:68) at org.apache.hive.iceberg.com.fasterxml.jackson.databind.deser.std.JsonNodeDeserializer.deserialize(JsonNodeDeserializer.java:15) at org.apache.hive.iceberg.com.fasterxml.jackson.databind.ObjectMapper._readMapAndClose(ObjectMapper.java:4218) at org.apache.hive.iceberg.com.fasterxml.jackson.databind.ObjectMapper.readValue(ObjectMapper.java:3251) at org.apache.iceberg.TableMetadataParser.read(TableMetadataParser.java:264) at org.apache.iceberg.TableMetadataParser.read(TableMetadataParser.java:258) at org.apache.iceberg.BaseMetastoreTableOperations.lambda$refreshFromMetadataLocation$0(BaseMetastoreTableOperations.java:177) at org.apache.iceberg.BaseMetastoreTableOperations$$Lambda$685/0x000840e1b440.apply(Unknown Source) at org.apache.iceberg.BaseMetastoreTableOperations.lambda$refreshFromMetadataLocation$1(BaseMetastoreTableOperations.java:191) at org.apache.iceberg.BaseMetastoreTableOperations$$Lambda$686/0x000840e1a840.run(Unknown Source) at org.apache.iceberg.util.Tasks$Builder.runTaskWithRetry(Tasks.java:404) at org.apache.iceberg.util.Tasks$Builder.runSingleThreaded(Tasks.java:214) at org.apache.iceberg.util.Tasks$Builder.run(Tasks.java:198) at org.apache.iceberg.util.Tasks$Builder.run(Tasks.java:190) at org.apache.iceberg.BaseMetastoreTableOperations.refreshFromMetadataLocation(BaseMetastoreTableOperations.java:191) at org.apache.iceberg.BaseMetastoreTableOperations.refreshFromMetadataLocation(BaseMetastoreTableOperations.java:176) at org.apache.iceberg.BaseMetastoreTableOperations.refreshFromMetadataLocation(BaseMetastoreTableOperations.java:171) at org.apache.iceberg.hive.HiveTableOperations.doRefresh(HiveTableOperations.java:153) at org.apache.iceberg.BaseMetastoreTableOperations.refresh(BaseMetastoreTableOperations.java:96) at org.apache.iceberg.BaseMetastoreTableOperations.current(BaseMetastoreTableOperations.java:79) at org.apache.iceberg.BaseMetastoreCatalog.loadTable(BaseMetastoreCatalog.java:44) at org.apache.iceberg.mr.Catalogs.loadTable(Catalogs.java:116) at org.apache.iceberg.mr.Catalogs.loadTable(Catalogs.java:106) at org.apache.iceberg.mr.hive.HiveIcebergStorageHandler.getBasicStatistics(HiveIcebergStorageHandler.java:309) at org.apache.hadoop.hive.ql.stats.BasicStatsTask$BasicStatsProcessor.(BasicStatsTask.java:138) at org.apache.hadoop.hive.ql.stats.BasicStatsTask.aggregateStats(BasicStatsTask.java:301) at org.apache.hadoop.hive.ql.stats.BasicStatsTask.process(BasicStatsTask.java:108) at org.apache.hadoop.hive.ql.exec.StatsTask.execute(StatsTask.java:107) at org.apache.hadoop.hive.ql.exec.Task.executeTask(Task.java:213) at org.apache.hadoop.hive.ql.exec.TaskRunner.runSequential(TaskRunner.java:105) at org.apache.hadoop.hive.ql.Executor.launchTask(Executor.java:360) at org.apache.hadoop.hive.ql.Executor.launchTasks(Executor.java:333) at org.apache.hadoop.hive.ql.Executor.runTasks(Executor.java:250) at org.apache.hadoop.hive.ql.Executor.execute(Executor.java:111) at org.apache.hadoop.hive.ql.Driver.runInternal(Driver.java:806) at org.apache.hadoop.hive.ql.Driver.run(Driver.java:540) at org.apache.hadoop.hive.ql.Driver.run(Driver.java:534) at org.apache.hadoop.hive.ql.reexec.ReExecDriver.run(ReExecDriver.java:166) at org.apache.hive.service.cli.operation.SQLOperation.runQuery(SQLOperation.java:232) at org.apache.hive.service.cli.operation.SQLOperation.access$700(SQLOperation.java:89) at org.apache.hive.service.cli.operation.SQLOperation$BackgroundWork
[jira] [Created] (HIVE-26699) Iceberg: S3 fadvise can hurt JSON parsing significantly in DWX
Rajesh Balamohan created HIVE-26699: --- Summary: Iceberg: S3 fadvise can hurt JSON parsing significantly in DWX Key: HIVE-26699 URL: https://issues.apache.org/jira/browse/HIVE-26699 Project: Hive Issue Type: Improvement Reporter: Rajesh Balamohan Hive reads JSON metadata information (TableMetadataParser::read()) multiple times; E.g during query compilation, AM split computation, stats computation, during commits etc. With large JSON files (due to multiple inserts), it takes a lot longer time with S3 FS with "fs.s3a.experimental.input.fadvise" set to "random". (e.g in the order of 10x).To be on safer side, it will be good to set this to "normal" mode in configs, when reading iceberg tables. -- This message was sent by Atlassian Jira (v8.20.10#820010)
[jira] [Created] (HIVE-26714) Iceberg delete files are read twice during query processing causing delays
Rajesh Balamohan created HIVE-26714: --- Summary: Iceberg delete files are read twice during query processing causing delays Key: HIVE-26714 URL: https://issues.apache.org/jira/browse/HIVE-26714 Project: Hive Issue Type: Improvement Components: Iceberg integration Reporter: Rajesh Balamohan Attachments: Screenshot 2022-11-08 at 9.37.17 PM.png Delete positions are read twice in query processing causing delays in runtime. !Screenshot 2022-11-08 at 9.37.17 PM.png|width=707,height=629! -- This message was sent by Atlassian Jira (v8.20.10#820010)
[jira] [Created] (HIVE-26874) Iceberg: Positional delete files are not cached
Rajesh Balamohan created HIVE-26874: --- Summary: Iceberg: Positional delete files are not cached Key: HIVE-26874 URL: https://issues.apache.org/jira/browse/HIVE-26874 Project: Hive Issue Type: Improvement Reporter: Rajesh Balamohan With iceberg v2 (MOR mode), "positional delete" files are not cached causing runtime delays. -- This message was sent by Atlassian Jira (v8.20.10#820010)
[jira] [Created] (HIVE-26913) HiveVectorizedReader::parquetRecordReader should reuse footer information
Rajesh Balamohan created HIVE-26913: --- Summary: HiveVectorizedReader::parquetRecordReader should reuse footer information Key: HIVE-26913 URL: https://issues.apache.org/jira/browse/HIVE-26913 Project: Hive Issue Type: Improvement Components: Iceberg integration Reporter: Rajesh Balamohan HiveVectorizedReader::parquetRecordReader should reuse details of parquet footer, instead of reading it again. It reads parquet footer here: [https://github.com/apache/hive/blob/master/iceberg/iceberg-handler/src/main/java/org/apache/iceberg/mr/hive/vector/HiveVectorizedReader.java#L230-L232] Again it reads the footer here for constructing vectorized recordreader [https://github.com/apache/hive/blob/master/iceberg/iceberg-handler/src/main/java/org/apache/iceberg/mr/hive/vector/HiveVectorizedReader.java#L249] [https://github.com/apache/hive/blob/master/ql/src/java/org/apache/hadoop/hive/ql/io/parquet/VectorizedParquetInputFormat.java#L50] Check the codepath of VectorizedParquetRecordReader::setupMetadataAndParquetSplit [https://github.com/apache/hive/blob/6b0139188aba6a95808c8d1bec63a651ec9e4bdc/ql/src/java/org/apache/hadoop/hive/ql/io/parquet/vector/VectorizedParquetRecordReader.java#L180] It should be possible to share "ParquetMetadata" in VectorizedParuqetRecordReader. -- This message was sent by Atlassian Jira (v8.20.10#820010)
[jira] [Created] (HIVE-26917) Upgrade parquet to 1.12.3
Rajesh Balamohan created HIVE-26917: --- Summary: Upgrade parquet to 1.12.3 Key: HIVE-26917 URL: https://issues.apache.org/jira/browse/HIVE-26917 Project: Hive Issue Type: Improvement Reporter: Rajesh Balamohan -- This message was sent by Atlassian Jira (v8.20.10#820010)
[jira] [Created] (HIVE-26927) Iceberg: Add support for set_current_snapshotid
Rajesh Balamohan created HIVE-26927: --- Summary: Iceberg: Add support for set_current_snapshotid Key: HIVE-26927 URL: https://issues.apache.org/jira/browse/HIVE-26927 Project: Hive Issue Type: Improvement Components: Iceberg integration Reporter: Rajesh Balamohan Currently, hive supports "rollback" feature. Once rolledback, it is not possible to move from older snapshot to newer snapshot. It ends up throwing {color:#0747a6}"org.apache.iceberg.exceptions.ValidationException: Cannot roll back to snapshot, not an ancestor of the current state:" {color}error. It will be good to support "set_current_snapshot" function to move to different snapshot ids. -- This message was sent by Atlassian Jira (v8.20.10#820010)
[jira] [Created] (HIVE-26928) LlapIoImpl::getParquetFooterBuffersFromCache throws exception when metadata cache is disabled
Rajesh Balamohan created HIVE-26928: --- Summary: LlapIoImpl::getParquetFooterBuffersFromCache throws exception when metadata cache is disabled Key: HIVE-26928 URL: https://issues.apache.org/jira/browse/HIVE-26928 Project: Hive Issue Type: Improvement Components: Iceberg integration Reporter: Rajesh Balamohan When metadata / LLAP cache is disabled, "iceberg + parquet" throws the following error. It should check for "metadatacache" correctly or fix it in LlapIoImpl. {noformat} Caused by: java.lang.NullPointerException: Metadata cache must not be null at com.google.common.base.Preconditions.checkNotNull(Preconditions.java:897) at org.apache.hadoop.hive.llap.io.api.impl.LlapIoImpl.getParquetFooterBuffersFromCache(LlapIoImpl.java:467) at org.apache.iceberg.mr.hive.vector.HiveVectorizedReader.parquetRecordReader(HiveVectorizedReader.java:227) at org.apache.iceberg.mr.hive.vector.HiveVectorizedReader.reader(HiveVectorizedReader.java:162) at java.base/jdk.internal.reflect.NativeMethodAccessorImpl.invoke0(Native Method) at java.base/jdk.internal.reflect.NativeMethodAccessorImpl.invoke(NativeMethodAccessorImpl.java:62) at java.base/jdk.internal.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.java:43) at java.base/java.lang.reflect.Method.invoke(Method.java:566) at org.apache.iceberg.common.DynMethods$UnboundMethod.invokeChecked(DynMethods.java:65) at org.apache.iceberg.common.DynMethods$UnboundMethod.invoke(DynMethods.java:77) at org.apache.iceberg.common.DynMethods$StaticMethod.invoke(DynMethods.java:196) at org.apache.iceberg.mr.mapreduce.IcebergInputFormat$IcebergRecordReader.openVectorized(IcebergInputFormat.java:331) at org.apache.iceberg.mr.mapreduce.IcebergInputFormat$IcebergRecordReader.open(IcebergInputFormat.java:377) at org.apache.iceberg.mr.mapreduce.IcebergInputFormat$IcebergRecordReader.nextTask(IcebergInputFormat.java:270) at org.apache.iceberg.mr.mapreduce.IcebergInputFormat$IcebergRecordReader.initialize(IcebergInputFormat.java:266) at org.apache.iceberg.mr.mapred.AbstractMapredIcebergRecordReader.(AbstractMapredIcebergRecordReader.java:40) at org.apache.iceberg.mr.hive.vector.HiveIcebergVectorizedRecordReader.(HiveIcebergVectorizedRecordReader.java:41) {noformat} -- This message was sent by Atlassian Jira (v8.20.10#820010)
[jira] [Created] (HIVE-26944) FileSinkOperator shouldn't check for compactiontable for every row being processed
Rajesh Balamohan created HIVE-26944: --- Summary: FileSinkOperator shouldn't check for compactiontable for every row being processed Key: HIVE-26944 URL: https://issues.apache.org/jira/browse/HIVE-26944 Project: Hive Issue Type: Improvement Reporter: Rajesh Balamohan Attachments: Screenshot 2023-01-16 at 10.32.24 AM.png -- This message was sent by Atlassian Jira (v8.20.10#820010)
[jira] [Created] (HIVE-26950) (CTLT) Create external table like V2 table is not preserving table properties
Rajesh Balamohan created HIVE-26950: --- Summary: (CTLT) Create external table like V2 table is not preserving table properties Key: HIVE-26950 URL: https://issues.apache.org/jira/browse/HIVE-26950 Project: Hive Issue Type: Improvement Components: Iceberg integration Reporter: Rajesh Balamohan # Create an external iceberg V2 table. e.g t1 # "create external table t2 like t1" <--- This ends up creating V1 table and "format-version=2" is not retained and "'format'='iceberg/parquet'" is also not retained. -- This message was sent by Atlassian Jira (v8.20.10#820010)
[jira] [Created] (HIVE-26951) Setting details in PositionDeleteInfo takes up lot of CPU cycles
Rajesh Balamohan created HIVE-26951: --- Summary: Setting details in PositionDeleteInfo takes up lot of CPU cycles Key: HIVE-26951 URL: https://issues.apache.org/jira/browse/HIVE-26951 Project: Hive Issue Type: Improvement Components: Iceberg integration Reporter: Rajesh Balamohan Attachments: Screenshot 2023-01-17 at 11.29.29 AM.png, Screenshot 2023-01-17 at 11.29.36 AM.png !Screenshot 2023-01-17 at 11.29.29 AM.png|width=898,height=532! !Screenshot 2023-01-17 at 11.29.36 AM.png|width=1000,height=591! This was observed with merge-into statements. -- This message was sent by Atlassian Jira (v8.20.10#820010)
[jira] [Created] (HIVE-26974) CTL from iceberg table should copy partition fields correctly
Rajesh Balamohan created HIVE-26974: --- Summary: CTL from iceberg table should copy partition fields correctly Key: HIVE-26974 URL: https://issues.apache.org/jira/browse/HIVE-26974 Project: Hive Issue Type: Improvement Components: Iceberg integration Reporter: Rajesh Balamohan # Create iceberg table. Ensure it to have a partition field. # run "create external table like x" # Created table in #2 misses out on creating relevant partition field. -- This message was sent by Atlassian Jira (v8.20.10#820010)
[jira] [Created] (HIVE-26975) MERGE: Wrong reducer estimate causing smaller files to be created
Rajesh Balamohan created HIVE-26975: --- Summary: MERGE: Wrong reducer estimate causing smaller files to be created Key: HIVE-26975 URL: https://issues.apache.org/jira/browse/HIVE-26975 Project: Hive Issue Type: Improvement Components: Iceberg integration Reporter: Rajesh Balamohan * "Merge into" estimates wrong number of reducers causing more number of small files to be created.* e.g 400+ files in 3+ MB file each.* * This can be reproduced by writing data into "store_sales" table in iceberg format via another source table (using merge-into). ** e.g Running this few times will create wrong number of reduce tasks causing lot of small files to be created in iceberg table. {noformat} MERGE INTO store_sales_t t using ssv s ON ( t.ss_item_sk = s.ss_item_sk AND t.ss_customer_sk = s.ss_customer_sk AND t.ss_sold_date_sk = "2451181" AND ( ( Floor(( s.ss_item_sk ) / 1000) * 1000 ) BETWEEN 1000 AND 2000 ) AND s.ss_ext_discount_amt < 0.0 ) WHEN matched AND t.ss_ext_discount_amt IS NULL THEN UPDATE SET ss_ext_discount_amt = 0.0 WHEN NOT matched THEN INSERT ( ss_sold_time_sk, ss_item_sk, ss_customer_sk, ss_cdemo_sk, ss_hdemo_sk, ss_addr_sk, ss_store_sk, ss_promo_sk, ss_ticket_number, ss_quantity, ss_wholesale_cost, ss_list_price, ss_sales_price, ss_ext_discount_amt, ss_ext_sales_price, ss_ext_wholesale_cost, ss_ext_list_price, ss_ext_tax, ss_coupon_amt, ss_net_paid, ss_net_paid_inc_tax, ss_net_profit, ss_sold_date_sk ) VALUES ( s.ss_sold_time_sk, s.ss_item_sk, s.ss_customer_sk, s.ss_cdemo_sk, s.ss_hdemo_sk, s.ss_addr_sk, s.ss_store_sk, s.ss_promo_sk, s.ss_ticket_number, s.ss_quantity, s.ss_wholesale_cost, s.ss_list_price, s.ss_sales_price, s.ss_ext_discount_amt, s.ss_ext_sales_price, s.ss_ext_wholesale_cost, s.ss_ext_list_price, s.ss_ext_tax, s.ss_coupon_amt, s.ss_net_paid, s.ss_net_paid_inc_tax, s.ss_net_profit, "2451181") {noformat} -- This message was sent by Atlassian Jira (v8.20.10#820010)
[jira] [Created] (HIVE-26978) Stale "Runtime stats" causes poor query planning
Rajesh Balamohan created HIVE-26978: --- Summary: Stale "Runtime stats" causes poor query planning Key: HIVE-26978 URL: https://issues.apache.org/jira/browse/HIVE-26978 Project: Hive Issue Type: Improvement Components: HiveServer2 Reporter: Rajesh Balamohan Attachments: Screenshot 2023-01-24 at 10.23.16 AM.png * Runtime stats can be stored in hiveserver or in metastore via "hive.query.reexecution.stats.persist.scope". * Though the table is dropped and recreated, it ends up showing old stats via "RUNTIME" stats. Here is an example (note that the table is empty, but gets datasize and numRows from RUNTIME stats) * This causes suboptimal plan for "MERGE INTO" queries by creating CUSTOM_EDGE instead of broadcast edge. !Screenshot 2023-01-24 at 10.23.16 AM.png|width=2053,height=753! -- This message was sent by Atlassian Jira (v8.20.10#820010)
[jira] [Created] (HIVE-26997) Iceberg: Vectorization gets disabled at runtime in merge-into statements
Rajesh Balamohan created HIVE-26997: --- Summary: Iceberg: Vectorization gets disabled at runtime in merge-into statements Key: HIVE-26997 URL: https://issues.apache.org/jira/browse/HIVE-26997 Project: Hive Issue Type: Improvement Components: Iceberg integration Reporter: Rajesh Balamohan Attachments: explain_merge_into.txt *Query:* Think of "ssv" table as a table containing trickle feed data in the following query. "store_sales_delete_1" is the destination table. {noformat} MERGE INTO tpcds_1000_iceberg_mor_v4.store_sales_delete_1 t USING tpcds_1000_update.ssv s ON (t.ss_item_sk = s.ss_item_sk AND t.ss_customer_sk=s.ss_customer_sk AND t.ss_sold_date_sk = "2451181" AND ((Floor((s.ss_item_sk) / 1000) * 1000) BETWEEN 1000 AND 2000) AND s.ss_ext_discount_amt < 0.0) WHEN matched AND t.ss_ext_discount_amt IS NULL THEN UPDATE SET ss_ext_discount_amt = 0.0 WHEN NOT matched THEN INSERT (ss_sold_time_sk, ss_item_sk, ss_customer_sk, ss_cdemo_sk, ss_hdemo_sk, ss_addr_sk, ss_store_sk, ss_promo_sk, ss_ticket_number, ss_quantity, ss_wholesale_cost, ss_list_price, ss_sales_price, ss_ext_discount_amt, ss_ext_sales_price, ss_ext_wholesale_cost, ss_ext_list_price, ss_ext_tax, ss_coupon_amt, ss_net_paid, ss_net_paid_inc_tax, ss_net_profit, ss_sold_date_sk) VALUES (s.ss_sold_time_sk, s.ss_item_sk, s.ss_customer_sk, s.ss_cdemo_sk, s.ss_hdemo_sk, s.ss_addr_sk, s.ss_store_sk, s.ss_promo_sk, s.ss_ticket_number, s.ss_quantity, s.ss_wholesale_cost, s.ss_list_price, s.ss_sales_price, s.ss_ext_discount_amt, s.ss_ext_sales_price, s.ss_ext_wholesale_cost, s.ss_ext_list_price, s.ss_ext_tax, s.ss_coupon_amt, s.ss_net_paid, s.ss_net_paid_inc_tax, s.ss_net_profit, "2451181") {noformat} *Issue:* # Map phase is not getting vectorized due to "PARTITION_{_}SPEC{_}_ID" column {noformat} Map notVectorizedReason: Select expression for SELECT operator: Virtual column PARTITION__SPEC__ID is not supported {noformat} 2. "Reducer 2" stage isn't vectorized. {noformat} Reduce notVectorizedReason: exception: java.lang.RuntimeException: Full Outer Small Table Key Mapping duplicate column 0 in ordered column map {0=(value column: 30, type info: int), 1=(value column: 31, type info: int)} when adding value column 53, type into int stack trace: org.apache.hadoop.hive.ql.exec.vector.VectorColumnOrderedMap.add(VectorColumnOrderedMap.java:102), org.apache.hadoop.hive.ql.exec.vector.VectorColumnSourceMapping.add(VectorColumnSourceMapping.java:41), org.apache.hadoop.hive.ql.optimizer.physical.Vectorizer.canSpecializeMapJoin(Vectorizer.java:3865), org.apache.hadoop.hive.ql.optimizer.physical.Vectorizer.validateAndVectorizeOperator(Vectorizer.java:5246), org.apache.hadoop.hive.ql.optimizer.physical.Vectorizer.doProcessChild(Vectorizer.java:988), org.apache.hadoop.hive.ql.optimizer.physical.Vectorizer.doProcessChildren(Vectorizer.java:874), org.apache.hadoop.hive.ql.optimizer.physical.Vectorizer.validateAndVectorizeOperatorTree(Vectorizer.java:841), org.apache.hadoop.hive.ql.optimizer.physical.Vectorizer.access$2400(Vectorizer.java:251), org.apache.hadoop.hive.ql.optimizer.physical.Vectorizer$VectorizationDispatcher.validateAndVectorizeReduceOperators(Vectorizer.java:2298), org.apache.hadoop.hive.ql.optimizer.physical.Vectorizer$VectorizationDispatcher.validateAndVectorizeReduceOperators(Vectorizer.java:2246), org.apache.hadoop.hive.ql.optimizer.physical.Vectorizer$VectorizationDispatcher.validateAndVectorizeReduceWork(Vectorizer.java:2224), org.apache.hadoop.hive.ql.optimizer.physical.Vectorizer$VectorizationDispatcher.convertReduceWork(Vectorizer.java:2206), org.apache.hadoop.hive.ql.optimizer.physical.Vectorizer$VectorizationDispatcher.dispatch(Vectorizer.java:1038), org.apache.hadoop.hive.ql.lib.TaskGraphWalker.dispatch(TaskGraphWalker.java:111), org.apache.hadoop.hive.ql.lib.TaskGraphWalker.walk(TaskGraphWalker.java:180), ... {noformat} I have attached the explain plan for this, which has details on this. -- This message was sent by Atlassian Jira (v8.20.10#820010)
[jira] [Created] (HIVE-27003) Iceberg: Vectorization missed out for update/delete due to virtual columns
Rajesh Balamohan created HIVE-27003: --- Summary: Iceberg: Vectorization missed out for update/delete due to virtual columns Key: HIVE-27003 URL: https://issues.apache.org/jira/browse/HIVE-27003 Project: Hive Issue Type: Improvement Components: Iceberg integration Reporter: Rajesh Balamohan Attachments: delete_iceberg_vect.txt, update_iceberg_vect.txt Vectorization is missed out during table scan due to the addition of virtual columns during scans. I will attach the plan details here with. -- This message was sent by Atlassian Jira (v8.20.10#820010)
[jira] [Created] (HIVE-27005) Iceberg: Col stats are not used in queries
Rajesh Balamohan created HIVE-27005: --- Summary: Iceberg: Col stats are not used in queries Key: HIVE-27005 URL: https://issues.apache.org/jira/browse/HIVE-27005 Project: Hive Issue Type: Improvement Components: Iceberg integration Reporter: Rajesh Balamohan Attachments: col_stats.txt 1. Though, insert-queries compute colstats during runtime, they are not persisted in HMS during final call. 2. Due to #1, col stats are not available during runtime for hive queries. This includes col stats, NDV etc. So unless users explicitly run "analyse table" statements, queries can be have suboptimal plans. E.g [col_stats.txt{^}!https://jira.cloudera.com/images/icons/link_attachment_7.gif|width=7,height=7!{^}|https://jira.cloudera.com/secure/attachment/658390/658390_col_stats.txt](note that there is no col stats being used) -- This message was sent by Atlassian Jira (v8.20.10#820010)
[jira] [Created] (HIVE-27010) Reduce compilation time
Rajesh Balamohan created HIVE-27010: --- Summary: Reduce compilation time Key: HIVE-27010 URL: https://issues.apache.org/jira/browse/HIVE-27010 Project: Hive Issue Type: Improvement Reporter: Rajesh Balamohan Context: Post HIVE-24645, compilation time for queries has increased. -- This message was sent by Atlassian Jira (v8.20.10#820010)
[jira] [Created] (HIVE-27013) Provide an option to enable iceberg manifest caching via table properties
Rajesh Balamohan created HIVE-27013: --- Summary: Provide an option to enable iceberg manifest caching via table properties Key: HIVE-27013 URL: https://issues.apache.org/jira/browse/HIVE-27013 Project: Hive Issue Type: Improvement Components: Iceberg integration Reporter: Rajesh Balamohan {color:#22}I tried the following thinking that it would work with iceberg manifest caching; but it didn't.{color} {color:#22}{noformat}{color} {color:#22}alter table store_sales set tblproperties('io.manifest.cac{color}{color:#22}he-enabled'='true'); \{noformat}{color} {color:#22}Creating this ticket as a placeholder to fix the same.{color} -- This message was sent by Atlassian Jira (v8.20.10#820010)
[jira] [Created] (HIVE-27014) Iceberg: getSplits/planTasks should filter out relevant folders instead of scanning entire table
Rajesh Balamohan created HIVE-27014: --- Summary: Iceberg: getSplits/planTasks should filter out relevant folders instead of scanning entire table Key: HIVE-27014 URL: https://issues.apache.org/jira/browse/HIVE-27014 Project: Hive Issue Type: Improvement Components: Iceberg integration Reporter: Rajesh Balamohan With dynamic partition pruning, only relevant folders in fact tables are scanned. In tez, DynamicPartitionPruner will set the relevant filters.In iceberg, these filters are applied after "Table:planTasks()" is invoked in iceberg. This forces entire table metadata to be scanned and then throw off the unwanted partitions. This makes split computation expensive (e.g for store_sales, it has to look at all 1800+ partitions and throw off unwanted partitions). For short running queries, it takes 3-5+ seconds for split computation. Creating this ticket as a placeholder to make use of the relevant filters from DPP. -- This message was sent by Atlassian Jira (v8.20.10#820010)
[jira] [Created] (HIVE-27049) Iceberg: Provide current snapshot version in show-create-table
Rajesh Balamohan created HIVE-27049: --- Summary: Iceberg: Provide current snapshot version in show-create-table Key: HIVE-27049 URL: https://issues.apache.org/jira/browse/HIVE-27049 Project: Hive Issue Type: Improvement Components: Iceberg integration Reporter: Rajesh Balamohan It will be helpful to show "current snapshot" id in "show create table" statement. This will help in easier debugging. Otherwise, user has to explicitly query the metadata or read the JSON file to get this info. -- This message was sent by Atlassian Jira (v8.20.10#820010)
[jira] [Created] (HIVE-27050) Iceberg: MOR: Restrict reducer extrapolation to contain number of small files being created
Rajesh Balamohan created HIVE-27050: --- Summary: Iceberg: MOR: Restrict reducer extrapolation to contain number of small files being created Key: HIVE-27050 URL: https://issues.apache.org/jira/browse/HIVE-27050 Project: Hive Issue Type: Improvement Components: Iceberg integration Reporter: Rajesh Balamohan Scenario: # Create a simple table in iceberg (MOR mode). e.g store_sales_delete_1 # Insert some data into it. # Run an update statement as follows ## "update store_sales_delete_1 set ss_sold_time_sk=699060 where ss_sold_time_sk=69906" Hive estimates the number of reducers as "1". But due to "hive.tez.max.partition.factor" which defaults to "2.0", it will double the number of reducers. To put in perspective, it will create very small positional delete files spreading across different reducers. This will cause problems during reading, as all files should be opened for reading. # When iceberg MOR tables are involved in update/delete/merges, disable "hive.tez.max.partition.factor"; or set it to "1.0" irrespective of the user setting; # Have explicit logs for easier debugging; User shouldn't be confused on why the setting is not taking into effect. -- This message was sent by Atlassian Jira (v8.20.10#820010)
[jira] [Created] (HIVE-27084) Iceberg: Stats are not populated correctly during query compilation
Rajesh Balamohan created HIVE-27084: --- Summary: Iceberg: Stats are not populated correctly during query compilation Key: HIVE-27084 URL: https://issues.apache.org/jira/browse/HIVE-27084 Project: Hive Issue Type: Improvement Components: Iceberg integration Reporter: Rajesh Balamohan - Table stats are not properly used/computed during query compilation phase. - Here is an example. Check out the query with the filter which give more data than the regular query This is just an example, real world queries can have bad query plans due to this {{10470974584 with filter, vs 303658262936 without filter}} {noformat} explain select count(*) from store_sales where ss_sold_date_sk=2450822 and ss_wholesale_cost > 0.0 Explain STAGE DEPENDENCIES: Stage-1 is a root stage Stage-0 depends on stages: Stage-1 STAGE PLANS: Stage: Stage-1 Tez DagId: hive_20230216065808_80d68e3f-3a6b-422b-9265-50bc707ae3c6:48 Edges: Reducer 2 <- Map 1 (CUSTOM_SIMPLE_EDGE) DagName: hive_20230216065808_80d68e3f-3a6b-422b-9265-50bc707ae3c6:48 Vertices: Map 1 Map Operator Tree: TableScan alias: store_sales filterExpr: ((ss_sold_date_sk = 2450822) and (ss_wholesale_cost > 0)) (type: boolean) Statistics: Num rows: 2755519629 Data size: 303658262936 Basic stats: COMPLETE Column stats: NONE Filter Operator predicate: ((ss_sold_date_sk = 2450822) and (ss_wholesale_cost > 0)) (type: boolean) Statistics: Num rows: 5 Data size: 550 Basic stats: COMPLETE Column stats: NONE Select Operator Statistics: Num rows: 5 Data size: 550 Basic stats: COMPLETE Column stats: NONE Group By Operator aggregations: count() minReductionHashAggr: 0.99 mode: hash outputColumnNames: _col0 Statistics: Num rows: 1 Data size: 124 Basic stats: COMPLETE Column stats: NONE Reduce Output Operator null sort order: sort order: Statistics: Num rows: 1 Data size: 124 Basic stats: COMPLETE Column stats: NONE value expressions: _col0 (type: bigint) Execution mode: vectorized, llap LLAP IO: all inputs (cache only) Reducer 2 Execution mode: vectorized, llap Reduce Operator Tree: Group By Operator aggregations: count(VALUE._col0) mode: mergepartial outputColumnNames: _col0 Statistics: Num rows: 1 Data size: 124 Basic stats: COMPLETE Column stats: NONE File Output Operator compressed: false Statistics: Num rows: 1 Data size: 124 Basic stats: COMPLETE Column stats: NONE table: input format: org.apache.hadoop.mapred.SequenceFileInputFormat output format: org.apache.hadoop.hive.ql.io.HiveSequenceFileOutputFormat serde: org.apache.hadoop.hive.serde2.lazy.LazySimpleSerDe Stage: Stage-0 Fetch Operator limit: -1 Processor Tree: ListSink 58 rows selected (0.73 seconds) explain select count(*) from store_sales where ss_sold_date_sk=2450822 INFO : Starting task [Stage-3:EXPLAIN] in serial mode INFO : Completed executing command(queryId=hive_20230216065813_e51482a2-1c9a-41a7-b1b3-9aec2fba9ba7); Time taken: 0.061 seconds INFO : OK Explain STAGE DEPENDENCIES: Stage-1 is a root stage Stage-0 depends on stages: Stage-1 STAGE PLANS: Stage: Stage-1 Tez DagId: hive_20230216065813_e51482a2-1c9a-41a7-b1b3-9aec2fba9ba7:49 Edges: Reducer 2 <- Map 1 (CUSTOM_SIMPLE_EDGE) DagName: hive_20230216065813_e51482a2-1c9a-41a7-b1b3-9aec2fba9ba7:49 Vertices: Map 1 Map Operator Tree: TableScan alias: store_sales filterExpr: (ss_sold_date_sk = 2450822) (type: boolean) Statistics: Num rows: 2755519629 Data size: 10470974584 Basic stats: COMPLETE Column stats: NONE Filter Operator predicate: (ss_sold_date_sk = 2450822) (type: boolean) Statistics: Num rows: 5 Data size: 18 Basic stats: COMPLETE Column stats: NONE Select Operator Statistics: Num rows: 5 Data size: 18 Basic stats: COMPLETE Column stats: NONE Group By Operator aggregations: count() minReductionHashAggr: 0.99
[jira] [Created] (HIVE-27099) Iceberg: select count(*) from table queries all data
Rajesh Balamohan created HIVE-27099: --- Summary: Iceberg: select count(*) from table queries all data Key: HIVE-27099 URL: https://issues.apache.org/jira/browse/HIVE-27099 Project: Hive Issue Type: Improvement Reporter: Rajesh Balamohan select count is scanning all data. Though it has complete basic stats, it launched tez job which wasn't needed. Second issue is, it ended up scanning ENTIRE 148 GB dataset which is completely not required. It should have got the data from parq files itself. Ideal situation is getting entire records from manifest itself. Data is stored in parquet format in external tables. This may be broken for parquet, as for ORC it is able to read less data (footer info). 1. Consider fixing count( * ) for parq 2. Check if it is possible to read stats from iceberg manifests after #1. {noformat} explain select count(*) from store_sales; Explain STAGE DEPENDENCIES: Stage-1 is a root stage Stage-0 depends on stages: Stage-1 STAGE PLANS: Stage: Stage-1 Tez DagId: hive_20230223031934_2abeb3b9-8c18-4ff7-a8f9-df7368010189:5 Edges: Reducer 2 <- Map 1 (CUSTOM_SIMPLE_EDGE) DagName: hive_20230223031934_2abeb3b9-8c18-4ff7-a8f9-df7368010189:5 Vertices: Map 1 Map Operator Tree: TableScan alias: store_sales Statistics: Num rows: 2879966589 Data size: 195666988943 Basic stats: COMPLETE Column stats: COMPLETE Select Operator Statistics: Num rows: 2879966589 Data size: 195666988943 Basic stats: COMPLETE Column stats: COMPLETE Group By Operator aggregations: count() minReductionHashAggr: 0.5 mode: hash outputColumnNames: _col0 Statistics: Num rows: 1 Data size: 8 Basic stats: COMPLETE Column stats: COMPLETE Reduce Output Operator null sort order: sort order: Statistics: Num rows: 1 Data size: 8 Basic stats: COMPLETE Column stats: COMPLETE value expressions: _col0 (type: bigint) Execution mode: vectorized Reducer 2 Execution mode: vectorized Reduce Operator Tree: Group By Operator aggregations: count(VALUE._col0) mode: mergepartial outputColumnNames: _col0 Statistics: Num rows: 1 Data size: 8 Basic stats: COMPLETE Column stats: COMPLETE File Output Operator compressed: false Statistics: Num rows: 1 Data size: 8 Basic stats: COMPLETE Column stats: COMPLETE table: input format: org.apache.hadoop.mapred.SequenceFileInputFormat output format: org.apache.hadoop.hive.ql.io.HiveSequenceFileOutputFormat serde: org.apache.hadoop.hive.serde2.lazy.LazySimpleSerDe Stage: Stage-0 Fetch Operator limit: -1 Processor Tree: ListSink 53 rows selected (1.454 seconds) 0: jdbc:hive2://ve0:218> select count(*) from store_sales; INFO : Query ID = hive_20230223031940_9ff5d61d-1fe2-4476-a561-7820e4a3a5f8 INFO : Total jobs = 1 INFO : Launching Job 1 out of 1 INFO : Starting task [Stage-1:MAPRED] in serial mode INFO : Subscribed to counters: [] for queryId: hive_20230223031940_9ff5d61d-1fe2-4476-a561-7820e4a3a5f8 INFO : Session is already open INFO : Dag name: select count(*) from store_sales (Stage-1) INFO : Status: Running (Executing on YARN cluster with App id application_1676286357243_0061) -- VERTICES MODESTATUS TOTAL COMPLETED RUNNING PENDING FAILED KILLED -- Map 1 .. container SUCCEEDED76776700 0 0 Reducer 2 .. container SUCCEEDED 1 100 0 0 -- VERTICES: 02/02 [==>>] 100% ELAPSED TIME: 54.94 s -- INFO : Status: DAG finished successfully in 54.85 seconds INFO : INFO : Query Execution Summary INFO : -- INFO : OPERATIONDURATION INFO : -- INFO : Compile Query
[jira] [Created] (HIVE-27119) Iceberg: Delete from table generates lot of files
Rajesh Balamohan created HIVE-27119: --- Summary: Iceberg: Delete from table generates lot of files Key: HIVE-27119 URL: https://issues.apache.org/jira/browse/HIVE-27119 Project: Hive Issue Type: Improvement Components: Iceberg integration Reporter: Rajesh Balamohan With "delete" it generates lot of files due to the way data is sent to the reducers. Files per partition is impacted by the number of reduce tasks. One way could be to explicitly control the number of reducers; Creating this ticket to have a long term fix. {noformat} explain delete from store_Sales where ss_customer_sk % 10 = 0; INFO : Compiling command(queryId=hive_20230303021031_855dd644-8f67-482d-98d7-e9f70b56ae0b): explain delete from store_Sales where ss_customer_sk % 10 = 0 INFO : No Stats for tpcds_1000_iceberg_mor_v4@store_sales, Columns: ss_sold_time_sk, ss_cdemo_sk, ss_promo_sk, ss_ext_discount_amt, ss_ext_sales_price, ss_net_profit, ss_addr_sk, ss_ticket_number, ss_wholesale_cost, ss_item_sk, ss_ext_list_price, ss_sold_date_sk, ss_store_sk, ss_coupon_amt, ss_quantity, ss_list_price, ss_sales_price, ss_customer_sk, ss_ext_wholesale_cost, ss_net_paid, ss_ext_tax, ss_hdemo_sk, ss_net_paid_inc_tax INFO : Semantic Analysis Completed (retrial = false) INFO : Created Hive schema: Schema(fieldSchemas:[FieldSchema(name:Explain, type:string, comment:null)], properties:null) INFO : Completed compiling command(queryId=hive_20230303021031_855dd644-8f67-482d-98d7-e9f70b56ae0b); Time taken: 0.704 seconds INFO : Executing command(queryId=hive_20230303021031_855dd644-8f67-482d-98d7-e9f70b56ae0b): explain delete from store_Sales where ss_customer_sk % 10 = 0 INFO : Starting task [Stage-4:EXPLAIN] in serial mode INFO : Completed executing command(queryId=hive_20230303021031_855dd644-8f67-482d-98d7-e9f70b56ae0b); Time taken: 0.005 seconds INFO : OK Explain STAGE DEPENDENCIES: Stage-1 is a root stage Stage-2 depends on stages: Stage-1 Stage-0 depends on stages: Stage-2 Stage-3 depends on stages: Stage-0 STAGE PLANS: Stage: Stage-1 Tez DagId: hive_20230303021031_855dd644-8f67-482d-98d7-e9f70b56ae0b:377 Edges: Reducer 2 <- Map 1 (SIMPLE_EDGE) DagName: hive_20230303021031_855dd644-8f67-482d-98d7-e9f70b56ae0b:377 Vertices: Map 1 Map Operator Tree: TableScan alias: store_sales filterExpr: ((ss_customer_sk % 10) = 0) (type: boolean) Statistics: Num rows: 2755519629 Data size: 3643899155232 Basic stats: COMPLETE Column stats: NONE Filter Operator predicate: ((ss_customer_sk % 10) = 0) (type: boolean) Statistics: Num rows: 1377759814 Data size: 1821949576954 Basic stats: COMPLETE Column stats: NONE Select Operator expressions: PARTITION__SPEC__ID (type: int), PARTITION__HASH (type: bigint), FILE__PATH (type: string), ROW__POSITION (type: bigint), ss_sold_time_sk (type: int), ss_item_sk (type: int), ss_customer_sk (type: int), ss_cdemo_sk (type: int), ss_hdemo_sk (type: int), ss_addr_sk (type: int), ss_store_sk (type: int), ss_promo_sk (type: int), ss_ticket_number (type: bigint), ss_quantity (type: int), ss_wholesale_cost (type: decimal(7,2)), ss_list_price (type: decimal(7,2)), ss_sales_price (type: decimal(7,2)), ss_ext_discount_amt (type: decimal(7,2)), ss_ext_sales_price (type: decimal(7,2)), ss_ext_wholesale_cost (type: decimal(7,2)), ss_ext_list_price (type: decimal(7,2)), ss_ext_tax (type: decimal(7,2)), ss_coupon_amt (type: decimal(7,2)), ss_net_paid (type: decimal(7,2)), ss_net_paid_inc_tax (type: decimal(7,2)), ss_net_profit (type: decimal(7,2)), ss_sold_date_sk (type: int) outputColumnNames: _col0, _col1, _col2, _col3, _col4, _col5, _col6, _col7, _col8, _col9, _col10, _col11, _col12, _col13, _col14, _col15, _col16, _col17, _col18, _col19, _col20, _col21, _col22, _col23, _col24, _col25, _col26 Statistics: Num rows: 1377759814 Data size: 1821949576954 Basic stats: COMPLETE Column stats: NONE Reduce Output Operator key expressions: _col0 (type: int), _col1 (type: bigint), _col2 (type: string), _col3 (type: bigint) null sort order: sort order: Statistics: Num rows: 1377759814 Data size: 1821949576954 Basic stats: COMPLETE Column stats: NONE value expressions: _col4 (type: int), _col5 (type: int), _col6 (type: int), _col7 (type: int), _col8 (type: int), _col9 (type: int), _col10 (type: int), _col11 (type: int), _col12 (type: bigint), _col13 (type: int), _col14 (type: decimal(7,2)), _col15 (type: decimal(7,2)), _col16 (type: decimal(7,2)), _col17 (type
[jira] [Created] (HIVE-27144) Alter table partitions need not DBNotificationListener for external tables
Rajesh Balamohan created HIVE-27144: --- Summary: Alter table partitions need not DBNotificationListener for external tables Key: HIVE-27144 URL: https://issues.apache.org/jira/browse/HIVE-27144 Project: Hive Issue Type: Improvement Components: HiveServer2 Reporter: Rajesh Balamohan DBNotificationListener for external tables may not be needed. Even for "analyze table blah compute statistics for columns" for external partitioned tables, it invokes DBNotificationListener for all partitions. {noformat} at org.datanucleus.store.query.Query.execute(Query.java:1726) at org.datanucleus.api.jdo.JDOQuery.executeInternal(JDOQuery.java:374) at org.datanucleus.api.jdo.JDOQuery.execute(JDOQuery.java:216) at org.apache.hadoop.hive.metastore.ObjectStore.addNotificationEvent(ObjectStore.java:11774) at jdk.internal.reflect.GeneratedMethodAccessor135.invoke(Unknown Source) at jdk.internal.reflect.DelegatingMethodAccessorImpl.invoke(java.base@11.0.18/DelegatingMethodAccessorImpl.java:43) at java.lang.reflect.Method.invoke(java.base@11.0.18/Method.java:566) at org.apache.hadoop.hive.metastore.RawStoreProxy.invoke(RawStoreProxy.java:97) at com.sun.proxy.$Proxy33.addNotificationEvent(Unknown Source) at org.apache.hive.hcatalog.listener.DbNotificationListener.process(DbNotificationListener.java:1308) at org.apache.hive.hcatalog.listener.DbNotificationListener.onAlterPartition(DbNotificationListener.java:458) at org.apache.hadoop.hive.metastore.MetaStoreListenerNotifier$14.notify(MetaStoreListenerNotifier.java:161) at org.apache.hadoop.hive.metastore.MetaStoreListenerNotifier.notifyEvent(MetaStoreListenerNotifier.java:328) at org.apache.hadoop.hive.metastore.MetaStoreListenerNotifier.notifyEvent(MetaStoreListenerNotifier.java:390) at org.apache.hadoop.hive.metastore.HiveAlterHandler.alterPartitions(HiveAlterHandler.java:863) at org.apache.hadoop.hive.metastore.HiveMetaStore$HMSHandler.alter_partitions_with_environment_context(HiveMetaStore.java:6253) at org.apache.hadoop.hive.metastore.HiveMetaStore$HMSHandler.alter_partitions_req(HiveMetaStore.java:6201) at jdk.internal.reflect.NativeMethodAccessorImpl.invoke0(java.base@11.0.18/Native Method) at jdk.internal.reflect.NativeMethodAccessorImpl.invoke(java.base@11.0.18/NativeMethodAccessorImpl.java:62) at jdk.internal.reflect.DelegatingMethodAccessorImpl.invoke(java.base@11.0.18/DelegatingMethodAccessorImpl.java:43) at java.lang.reflect.Method.invoke(java.base@11.0.18/Method.java:566) at org.apache.hadoop.hive.metastore.RetryingHMSHandler.invokeInternal(RetryingHMSHandler.java:160) at org.apache.hadoop.hive.metastore.RetryingHMSHandler.invoke(RetryingHMSHandler.java:121) at com.sun.proxy.$Proxy34.alter_partitions_req(Unknown Source) at org.apache.hadoop.hive.metastore.api.ThriftHiveMetastore$Processor$alter_partitions_req.getResult(ThriftHiveMetastore.java:21532) at org.apache.hadoop.hive.metastore.api.ThriftHiveMetastore$Processor$alter_partitions_req.getResult(ThriftHiveMetastore.java:21511) at org.apache.thrift.ProcessFunction.process(ProcessFunction.java:38) at org.apache.thrift.TBaseProcessor.process(TBaseProcessor.java:38) at org.apache.hadoop.hive.metastore.security.HadoopThriftAuthBridge$Server$TUGIAssumingProcessor$1.run(HadoopThriftAuthBridge.java:652) at org.apache.hadoop.hive.metastore.security.HadoopThriftAuthBridge$Server$TUGIAssumingProcessor$1.run(HadoopThriftAuthBridge.java:647) at java.security.AccessController.doPrivileged(java.base@11.0.18/Native Method) {noformat} -- This message was sent by Atlassian Jira (v8.20.10#820010)
[jira] [Created] (HIVE-27159) Filters are not pushed down for decimal format in Parquet
Rajesh Balamohan created HIVE-27159: --- Summary: Filters are not pushed down for decimal format in Parquet Key: HIVE-27159 URL: https://issues.apache.org/jira/browse/HIVE-27159 Project: Hive Issue Type: Improvement Reporter: Rajesh Balamohan Decimal filters are not created and pushed down in parquet readers. This causes latency delays and unwanted row processing in query execution. It throws exception in runtime and processes more rows. E.g Q13. {noformat} Parquet: (Map 1) INFO : Task Execution Summary INFO : -- INFO : VERTICES DURATION(ms) CPU_TIME(ms)GC_TIME(ms) INPUT_RECORDS OUTPUT_RECORDS INFO : -- INFO : Map 1 31254.00 0 0 549,181,950 133 INFO : Map 3 0.00 0 0 73,049 365 INFO : Map 4 2027.00 0 0 6,000,0001,689,919 INFO : Map 5 0.00 0 0 7,2001,440 INFO : Map 6517.00 0 0 1,920,800 493,920 INFO : Map 7 0.00 0 0 1,0021,002 INFO : Reducer 2 18716.00 0 0 1330 INFO : -- ORC: INFO : Task Execution Summary INFO : -- INFO : VERTICES DURATION(ms) CPU_TIME(ms)GC_TIME(ms) INPUT_RECORDS OUTPUT_RECORDS INFO : -- INFO : Map 1 6556.00 0 0 267,146,063 152 INFO : Map 3 0.00 0 0 10,000 365 INFO : Map 4 2014.00 0 0 6,000,0001,689,919 INFO : Map 5 0.00 0 0 7,2001,440 INFO : Map 6504.00 0 0 1,920,800 493,920 INFO : Reducer 2 3159.00 0 0 1520 INFO : -- {noformat} {noformat} Map 1 Map Operator Tree: TableScan alias: store_sales filterExpr: (ss_hdemo_sk is not null and ss_addr_sk is not null and ss_cdemo_sk is not null and ss_store_sk is not null and ((ss_sales_price >= 100) or (ss_sales_price <= 150) or (ss_sales_price >= 50) or (ss_sales_price <= 100) or (ss_sales_price >= 150) or (ss_sales_price <= 200)) and ((ss_net_profit >= 100) or (ss_net_profit <= 200) or (ss_net_profit >= 150) or (ss_net_profit <= 300) or (ss_net_profit >= 50) or (ss_net_profit <= 250))) (type: boolean) probeDecodeDetails: cacheKey:HASH_MAP_MAPJOIN_112_container, bigKeyColName:ss_hdemo_sk, smallTablePos:1, keyRatio:5.042575832290721E-6 Statistics: Num rows: 2750380056 Data size: 1321831086472 Basic stats: COMPLETE Column stats: COMPLETE Filter Operator predicate: (ss_hdemo_sk is not null and ss_addr_sk is not null and ss_cdemo_sk is not null and ss_store_sk is not null and ((ss_sales_price >= 100) or (ss_sales_price <= 150) or (ss_sales_price >= 50) or (ss_sales_price <= 100) or (ss_sales_price >= 150) or (ss_sales_price <= 200)) and ((ss_net_profit >= 100) or (ss_net_profit <= 200) or (ss_net_profit >= 150) or (ss_net_profit <= 300) or (ss_net_profit >= 50) or (ss_net_profit <= 250))) (type: boolean) Statistics: Num rows: 2500252205 Data size: 1201619783884 Basic stats: COMPLETE Column stats: COMPLETE Select Operator expressions: ss_cdemo_sk (type: bigint), ss_hdemo_sk (type: bigint), ss_addr_sk (type: bigint), ss_store_sk (type: bigint), ss_quantity (type: int), ss_ext_sales_price (type: decimal(7,2)), ss_ext_wholesale_cost (type: decimal(7,2)), ss_sold_date_sk (type: bigint), ss_net_profit BETWEEN 100 AND 200 (type: boolean), ss_net_profit BETWEEN 150 AND 300 (type: boolean), ss_net_profit BETWEEN 50 AND 250 (type: boolean), ss_sales_price BETWEEN 100 AND 150 (type: boolean), ss_sales_price BETWEEN 50 AND 100 (type: boolean), ss_sales_price BETWEEN 150 AND 200 (type: boolean)
[jira] [Created] (HIVE-27183) Iceberg: Table information is loaded multiple times
Rajesh Balamohan created HIVE-27183: --- Summary: Iceberg: Table information is loaded multiple times Key: HIVE-27183 URL: https://issues.apache.org/jira/browse/HIVE-27183 Project: Hive Issue Type: Improvement Reporter: Rajesh Balamohan HMS::getTable invokes "HiveIcebergMetaHook::postGetTable" which internally loads iceberg table again. If this isn't needed or needed only for show-create-table, do not load the table again. {noformat} at jdk.internal.misc.Unsafe.park(java.base@11.0.18/Native Method) - parking to wait for <0x00066f84eef0> (a java.util.concurrent.CompletableFuture$Signaller) at java.util.concurrent.locks.LockSupport.park(java.base@11.0.18/LockSupport.java:194) at java.util.concurrent.CompletableFuture$Signaller.block(java.base@11.0.18/CompletableFuture.java:1796) at java.util.concurrent.ForkJoinPool.managedBlock(java.base@11.0.18/ForkJoinPool.java:3128) at java.util.concurrent.CompletableFuture.waitingGet(java.base@11.0.18/CompletableFuture.java:1823) at java.util.concurrent.CompletableFuture.get(java.base@11.0.18/CompletableFuture.java:1998) at org.apache.hadoop.util.functional.FutureIO.awaitFuture(FutureIO.java:77) at org.apache.iceberg.hadoop.HadoopInputFile.newStream(HadoopInputFile.java:196) at org.apache.iceberg.TableMetadataParser.read(TableMetadataParser.java:263) at org.apache.iceberg.TableMetadataParser.read(TableMetadataParser.java:258) at org.apache.iceberg.BaseMetastoreTableOperations.lambda$refreshFromMetadataLocation$0(BaseMetastoreTableOperations.java:177) at org.apache.iceberg.BaseMetastoreTableOperations$$Lambda$609/0x000840e18040.apply(Unknown Source) at org.apache.iceberg.BaseMetastoreTableOperations.lambda$refreshFromMetadataLocation$1(BaseMetastoreTableOperations.java:191) at org.apache.iceberg.BaseMetastoreTableOperations$$Lambda$610/0x000840e18440.run(Unknown Source) at org.apache.iceberg.util.Tasks$Builder.runTaskWithRetry(Tasks.java:404) at org.apache.iceberg.util.Tasks$Builder.runSingleThreaded(Tasks.java:214) at org.apache.iceberg.util.Tasks$Builder.run(Tasks.java:198) at org.apache.iceberg.util.Tasks$Builder.run(Tasks.java:190) at org.apache.iceberg.BaseMetastoreTableOperations.refreshFromMetadataLocation(BaseMetastoreTableOperations.java:191) at org.apache.iceberg.BaseMetastoreTableOperations.refreshFromMetadataLocation(BaseMetastoreTableOperations.java:176) at org.apache.iceberg.BaseMetastoreTableOperations.refreshFromMetadataLocation(BaseMetastoreTableOperations.java:171) at org.apache.iceberg.hive.HiveTableOperations.doRefresh(HiveTableOperations.java:153) at org.apache.iceberg.BaseMetastoreTableOperations.refresh(BaseMetastoreTableOperations.java:96) at org.apache.iceberg.BaseMetastoreTableOperations.current(BaseMetastoreTableOperations.java:79) at org.apache.iceberg.BaseMetastoreCatalog.loadTable(BaseMetastoreCatalog.java:44) at org.apache.iceberg.mr.Catalogs.loadTable(Catalogs.java:115) at org.apache.iceberg.mr.Catalogs.loadTable(Catalogs.java:105) at org.apache.iceberg.mr.hive.IcebergTableUtil.lambda$getTable$1(IcebergTableUtil.java:99) at org.apache.iceberg.mr.hive.IcebergTableUtil$$Lambda$552/0x000840d59840.apply(Unknown Source) at org.apache.iceberg.mr.hive.IcebergTableUtil.lambda$getTable$4(IcebergTableUtil.java:111) at org.apache.iceberg.mr.hive.IcebergTableUtil$$Lambda$557/0x000840d58c40.get(Unknown Source) at java.util.Optional.orElseGet(java.base@11.0.18/Optional.java:369) at org.apache.iceberg.mr.hive.IcebergTableUtil.getTable(IcebergTableUtil.java:108) at org.apache.iceberg.mr.hive.IcebergTableUtil.getTable(IcebergTableUtil.java:69) at org.apache.iceberg.mr.hive.IcebergTableUtil.getTable(IcebergTableUtil.java:73) at org.apache.iceberg.mr.hive.HiveIcebergMetaHook.postGetTable(HiveIcebergMetaHook.java:931) at org.apache.hadoop.hive.metastore.HiveMetaStoreClient.executePostGetTableHook(HiveMetaStoreClient.java:2638) at org.apache.hadoop.hive.metastore.HiveMetaStoreClient.getTable(HiveMetaStoreClient.java:2624) at org.apache.hadoop.hive.ql.metadata.SessionHiveMetaStoreClient.getTable(SessionHiveMetaStoreClient.java:267) at jdk.internal.reflect.GeneratedMethodAccessor137.invoke(Unknown Source) at jdk.internal.reflect.DelegatingMethodAccessorImpl.invoke(java.base@11.0.18/DelegatingMethodAccessorImpl.java:43) at java.lang.reflect.Method.invoke(java.base@11.0.18/Method.java:566) at org.apache.hadoop.hive.metastore.RetryingMetaStoreClient.invoke(RetryingMetaStoreClient.java:216) at com.sun.proxy.$Proxy56.getTable(Unknown Source) at jdk.internal.reflect.GeneratedMethodAccessor137.invoke(Unknown Source) at jdk.internal.reflect.DelegatingMethodAccessorImpl.invoke(java.base@11.0.18/DelegatingMetho
[jira] [Created] (HIVE-27184) Add class name profiling option in ProfileServlet
Rajesh Balamohan created HIVE-27184: --- Summary: Add class name profiling option in ProfileServlet Key: HIVE-27184 URL: https://issues.apache.org/jira/browse/HIVE-27184 Project: Hive Issue Type: Improvement Components: HiveServer2 Reporter: Rajesh Balamohan With async-profiler "-e classame.method", it is possible to profile specific events. Currently profileServlet supports events like cpu, alloc, lock etc. It will be good to enhance to support method name profiling as well. -- This message was sent by Atlassian Jira (v8.20.10#820010)
[jira] [Created] (HIVE-27188) Explore usage of FilterApi.in(C column, Set values) in Parquet instead of nested OR
Rajesh Balamohan created HIVE-27188: --- Summary: Explore usage of FilterApi.in(C column, Set values) in Parquet instead of nested OR Key: HIVE-27188 URL: https://issues.apache.org/jira/browse/HIVE-27188 Project: Hive Issue Type: Improvement Reporter: Rajesh Balamohan Following query can throw stackoverflow exception with "Xss256K". Currently it generates nested OR filter [https://github.com/apache/hive/blob/master/ql/src/java/org/apache/hadoop/hive/ql/io/parquet/FilterPredicateLeafBuilder.java#L43-L52] Instead, need to explore the possibility of using {color:#de350b}FilterApi.in(C column, Set values) {color:#172b4d}in parquet{color}{color} {noformat} drop table if exists test; create external table test (i int) stored as parquet; insert into test values (1),(2),(3); select count(*) from test where i in (1, 2, 3, 4, 5, 6, 7, 8, 9, 10, 11, 12, 13, 14, 15, 16, 17, 18, 19, 20, 21, 22, 23, 24, 25, 26, 27, 28, 29, 30, 31, 32, 33, 34, 35, 36, 37, 38, 39, 40, 41, 42, 43, 44, 45, 46, 47, 48, 49, 50, 51, 52, 53, 54, 55, 56, 57, 58, 59, 60, 61, 62, 63, 64, 65, 66, 67, 68, 69, 70, 71, 72, 73, 74, 75, 76, 77, 78, 79, 80, 81, 82, 83, 84, 85, 86, 87, 88, 89, 90, 91, 92, 93, 94, 95, 96, 97, 98, 99, 100, 101, 102, 103, 104, 105, 106, 107, 108, 109, 110, 111, 112, 113, 114, 115, 116, 117, 118, 119, 120, 121, 122, 123, 124, 125, 126, 127, 128, 129, 130, 131, 132, 133, 134, 135, 136, 137, 138, 139, 140, 141, 142, 143, 144, 145, 146, 147, 148, 149, 150, 151, 152, 153, 154, 155, 156, 157, 158, 159, 160, 161, 162, 163, 164, 165, 166, 167, 168, 169, 170, 171, 172, 173, 174, 175, 176, 177, 178, 179, 180, 181, 182, 183, 184, 185, 186, 187, 188, 189, 190, 191, 192, 193, 194, 195, 196, 197, 198, 199, 200, 201, 202, 203, 204, 205, 206, 207, 208, 209, 210, 211, 212, 213, 214, 215, 216, 217, 218, 219, 220, 221, 222, 223, 224, 225, 226, 227, 228, 229, 230, 231, 232, 233, 234, 235, 236, 237, 238, 239, 240, 241, 242, 243); {noformat} -- This message was sent by Atlassian Jira (v8.20.10#820010)
[jira] [Created] (HIVE-20816) FastHiveDecimal throws Exception (RuntimeException: Unexpected #3)
Rajesh Balamohan created HIVE-20816: --- Summary: FastHiveDecimal throws Exception (RuntimeException: Unexpected #3) Key: HIVE-20816 URL: https://issues.apache.org/jira/browse/HIVE-20816 Project: Hive Issue Type: Improvement Affects Versions: 2.3.2 Reporter: Rajesh Balamohan {noformat} with t1 as ( ... ... ) select id, max(abs(c1))) from t1 group by id; {noformat} throws the following exception {noformat} g.Thread.run(Thread.java:748) Caused by: org.apache.hadoop.hive.ql.metadata.HiveException: org.apache.hadoop.hive.ql.metadata.HiveException: org.apache.hadoop.hive.ql.metadata.HiveException: java.lang.RuntimeException: Unexpected #3 at org.apache.hadoop.hive.ql.exec.GroupByOperator.closeOp(GroupByOperator.java:1126) at org.apache.hadoop.hive.ql.exec.Operator.close(Operator.java:697) at org.apache.hadoop.hive.ql.exec.Operator.close(Operator.java:711) at org.apache.hadoop.hive.ql.exec.Operator.close(Operator.java:711) at org.apache.hadoop.hive.ql.exec.Operator.close(Operator.java:711) at org.apache.hadoop.hive.ql.exec.Operator.close(Operator.java:711) ... Caused by: org.apache.hadoop.hive.ql.metadata.HiveException: org.apache.hadoop.hive.ql.metadata.HiveException: java.lang.RuntimeException: Unexpected #3 at org.apache.hadoop.hive.ql.exec.GroupByOperator.flush(GroupByOperator.java:1084) at org.apache.hadoop.hive.ql.exec.GroupByOperator.closeOp(GroupByOperator.java:1123) ... 18 more Caused by: org.apache.hadoop.hive.ql.metadata.HiveException: java.lang.RuntimeException: Unexpected #3 at org.apache.hadoop.hive.ql.exec.ReduceSinkOperator.process(ReduceSinkOperator.java:397) at org.apache.hadoop.hive.ql.exec.Operator.forward(Operator.java:897) at org.apache.hadoop.hive.ql.exec.GroupByOperator.forward(GroupByOperator.java:1047) at org.apache.hadoop.hive.ql.exec.GroupByOperator.flush(GroupByOperator.java:1067) ... 19 more {noformat} -- This message was sent by Atlassian JIRA (v7.6.3#76005)
[jira] [Created] (HIVE-20886) Fix NPE: GenericUDFLower
Rajesh Balamohan created HIVE-20886: --- Summary: Fix NPE: GenericUDFLower Key: HIVE-20886 URL: https://issues.apache.org/jira/browse/HIVE-20886 Project: Hive Issue Type: Improvement Components: Hive Reporter: Rajesh Balamohan Assignee: Rajesh Balamohan {noformat} create table if not exists test1(uuid array); select lower(uuid) from test1; Error: Error while compiling statement: FAILED: NullPointerException null (state=42000,code=4) {noformat} -- This message was sent by Atlassian JIRA (v7.6.3#76005)
[jira] [Created] (HIVE-20928) NPE in StatsUtils for complex type
Rajesh Balamohan created HIVE-20928: --- Summary: NPE in StatsUtils for complex type Key: HIVE-20928 URL: https://issues.apache.org/jira/browse/HIVE-20928 Project: Hive Issue Type: Bug Components: Hive Affects Versions: 2.3.4 Reporter: Rajesh Balamohan {noformat} Caused by: java.lang.NullPointerException at org.apache.hadoop.hive.ql.stats.StatsUtils.getWritableSize(StatsUtils.java:1147) at org.apache.hadoop.hive.ql.stats.StatsUtils.getSizeOfMap(StatsUtils.java:1108) at org.apache.hadoop.hive.ql.stats.StatsUtils.getSizeOfComplexTypes(StatsUtils.java:978) at org.apache.hadoop.hive.ql.stats.StatsUtils.getAvgColLenOf(StatsUtils.java:916) at org.apache.hadoop.hive.ql.stats.StatsUtils.getColStatisticsFromExpression(StatsUtils.java:1374) at org.apache.hadoop.hive.ql.stats.StatsUtils.getColStatisticsFromExprMap(StatsUtils.java:1197) at org.apache.hadoop.hive.ql.optimizer.stats.annotation.StatsRulesProcFactory$GroupByStatsRule.process(StatsRulesProcFactory.java:1009) at org.apache.hadoop.hive.ql.lib.DefaultRuleDispatcher.dispatch(DefaultRuleDispatcher.java:90) at org.apache.hadoop.hive.ql.lib.DefaultGraphWalker.dispatchAndReturn(DefaultGraphWalker.java:105) at org.apache.hadoop.hive.ql.lib.DefaultGraphWalker.dispatch(DefaultGraphWalker.java:89) at org.apache.hadoop.hive.ql.lib.LevelOrderWalker.walk(LevelOrderWalker.java:143) at org.apache.hadoop.hive.ql.lib.LevelOrderWalker.startWalking(LevelOrderWalker.java:122) at org.apache.hadoop.hive.ql.optimizer.stats.annotation.AnnotateWithStatistics.transform(AnnotateWithStatistics.java:78) at org.apache.hadoop.hive.ql.parse.spark.SparkCompiler.runStatsAnnotation(SparkCompiler.java:240) {noformat} Issue should be there in master as well. -- This message was sent by Atlassian JIRA (v7.6.3#76005)
[jira] [Created] (HIVE-20974) TezTask should set task exception on failures
Rajesh Balamohan created HIVE-20974: --- Summary: TezTask should set task exception on failures Key: HIVE-20974 URL: https://issues.apache.org/jira/browse/HIVE-20974 Project: Hive Issue Type: Improvement Components: Hive Reporter: Rajesh Balamohan TezTask logs the error as "Failed to execute tez graph" and proceeds further. "TaskRunner.runSequentail()" code would not be able to get these exceptions for TezTask. If there are any failure hooks configured, these exceptions wouldn't show up. -- This message was sent by Atlassian JIRA (v7.6.3#76005)
[jira] [Created] (HIVE-21102) Optimize SparkPlanGenerator for getInputPaths (emptyFile checks)
Rajesh Balamohan created HIVE-21102: --- Summary: Optimize SparkPlanGenerator for getInputPaths (emptyFile checks) Key: HIVE-21102 URL: https://issues.apache.org/jira/browse/HIVE-21102 Project: Hive Issue Type: Improvement Reporter: Rajesh Balamohan -- This message was sent by Atlassian JIRA (v7.6.3#76005)
[jira] [Created] (HIVE-21104) PTF with nested structure throws ClassCastException
Rajesh Balamohan created HIVE-21104: --- Summary: PTF with nested structure throws ClassCastException Key: HIVE-21104 URL: https://issues.apache.org/jira/browse/HIVE-21104 Project: Hive Issue Type: Bug Components: Hive Reporter: Rajesh Balamohan {noformat} DROP TABLE IF EXISTS dummy; CREATE TABLE dummy (i int); INSERT INTO TABLE dummy VALUES (1); DROP TABLE IF EXISTS struct_table_example; CREATE TABLE struct_table_example (a int, s1 struct ) STORED AS ORC; INSERT INTO TABLE struct_table_example SELECT 1, named_struct('f1', false, 'f2', 'test', 'f3', 3, 'f4', 4) FROM dummy; select s1.f1, s1.f2, rank() over (partition by s1.f2 order by s1.f4) from struct_table_example; {noformat} This would throw the following error {noformat} Caused by: java.lang.RuntimeException: org.apache.hadoop.hive.ql.metadata.HiveException: Hive Runtime Error while processing row (tag=0) {"key":{"reducesinkkey0":"test","reducesinkkey1":4},"value":{"_col1":{"f1":false,"f2":"test","f3":3,"f4":4}}} at org.apache.hadoop.hive.ql.exec.tez.ReduceRecordSource.pushRecord(ReduceRecordSource.java:297) at org.apache.hadoop.hive.ql.exec.tez.ReduceRecordProcessor.run(ReduceRecordProcessor.java:317) at org.apache.hadoop.hive.ql.exec.tez.TezProcessor.initializeAndRunProcessor(TezProcessor.java:185) ... 14 more Caused by: org.apache.hadoop.hive.ql.metadata.HiveException: Hive Runtime Error while processing row (tag=0) {"key":{"reducesinkkey0":"test","reducesinkkey1":4},"value":{"_col1":{"f1":false,"f2":"test","f3":3,"f4":4}}} at org.apache.hadoop.hive.ql.exec.tez.ReduceRecordSource$GroupIterator.next(ReduceRecordSource.java:365) at org.apache.hadoop.hive.ql.exec.tez.ReduceRecordSource.pushRecord(ReduceRecordSource.java:287) ... 16 more Caused by: java.lang.ClassCastException: org.apache.hadoop.hive.serde2.lazybinary.LazyBinaryStruct cannot be cast to org.apache.hadoop.io.IntWritable at org.apache.hadoop.hive.serde2.objectinspector.primitive.WritableIntObjectInspector.getPrimitiveJavaObject(WritableIntObjectInspector.java:46) at org.apache.hadoop.hive.serde2.objectinspector.ObjectInspectorUtils.copyToStandardObject(ObjectInspectorUtils.java:412) at org.apache.hadoop.hive.ql.udf.generic.GenericUDAFRank.copyToStandardObject(GenericUDAFRank.java:219) at org.apache.hadoop.hive.ql.udf.generic.GenericUDAFRank$GenericUDAFAbstractRankEvaluator.iterate(GenericUDAFRank.java:154) at org.apache.hadoop.hive.ql.udf.generic.GenericUDAFEvaluator.aggregate(GenericUDAFEvaluator.java:192) at org.apache.hadoop.hive.ql.udf.ptf.WindowingTableFunction.processRow(WindowingTableFunction.java:407) at org.apache.hadoop.hive.ql.exec.PTFOperator$PTFInvocation.processRow(PTFOperator.java:325) at org.apache.hadoop.hive.ql.exec.PTFOperator.process(PTFOperator.java:139) at org.apache.hadoop.hive.ql.exec.Operator.forward(Operator.java:897) at org.apache.hadoop.hive.ql.exec.SelectOperator.process(SelectOperator.java:95) at org.apache.hadoop.hive.ql.exec.tez.ReduceRecordSource$GroupIterator.next(ReduceRecordSource.java:356) ... 17 more ]], Vertex did not succeed due to OWN_TASK_FAILURE, failedTasks:1 killedTasks:0, Vertex vertex_1546783872011_263870_1_01 [Reducer 2] killed/failed due to:OWN_TASK_FAILURE]DAG did not succeed due to VERTEX_FAILURE. failedVertices:1 killedVertices:0 at org.apache.hadoop.hive.ql.exec.tez.TezTask.execute(TezTask.java:196) at org.apache.hadoop.hive.ql.exec.Task.executeTask(Task.java:199) at org.apache.hadoop.hive.ql.exec.TaskRunner.runSequential(TaskRunner.java:100) at org.apache.hadoop.hive.ql.exec.TaskRunner.run(TaskRunner.java:79) (state=08S01,code=2) {noformat} -- This message was sent by Atlassian JIRA (v7.6.3#76005)
[jira] [Created] (HIVE-21136) Kryo exception : Unable to create serializer for class AtomicReference
Rajesh Balamohan created HIVE-21136: --- Summary: Kryo exception : Unable to create serializer for class AtomicReference Key: HIVE-21136 URL: https://issues.apache.org/jira/browse/HIVE-21136 Project: Hive Issue Type: Bug Components: Serializers/Deserializers Reporter: Rajesh Balamohan Assignee: Rajesh Balamohan {noformat} Caused by: org.apache.hive.com.esotericsoftware.kryo.KryoException: java.lang.IllegalArgumentException: Unable to create serializer "org.apache.hive.com.esotericsoftware.kryo.serializers.FieldSerializer" for class: java.util.concurrent.atomic.AtomicReference Serialization trace: _tableInfo (org.codehaus.jackson.sym.BytesToNameCanonicalizer) _rootByteSymbols (org.codehaus.jackson.JsonFactory) jsonFactory (brickhouse.udf.json.ToJsonUDF) genericUDF (org.apache.hadoop.hive.ql.plan.ExprNodeGenericFuncDesc) chidren (org.apache.hadoop.hive.ql.plan.ExprNodeGenericFuncDesc) colExprMap (org.apache.hadoop.hive.ql.exec.GroupByOperator) childOperators (org.apache.hadoop.hive.ql.exec.SelectOperator) childOperators (org.apache.hadoop.hive.ql.exec.LateralViewJoinOperator) childOperators (org.apache.hadoop.hive.ql.exec.SelectOperator) childOperators (org.apache.hadoop.hive.ql.exec.LateralViewForwardOperator) childOperators (org.apache.hadoop.hive.ql.exec.LateralViewJoinOperator) childOperators (org.apache.hadoop.hive.ql.exec.SelectOperator) childOperators (org.apache.hadoop.hive.ql.exec.LateralViewForwardOperator) childOperators (org.apache.hadoop.hive.ql.exec.LateralViewJoinOperator) childOperators (org.apache.hadoop.hive.ql.exec.SelectOperator) childOperators (org.apache.hadoop.hive.ql.exec.LateralViewForwardOperator) childOperators (org.apache.hadoop.hive.ql.exec.GroupByOperator) reducer (org.apache.hadoop.hive.ql.plan.ReduceWork) at org.apache.hive.com.esotericsoftware.kryo.serializers.ObjectField.read(ObjectField.java:144) at org.apache.hive.com.esotericsoftware.kryo.serializers.FieldSerializer.read(FieldSerializer.java:551) at org.apache.hive.com.esotericsoftware.kryo.Kryo.readObjectOrNull(Kryo.java:759) at org.apache.hadoop.hive.ql.exec.SerializationUtilities$KryoWithHooks.readObjectOrNull(SerializationUtilities.java:199) at org.apache.hive.com.esotericsoftware.kryo.serializers.ObjectField.read(ObjectField.java:132) at org.apache.hive.com.esotericsoftware.kryo.serializers.FieldSerializer.read(FieldSerializer.java:551) at org.apache.hive.com.esotericsoftware.kryo.Kryo.readObject(Kryo.java:708) at org.apache.hadoop.hive.ql.exec.SerializationUtilities$KryoWithHooks.readObject(SerializationUtilities.java:214) at org.apache.hive.com.esotericsoftware.kryo.serializers.ObjectField.read(ObjectField.java:125) at org.apache.hive.com.esotericsoftware.kryo.serializers.FieldSerializer.read(FieldSerializer.java:551) at org.apache.hive.com.esotericsoftware.kryo.Kryo.readObject(Kryo.java:708) at org.apache.hadoop.hive.ql.exec.SerializationUtilities$KryoWithHooks.readObject(SerializationUtilities.java:214) at org.apache.hive.com.esotericsoftware.kryo.serializers.ObjectField.read(ObjectField.java:125) at org.apache.hive.com.esotericsoftware.kryo.serializers.FieldSerializer.read(FieldSerializer.java:551) at org.apache.hive.com.esotericsoftware.kryo.Kryo.readClassAndObject(Kryo.java:790) at org.apache.hadoop.hive.ql.exec.SerializationUtilities$KryoWithHooks.readClassAndObject(SerializationUtilities.java:176) at org.apache.hive.com.esotericsoftware.kryo.serializers.CollectionSerializer.read(CollectionSerializer.java:134) at org.apache.hive.com.esotericsoftware.kryo.serializers.CollectionSerializer.read(CollectionSerializer.java:40) at org.apache.hive.com.esotericsoftware.kryo.Kryo.readObject(Kryo.java:708) at org.apache.hadoop.hive.ql.exec.SerializationUtilities$KryoWithHooks.readObject(SerializationUtilities.java:214) at org.apache.hive.com.esotericsoftware.kryo.serializers.ObjectField.read(ObjectField.java:125) at org.apache.hive.com.esotericsoftware.kryo.serializers.FieldSerializer.read(FieldSerializer.java:551) at org.apache.hive.com.esotericsoftware.kryo.Kryo.readClassAndObject(Kryo.java:790) at org.apache.hadoop.hive.ql.exec.SerializationUtilities$KryoWithHooks.readClassAndObject(SerializationUtilities.java:176) at org.apache.hive.com.esotericsoftware.kryo.serializers.MapSerializer.read(MapSerializer.java:161) at org.apache.hive.com.esotericsoftware.kryo.serializers.MapSerializer.read(MapSerializer.java:39) at org.apache.hive.com.esotericsoftware.kryo.Kryo.readObject(Kryo.java:708) at org.apache.hadoop.hive.ql.exec.SerializationUtilities$KryoWithHooks.readObject(SerializationUtilities.java:214) at org.apache.hive.com.esotericsoftware.kryo.serializers.ObjectField.read(ObjectField.java:125) at org.apache.hive.com.esotericsoftw
[jira] [Created] (HIVE-21162) MetaStoreListenerNotifier events can get fired even when exceptions are thrown
Rajesh Balamohan created HIVE-21162: --- Summary: MetaStoreListenerNotifier events can get fired even when exceptions are thrown Key: HIVE-21162 URL: https://issues.apache.org/jira/browse/HIVE-21162 Project: Hive Issue Type: Bug Components: Standalone Metastore Reporter: Rajesh Balamohan [https://github.com/apache/hive/blob/master/standalone-metastore/metastore-server/src/main/java/org/apache/hadoop/hive/metastore/HiveMetaStore.java#L3870] When same partition is added twice, it ends up throwing {{PartitionAlreadyExistsException}}. However, by then even listeners are notified. -- This message was sent by Atlassian JIRA (v7.6.3#76005)
[jira] [Created] (HIVE-21312) FSStatsAggregator::connect is slow
Rajesh Balamohan created HIVE-21312: --- Summary: FSStatsAggregator::connect is slow Key: HIVE-21312 URL: https://issues.apache.org/jira/browse/HIVE-21312 Project: Hive Issue Type: Improvement Components: Statistics Reporter: Rajesh Balamohan -- This message was sent by Atlassian JIRA (v7.6.3#76005)
[jira] [Created] (HIVE-21331) Metastore should throw exception back if it is not able to delete the folder
Rajesh Balamohan created HIVE-21331: --- Summary: Metastore should throw exception back if it is not able to delete the folder Key: HIVE-21331 URL: https://issues.apache.org/jira/browse/HIVE-21331 Project: Hive Issue Type: Improvement Components: Metastore Reporter: Rajesh Balamohan [https://github.com/apache/hive/blob/master/standalone-metastore/metastore-server/src/main/java/org/apache/hadoop/hive/metastore/HiveMetaStore.java#L2678] In one of the cases, table got deleted from HMS, but the data was not deleted. On looking at the issue, `deleteDir` is not throwing the exception back. Real exception gets logged (in this case it was user quota limit exceeeded exception), but the managed table gets dropped without deleting the data. -- This message was sent by Atlassian JIRA (v7.6.3#76005)
[jira] [Created] (HIVE-21431) Vectorization: ltrim throws ArrayIndexOutOfBounds in corner cases
Rajesh Balamohan created HIVE-21431: --- Summary: Vectorization: ltrim throws ArrayIndexOutOfBounds in corner cases Key: HIVE-21431 URL: https://issues.apache.org/jira/browse/HIVE-21431 Project: Hive Issue Type: Bug Components: Vectorization Affects Versions: 2.3.4 Reporter: Rajesh Balamohan In corner cases, {{ltrim}} with string columns throws arraryindexoutofboundsexception with vectorization enabled. {{HIVE-19565}} seem to fix corner cases. But in another corner case, {{length[]}} was all {{0}} and this causes {{-1}} to be returned in the length to be set in the target vector. I will check if i can get a easier repro for this. -- This message was sent by Atlassian JIRA (v7.6.3#76005)
[jira] [Created] (HIVE-21439) Provide an option to reduce lookup overhead for bucketed tables
Rajesh Balamohan created HIVE-21439: --- Summary: Provide an option to reduce lookup overhead for bucketed tables Key: HIVE-21439 URL: https://issues.apache.org/jira/browse/HIVE-21439 Project: Hive Issue Type: Bug Reporter: Rajesh Balamohan If a table is bucketed, `OpTraitsRulesProcFactory::TableScanRule` ends up verifying if the partitions have got the same number of files as the number of buckets in table. https://github.com/apache/hive/blob/master/ql/src/java/org/apache/hadoop/hive/ql/optimizer/metainfo/annotation/OpTraitsRulesProcFactory.java#L185 In large tables, this turns out to be very time consuming operation. It would be good to have an option to by pass this depending on need basis. -- This message was sent by Atlassian JIRA (v7.6.3#76005)
[jira] [Created] (HIVE-21475) SparkClientUtilities::urlFromPathString should handle viewfs to avoid UDF ClassNotFoundExcpetion
Rajesh Balamohan created HIVE-21475: --- Summary: SparkClientUtilities::urlFromPathString should handle viewfs to avoid UDF ClassNotFoundExcpetion Key: HIVE-21475 URL: https://issues.apache.org/jira/browse/HIVE-21475 Project: Hive Issue Type: Bug Reporter: Rajesh Balamohan -- This message was sent by Atlassian JIRA (v7.6.3#76005)
[jira] [Created] (HIVE-21503) Vectorization: query with regex gives incorrect results with vectorization
Rajesh Balamohan created HIVE-21503: --- Summary: Vectorization: query with regex gives incorrect results with vectorization Key: HIVE-21503 URL: https://issues.apache.org/jira/browse/HIVE-21503 Project: Hive Issue Type: Bug Components: Vectorization Reporter: Rajesh Balamohan i see wrong results with vectorization. Without vectorization, it works fine. Suspecting minor issue in {{StringGroupColConcatCharScalar}} {noformat} e.g WHEN x like '%radio%' THEN 'radio' WHEN x like '%tv%' THEN 'tv' {noformat} -- This message was sent by Atlassian JIRA (v7.6.3#76005)
[jira] [Created] (HIVE-21520) Query "Submit plan" time reported is incorrect
Rajesh Balamohan created HIVE-21520: --- Summary: Query "Submit plan" time reported is incorrect Key: HIVE-21520 URL: https://issues.apache.org/jira/browse/HIVE-21520 Project: Hive Issue Type: Bug Reporter: Rajesh Balamohan Hive master branch + LLAP {noformat} Query Execution Summary -- OPERATION DURATION -- Compile Query 0.00s Prepare Plan 0.00s Get Query Coordinator (AM) 0.00s Submit Plan 1553658149.89s Start DAG 0.53s Run DAG 0.43s -- {noformat} -- This message was sent by Atlassian JIRA (v7.6.3#76005)
[jira] [Created] (HIVE-21565) Utilities::isEmptyPath should throw back FNFE instead of returning true
Rajesh Balamohan created HIVE-21565: --- Summary: Utilities::isEmptyPath should throw back FNFE instead of returning true Key: HIVE-21565 URL: https://issues.apache.org/jira/browse/HIVE-21565 Project: Hive Issue Type: Bug Reporter: Rajesh Balamohan In case there is a {{viewfs}} configured and it ends up throwing FNFE, current codepath silently ignores the error and ends up creating an empty file. {noformat} at org.apache.hadoop.fs.viewfs.InodeTree.resolve(InodeTree.java:403) at org.apache.hadoop.fs.viewfs.ViewFileSystem.listStatus(ViewFileSystem.java:374) at org.apache.hadoop.fs.FileSystem.listStatus(FileSystem.java:1497) at org.apache.hadoop.fs.FileSystem.listStatus(FileSystem.java:1537) at org.apache.hadoop.hive.ql.exec.Utilities.isEmptyPath(Utilities.java:2350) at org.apache.hadoop.hive.ql.exec.Utilities.isEmptyPath(Utilities.java:2343) at org.apache.hadoop.hive.ql.exec.Utilities$GetInputPathsCallable.call(Utilities.java:3128) at org.apache.hadoop.hive.ql.exec.Utilities.getInputPaths(Utilities.java:3092) at org.apache.hadoop.hive.ql.exec.spark.SparkPlanGenerator.cloneJobConf(SparkPlanGenerator.java:303) at org.apache.hadoop.hive.ql.exec.spark.SparkPlanGenerator.generate(SparkPlanGenerator.java:226) at org.apache.hadoop.hive.ql.exec.spark.SparkPlanGenerator.generate(SparkPlanGenerator.java:109) at org.apache.hadoop.hive.ql.exec.spark.RemoteHiveSparkClient$JobStatusJob.call(RemoteHiveSparkClient.java:346) at org.apache.hive.spark.client.RemoteDriver$JobWrapper.call(RemoteDriver.java:358) at org.apache.hive.spark.client.RemoteDriver$JobWrapper.call(RemoteDriver.java:323) at java.util.concurrent.FutureTask.run(FutureTask.java:266) at java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1149) at java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:624) {noformat} -- This message was sent by Atlassian JIRA (v7.6.3#76005)
[jira] [Created] (HIVE-21622) Provide an option to invoke `ReflectionUtil::newInstance` without storing in constructor_cache
Rajesh Balamohan created HIVE-21622: --- Summary: Provide an option to invoke `ReflectionUtil::newInstance` without storing in constructor_cache Key: HIVE-21622 URL: https://issues.apache.org/jira/browse/HIVE-21622 Project: Hive Issue Type: Bug Components: HiveServer2 Reporter: Rajesh Balamohan Attachments: Screenshot 2019-04-17 at 2.17.21 PM.png In certain cases, UDFs would be dynamically registered/deregistered often. This can clutter "constructor_cache" of "ReflectionUtil" and cause memory pressure. !Screenshot 2019-04-17 at 2.17.21 PM.png! It would be good to provide an option to invoke ReflectionUtil without hitting constructor cache. -- This message was sent by Atlassian JIRA (v7.6.3#76005)
[jira] [Created] (HIVE-21684) tmp table space directory should be removed on session close
Rajesh Balamohan created HIVE-21684: --- Summary: tmp table space directory should be removed on session close Key: HIVE-21684 URL: https://issues.apache.org/jira/browse/HIVE-21684 Project: Hive Issue Type: Bug Reporter: Rajesh Balamohan `_tmp_space.db` folder should be deleted on session close. {noformat} org.apache.hadoop.ipc.RemoteException(org.apache.hadoop.hdfs.protocol.FSLimitException$MaxDirectoryItemsExceededException): The directory item limit of... {noformat} -- This message was sent by Atlassian JIRA (v7.6.3#76005)
[jira] [Created] (HIVE-21778) CBO: "Struct is not null" gets evaluated as `nullable` always causing pushdown miss in the query
Rajesh Balamohan created HIVE-21778: --- Summary: CBO: "Struct is not null" gets evaluated as `nullable` always causing pushdown miss in the query Key: HIVE-21778 URL: https://issues.apache.org/jira/browse/HIVE-21778 Project: Hive Issue Type: Bug Components: CBO Affects Versions: 2.3.5 Reporter: Rajesh Balamohan {noformat} drop table if exists test_struct; CREATE external TABLE test_struct ( f1 string, demo_struct struct, datestr string ); set hive.cbo.enable=true; explain select * from etltmp.test_struct where datestr='2019-01-01' and demo_struct is not null; STAGE PLANS: Stage: Stage-0 Fetch Operator limit: -1 Processor Tree: TableScan alias: test_struct filterExpr: (datestr = '2019-01-01') (type: boolean) <- Note that demo_struct filter is not added here Filter Operator predicate: (datestr = '2019-01-01') (type: boolean) Select Operator expressions: f1 (type: string), demo_struct (type: struct), '2019-01-01' (type: string) outputColumnNames: _col0, _col1, _col2 ListSink set hive.cbo.enable=false; explain select * from etltmp.test_struct where datestr='2019-01-01' and demo_struct is not null; STAGE PLANS: Stage: Stage-0 Fetch Operator limit: -1 Processor Tree: TableScan alias: test_struct filterExpr: ((datestr = '2019-01-01') and demo_struct is not null) (type: boolean) <- Note that demo_struct filter is added when CBO is turned off Filter Operator predicate: ((datestr = '2019-01-01') and demo_struct is not null) (type: boolean) Select Operator expressions: f1 (type: string), demo_struct (type: struct), '2019-01-01' (type: string) outputColumnNames: _col0, _col1, _col2 ListSink {noformat} In CalcitePlanner::genFilterRelNode, the following code misses to evaluate this filter. {noformat} RexNode factoredFilterExpr = RexUtil .pullFactors(cluster.getRexBuilder(), convertedFilterExpr); {noformat} Note that even if we add `demo_struct.f1` it would end up pushing the filter correctly. Suspecting {code}RexCall::isAlwaysTrue{code} is evaluating to true in this case. -- This message was sent by Atlassian JIRA (v7.6.3#76005)
[jira] [Created] (HIVE-21971) HS2 leaks classload due to `ReflectionUtils::CONSTRUCTOR_CACHE` with temporary functions + GenericUDF
Rajesh Balamohan created HIVE-21971: --- Summary: HS2 leaks classload due to `ReflectionUtils::CONSTRUCTOR_CACHE` with temporary functions + GenericUDF Key: HIVE-21971 URL: https://issues.apache.org/jira/browse/HIVE-21971 Project: Hive Issue Type: Bug Components: HiveServer2 Affects Versions: 2.3.4 Reporter: Rajesh Balamohan https://issues.apache.org/jira/browse/HIVE-10329 helped in moving away from hadoop's ReflectionUtils constructor cache issue (https://issues.apache.org/jira/browse/HADOOP-10513). However, there are corner cases where hadoop's {{ReflectionUtils}} is in use and this causes gradual build up of memory in HS2. I have observed this in Hive 2.3. But the codepath in master for this has not changed much. Easiest way to repro would be to add a temp function which extends {{GenericUDF}}. In {{FunctionRegistry::cloneGenericUDF,}} this would end up using {{org.apache.hadoop.util.ReflectionUtils.newInstance}} which in turn lands up in COSNTRUCTOR_CACHE of ReflectionUtils. {noformat} CREATE TEMPORARY FUNCTION dummy AS 'com.hive.test.DummyGenericUDF' USING JAR 'file:///home/test/udf/dummy.jar'; select dummy(); at org.apache.hadoop.util.ReflectionUtils.newInstance(ReflectionUtils.java:107) at org.apache.hadoop.hive.ql.exec.FunctionRegistry.cloneGenericUDF(FunctionRegistry.java:1353) at org.apache.hadoop.hive.ql.exec.FunctionInfo.getGenericUDF(FunctionInfo.java:122) at org.apache.hadoop.hive.ql.parse.TypeCheckProcFactory$DefaultExprProcessor.getXpathOrFuncExprNodeDesc(TypeCheckProcFactory.java:983) at org.apache.hadoop.hive.ql.parse.TypeCheckProcFactory$DefaultExprProcessor.process(TypeCheckProcFactory.java:1359) at org.apache.hadoop.hive.ql.lib.DefaultRuleDispatcher.dispatch(DefaultRuleDispatcher.java:90) at org.apache.hadoop.hive.ql.lib.DefaultGraphWalker.dispatchAndReturn(DefaultGraphWalker.java:105) at org.apache.hadoop.hive.ql.lib.DefaultGraphWalker.dispatch(DefaultGraphWalker.java:89) at org.apache.hadoop.hive.ql.lib.ExpressionWalker.walk(ExpressionWalker.java:76) at org.apache.hadoop.hive.ql.lib.DefaultGraphWalker.startWalking(DefaultGraphWalker.java:120) {noformat} Note: Reflection based invocation of hadoop's `ReflectionUtils::clear` was removed in 2.x. -- This message was sent by Atlassian JIRA (v7.6.3#76005)
[jira] [Created] (HIVE-21993) HS/HMS delegationstore with ZK can degrade performance when jute.maxBuffer is reached
Rajesh Balamohan created HIVE-21993: --- Summary: HS/HMS delegationstore with ZK can degrade performance when jute.maxBuffer is reached Key: HIVE-21993 URL: https://issues.apache.org/jira/browse/HIVE-21993 Project: Hive Issue Type: Bug Components: Hive Affects Versions: 2.3.4, 3.0.0, 4.0.0 Reporter: Rajesh Balamohan DelegationStore can be configured to run in-mem/DB/ZK based TokenStores. {{TokenStoreDelegationTokenSecretManager}} purges the tokens (>24 hours) periodically every 1 hour by default. +Issue:+ When large number of delegation tokens are present in ZK, {{TokenStoreDelegationTokenSecretManager::removeExpiredTokens}} can throw the following exception when connecting to ZK. {{noformat}} WARN [main-SendThread(xyz:2181)]: org.apache.zookeeper.ClientCnxn: Session 0x36a161083865cd9 for server xyz/1.2.3.4:2181, unexpected error, closing socket connection and attempting reconnect java.io.IOException: Packet len68985070 is out of range! at org.apache.zookeeper.ClientCnxnSocket.readLength(ClientCnxnSocket.java:112) ~[zookeeper-3.4.6.jar] at org.apache.zookeeper.ClientCnxnSocketNIO.doIO(ClientCnxnSocketNIO.java:79) ~[zookeeper-3.4.6.jar] at org.apache.zookeeper.ClientCnxnSocketNIO.doTransport(ClientCnxnSocketNIO.java:366) ~[zookeeper-3.4.6.jar] at org.apache.zookeeper.ClientCnxn$SendThread.run(ClientCnxn.java:1081) [zookeeper-3.4.6.jar] ... ... INFO [main-EventThread]: org.apache.curator.framework.state.ConnectionStateManager: State change: SUSPENDED ERROR [Thread[Thread-13,5,main]]: org.apache.hadoop.hive.thrift.TokenStoreDelegationTokenSecretManager: ExpiredTokenRemover thread received unexpected exception. org.apache.hadoop.hive.thrift.DelegationTokenStore$TokenStoreException: Error getting children for /hivedelegationMETASTORE/tokens org.apache.hadoop.hive.thrift.DelegationTokenStore$TokenStoreException: Error getting children for /hivedelegationMETASTORE/tokens at org.apache.hadoop.hive.thrift.ZooKeeperTokenStore.zkGetChildren(ZooKeeperTokenStore.java:280) ~[hive-exec-x.y.z.jar] at org.apache.hadoop.hive.thrift.ZooKeeperTokenStore.getAllDelegationTokenIdentifiers(ZooKeeperTokenStore.java:413) ~[hive-exec-x.y.z.jar] at org.apache.hadoop.hive.thrift.TokenStoreDelegationTokenSecretManager.removeExpiredTokens(TokenStoreDelegationTokenSecretManager.java:238) ~[hive-exec-x.y.z.jar] at org.apache.hadoop.hive.thrift.TokenStoreDelegationTokenSecretManager$ExpiredTokenRemover.run(TokenStoreDelegationTokenSecretManager.java:309) [hive-exec-x.y.z.jar] at java.lang.Thread.run(Thread.java:748) [?:1.8.0_171] Caused by: org.apache.zookeeper.KeeperException$ConnectionLossException: KeeperErrorCode = ConnectionLoss for /hivedelegationMETASTORE/tokens at org.apache.zookeeper.KeeperException.create(KeeperException.java:99) ~[zookeeper-3.4.6.jar] at org.apache.zookeeper.KeeperException.create(KeeperException.java:51) ~[zookeeper-3.4.6.jar] at org.apache.zookeeper.ZooKeeper.getChildren(ZooKeeper.java:1590) ~[zookeeper-3.4.6.jar] at org.apache.curator.framework.imps.GetChildrenBuilderImpl$3.call(GetChildrenBuilderImpl.java:214) ~[curator-framework-2.7.1.jar:?] at org.apache.curator.framework.imps.GetChildrenBuilderImpl$3.call(GetChildrenBuilderImpl.java:203) ~[curator-framework-2.7.1.jar:?] at org.apache.curator.RetryLoop.callWithRetry(RetryLoop.java:107) ~[curator-client-2.7.1.jar:?] at org.apache.curator.framework.imps.GetChildrenBuilderImpl.pathInForeground(GetChildrenBuilderImpl.java:200) ~[curator-framework-2.7.1.jar:?] at org.apache.curator.framework.imps.GetChildrenBuilderImpl.forPath(GetChildrenBuilderImpl.java:191) ~[curator-framework-2.7.1.jar:?] at org.apache.curator.framework.imps.GetChildrenBuilderImpl.forPath(GetChildrenBuilderImpl.java:38) ~[curator-framework-2.7.1.jar:?] at org.apache.hadoop.hive.thrift.ZooKeeperTokenStore.zkGetChildren(ZooKeeperTokenStore.java:278) ~[hive-exec-x.y.z.jar] ... 4 more {{noformat}} When packet length is greater than {{jute.maxBuffer}}, it ends up throwing this exception and it reconnects the connection. However, the same ZK client is being used for {{addToken, removeToken}} calls which are run in different threads. This creates problems when creating/deleting tokens. 1. Issue in creating tokens: New token is added at the same time, when ZK client is in suspended state (due to above mentioned reason). Node is already created by curator, but before it could verify connection gets into stale state. So curator framework retries and ends up with the following exception. So creating tokens fails often. {{noformat}} Caused by: org.apache.hadoop.hive.thrift.DelegationTokenStore$TokenStoreException: Error creating new node wi
[jira] [Created] (HIVE-22013) "Show table extended" should not compute table statistics
Rajesh Balamohan created HIVE-22013: --- Summary: "Show table extended" should not compute table statistics Key: HIVE-22013 URL: https://issues.apache.org/jira/browse/HIVE-22013 Project: Hive Issue Type: Bug Components: Hive Reporter: Rajesh Balamohan In some of the `show table extended` statements, following codepath is invoked [https://github.com/apache/hive/blob/master/ql/src/java/org/apache/hadoop/hive/ql/metadata/formatting/TextMetaDataFormatter.java#L421] [https://github.com/apache/hive/blob/master/ql/src/java/org/apache/hadoop/hive/ql/metadata/formatting/TextMetaDataFormatter.java#L449] [https://github.com/apache/hive/blob/master/ql/src/java/org/apache/hadoop/hive/ql/metadata/formatting/TextMetaDataFormatter.java#L468] 1. Not sure why this invokes stats computation. This should be removed? 2. Even if #1 is needed, it would be broken when {{tblPath}} and {{partitionPaths}} are different (i.e when both of them of them are in different fs or configured via router etc). -- This message was sent by Atlassian JIRA (v7.6.14#76016)
[jira] [Created] (HIVE-22039) Query with CBO crashes HS2 in corner cases
Rajesh Balamohan created HIVE-22039: --- Summary: Query with CBO crashes HS2 in corner cases Key: HIVE-22039 URL: https://issues.apache.org/jira/browse/HIVE-22039 Project: Hive Issue Type: Bug Components: CBO Affects Versions: 2.3.4, 3.1.1 Reporter: Rajesh Balamohan Here is a very simple repro for this case. This along with CBO would crash HS2. It runs into a infinite loop creating too many number of RexCalls and finally OOMs. This is observed in 2.x, 3.x. With 4.x (master branch), it does not happen. Master has {{calcite-core-1.19.0.jar}}, where as 3.x has {{calcite-core-1.16.0.jar}}. {noformat} drop table if exists tableA; drop table if exists tableB; create table if not exists tableA(id int, reporting_date string) stored as orc; create table if not exists tableB(id int, reporting_date string) partitioned by (datestr string) stored as orc; explain with tableA_cte as ( select id, reporting_date from tableA ), tableA_cte_2 as ( select 0 as id, reporting_date from tableA ), tableA_cte_5 as ( select * from tableA_cte union select * from tableA_cte_2 ), tableB_cte_0 as ( select id, reporting_date from tableB where reporting_date = '2018-10-29' ), tableB_cte_1 as ( select 0 as id, reporting_date from tableB where datestr = '2018-10-29' ), tableB_cte_4 as ( select * from tableB_cte_0 union select * from tableB_cte_1 ) select a.id as id, b.reporting_date from tableA_cte_5 a join tableB_cte_4 b on (a.id = b.id and a.reporting_date = b.reporting_date); {noformat} -- This message was sent by Atlassian JIRA (v7.6.14#76016)
[jira] [Created] (HIVE-22102) Reduce HMS call when createing HiveSession
Rajesh Balamohan created HIVE-22102: --- Summary: Reduce HMS call when createing HiveSession Key: HIVE-22102 URL: https://issues.apache.org/jira/browse/HIVE-22102 Project: Hive Issue Type: Improvement Components: HiveServer2 Reporter: Rajesh Balamohan When establishing HiveSession, it ends up configuring session variables/settings. As part of it, it ends up checking the database details. [https://github.com/apache/hive/blob/master/service/src/java/org/apache/hive/service/cli/session/HiveSessionImpl.java#L314] Even if it is `default` DB, it ends up making this check. In corner cases, these calls turn out to be expensive. {noformat} 2019-08-13T03:16:57,189 INFO [b42ba57f-1740-4174-855d-4e3f08319ca5 HiveServer2-Handler-Pool: Thread-1552313] metadata.Hive: Total time spent in this metastore function was greater than 1000ms : getDatabase_(String, )=13265 {noformat} We can just skip this check if its `DEFAULT_DATABASE_NAME` (default) DB. This may not be an issue for CachedStore. -- This message was sent by Atlassian JIRA (v7.6.14#76016)
[jira] [Created] (HIVE-22214) Explain vectorization should disable user level explain
Rajesh Balamohan created HIVE-22214: --- Summary: Explain vectorization should disable user level explain Key: HIVE-22214 URL: https://issues.apache.org/jira/browse/HIVE-22214 Project: Hive Issue Type: Improvement Components: Hive Reporter: Rajesh Balamohan -- This message was sent by Atlassian Jira (v8.3.4#803005)
[jira] [Created] (HIVE-22246) Beeline reflector should handle map types
Rajesh Balamohan created HIVE-22246: --- Summary: Beeline reflector should handle map types Key: HIVE-22246 URL: https://issues.apache.org/jira/browse/HIVE-22246 Project: Hive Issue Type: Bug Components: Beeline Reporter: Rajesh Balamohan Since beeline {{Reflector}} is not handling Map types, it ends up converting values from {{beeline.properties}} to "null" and throws NPE with {{"}}beeline --hivevar x=1 --hivevar y=1". -- This message was sent by Atlassian Jira (v8.3.4#803005)
[jira] [Created] (HIVE-22269) Missing stats in the operator with "hive.optimize.sort.dynamic.partition" (SortedDynPartitionOptimizer) misestimates reducer count
Rajesh Balamohan created HIVE-22269: --- Summary: Missing stats in the operator with "hive.optimize.sort.dynamic.partition" (SortedDynPartitionOptimizer) misestimates reducer count Key: HIVE-22269 URL: https://issues.apache.org/jira/browse/HIVE-22269 Project: Hive Issue Type: Bug Components: Statistics Reporter: Rajesh Balamohan {{hive.optimize.sort.dynamic.partition=true}} introduces new stage to reduce number of writes in dynamic partitioning usecase. Earlier {{SortedDynPartitionOptimizer}} added this new operator via {{Optimizer.java}} and the stats for the newly added operator was populated via {{StatsRulesProcFactory$ReduceSinkStatsRule}}. However, with "HIVE-20703" this got changed. This is moved to {{TezCompiler}} for cost based decision. Though the operator gets added correctly, the stats for this does not get added (as it runs after runStatsAnnotation()). This causes reducer count to be mis-estimated in the query. {noformat} e.g For the following query, reducer_2 would be estimated as "2" instead of "1009". This causes huge delay in the runtime. explain from tpcds_xtext_1000.store_sales ss insert overwrite table store_sales partition (ss_sold_date_sk) select ss.ss_sold_time_sk, ss.ss_item_sk, ss.ss_customer_sk, ss.ss_cdemo_sk, ss.ss_hdemo_sk, ss.ss_addr_sk, ss.ss_store_sk, ss.ss_promo_sk, ss.ss_ticket_number, ss.ss_quantity, ss.ss_wholesale_cost, ss.ss_list_price, ss.ss_sales_price, ss.ss_ext_discount_amt, ss.ss_ext_sales_price, ss.ss_ext_wholesale_cost, ss.ss_ext_list_price, ss.ss_ext_tax, ss.ss_coupon_amt, ss.ss_net_paid, ss.ss_net_paid_inc_tax, ss.ss_net_profit, ss.ss_sold_date_sk where ss.ss_sold_date_sk is not null insert overwrite table store_sales partition (ss_sold_date_sk) select ss.ss_sold_time_sk, ss.ss_item_sk, ss.ss_customer_sk, ss.ss_cdemo_sk, ss.ss_hdemo_sk, ss.ss_addr_sk, ss.ss_store_sk, ss.ss_promo_sk, ss.ss_ticket_number, ss.ss_quantity, ss.ss_wholesale_cost, ss.ss_list_price, ss.ss_sales_price, ss.ss_ext_discount_amt, ss.ss_ext_sales_price, ss.ss_ext_wholesale_cost, ss.ss_ext_list_price, ss.ss_ext_tax, ss.ss_coupon_amt, ss.ss_net_paid, ss.ss_net_paid_inc_tax, ss.ss_net_profit, ss.ss_sold_date_sk where ss.ss_sold_date_sk is null distribute by ss.ss_item_sk ; {noformat} -- This message was sent by Atlassian Jira (v8.3.4#803005)
[jira] [Created] (HIVE-22316) Avoid hostname resolution in LlapInputFormat
Rajesh Balamohan created HIVE-22316: --- Summary: Avoid hostname resolution in LlapInputFormat Key: HIVE-22316 URL: https://issues.apache.org/jira/browse/HIVE-22316 Project: Hive Issue Type: Improvement Components: llap Reporter: Rajesh Balamohan Attachments: Screenshot 2019-10-10 at 10.13.48 AM.png Attaching prof output, which showed up when running short query. It would be good to have the hostname as static final. -- This message was sent by Atlassian Jira (v8.3.4#803005)
[jira] [Created] (HIVE-22379) Reduce db lookups during dynamic partition loading
Rajesh Balamohan created HIVE-22379: --- Summary: Reduce db lookups during dynamic partition loading Key: HIVE-22379 URL: https://issues.apache.org/jira/browse/HIVE-22379 Project: Hive Issue Type: Improvement Reporter: Rajesh Balamohan {\{HiveAlterHandler::alterPartitions}} could lookup all partition details via single \{{getPartition}} call instead of multiple calls. -- This message was sent by Atlassian Jira (v8.3.4#803005)
[jira] [Created] (HIVE-22383) `alterPartitions` is invoked twice during dynamic partition load causing runtime delay
Rajesh Balamohan created HIVE-22383: --- Summary: `alterPartitions` is invoked twice during dynamic partition load causing runtime delay Key: HIVE-22383 URL: https://issues.apache.org/jira/browse/HIVE-22383 Project: Hive Issue Type: Bug Reporter: Rajesh Balamohan First invocation in {{Hive::loadDynamicPartitions}} https://github.com/apache/hive/blob/master/ql/src/java/org/apache/hadoop/hive/ql/metadata/Hive.java#L2978 https://github.com/apache/hive/blob/master/ql/src/java/org/apache/hadoop/hive/ql/metadata/Hive.java#L2638 Second invocation in {{BasicStatsTask::aggregateStats}} https://github.com/apache/hive/blob/master/ql/src/java/org/apache/hadoop/hive/ql/stats/BasicStatsTask.java#L335 This leads to good amount of delay in dynamic partition loading. -- This message was sent by Atlassian Jira (v8.3.4#803005)
[jira] [Created] (HIVE-22385) Repl: Perf fixes
Rajesh Balamohan created HIVE-22385: --- Summary: Repl: Perf fixes Key: HIVE-22385 URL: https://issues.apache.org/jira/browse/HIVE-22385 Project: Hive Issue Type: Improvement Components: repl Reporter: Rajesh Balamohan Creating this high level ticket for tracking repl perf fixes. -- This message was sent by Atlassian Jira (v8.3.4#803005)
[jira] [Created] (HIVE-22386) Repl: Optimise ReplDumpTask::bootStrapDump
Rajesh Balamohan created HIVE-22386: --- Summary: Repl: Optimise ReplDumpTask::bootStrapDump Key: HIVE-22386 URL: https://issues.apache.org/jira/browse/HIVE-22386 Project: Hive Issue Type: Sub-task Reporter: Rajesh Balamohan {\{ReplDumpTask::bootStrapDump}} dumps one table at a time within a database. This data is written in separate folders per table. This can be optimized to write in parallel. -- This message was sent by Atlassian Jira (v8.3.4#803005)
[jira] [Created] (HIVE-22387) Repl: Reduce FS lookups in repl bootstrap
Rajesh Balamohan created HIVE-22387: --- Summary: Repl: Reduce FS lookups in repl bootstrap Key: HIVE-22387 URL: https://issues.apache.org/jira/browse/HIVE-22387 Project: Hive Issue Type: Sub-task Components: repl Reporter: Rajesh Balamohan During bootstrap, \{{dbRoot}} is obtained per database. This need not be validated for every table dump (in \{{TableExport.Paths}}). -- This message was sent by Atlassian Jira (v8.3.4#803005)
[jira] [Created] (HIVE-22389) Repl: Optimise ReplDumpTask.incrementalDump
Rajesh Balamohan created HIVE-22389: --- Summary: Repl: Optimise ReplDumpTask.incrementalDump Key: HIVE-22389 URL: https://issues.apache.org/jira/browse/HIVE-22389 Project: Hive Issue Type: Sub-task Components: repl Reporter: Rajesh Balamohan -- This message was sent by Atlassian Jira (v8.3.4#803005)