from:"Rajesh Balamohan \(Jira\)"

[jira] [Updated] (HIVE-7544) Changes related to TEZ-1288 (FastTezSerialization)

2014-08-23 Thread Rajesh Balamohan (JIRA)


 [ 
https://issues.apache.org/jira/browse/HIVE-7544?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Rajesh Balamohan updated HIVE-7544:
---

Attachment: HIVE-7544.tez-branch.2.patch

Uploading the rebased patch for tez branch.

> Changes related to TEZ-1288 (FastTezSerialization)
> --
>
> Key: HIVE-7544
> URL: https://issues.apache.org/jira/browse/HIVE-7544
> Project: Hive
>  Issue Type: Sub-task
>  Components: Tez
>Affects Versions: 0.14.0
>Reporter: Rajesh Balamohan
>Assignee: Rajesh Balamohan
> Attachments: HIVE-7544.1.patch, HIVE-7544.tez-branch.2.patch
>
>
> Add ability to make use of TezBytesWritableSerialization.
> NO PRECOMMIT TESTS



--
This message was sent by Atlassian JIRA
(v6.2#6252)

[jira] [Resolved] (HIVE-7910) Enhance natural order scheduler to prevent downstream vertex from monopolizing the cluster resources

2014-08-29 Thread Rajesh Balamohan (JIRA)


 [ 
https://issues.apache.org/jira/browse/HIVE-7910?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Rajesh Balamohan resolved HIVE-7910.


Resolution: Won't Fix

Apologizes..Meant for tez project. Closing this bug.

> Enhance natural order scheduler to prevent downstream vertex from 
> monopolizing the cluster resources
> 
>
> Key: HIVE-7910
> URL: https://issues.apache.org/jira/browse/HIVE-7910
> Project: Hive
>  Issue Type: Bug
>Reporter: Rajesh Balamohan
>  Labels: performance
>
> M2 M7
> \  /
> (sg) \/
>R3/ (b)
> \   /
>  (b) \ /
>   \   /
> M5
> |
> R6 
> Plz refer to the attachment (task runtime SVG).  In this case, M5 got 
> scheduled much earlier than R3 (R3 is mentioned as green color in the 
> diagram) and retained lots of containers.  R3 got less containers to work 
> with. 
> Attaching the output from the status monitor when the job ran;  Map_5 has 
> taken up almost all containers, whereas Reducer_3 got fraction of the 
> capacity.
> Map_2: 1/1  Map_5: 0(+373)/1000 Map_7: 1/1  Reducer_3: 0/8000 
>   Reducer_6: 0/1
> Map_2: 1/1  Map_5: 0(+374)/1000 Map_7: 1/1  Reducer_3: 0/8000 
>   Reducer_6: 0/1
> Map_2: 1/1  Map_5: 0(+374)/1000 Map_7: 1/1  Reducer_3: 0(+1)/8000 
>   Reducer_6: 0/1
> 
> Map_2: 1/1  Map_5: 0(+374)/1000 Map_7: 1/1  Reducer_3: 
> 14(+7)/8000  Reducer_6: 0/1
> Map_2: 1/1  Map_5: 0(+374)/1000 Map_7: 1/1  Reducer_3: 
> 63(+14)/8000 Reducer_6: 0/1
> Map_2: 1/1  Map_5: 0(+374)/1000 Map_7: 1/1  Reducer_3: 
> 159(+22)/8000Reducer_6: 0/1
> Map_2: 1/1  Map_5: 0(+374)/1000 Map_7: 1/1  Reducer_3: 
> 308(+29)/8000Reducer_6: 0/1
> ...
> Creating this JIRA as a placeholder for scheduler enhancement. One 
> possibililty could be to
> schedule lesser number of tasks in downstream vertices, based on the 
> information available for the upstream vertex.



--
This message was sent by Atlassian JIRA
(v6.2#6252)

[jira] [Created] (HIVE-7910) Enhance natural order scheduler to prevent downstream vertex from monopolizing the cluster resources

2014-08-29 Thread Rajesh Balamohan (JIRA)

Rajesh Balamohan created HIVE-7910:
--

 Summary: Enhance natural order scheduler to prevent downstream 
vertex from monopolizing the cluster resources
 Key: HIVE-7910
 URL: https://issues.apache.org/jira/browse/HIVE-7910
 Project: Hive
  Issue Type: Bug
Reporter: Rajesh Balamohan


M2 M7
\  /
(sg) \/
   R3/ (b)
\   /
 (b) \ /
  \   /
M5
|
R6 

Plz refer to the attachment (task runtime SVG).  In this case, M5 got scheduled 
much earlier than R3 (R3 is mentioned as green color in the diagram) and 
retained lots of containers.  R3 got less containers to work with. 

Attaching the output from the status monitor when the job ran;  Map_5 has taken 
up almost all containers, whereas Reducer_3 got fraction of the capacity.

Map_2: 1/1  Map_5: 0(+373)/1000 Map_7: 1/1  Reducer_3: 0/8000   
Reducer_6: 0/1
Map_2: 1/1  Map_5: 0(+374)/1000 Map_7: 1/1  Reducer_3: 0/8000   
Reducer_6: 0/1
Map_2: 1/1  Map_5: 0(+374)/1000 Map_7: 1/1  Reducer_3: 0(+1)/8000   
Reducer_6: 0/1

Map_2: 1/1  Map_5: 0(+374)/1000 Map_7: 1/1  Reducer_3: 14(+7)/8000  
Reducer_6: 0/1
Map_2: 1/1  Map_5: 0(+374)/1000 Map_7: 1/1  Reducer_3: 63(+14)/8000 
Reducer_6: 0/1
Map_2: 1/1  Map_5: 0(+374)/1000 Map_7: 1/1  Reducer_3: 
159(+22)/8000Reducer_6: 0/1
Map_2: 1/1  Map_5: 0(+374)/1000 Map_7: 1/1  Reducer_3: 
308(+29)/8000Reducer_6: 0/1
...


Creating this JIRA as a placeholder for scheduler enhancement. One possibililty 
could be to
schedule lesser number of tasks in downstream vertices, based on the 
information available for the upstream vertex.



--
This message was sent by Atlassian JIRA
(v6.2#6252)

[jira] [Created] (HIVE-8071) hive shell tries to write hive-exec.jar for each run

2014-09-11 Thread Rajesh Balamohan (JIRA)

Rajesh Balamohan created HIVE-8071:
--

 Summary: hive shell tries to write hive-exec.jar for each run
 Key: HIVE-8071
 URL: https://issues.apache.org/jira/browse/HIVE-8071
 Project: Hive
  Issue Type: Bug
Reporter: Rajesh Balamohan
Assignee: Rajesh Balamohan




For every run of the hive CLI there is a delay for the shell startup

14/07/31 23:07:19 INFO Configuration.deprecation: fs.default.name is 
deprecated. Instead, use fs.defaultFS
14/07/31 23:07:19 INFO tez.DagUtils: Hive jar directory is 
hdfs://mac-10:8020/user/gopal/apps/2014-Jul-31/hive/
14/07/31 23:07:19 INFO tez.DagUtils: Localizing resource because it does not 
exist: 
file:/home/gopal/tez-autobuild/dist/hive/lib/hive-exec-0.14.0-SNAPSHOT.jar to 
dest: 
hdfs://mac-10:8020/user/gopal/apps/2014-Jul-31/hive/hive-exec-0.14.0-SNAPSHOTde1f82f0b5561d3db9e3080dfb2897210a3bda4ca5e7b14e881e381115837fd8.
jar
14/07/31 23:07:19 INFO tez.DagUtils: Looks like another thread is writing the 
same file will wait.
14/07/31 23:07:19 INFO tez.DagUtils: Number of wait attempts: 5. Wait interval: 
5000
14/07/31 23:07:19 INFO tez.DagUtils: Resource modification time: 1406870512963
14/07/31 23:07:20 INFO tez.TezSessionState: Opening new Tez Session (id: 
02d6b558-44cc-4182-b2f2-6a37ffdd25d2, scratch dir: 
hdfs://mac-10:8020/tmp/hive-gopal/_tez_session_dir/02d6b558-44cc-4182-b2f2-6a37ffdd25d2)

Traced this to a method which does PRIVATE LRs - this is marked as PRIVATE even 
if it is from a common install dir.
{code}
 public LocalResource localizeResource(Path src, Path dest, Configuration conf)
throws IOException {

return createLocalResource(destFS, dest, LocalResourceType.FILE,
LocalResourceVisibility.PRIVATE);
  }
{code}



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

[jira] [Updated] (HIVE-8071) hive shell tries to write hive-exec.jar for each run

2014-09-14 Thread Rajesh Balamohan (JIRA)


 [ 
https://issues.apache.org/jira/browse/HIVE-8071?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Rajesh Balamohan updated HIVE-8071:
---
Attachment: HIVE-8071.1.patch

> hive shell tries to write hive-exec.jar for each run
> 
>
> Key: HIVE-8071
> URL: https://issues.apache.org/jira/browse/HIVE-8071
> Project: Hive
>  Issue Type: Bug
>Reporter: Rajesh Balamohan
>Assignee: Rajesh Balamohan
> Attachments: HIVE-8071.1.patch
>
>
> For every run of the hive CLI there is a delay for the shell startup
> 14/07/31 23:07:19 INFO Configuration.deprecation: fs.default.name is 
> deprecated. Instead, use fs.defaultFS
> 14/07/31 23:07:19 INFO tez.DagUtils: Hive jar directory is 
> hdfs://mac-10:8020/user/gopal/apps/2014-Jul-31/hive/
> 14/07/31 23:07:19 INFO tez.DagUtils: Localizing resource because it does not 
> exist: 
> file:/home/gopal/tez-autobuild/dist/hive/lib/hive-exec-0.14.0-SNAPSHOT.jar to 
> dest: 
> hdfs://mac-10:8020/user/gopal/apps/2014-Jul-31/hive/hive-exec-0.14.0-SNAPSHOTde1f82f0b5561d3db9e3080dfb2897210a3bda4ca5e7b14e881e381115837fd8.
> jar
> 14/07/31 23:07:19 INFO tez.DagUtils: Looks like another thread is writing the 
> same file will wait.
> 14/07/31 23:07:19 INFO tez.DagUtils: Number of wait attempts: 5. Wait 
> interval: 5000
> 14/07/31 23:07:19 INFO tez.DagUtils: Resource modification time: 1406870512963
> 14/07/31 23:07:20 INFO tez.TezSessionState: Opening new Tez Session (id: 
> 02d6b558-44cc-4182-b2f2-6a37ffdd25d2, scratch dir: 
> hdfs://mac-10:8020/tmp/hive-gopal/_tez_session_dir/02d6b558-44cc-4182-b2f2-6a37ffdd25d2)
> Traced this to a method which does PRIVATE LRs - this is marked as PRIVATE 
> even if it is from a common install dir.
> {code}
>  public LocalResource localizeResource(Path src, Path dest, Configuration 
> conf)
> throws IOException {
> 
> return createLocalResource(destFS, dest, LocalResourceType.FILE,
> LocalResourceVisibility.PRIVATE);
>   }
> {code}



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

[jira] [Updated] (HIVE-8071) hive shell tries to write hive-exec.jar for each run

2014-09-14 Thread Rajesh Balamohan (JIRA)


 [ 
https://issues.apache.org/jira/browse/HIVE-8071?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Rajesh Balamohan updated HIVE-8071:
---
Status: Patch Available  (was: Open)

> hive shell tries to write hive-exec.jar for each run
> 
>
> Key: HIVE-8071
> URL: https://issues.apache.org/jira/browse/HIVE-8071
> Project: Hive
>  Issue Type: Bug
>Reporter: Rajesh Balamohan
>Assignee: Rajesh Balamohan
> Attachments: HIVE-8071.1.patch
>
>
> For every run of the hive CLI there is a delay for the shell startup
> 14/07/31 23:07:19 INFO Configuration.deprecation: fs.default.name is 
> deprecated. Instead, use fs.defaultFS
> 14/07/31 23:07:19 INFO tez.DagUtils: Hive jar directory is 
> hdfs://mac-10:8020/user/gopal/apps/2014-Jul-31/hive/
> 14/07/31 23:07:19 INFO tez.DagUtils: Localizing resource because it does not 
> exist: 
> file:/home/gopal/tez-autobuild/dist/hive/lib/hive-exec-0.14.0-SNAPSHOT.jar to 
> dest: 
> hdfs://mac-10:8020/user/gopal/apps/2014-Jul-31/hive/hive-exec-0.14.0-SNAPSHOTde1f82f0b5561d3db9e3080dfb2897210a3bda4ca5e7b14e881e381115837fd8.
> jar
> 14/07/31 23:07:19 INFO tez.DagUtils: Looks like another thread is writing the 
> same file will wait.
> 14/07/31 23:07:19 INFO tez.DagUtils: Number of wait attempts: 5. Wait 
> interval: 5000
> 14/07/31 23:07:19 INFO tez.DagUtils: Resource modification time: 1406870512963
> 14/07/31 23:07:20 INFO tez.TezSessionState: Opening new Tez Session (id: 
> 02d6b558-44cc-4182-b2f2-6a37ffdd25d2, scratch dir: 
> hdfs://mac-10:8020/tmp/hive-gopal/_tez_session_dir/02d6b558-44cc-4182-b2f2-6a37ffdd25d2)
> Traced this to a method which does PRIVATE LRs - this is marked as PRIVATE 
> even if it is from a common install dir.
> {code}
>  public LocalResource localizeResource(Path src, Path dest, Configuration 
> conf)
> throws IOException {
> 
> return createLocalResource(destFS, dest, LocalResourceType.FILE,
> LocalResourceVisibility.PRIVATE);
>   }
> {code}



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

[jira] [Created] (HIVE-8158) Optimize writeValue/setValue in VectorExpressionWriterFactory (in VectorReduceSinkOperator codepath)

2014-09-17 Thread Rajesh Balamohan (JIRA)

Rajesh Balamohan created HIVE-8158:
--

 Summary: Optimize writeValue/setValue in 
VectorExpressionWriterFactory (in VectorReduceSinkOperator codepath)
 Key: HIVE-8158
 URL: https://issues.apache.org/jira/browse/HIVE-8158
 Project: Hive
  Issue Type: Bug
Reporter: Rajesh Balamohan
Assignee: Rajesh Balamohan


VectorReduceSinkOperator --> ProcessOp --> makeValueWriatable --> 
VectorExpressionWriterFactory --> writeValue(byte[], int, int) /setValue.

It appears that this goes through an additional layer of Text.encode/decode 
causing CPU pressure (profiler output attached).

SettableStringObjectInspector / WritableStringObjectInspector has "set(Object 
o, Text value)" method. It would be beneficial to use set(Object, Text) 
directly to save CPU cycles.




--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

[jira] [Updated] (HIVE-8158) Optimize writeValue/setValue in VectorExpressionWriterFactory (in VectorReduceSinkOperator codepath)

2014-09-17 Thread Rajesh Balamohan (JIRA)


 [ 
https://issues.apache.org/jira/browse/HIVE-8158?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Rajesh Balamohan updated HIVE-8158:
---
Attachment: profiler_output.png

> Optimize writeValue/setValue in VectorExpressionWriterFactory (in 
> VectorReduceSinkOperator codepath)
> 
>
> Key: HIVE-8158
> URL: https://issues.apache.org/jira/browse/HIVE-8158
> Project: Hive
>  Issue Type: Bug
>Reporter: Rajesh Balamohan
>Assignee: Rajesh Balamohan
>  Labels: performance
> Attachments: profiler_output.png
>
>
> VectorReduceSinkOperator --> ProcessOp --> makeValueWriatable --> 
> VectorExpressionWriterFactory --> writeValue(byte[], int, int) /setValue.
> It appears that this goes through an additional layer of Text.encode/decode 
> causing CPU pressure (profiler output attached).
> SettableStringObjectInspector / WritableStringObjectInspector has "set(Object 
> o, Text value)" method. It would be beneficial to use set(Object, Text) 
> directly to save CPU cycles.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

[jira] [Updated] (HIVE-8158) Optimize writeValue/setValue in VectorExpressionWriterFactory (in VectorReduceSinkOperator codepath)

2014-09-17 Thread Rajesh Balamohan (JIRA)


 [ 
https://issues.apache.org/jira/browse/HIVE-8158?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Rajesh Balamohan updated HIVE-8158:
---
Attachment: HIVE-8158.1.patch

> Optimize writeValue/setValue in VectorExpressionWriterFactory (in 
> VectorReduceSinkOperator codepath)
> 
>
> Key: HIVE-8158
> URL: https://issues.apache.org/jira/browse/HIVE-8158
> Project: Hive
>  Issue Type: Bug
>Reporter: Rajesh Balamohan
>Assignee: Rajesh Balamohan
>  Labels: performance
> Attachments: HIVE-8158.1.patch, profiler_output.png
>
>
> VectorReduceSinkOperator --> ProcessOp --> makeValueWriatable --> 
> VectorExpressionWriterFactory --> writeValue(byte[], int, int) /setValue.
> It appears that this goes through an additional layer of Text.encode/decode 
> causing CPU pressure (profiler output attached).
> SettableStringObjectInspector / WritableStringObjectInspector has "set(Object 
> o, Text value)" method. It would be beneficial to use set(Object, Text) 
> directly to save CPU cycles.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

[jira] [Commented] (HIVE-7389) Reduce number of metastore calls in MoveTask (when loading dynamic partitions)

2014-09-25 Thread Rajesh Balamohan (JIRA)


[ 
https://issues.apache.org/jira/browse/HIVE-7389?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14148659#comment-14148659
 ] 

Rajesh Balamohan commented on HIVE-7389:


[~hagleitn] Looks like i need to rebase the patch.  I will upload it soon.

> Reduce number of metastore calls in MoveTask (when loading dynamic partitions)
> --
>
> Key: HIVE-7389
> URL: https://issues.apache.org/jira/browse/HIVE-7389
> Project: Hive
>  Issue Type: Bug
>Affects Versions: 0.14.0
>Reporter: Rajesh Balamohan
>Assignee: Rajesh Balamohan
>  Labels: performance
> Attachments: HIVE-7389.1.patch, local_vm_testcase.txt
>
>
> When the number of dynamic partitions to be loaded are high, the time taken 
> for 'MoveTask' is greater than the actual job in some scenarios.  It would be 
> possible to reduce overall runtime by reducing the number of calls made to 
> metastore from MoveTask operation.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

[jira] [Updated] (HIVE-7389) Reduce number of metastore calls in MoveTask (when loading dynamic partitions)

2014-09-26 Thread Rajesh Balamohan (JIRA)


 [ 
https://issues.apache.org/jira/browse/HIVE-7389?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Rajesh Balamohan updated HIVE-7389:
---
Attachment: HIVE-7389.2.patch

rebasing the patch to trunk.

> Reduce number of metastore calls in MoveTask (when loading dynamic partitions)
> --
>
> Key: HIVE-7389
> URL: https://issues.apache.org/jira/browse/HIVE-7389
> Project: Hive
>  Issue Type: Bug
>Affects Versions: 0.14.0
>Reporter: Rajesh Balamohan
>Assignee: Rajesh Balamohan
>  Labels: performance
> Attachments: HIVE-7389.1.patch, HIVE-7389.2.patch, 
> local_vm_testcase.txt
>
>
> When the number of dynamic partitions to be loaded are high, the time taken 
> for 'MoveTask' is greater than the actual job in some scenarios.  It would be 
> possible to reduce overall runtime by reducing the number of calls made to 
> metastore from MoveTask operation.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

[jira] [Created] (HIVE-25827) Parquet file footer is read multiple times, when multiple splits are created in same file

2021-12-20 Thread Rajesh Balamohan (Jira)

Rajesh Balamohan created HIVE-25827:
---

 Summary: Parquet file footer is read multiple times, when multiple 
splits are created in same file
 Key: HIVE-25827
 URL: https://issues.apache.org/jira/browse/HIVE-25827
 Project: Hive
  Issue Type: Improvement
Reporter: Rajesh Balamohan
 Attachments: image-2021-12-21-03-19-38-577.png

With large files, it is possible that multiple splits are created in the same 
file. With current codebase, "ParquetRecordReaderBase" ends up reading file 
footer for each split. 

It can be optimized not to read footer information multiple times for the same 
file.

 

[https://github.com/apache/hive/blob/master/ql/src/java/org/apache/hadoop/hive/ql/io/parquet/vector/VectorizedParquetRecordReader.java#L160]

 

[https://github.com/apache/hive/blob/master/ql/src/java/org/apache/hadoop/hive/ql/io/parquet/ParquetRecordReaderBase.java#L91]

 

 

!image-2021-12-21-03-19-38-577.png|width=1363,height=1256!

 



--
This message was sent by Atlassian Jira
(v8.20.1#820001)

[jira] [Created] (HIVE-25845) Support ColumnIndexes for Parq files

2022-01-04 Thread Rajesh Balamohan (Jira)

Rajesh Balamohan created HIVE-25845:
---

 Summary: Support ColumnIndexes for Parq files
 Key: HIVE-25845
 URL: https://issues.apache.org/jira/browse/HIVE-25845
 Project: Hive
  Issue Type: Improvement
Reporter: Rajesh Balamohan


https://issues.apache.org/jira/browse/PARQUET-1201

 

[https://github.com/apache/hive/blob/master/ql/src/java/org/apache/hadoop/hive/ql/io/parquet/vector/VectorizedParquetRecordReader.java#L271-L273]



--
This message was sent by Atlassian Jira
(v8.20.1#820001)

[jira] [Created] (HIVE-25913) Dynamic Partition Pruning Operator: Not working in iceberg tables

2022-01-30 Thread Rajesh Balamohan (Jira)

Rajesh Balamohan created HIVE-25913:
---

 Summary: Dynamic Partition Pruning Operator: Not working in 
iceberg tables
 Key: HIVE-25913
 URL: https://issues.apache.org/jira/browse/HIVE-25913
 Project: Hive
  Issue Type: Improvement
Reporter: Rajesh Balamohan


Notice "Dynamic Partitioning Event Operator" missing in Map 3 in iceberg tables.

This causes heavy IO in iceberg tables leading to perf degradation.
{noformat}
ACID table
==

explain select count(*) from store_sales, date_dim  where d_month_seq between 
1212 and 1212+11 and ss_store_sk is not null and ss_sold_date_sk=d_date_sk;


Explain
STAGE DEPENDENCIES:
  Stage-1 is a root stage
  Stage-0 depends on stages: Stage-1STAGE PLANS:
  Stage: Stage-1
    Tez
      DagId: hive_20220131032425_be2fab7f-7943-4aa1-bbdd-289139ea0f90:17
      Edges:
        Map 1 <- Map 3 (BROADCAST_EDGE)
        Reducer 2 <- Map 1 (CUSTOM_SIMPLE_EDGE)
      DagName: hive_20220131032425_be2fab7f-7943-4aa1-bbdd-289139ea0f90:17
      Vertices:
        Map 1
            Map Operator Tree:
                TableScan
                  alias: store_sales
                  filterExpr: ss_store_sk is not null (type: boolean)
                  Statistics: Num rows: 27503885621 Data size: 434880571744 
Basic stats: COMPLETE Column stats: COMPLETE
                  Filter Operator
                    predicate: ss_store_sk is not null (type: boolean)
                    Statistics: Num rows: 26856185846 Data size: 424639398832 
Basic stats: COMPLETE Column stats: COMPLETE
                    Select Operator
                      expressions: ss_sold_date_sk (type: bigint)
                      outputColumnNames: _col0
                      Statistics: Num rows: 26856185846 Data size: 214849486768 
Basic stats: COMPLETE Column stats: COMPLETE
                      Map Join Operator
                        condition map:
                             Inner Join 0 to 1
                        keys:
                          0 _col0 (type: bigint)
                          1 _col0 (type: bigint)
                        input vertices:
                          1 Map 3
                        Statistics: Num rows: 5279977323 Data size: 42239818584 
Basic stats: COMPLETE Column stats: COMPLETE
                        Group By Operator
                          aggregations: count()
                          minReductionHashAggr: 0.99
                          mode: hash
                          outputColumnNames: _col0
                          Statistics: Num rows: 1 Data size: 8 Basic stats: 
COMPLETE Column stats: COMPLETE
                          Reduce Output Operator
                            null sort order:
                            sort order:
                            Statistics: Num rows: 1 Data size: 8 Basic stats: 
COMPLETE Column stats: COMPLETE
                            value expressions: _col0 (type: bigint)
            Execution mode: vectorized, llap
            LLAP IO: may be used (ACID table)
        Map 3
            Map Operator Tree:
                TableScan
                  alias: date_dim
                  filterExpr: (d_month_seq BETWEEN 1212 AND 1223 and d_date_sk 
is not null) (type: boolean)
                  Statistics: Num rows: 73049 Data size: 876588 Basic stats: 
COMPLETE Column stats: COMPLETE
                  Filter Operator
                    predicate: (d_month_seq BETWEEN 1212 AND 1223 and d_date_sk 
is not null) (type: boolean)
                    Statistics: Num rows: 359 Data size: 4308 Basic stats: 
COMPLETE Column stats: COMPLETE
                    Select Operator
                      expressions: d_date_sk (type: bigint)
                      outputColumnNames: _col0
                      Statistics: Num rows: 359 Data size: 2872 Basic stats: 
COMPLETE Column stats: COMPLETE
                      Reduce Output Operator
                        key expressions: _col0 (type: bigint)
                        null sort order: a
                        sort order: +
                        Map-reduce partition columns: _col0 (type: bigint)
                        Statistics: Num rows: 359 Data size: 2872 Basic stats: 
COMPLETE Column stats: COMPLETE
                      Select Operator
                        expressions: _col0 (type: bigint)
                        outputColumnNames: _col0
                        Statistics: Num rows: 359 Data size: 2872 Basic stats: 
COMPLETE Column stats: COMPLETE
                        Group By Operator
                          keys: _col0 (type: bigint)
                          minReductionHashAggr: 0.5013927
                          mode: hash
                          outputColumnNames: _col0
                          Statistics: Num rows: 179 Data size: 1432 Basic 
stats: COMPLETE Column stats: COMPLETE
                          Dynamic Partitioning Event Operator

[jira] [Created] (HIVE-25927) Fix DataWritableReadSupport

2022-02-03 Thread Rajesh Balamohan (Jira)

Rajesh Balamohan created HIVE-25927:
---

 Summary: Fix DataWritableReadSupport 
 Key: HIVE-25927
 URL: https://issues.apache.org/jira/browse/HIVE-25927
 Project: Hive
  Issue Type: Improvement
Reporter: Rajesh Balamohan
 Attachments: Screenshot 2022-02-04 at 4.57.22 AM.png

!Screenshot 2022-02-04 at 4.57.22 AM.png|width=530,height=406!

Takes n^2 ops to match columns.



--
This message was sent by Atlassian Jira
(v8.20.1#820001)

[jira] [Created] (HIVE-25958) Optimise BasicStatsNoJobTask

2022-02-15 Thread Rajesh Balamohan (Jira)

Rajesh Balamohan created HIVE-25958:
---

 Summary: Optimise BasicStatsNoJobTask
 Key: HIVE-25958
 URL: https://issues.apache.org/jira/browse/HIVE-25958
 Project: Hive
  Issue Type: Improvement
Reporter: Rajesh Balamohan


When there are large number of files are present, it takes lot of time for 
analyzing table (for stats) takes lot longer time especially on cloud 
platforms. Each file is read in sequential fashion for computing stats, which 
can be optimized.

 
{code:java}
    at org.apache.hadoop.fs.s3a.Invoker.retry(Invoker.java:293)
    at org.apache.hadoop.fs.s3a.S3AInputStream.read(S3AInputStream.java:506)
    - locked <0x000642995b10> (a org.apache.hadoop.fs.s3a.S3AInputStream)
    at 
org.apache.hadoop.fs.s3a.S3AInputStream.readFully(S3AInputStream.java:775)
    - locked <0x000642995b10> (a org.apache.hadoop.fs.s3a.S3AInputStream)
    at 
org.apache.hadoop.fs.FSDataInputStream.readFully(FSDataInputStream.java:116)
    at 
org.apache.orc.impl.RecordReaderUtils.readDiskRanges(RecordReaderUtils.java:574)
    at 
org.apache.orc.impl.RecordReaderUtils$DefaultDataReader.readFileData(RecordReaderUtils.java:282)
    at 
org.apache.orc.impl.RecordReaderImpl.readAllDataStreams(RecordReaderImpl.java:1172)
    at 
org.apache.orc.impl.RecordReaderImpl.readStripe(RecordReaderImpl.java:1128)
    at 
org.apache.orc.impl.RecordReaderImpl.advanceStripe(RecordReaderImpl.java:1281)
    at 
org.apache.orc.impl.RecordReaderImpl.advanceToNextRow(RecordReaderImpl.java:1316)
    at org.apache.orc.impl.RecordReaderImpl.(RecordReaderImpl.java:302)
    at 
org.apache.hadoop.hive.ql.io.orc.RecordReaderImpl.(RecordReaderImpl.java:68)
    at 
org.apache.hadoop.hive.ql.io.orc.ReaderImpl.rowsOptions(ReaderImpl.java:83)
    at 
org.apache.hadoop.hive.ql.io.orc.OrcInputFormat.createReaderFromFile(OrcInputFormat.java:367)
    at 
org.apache.hadoop.hive.ql.io.orc.OrcInputFormat$OrcRecordReader.(OrcInputFormat.java:276)
    at 
org.apache.hadoop.hive.ql.io.orc.OrcInputFormat.getRecordReader(OrcInputFormat.java:2027)
    at 
org.apache.hadoop.hive.ql.stats.BasicStatsNoJobTask$FooterStatCollector.run(BasicStatsNoJobTask.java:235)
    at 
java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1149)
    at 
java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:624)
    at java.lang.Thread.run(Thread.java:748)
"HiveServer2-Background-Pool: Thread-5161" #5161 prio=5 os_prio=0 
tid=0x7f271217d800 nid=0x21b7 waiting on condition [0x7f26fce88000]
   java.lang.Thread.State: TIMED_WAITING (parking)
    at sun.misc.Unsafe.park(Native Method)
    - parking to wait for  <0x0006bee1b3a0> (a 
java.util.concurrent.locks.AbstractQueuedSynchronizer$ConditionObject)
    at java.util.concurrent.locks.LockSupport.parkNanos(LockSupport.java:215)
    at 
java.util.concurrent.locks.AbstractQueuedSynchronizer$ConditionObject.awaitNanos(AbstractQueuedSynchronizer.java:2078)
    at 
java.util.concurrent.ThreadPoolExecutor.awaitTermination(ThreadPoolExecutor.java:1475)
    at 
org.apache.hadoop.hive.ql.stats.BasicStatsNoJobTask.shutdownAndAwaitTermination(BasicStatsNoJobTask.java:426)
    at 
org.apache.hadoop.hive.ql.stats.BasicStatsNoJobTask.aggregateStats(BasicStatsNoJobTask.java:338)
    at 
org.apache.hadoop.hive.ql.stats.BasicStatsNoJobTask.process(BasicStatsNoJobTask.java:121)
    at org.apache.hadoop.hive.ql.exec.StatsTask.execute(StatsTask.java:107)
    at org.apache.hadoop.hive.ql.exec.Task.executeTask(Task.java:213)
    at 
org.apache.hadoop.hive.ql.exec.TaskRunner.runSequential(TaskRunner.java:105)
    at org.apache.hadoop.hive.ql.Executor.launchTask(Executor.java:361)
    at org.apache.hadoop.hive.ql.Executor.launchTasks(Executor.java:334)
    at org.apache.hadoop.hive.ql.Executor.runTasks(Executor.java:250) {code}



--
This message was sent by Atlassian Jira
(v8.20.1#820001)

[jira] [Created] (HIVE-25981) Avoid checking for archived parts in analyze table

2022-02-24 Thread Rajesh Balamohan (Jira)

Rajesh Balamohan created HIVE-25981:
---

 Summary: Avoid checking for archived parts in analyze table
 Key: HIVE-25981
 URL: https://issues.apache.org/jira/browse/HIVE-25981
 Project: Hive
  Issue Type: Improvement
  Components: HiveServer2
Reporter: Rajesh Balamohan


Analyze table on large partitioned table is expensive due to unwanted checks on 
archived data.

 
{noformat}
at 
org.apache.hadoop.hive.metastore.HiveMetaStoreClient$SynchronizedHandler.invoke(HiveMetaStoreClient.java:3908)
    - locked <0x0003d4c4c070> (a 
org.apache.hadoop.hive.metastore.HiveMetaStoreClient$SynchronizedHandler)
    at com.sun.proxy.$Proxy56.listPartitionsWithAuthInfo(Unknown Source)
    at org.apache.hadoop.hive.ql.metadata.Hive.getPartitions(Hive.java:3845)
    at 
org.apache.hadoop.hive.ql.exec.ArchiveUtils.conflictingArchiveNameOrNull(ArchiveUtils.java:299)
    at 
org.apache.hadoop.hive.ql.parse.SemanticAnalyzer.validate(SemanticAnalyzer.java:13579)
    at org.apache.hadoop.hive.ql.Compiler.analyze(Compiler.java:241)
    at org.apache.hadoop.hive.ql.Compiler.compile(Compiler.java:104)
    at org.apache.hadoop.hive.ql.Driver.compile(Driver.java:196)
    at org.apache.hadoop.hive.ql.Driver.compileInternal(Driver.java:615)
    at org.apache.hadoop.hive.ql.Driver.compileAndRespond(Driver.java:561)
    at org.apache.hadoop.hive.ql.Driver.compileAndRespond(Driver.java:555)
    at 
org.apache.hadoop.hive.ql.reexec.ReExecDriver.compileAndRespond(ReExecDriver.java:127)
    at 
org.apache.hive.service.cli.operation.SQLOperation.prepare(SQLOperation.java:204)
    at 
org.apache.hive.service.cli.operation.SQLOperation.runInternal(SQLOperation.java:265)
    at org.apache.hive.service.cli.operation.Operation.run(Operation.java:285)  
 {noformat}



--
This message was sent by Atlassian Jira
(v8.20.1#820001)

[jira] [Created] (HIVE-26008) Dynamic partition pruning not sending right partitions with subqueries

2022-03-06 Thread Rajesh Balamohan (Jira)

Rajesh Balamohan created HIVE-26008:
---

 Summary: Dynamic partition pruning not sending right partitions 
with subqueries
 Key: HIVE-26008
 URL: https://issues.apache.org/jira/browse/HIVE-26008
 Project: Hive
  Issue Type: Improvement
  Components: HiveServer2
Reporter: Rajesh Balamohan


DPP isn't working fine when there are subqueries involved. Here is an example 
query (q83). 

Note that "date_dim" has another query involved. Due to this, DPP operator ends 
up sending entire "date_dim" to the fact tables. 

Because of this, data scanned for fact tables are way higher and query runtime 
is increased.

For context, on a very small cluster, this query ran for 265 seconds and with 
the rewritten query it finished in 11 seconds!. Fact table scan was 10MB vs 10 
GB.

{noformat}
HiveJoin(condition=[=($2, $5)], joinType=[inner])
HiveJoin(condition=[=($0, $3)], joinType=[inner])
  HiveProject(cr_item_sk=[$1], cr_return_quantity=[$16], 
cr_returned_date_sk=[$26])
HiveFilter(condition=[AND(IS NOT NULL($26), IS NOT 
NULL($1))])
  HiveTableScan(table=[[tpcds_bin_partitioned_orc_1, 
catalog_returns]], table:alias=[catalog_returns])
  HiveProject(i_item_sk=[$0], i_item_id=[$1])
HiveFilter(condition=[AND(IS NOT NULL($1), IS NOT 
NULL($0))])
  HiveTableScan(table=[[tpcds_bin_partitioned_orc_1, 
item]], table:alias=[item])
HiveProject(d_date_sk=[$0], d_date=[$2])
  HiveFilter(condition=[AND(IS NOT NULL($2), IS NOT NULL($0))])
HiveTableScan(table=[[tpcds_bin_partitioned_orc_1, 
date_dim]], table:alias=[date_dim])
  HiveProject(d_date=[$0])
HiveSemiJoin(condition=[=($1, $2)], joinType=[semi])
  HiveProject(d_date=[$2], d_week_seq=[$4])
HiveFilter(condition=[AND(IS NOT NULL($4), IS NOT 
NULL($2))])
  HiveTableScan(table=[[tpcds_bin_partitioned_orc_1, 
date_dim]], table:alias=[date_dim])
  HiveProject(d_week_seq=[$4])
HiveFilter(condition=[AND(IN($2, 1998-01-02:DATE, 
1998-10-15:DATE, 1998-11-10:DATE), IS NOT NULL($4))])
  HiveTableScan(table=[[tpcds_bin_partitioned_orc_1, 
date_dim]], table:alias=[date_dim])
{noformat}

*Original Query & Plan: *

{noformat}
explain cbo with sr_items as
(select i_item_id item_id,
sum(sr_return_quantity) sr_item_qty
from store_returns,
item,
date_dim
where sr_item_sk = i_item_sk
and   d_datein
(select d_date
from date_dim
where d_week_seq in
(select d_week_seq
from date_dim
where d_date in ('1998-01-02','1998-10-15','1998-11-10')))
and   sr_returned_date_sk   = d_date_sk
group by i_item_id),
cr_items as
(select i_item_id item_id,
sum(cr_return_quantity) cr_item_qty
from catalog_returns,
item,
date_dim
where cr_item_sk = i_item_sk
and   d_datein
(select d_date
from date_dim
where d_week_seq in
(select d_week_seq
from date_dim
where d_date in ('1998-01-02','1998-10-15','1998-11-10')))
and   cr_returned_date_sk   = d_date_sk
group by i_item_id),
wr_items as
(select i_item_id item_id,
sum(wr_return_quantity) wr_item_qty
from web_returns,
item,
date_dim
where wr_item_sk = i_item_sk
and   d_datein
(select d_date
from date_dim
where d_week_seq in
(select d_week_seq
from date_dim
where d_date in ('1998-01-02','1998-10-15','1998-11-10')))
and   wr_returned_date_sk   = d_date_sk
group by i_item_id)
select  sr_items.item_id
,sr_item_qty
,sr_item_qty/(sr_item_qty+cr_item_qty+wr_item_qty)/3.0 * 100 sr_dev
,cr_item_qty
,cr_item_qty/(sr_item_qty+cr_item_qty+wr_item_qty)/3.0 * 100 cr_dev
,wr_item_qty
,wr_item_qty/(sr_item_qty+cr_item_qty+wr_item_qty)/3.0 * 100 wr_dev
,(sr_item_qty+cr_item_qty+wr_item_qty)/3.0 average
from sr_items
,cr_items
,wr_items
where sr_items.item_id=cr_items.item_id
and sr_items.item_id=wr_items.item_id
order by sr_items.item_id
,sr_item_qty
limit 100
INFO  : Starting task [Stage-3:EXPLAIN] in serial mode
INFO  : Completed executing 
command(queryId=hive_20220307055109_88ad0cbd-bd40-45bc-92ae-ab15fa6b1da4); Time 
taken: 0.973 seconds
INFO  : OK
Explain
CBO PLAN:
HiveSortLimit(sort0=[$0], sort1=[$1], dir0=[ASC], dir1=[ASC], fetch=[100])
  HiveProject(item_id=[$0], sr_item_qty=[$4], sr_dev=[*(/(/($5, CAST(+(+($4, 
$1), $7)):DOUBLE), 3), 100)], cr_item_qty=[$1], cr_dev=[*(/(/($2, CAST(+(+($4, 
$1), $7)):DOUBLE), 3), 100)], wr_item_qty=[$7], wr_dev=[*(/(/($8, CAST(+(+($4, 
$1), $7)):DOUBLE), 3), 100)], average=[/(CAST(+(+($4, $1), $7)):DECIMAL(19, 0), 
3:DECIMAL(1, 0))])
HiveJoin(condition=[=($0, $6)], joinType=[inner])
  HiveJoin(condition=[=($3, $0)], joinType=[inner])
HiveProject($f0=[$0], $f1=[$1], EXPR$0=[CAST($1):DOUBLE])
  HiveAggregate(group=[{4}], agg#0=[sum($1)])
HiveSemiJoin(co

[jira] [Created] (HIVE-26013) Parquet predicate filters are not properly propogated to task configs at runtime

2022-03-08 Thread Rajesh Balamohan (Jira)

Rajesh Balamohan created HIVE-26013:
---

 Summary: Parquet predicate filters are not properly propogated to 
task configs at runtime
 Key: HIVE-26013
 URL: https://issues.apache.org/jira/browse/HIVE-26013
 Project: Hive
  Issue Type: Bug
Reporter: Rajesh Balamohan


Hive ParquetRecordReader sets the predicate filter in the config for parquet 
libs to read.

Ref: 
[https://github.com/apache/hive/blob/master/ql%2Fsrc%2Fjava%2Forg%2Fapache%2Fhadoop%2Fhive%2Fql%2Fio%2Fparquet%2FParquetRecordReaderBase.java#L188]
{code:java}
 ParquetInputFormat.setFilterPredicate(conf, p);
{code}
This internally sets {color:#FF}"parquet.private.read.filter.predicate" 
{color}variable in config.

Ref: 
[https://github.com/apache/parquet-mr/blob/master/parquet-hadoop%2Fsrc%2Fmain%2Fjava%2Forg%2Fapache%2Fparquet%2Fhadoop%2FParquetInputFormat.java#L231]

Config set in compilation phase isn't visible at runtime for the tasks. This 
causes filters to be lost and tasks run with excessive IO.



--
This message was sent by Atlassian Jira
(v8.20.1#820001)

[jira] [Created] (HIVE-26035) Move to directsql for ObjectStore::addPartitions

2022-03-14 Thread Rajesh Balamohan (Jira)

Rajesh Balamohan created HIVE-26035:
---

 Summary: Move to directsql for ObjectStore::addPartitions
 Key: HIVE-26035
 URL: https://issues.apache.org/jira/browse/HIVE-26035
 Project: Hive
  Issue Type: Bug
Reporter: Rajesh Balamohan


Currently {{addPartitions}} uses datanuclues and is super slow for large number 
of partitions. It will be good to move to direct sql. Lots of repeated SQLs can 
be avoided as well (e.g SDS, SERDE, TABLE_PARAMS)



--
This message was sent by Atlassian Jira
(v8.20.1#820001)

[jira] [Created] (HIVE-26072) Enable vectorization for stats gathering (tablescan op)

2022-03-24 Thread Rajesh Balamohan (Jira)

Rajesh Balamohan created HIVE-26072:
---

 Summary: Enable vectorization for stats gathering (tablescan op)
 Key: HIVE-26072
 URL: https://issues.apache.org/jira/browse/HIVE-26072
 Project: Hive
  Issue Type: Bug
  Components: Hive
Reporter: Rajesh Balamohan


https://issues.apache.org/jira/browse/HIVE-24510 enabled vectorization for 
compute_bit_vector. 

But tablescan operator for stats gathering is disabled by default.

[https://github.com/apache/hive/blob/master/ql/src/java/org/apache/hadoop/hive/ql/optimizer/physical/Vectorizer.java#L2577]

Need to enable vectorization for this. This can significantly reduce runtimes 
for analyze statements for large tables.



--
This message was sent by Atlassian Jira
(v8.20.1#820001)

[jira] [Created] (HIVE-26091) Support DecimalFilterPredicateLeafBuilder for parquet

2022-03-30 Thread Rajesh Balamohan (Jira)

Rajesh Balamohan created HIVE-26091:
---

 Summary: Support DecimalFilterPredicateLeafBuilder for parquet
 Key: HIVE-26091
 URL: https://issues.apache.org/jira/browse/HIVE-26091
 Project: Hive
  Issue Type: Bug
Reporter: Rajesh Balamohan


 
https://github.com/apache/hive/blob/master/ql/src/java/org/apache/hadoop/hive/ql/io/parquet/LeafFilterFactory.java#L41

It will nice to have DecimalFilterPredicateLeafBuilder. This will help in 
supporting SARG pushdowns.

{noformat}
2022-03-30 08:59:50,040 [ERROR] [TezChild] 
|read.ParquetFilterPredicateConverter|: fail to build predicate filter leaf 
with errorsorg.apache.hadoop.hive.ql.metadata.HiveException: Conversion to 
Parquet FilterPredicate not supported for DECIMAL
org.apache.hadoop.hive.ql.metadata.HiveException: Conversion to Parquet 
FilterPredicate not supported for DECIMAL
at 
org.apache.hadoop.hive.ql.io.parquet.LeafFilterFactory.getLeafFilterBuilderByType(LeafFilterFactory.java:223)
at 
org.apache.hadoop.hive.ql.io.parquet.read.ParquetFilterPredicateConverter.buildFilterPredicateFromPredicateLeaf(ParquetFilterPredicateConverter.java:130)
at 
org.apache.hadoop.hive.ql.io.parquet.read.ParquetFilterPredicateConverter.translate(ParquetFilterPredicateConverter.java:111)
at 
org.apache.hadoop.hive.ql.io.parquet.read.ParquetFilterPredicateConverter.translate(ParquetFilterPredicateConverter.java:97)
at 
org.apache.hadoop.hive.ql.io.parquet.read.ParquetFilterPredicateConverter.translate(ParquetFilterPredicateConverter.java:71)
at 
org.apache.hadoop.hive.ql.io.parquet.read.ParquetFilterPredicateConverter.translate(ParquetFilterPredicateConverter.java:88)
at 
org.apache.hadoop.hive.ql.io.parquet.read.ParquetFilterPredicateConverter.toFilterPredicate(ParquetFilterPredicateConverter.java:57)
at 
org.apache.hadoop.hive.ql.io.parquet.ParquetRecordReaderBase.setFilter(ParquetRecordReaderBase.java:184)
at 
org.apache.hadoop.hive.ql.io.parquet.ParquetRecordReaderBase.getSplit(ParquetRecordReaderBase.java:124)
at 
org.apache.hadoop.hive.ql.io.parquet.vector.VectorizedParquetRecordReader.(VectorizedParquetRecordReader.java:158)
at 
org.apache.hadoop.hive.ql.io.parquet.VectorizedParquetInputFormat.getRecordReader(VectorizedParquetInputFormat.java:50)
at 
org.apache.hadoop.hive.ql.io.parquet.MapredParquetInputFormat.getRecordReader(MapredParquetInputFormat.java:87)
at 
org.apache.hadoop.hive.ql.io.RecordReaderWrapper.create(RecordReaderWrapper.java:72)
at 
org.apache.hadoop.hive.ql.io.HiveInputFormat.getRecordReader(HiveInputFormat.java:429)
at 
org.apache.hadoop.mapred.split.TezGroupedSplitsInputFormat$TezGroupedSplitsRecordReader.initNextRecordReader(TezGroupedSplitsInputFormat.java:203)
at 
org.apache.hadoop.mapred.split.TezGroupedSplitsInputFormat$TezGroupedSplitsRecordReader.next(TezGroupedSplitsInputFormat.java:152)
at 
org.apache.tez.mapreduce.lib.MRReaderMapred.next(MRReaderMapred.java:116)
at 
org.apache.hadoop.hive.ql.exec.tez.MapRecordSource.pushRecord(MapRecordSource.java:68)
at 
org.apache.hadoop.hive.ql.exec.tez.MapRecordProcessor.run(MapRecordProcessor.java:437)
at 
org.apache.hadoop.hive.ql.exec.tez.TezProcessor.initializeAndRunProcessor(TezProcessor.java:282)
at 
org.apache.hadoop.hive.ql.exec.tez.TezProcessor.run(TezProcessor.java:265)
at 
org.apache.tez.runtime.LogicalIOProcessorRuntimeTask.run(LogicalIOProcessorRuntimeTask.java:374)
at 
org.apache.tez.runtime.task.TaskRunner2Callable$1.run(TaskRunner2Callable.java:75)
at 
org.apache.tez.runtime.task.TaskRunner2Callable$1.run(TaskRunner2Callable.java:62)
at java.security.AccessController.doPrivileged(Native Method)
at javax.security.auth.Subject.doAs(Subject.java:422)
at 
org.apache.hadoop.security.UserGroupInformation.doAs(UserGroupInformation.java:1898)
at 
org.apache.tez.runtime.task.TaskRunner2Callable.callInternal(TaskRunner2Callable.java:62)
at 
org.apache.tez.runtime.task.TaskRunner2Callable.callInternal(TaskRunner2Callable.java:38)
at org.apache.tez.common.CallableWithNdc.call(CallableWithNdc.java:36)
at 
com.google.common.util.concurrent.TrustedListenableFutureTask$TrustedFutureInterruptibleTask.runInterruptibly(TrustedListenableFutureTask.java:125)
at 
com.google.common.util.concurrent.InterruptibleTask.run(InterruptibleTask.java:69)
at 
com.google.common.util.concurrent.TrustedListenableFutureTask.run(TrustedListenableFutureTask.java:78)
at 
java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1149)
at 
java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:624)
at java.lang.Thread.run(Thread.java:748)
 {noformat}



--
This message was sent by Atlassian Jira
(

[jira] [Created] (HIVE-26110) bulk insert into partitioned table creates lots of files in iceberg

2022-04-04 Thread Rajesh Balamohan (Jira)

Rajesh Balamohan created HIVE-26110:
---

 Summary: bulk insert into partitioned table creates lots of files 
in iceberg
 Key: HIVE-26110
 URL: https://issues.apache.org/jira/browse/HIVE-26110
 Project: Hive
  Issue Type: Bug
Reporter: Rajesh Balamohan


For e.g, create web_returns table in tpcds in iceberg format and try to copy 
over data from regular table. More like "insert into web_returns_iceberg as 
select * from web_returns".

This inserts the data correctly, however there are lot of files present in each 
partition. IMO, dynamic sort optimisation isn't working fine and this causes 
records not to be grouped in the final phase.



--
This message was sent by Atlassian Jira
(v8.20.1#820001)

[jira] [Created] (HIVE-26115) Parquet footer is read 3 times when reading iceberg data

2022-04-04 Thread Rajesh Balamohan (Jira)

Rajesh Balamohan created HIVE-26115:
---

 Summary: Parquet footer is read 3 times when reading iceberg data
 Key: HIVE-26115
 URL: https://issues.apache.org/jira/browse/HIVE-26115
 Project: Hive
  Issue Type: Bug
Reporter: Rajesh Balamohan
 Attachments: Screenshot 2022-04-05 at 10.08.27 AM.png, Screenshot 
2022-04-05 at 10.08.35 AM.png, Screenshot 2022-04-05 at 10.08.50 AM.png, 
Screenshot 2022-04-05 at 10.09.03 AM.png

!Screenshot 2022-04-05 at 10.08.27 AM.png|width=627,height=331!

Here is the breakup of 3 footer reads per file.

!Screenshot 2022-04-05 at 10.08.35 AM.png|width=1109,height=500! 

 

 

!Screenshot 2022-04-05 at 10.08.50 AM.png|width=1067,height=447! 

 

 

!Screenshot 2022-04-05 at 10.09.03 AM.png|width=827,height=303!

 

HIVE-25827 already talks about the initial 2 footer reads per file.



--
This message was sent by Atlassian Jira
(v8.20.1#820001)

[jira] [Created] (HIVE-26128) Enabling dynamic runtime filtering in iceberg tables throws exception at runtime

2022-04-11 Thread Rajesh Balamohan (Jira)

Rajesh Balamohan created HIVE-26128:
---

 Summary: Enabling dynamic runtime filtering in iceberg tables 
throws exception at runtime
 Key: HIVE-26128
 URL: https://issues.apache.org/jira/browse/HIVE-26128
 Project: Hive
  Issue Type: Bug
Reporter: Rajesh Balamohan


E.g TPCDS Q2 at 10 TB scale throws the following error when run with 
"hive.disable.unsafe.external.table.operations=false". Iceberg tables were 
created as external tables and setting 
"hive.disable.unsafe.external.table.operations=false" will enable it to have 
dynamic runtime filtering; but throws  the following error at runtime


{noformat}
]Vertex failed, vertexName=Map 6, vertexId=vertex_1649658279052__1_03, 
diagnostics=[Vertex vertex_1649658279052__1_03 [Map 6] killed/failed due 
to:ROOT_INPUT_INIT_FAILURE, Vertex Input: date_dim initializer failed, 
vertex=vertex_1649658279052__1_03 [Map 6], 
java.lang.IndexOutOfBoundsException: Index: 0, Size: 0
at java.util.ArrayList.rangeCheck(ArrayList.java:659)
at java.util.ArrayList.get(ArrayList.java:435)
at 
org.apache.iceberg.mr.hive.HiveIcebergFilterFactory.translateLeaf(HiveIcebergFilterFactory.java:114)
at 
org.apache.iceberg.mr.hive.HiveIcebergFilterFactory.translate(HiveIcebergFilterFactory.java:86)
at 
org.apache.iceberg.mr.hive.HiveIcebergFilterFactory.translate(HiveIcebergFilterFactory.java:80)
at 
org.apache.iceberg.mr.hive.HiveIcebergFilterFactory.generateFilterExpression(HiveIcebergFilterFactory.java:59)
at 
org.apache.iceberg.mr.hive.HiveIcebergInputFormat.getSplits(HiveIcebergInputFormat.java:92)
at 
org.apache.hadoop.hive.ql.io.HiveInputFormat.addSplitsForGroup(HiveInputFormat.java:592)
at 
org.apache.hadoop.hive.ql.io.HiveInputFormat.getSplits(HiveInputFormat.java:900)
at 
org.apache.hadoop.hive.ql.exec.tez.HiveSplitGenerator.initialize(HiveSplitGenerator.java:274)
at 
org.apache.tez.dag.app.dag.RootInputInitializerManager.lambda$runInitializer$3(RootInputInitializerManager.java:199)
at java.security.AccessController.doPrivileged(Native Method)
at javax.security.auth.Subject.doAs(Subject.java:422)
at 
org.apache.hadoop.security.UserGroupInformation.doAs(UserGroupInformation.java:1898)
at 
org.apache.tez.dag.app.dag.RootInputInitializerManager.runInitializer(RootInputInitializerManager.java:192)
at 
org.apache.tez.dag.app.dag.RootInputInitializerManager.runInitializerAndProcessResult(RootInputInitializerManager.java:173)
at 
org.apache.tez.dag.app.dag.RootInputInitializerManager.lambda$createAndStartInitializing$2(RootInputInitializerManager.java:167)
at 
java.util.concurrent.Executors$RunnableAdapter.call(Executors.java:511)
at 
com.google.common.util.concurrent.TrustedListenableFutureTask$TrustedFutureInterruptibleTask.runInterruptibly(TrustedListenableFutureTask.java:125)
at 
com.google.common.util.concurrent.InterruptibleTask.run(InterruptibleTask.java:69)
at 
com.google.common.util.concurrent.TrustedListenableFutureTask.run(TrustedListenableFutureTask.java:78)
at 
java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1149)
at 
java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:624)
at java.lang.Thread.run(Thread.java:750)
]Vertex killed, vertexName=Map 13, vertexId=vertex_1649658279052__1_07, 
diagnostics=[Vertex received Kill in INITED state., Vertex 
vertex_1649658279052__1_07 [Map 13] killed/failed due 
to:OTHER_VERTEX_FAILURE]Vertex killed, vertexName=Map 10, 
vertexId=vertex_1649658279052__1_06, diagnostics=[Vertex received Kill in 
INITED state., Vertex vertex_1649658279052__1_06 [Map 10] killed/failed due 
to:OTHER_VERTEX_FAILURE]Vertex killed, vertexName=Map 5, 
vertexId=vertex_1649658279052__1_04, diagnostics=[Vertex received Kill in 
INITED state., Vertex vertex_1649658279052__1_04 [Map 5] killed/failed due 
to:OTHER_VERTEX_FAILURE]Vertex killed, vertexName=Reducer 4, 
vertexId=vertex_1649658279052__1_11, diagnostics=[Vertex received Kill in 
NEW state., Vertex vertex_1649658279052__1_11 [Reducer 4] killed/failed due 
to:OTHER_VERTEX_FAILURE]Vertex killed, vertexName=Reducer 3, 
vertexId=vertex_1649658279052__1_10, diagnostics=[Vertex received Kill in 
INITED state., Vertex vertex_1649658279052__1_10 [Reducer 3] killed/failed 
due to:OTHER_VERTEX_FAILURE]Vertex killed, vertexName=Reducer 12, 
vertexId=vertex_1649658279052__1_09, diagnostics=[Vertex received Kill in 
INITED state., Vertex vertex_1649658279052__1_09 [Reducer 12] killed/failed 
due to:OTHER_VERTEX_FAILURE]Vertex killed, vertexName=Map 1, 
vertexId=vertex_1649658279052__1_08, diagnostics=[Vertex received Kill in 
INITED state., Vertex vertex_1649658279052__1_08 [Map 1] killed/failed due 
to:OT

[jira] [Created] (HIVE-26181) Add details on the number of partitions/entries in dynamic partition pruning

2022-04-27 Thread Rajesh Balamohan (Jira)

Rajesh Balamohan created HIVE-26181:
---

 Summary: Add details on the number of partitions/entries in 
dynamic partition pruning
 Key: HIVE-26181
 URL: https://issues.apache.org/jira/browse/HIVE-26181
 Project: Hive
  Issue Type: Bug
Reporter: Rajesh Balamohan


Related ticket: HIVE-26008

It will be good to print details on the number of partition pruning entries for 
debugging and for understanding the eff* of the query.





--
This message was sent by Atlassian Jira
(v8.20.7#820007)

[jira] [Created] (HIVE-26185) Need support for metadataonly operations with iceberg (e.g select distinct on partition column)

2022-04-28 Thread Rajesh Balamohan (Jira)

Rajesh Balamohan created HIVE-26185:
---

 Summary: Need support for metadataonly operations with iceberg 
(e.g select distinct on partition column)
 Key: HIVE-26185
 URL: https://issues.apache.org/jira/browse/HIVE-26185
 Project: Hive
  Issue Type: Bug
  Components: HiveServer2
Reporter: Rajesh Balamohan


{noformat}
select distinct ss_sold_date_sk from store_sales
{noformat}

This query scans 1800+ rows in hive acid. But takes ages to process with 
NullScanOptimiser during compilation phase 
(https://issues.apache.org/jira/browse/HIVE-24262)

{noformat}
Hive ACID

INFO  : Executing 
command(queryId=hive_20220427233926_282bc9d8-220c-4a09-928d-411601c2ef14): 
select distinct ss_sold_date_sk from store_sales
INFO  : Compute 'ndembla-test2' is active.
INFO  : Query ID = hive_20220427233926_282bc9d8-220c-4a09-928d-411601c2ef14
INFO  : Total jobs = 1
INFO  : Launching Job 1 out of 1
INFO  : Starting task [Stage-1:MAPRED] in serial mode
INFO  : Subscribed to counters: [] for queryId: 
hive_20220427233926_282bc9d8-220c-4a09-928d-411601c2ef14
INFO  : Tez session hasn't been created yet. Opening session
INFO  : Dag name: select distinct ss_sold_date_s...store_sales (Stage-1)
INFO  : Status: Running (Executing on YARN cluster with App id 
application_1651102345385_)

INFO  : Status: DAG finished successfully in 1.81 seconds
INFO  : DAG ID: dag_1651102345385__5
INFO  :
INFO  : Query Execution Summary
INFO  : 
--
INFO  : OPERATIONDURATION
INFO  : 
--
INFO  : Compile Query  55.47s
INFO  : Prepare Plan2.32s
INFO  : Get Query Coordinator (AM)  0.13s
INFO  : Submit Plan 0.03s
INFO  : Start DAG   0.09s
INFO  : Run DAG 1.80s
INFO  : 
--
INFO  :
INFO  : Task Execution Summary
INFO  : 
--
INFO  :   VERTICES  DURATION(ms)   CPU_TIME(ms)GC_TIME(ms)   
INPUT_RECORDS   OUTPUT_RECORDS
INFO  : 
--
INFO  :  Map 1   1009.00  0  0   
1,8241,824
INFO  :  Reducer 2  0.00  0  0   
1,8240
INFO  : 
--
INFO  :

{noformat}




However, same query scans *2.8 Billion records.* in iceberg format. This can be 
fixed.

{noformat}

INFO  : Executing 
command(queryId=hive_20220427233519_cddc6dd1-95a3-4f0e-afa5-e11e9dc5fa72): 
select distinct ss_sold_date_sk from store_sales
INFO  : Compute 'ndembla-test2' is active.
INFO  : Query ID = hive_20220427233519_cddc6dd1-95a3-4f0e-afa5-e11e9dc5fa72
INFO  : Total jobs = 1
INFO  : Launching Job 1 out of 1
INFO  : Starting task [Stage-1:MAPRED] in serial mode
INFO  : Subscribed to counters: [] for queryId: 
hive_20220427233519_cddc6dd1-95a3-4f0e-afa5-e11e9dc5fa72
INFO  : Tez session hasn't been created yet. Opening session
INFO  : Dag name: select distinct ss_sold_date_s...store_sales (Stage-1)
INFO  : Status: Running (Executing on YARN cluster with App id 
application_1651102345385_)

--
VERTICES  MODESTATUS  TOTAL  COMPLETED  RUNNING  PENDING  
FAILED  KILLED
--
Map 1 ..  llap SUCCEEDED   7141   714100
   0   0
Reducer 2 ..  llap SUCCEEDED  2  200
   0   0
--
VERTICES: 02/02  [==>>] 100%  ELAPSED TIME: 18.48 s
--
INFO  : Status: DAG finished successfully in 17.97 seconds
INFO  : DAG ID: dag_1651102345385__4
INFO  :
INFO  : Query Execution Summary
INFO  : 
--
INFO  : OPERATIONDURATION
INFO  : 
--
INFO  : Compile Query   1.81s
INFO  : Prepare Plan0.04s
INFO  : Get Query Coordinator

[jira] [Created] (HIVE-26194) Unable to interrupt query in the middle of long compilation

2022-05-01 Thread Rajesh Balamohan (Jira)

Rajesh Balamohan created HIVE-26194:
---

 Summary: Unable to interrupt query in the middle of long 
compilation
 Key: HIVE-26194
 URL: https://issues.apache.org/jira/browse/HIVE-26194
 Project: Hive
  Issue Type: Bug
  Components: HiveServer2
Reporter: Rajesh Balamohan



*Issue:*
 * Certain queries can take lot longer time to compile, depending on the number 
of interactions with HMS.
 * When user tries to cancel such queries in the middle of compilation, it 
doesn't work. It interrupts the process only when the entire compilation phase 
is complete.
 * Example is given below (Q66 at 10 TB TPCDS)

 
{noformat}

. . . . . . . . . . . . . . . . . . . . . . .>,d_year
. . . . . . . . . . . . . . . . . . . . . . .>  )
. . . . . . . . . . . . . . . . . . . . . . .>  ) x
. . . . . . . . . . . . . . . . . . . . . . .>  group by
. . . . . . . . . . . . . . . . . . . . . . .> w_warehouse_name
. . . . . . . . . . . . . . . . . . . . . . .>,w_warehouse_sq_ft
. . . . . . . . . . . . . . . . . . . . . . .>,w_city
. . . . . . . . . . . . . . . . . . . . . . .>,w_county
. . . . . . . . . . . . . . . . . . . . . . .>,w_state
. . . . . . . . . . . . . . . . . . . . . . .>,w_country
. . . . . . . . . . . . . . . . . . . . . . .>,ship_carriers
. . . . . . . . . . . . . . . . . . . . . . .>,year
. . . . . . . . . . . . . . . . . . . . . . .>  order by w_warehouse_name
. . . . . . . . . . . . . . . . . . . . . . .>  limit 100;
Interrupting... Please be patient this may take some time.
Interrupting... Please be patient this may take some time.
Interrupting... Please be patient this may take some time.
Interrupting... Please be patient this may take some time.
Interrupting... Please be patient this may take some time.
Interrupting... Please be patient this may take some time.

...
...
...

,w_city
,w_county
,w_state
,w_country
,ship_carriers
,year
order by w_warehouse_name
limit 100
INFO  : Semantic Analysis Completed (retrial = false)
ERROR : FAILED: command has been interrupted: after analyzing query.
INFO  : Compiling 
command(queryId=hive_20220502040541_14c76b6f-f6d2-4ab3-ad82-522f17ede63a) has 
been interrupted after 32.872 seconds <<< Notice that 
it interrupted only after entire compilation is done at 32 seconds.
Error: Query was cancelled. Illegal Operation state transition from CANCELED to 
ERROR (state=01000,code=0)

 {noformat}

This becomes an issue in busy cluster.

Interrupt handling should be fixed in compilation phase.



--
This message was sent by Atlassian Jira
(v8.20.7#820007)

[jira] [Created] (HIVE-26490) Iceberg: Residual expression is constructed for the task from multiple places causing CPU burn

2022-08-22 Thread Rajesh Balamohan (Jira)

Rajesh Balamohan created HIVE-26490:
---

 Summary: Iceberg: Residual expression is constructed for the task 
from multiple places causing CPU burn
 Key: HIVE-26490
 URL: https://issues.apache.org/jira/browse/HIVE-26490
 Project: Hive
  Issue Type: Bug
Reporter: Rajesh Balamohan
 Attachments: Screenshot 2022-08-22 at 12.58.47 PM.jpg

"HiveIcebergInputFormat.residualForTask(task, job)" is invoked from multiple 
places causing CPU burn.

!Screenshot 2022-08-22 at 12.58.47 PM.jpg|width=918,height=932!



--
This message was sent by Atlassian Jira
(v8.20.10#820010)

[jira] [Created] (HIVE-26491) Iceberg: Drop table should purge the data for V2 tables

2022-08-22 Thread Rajesh Balamohan (Jira)

Rajesh Balamohan created HIVE-26491:
---

 Summary: Iceberg: Drop table should purge the data for V2 tables 
 Key: HIVE-26491
 URL: https://issues.apache.org/jira/browse/HIVE-26491
 Project: Hive
  Issue Type: Bug
Reporter: Rajesh Balamohan


# create external table stored by iceberg in orc format. Convert this to 
iceberg v2 table format via alter table statements. This should ideally have 
set "'external.table.purge'='true'" property by default which is missing for V2 
tables.
 # insert data into it
 # Drop the table.  This drops the metadata information, but retains actual 
data. 

Set "'external.table.purge'='true'" as default for iceberg (if it hasn't been 
set yet).



--
This message was sent by Atlassian Jira
(v8.20.10#820010)

[jira] [Created] (HIVE-26496) FetchOperator scans delete_delta folders multiple times causing slowness

2022-08-24 Thread Rajesh Balamohan (Jira)

Rajesh Balamohan created HIVE-26496:
---

 Summary: FetchOperator scans delete_delta folders multiple times 
causing slowness
 Key: HIVE-26496
 URL: https://issues.apache.org/jira/browse/HIVE-26496
 Project: Hive
  Issue Type: Bug
  Components: HiveServer2
Reporter: Rajesh Balamohan


FetchOperator scans way too many number of files/directories than needed.

For e.g here is a layout of a table which had set of updates and deletes. There 
are set of "delta" and "delete_delta" folders which are created.
{noformat}
s3a://bucket-name/warehouse/tablespace/managed/hive/test.db/date_dim/base_001
s3a://bucket-name/warehouse/tablespace/managed/hive/test.db/date_dim/delete_delta_002_002_
s3a://bucket-name/warehouse/tablespace/managed/hive/test.db/date_dim/delete_delta_003_003_
s3a://bucket-name/warehouse/tablespace/managed/hive/test.db/date_dim/delete_delta_004_004_
s3a://bucket-name/warehouse/tablespace/managed/hive/test.db/date_dim/delete_delta_005_005_
s3a://bucket-name/warehouse/tablespace/managed/hive/test.db/date_dim/delete_delta_006_006_
s3a://bucket-name/warehouse/tablespace/managed/hive/test.db/date_dim/delete_delta_007_007_
s3a://bucket-name/warehouse/tablespace/managed/hive/test.db/date_dim/delete_delta_008_008_
s3a://bucket-name/warehouse/tablespace/managed/hive/test.db/date_dim/delete_delta_009_009_
s3a://bucket-name/warehouse/tablespace/managed/hive/test.db/date_dim/delete_delta_010_010_
s3a://bucket-name/warehouse/tablespace/managed/hive/test.db/date_dim/delete_delta_011_011_
s3a://bucket-name/warehouse/tablespace/managed/hive/test.db/date_dim/delete_delta_012_012_
s3a://bucket-name/warehouse/tablespace/managed/hive/test.db/date_dim/delete_delta_013_013_
s3a://bucket-name/warehouse/tablespace/managed/hive/test.db/date_dim/delete_delta_014_014_
s3a://bucket-name/warehouse/tablespace/managed/hive/test.db/date_dim/delete_delta_015_015_
s3a://bucket-name/warehouse/tablespace/managed/hive/test.db/date_dim/delete_delta_016_016_
s3a://bucket-name/warehouse/tablespace/managed/hive/test.db/date_dim/delete_delta_017_017_
s3a://bucket-name/warehouse/tablespace/managed/hive/test.db/date_dim/delete_delta_018_018_
s3a://bucket-name/warehouse/tablespace/managed/hive/test.db/date_dim/delete_delta_019_019_
s3a://bucket-name/warehouse/tablespace/managed/hive/test.db/date_dim/delete_delta_020_020_
s3a://bucket-name/warehouse/tablespace/managed/hive/test.db/date_dim/delete_delta_021_021_
s3a://bucket-name/warehouse/tablespace/managed/hive/test.db/date_dim/delete_delta_022_022_
s3a://bucket-name/warehouse/tablespace/managed/hive/test.db/date_dim/delta_002_002_
s3a://bucket-name/warehouse/tablespace/managed/hive/test.db/date_dim/delta_003_003_
s3a://bucket-name/warehouse/tablespace/managed/hive/test.db/date_dim/delta_004_004_
s3a://bucket-name/warehouse/tablespace/managed/hive/test.db/date_dim/delta_005_005_
s3a://bucket-name/warehouse/tablespace/managed/hive/test.db/date_dim/delta_006_006_
s3a://bucket-name/warehouse/tablespace/managed/hive/test.db/date_dim/delta_007_007_
s3a://bucket-name/warehouse/tablespace/managed/hive/test.db/date_dim/delta_008_008_
s3a://bucket-name/warehouse/tablespace/managed/hive/test.db/date_dim/delta_009_009_
s3a://bucket-name/warehouse/tablespace/managed/hive/test.db/date_dim/delta_010_010_
s3a://bucket-name/warehouse/tablespace/managed/hive/test.db/date_dim/delta_011_011_
s3a://bucket-name/warehouse/tablespace/managed/hive/test.db/date_dim/delta_012_012_
s3a://bucket-name/warehouse/tablespace/managed/hive/test.db/date_dim/delta_013_013_
s3a://bucket-name/warehouse/tablespace/managed/hive/test.db/date_dim/delta_014_014_
s3a://bucket-name/warehouse/tablespace/managed/hive/test.db/date_dim/delta_015_015_
s3a://bucket-name/warehouse/tablespace/managed/hive/test.db/date_dim/delta_016_016_
s3a://bucket-name/warehouse/tablespace/managed/hive/test.db/date_dim/delta_017_017_
s3a://bucket-name/warehouse/tablespace/managed/hive/test.db/date_dim/delta_018_018_
s3a://bucket-name/warehouse/tablespace/managed/hive/test.db/date_dim/delta_019_019_
s3a://bucket-name/warehouse/tablespace/managed/hive/test.db/date_dim/delta_020_020_
s3a://bucket-name/warehouse/tablespace/managed/hive/test.db/date_dim/delta_021_021_

{noformat}
 

When user runs *{color:#0747a6}{{select * from date_dim}}{color}* from beeline, 
FetchOperator tries to compute splits in "date_dim". This "base" and "delta" 
folders and computes 2

[jira] [Created] (HIVE-26507) Iceberg: In place metadata generation may not work for certain datatypes

2022-08-31 Thread Rajesh Balamohan (Jira)

Rajesh Balamohan created HIVE-26507:
---

 Summary: Iceberg: In place metadata generation may not work for 
certain datatypes
 Key: HIVE-26507
 URL: https://issues.apache.org/jira/browse/HIVE-26507
 Project: Hive
  Issue Type: Bug
Reporter: Rajesh Balamohan


"alter table" statements can be used for generating iceberg metadata 
information (i.e for converting external tables  -> iceberg tables).

As part of this process, it also converts certain datatypes to iceberg 
compatible types (e.g char -> string). "iceberg.mr.schema.auto.conversion" 
enables this conversion.

This could cause certain issues at runtime. Here is an example
{noformat}

Before conversion:
==
-- external table
select count(*) from customer_demographics where cd_gender = 'F' and 
cd_marital_status = 'U' and cd_education_status = '2 yr Degree';

27440

after conversion:
=
-- iceberg table
select count(*) from customer_demographics where cd_gender = 'F' and 
cd_marital_status = 'U' and cd_education_status = '2 yr Degree';

0

select count(*) from customer_demographics where cd_gender = 'F' and 
cd_marital_status = 'U' and trim(cd_education_status) = '2 yr Degree';

27440
 {noformat}
 



--
This message was sent by Atlassian Jira
(v8.20.10#820010)

[jira] [Created] (HIVE-26520) Improve dynamic partition pruning operator when subqueries are involved

2022-09-06 Thread Rajesh Balamohan (Jira)

Rajesh Balamohan created HIVE-26520:
---

 Summary: Improve dynamic partition pruning operator when 
subqueries are involved
 Key: HIVE-26520
 URL: https://issues.apache.org/jira/browse/HIVE-26520
 Project: Hive
  Issue Type: Improvement
  Components: HiveServer2
Reporter: Rajesh Balamohan
 Attachments: q58_test.pdf

Dynamic partition pruning operator sends entire date_dim table and due to this, 
entire catalog_sales data is scanned causing huge IO and decoding cost.

If dynamic partition pruning operator was created after the "date_dim" subquery 
has been evaluated, it would have saved huge IO cost. E.g It would have just 
taken 6-7 partition scans instead of 1800+ partitions.

 

Consider the following simplified query as example

{noformat}

select count(*) from (select i_item_id item_id
,sum(cs_ext_sales_price) cs_item_rev
  from catalog_sales
  ,item
  ,date_dim
 where cs_item_sk = i_item_sk
  and  d_date in (select d_date
  from date_dim
  where d_week_seq = (select d_week_seq 
  from date_dim
  where d_date = '1998-02-21'))
  and  cs_sold_date_sk = d_date_sk
 group by i_item_id) a;
 

CBO PLAN:
HiveAggregate(group=[{}], agg#0=[count()])
  HiveProject(i_item_id=[$0])
HiveAggregate(group=[{4}])
  HiveSemiJoin(condition=[=($6, $7)], joinType=[semi])
HiveJoin(condition=[=($2, $5)], joinType=[inner])
  HiveJoin(condition=[=($0, $3)], joinType=[inner])
HiveProject(cs_item_sk=[$14], cs_ext_sales_price=[$22], 
cs_sold_date_sk=[$33])
  HiveFilter(condition=[AND(IS NOT NULL($33), IS NOT NULL($14))])
HiveTableScan(table=[[tpcds_bin_partitioned_orc_1_external, 
catalog_sales]], table:alias=[catalog_sales])
HiveProject(i_item_sk=[$0], i_item_id=[$1])
  HiveFilter(condition=[IS NOT NULL($0)])
HiveTableScan(table=[[tpcds_bin_partitioned_orc_1_external, 
item]], table:alias=[item])
  HiveProject(d_date_sk=[$0], d_date=[$2])
HiveFilter(condition=[AND(IS NOT NULL($2), IS NOT NULL($0))])
  HiveTableScan(table=[[tpcds_bin_partitioned_orc_1_external, 
date_dim]], table:alias=[date_dim])
HiveProject(d_date=[$0])
  HiveJoin(condition=[=($1, $3)], joinType=[inner])
HiveJoin(condition=[true], joinType=[inner])
  HiveProject(d_date=[$2], d_week_seq=[$4])
HiveFilter(condition=[AND(IS NOT NULL($2), IS NOT NULL($4))])
  
HiveTableScan(table=[[tpcds_bin_partitioned_orc_1_external, date_dim]], 
table:alias=[date_dim])
  HiveProject(cnt=[$0])
HiveFilter(condition=[<=(sq_count_check($0), 1)])
  HiveProject(cnt=[$0])
HiveAggregate(group=[{}], cnt=[COUNT()])
  HiveFilter(condition=[=($2, 1998-02-21)])

HiveTableScan(table=[[tpcds_bin_partitioned_orc_1_external, date_dim]], 
table:alias=[date_dim])
HiveProject(d_week_seq=[$4])
  HiveFilter(condition=[AND(=($2, 1998-02-21), IS NOT NULL($4))])
HiveTableScan(table=[[tpcds_bin_partitioned_orc_1_external, 
date_dim]], table:alias=[date_dim])
{noformat}
 
I will attach the formatted plan for reference as well. If planner generated 
the dynamic partition pruning event after "date_dim" got evaluated in "Map 7", 
it would be been very efficient.



--
This message was sent by Atlassian Jira
(v8.20.10#820010)

[jira] [Created] (HIVE-26529) Fix VectorizedSupport support for DECIMAL_64 in HiveIcebergInputFormat

2022-09-08 Thread Rajesh Balamohan (Jira)

Rajesh Balamohan created HIVE-26529:
---

 Summary: Fix VectorizedSupport support for  DECIMAL_64 in 
HiveIcebergInputFormat
 Key: HIVE-26529
 URL: https://issues.apache.org/jira/browse/HIVE-26529
 Project: Hive
  Issue Type: Improvement
  Components: HiveServer2
Reporter: Rajesh Balamohan


For supporting vectored reads in parquet, DECIMAL_64 support in ORC has been 
disabled in HiveIcebergInputFormat. This causes regressions in queries.

[https://github.com/apache/hive/blob/master/iceberg/iceberg-handler/src/main/java/org/apache/iceberg/mr/hive/HiveIcebergInputFormat.java#L182]

It will be good to restore DECIMAL_64 support in iceberg input format.

 



--
This message was sent by Atlassian Jira
(v8.20.10#820010)

[jira] [Created] (HIVE-26532) Remove logger from critical path in VectorMapJoinInnerLongOperator::processBatch

2022-09-12 Thread Rajesh Balamohan (Jira)

Rajesh Balamohan created HIVE-26532:
---

 Summary: Remove logger from critical path in 
VectorMapJoinInnerLongOperator::processBatch
 Key: HIVE-26532
 URL: https://issues.apache.org/jira/browse/HIVE-26532
 Project: Hive
  Issue Type: Improvement
Reporter: Rajesh Balamohan
 Attachments: Screenshot 2022-09-12 at 10.03.43 AM.png

!Screenshot 2022-09-12 at 10.03.43 AM.png|width=895,height=872!



--
This message was sent by Atlassian Jira
(v8.20.10#820010)

[jira] [Created] (HIVE-26540) Iceberg: Select queries after update/delete become expensive in reading contents

2022-09-16 Thread Rajesh Balamohan (Jira)

Rajesh Balamohan created HIVE-26540:
---

 Summary: Iceberg: Select queries after update/delete become  
expensive in reading contents
 Key: HIVE-26540
 URL: https://issues.apache.org/jira/browse/HIVE-26540
 Project: Hive
  Issue Type: Improvement
Reporter: Rajesh Balamohan


- Create basic date_dim table in tpcds. Store it in iceberg v2 format
- Update few 1000 records couple of times
- Run a simple select query {{select count ( * ) from date_dim_ice where d_qoy 
= 11 and d_dom=2 and d_fy_week_seq=3;}}

This takes 8-18 seconds where ACID takes 1.5 seconds.

Basic issue is that, it reads files multiple times (i.e both data and delete 
files).

Lines of interest:

IcebergInputFormat.java

{noformat}
   InternalRecordWrapper wrapper = new 
InternalRecordWrapper(readSchema.asStruct());
Evaluator filter = new Evaluator(readSchema.asStruct(), residual, 
caseSensitive);
return CloseableIterable.filter(iter, record -> 
filter.eval(wrapper.wrap((StructLike) record)));
{noformat}



{noformat}
   case GENERIC:
  DeleteFilter deletes = new GenericDeleteFilter(table.io(), 
currentTask, table.schema(), readSchema);
  Schema requiredSchema = deletes.requiredSchema();
  return deletes.filter(openGeneric(currentTask, requiredSchema));
{noformat}

These get evaluated for each row in the data file, causing delay.



--
This message was sent by Atlassian Jira
(v8.20.10#820010)

[jira] [Created] (HIVE-26686) Iceberg: Having lot of snapshots impacts runtime due to multiple loads of the table

2022-11-01 Thread Rajesh Balamohan (Jira)

Rajesh Balamohan created HIVE-26686:
---

 Summary: Iceberg: Having lot of snapshots impacts runtime due to 
multiple loads of the table
 Key: HIVE-26686
 URL: https://issues.apache.org/jira/browse/HIVE-26686
 Project: Hive
  Issue Type: Improvement
  Components: HiveServer2
Reporter: Rajesh Balamohan


When large number of snpashots are present in manifest file, it adversely 
impacts the runtime of the queries. (e.g 15 mts trickle feed).

Having more snapshots will slow down runtime in 2 additional places.

1. At the time of populating statistics, it tries to load the table details 
again. i.e refresh table invocation
2. At the time of hive metastore hook (HiveIcebergMetaHook::doPreAlterTable), 
during pre alter table.

Need to check if entire table information along with snapshot details are 
needed for this.

{noformat}
    at 
org.apache.hive.iceberg.com.fasterxml.jackson.databind.deser.std.BaseNodeDeserializer.deserializeArray(JsonNodeDeserializer.java:437)
    at 
org.apache.hive.iceberg.com.fasterxml.jackson.databind.deser.std.BaseNodeDeserializer.deserializeObject(JsonNodeDeserializer.java:261)
    at 
org.apache.hive.iceberg.com.fasterxml.jackson.databind.deser.std.JsonNodeDeserializer.deserialize(JsonNodeDeserializer.java:68)
    at 
org.apache.hive.iceberg.com.fasterxml.jackson.databind.deser.std.JsonNodeDeserializer.deserialize(JsonNodeDeserializer.java:15)
    at 
org.apache.hive.iceberg.com.fasterxml.jackson.databind.ObjectMapper._readMapAndClose(ObjectMapper.java:4218)
    at 
org.apache.hive.iceberg.com.fasterxml.jackson.databind.ObjectMapper.readValue(ObjectMapper.java:3251)
    at org.apache.iceberg.TableMetadataParser.read(TableMetadataParser.java:264)
    at org.apache.iceberg.TableMetadataParser.read(TableMetadataParser.java:258)
    at 
org.apache.iceberg.BaseMetastoreTableOperations.lambda$refreshFromMetadataLocation$0(BaseMetastoreTableOperations.java:177)
    at 
org.apache.iceberg.BaseMetastoreTableOperations$$Lambda$685/0x000840e1b440.apply(Unknown
 Source)
    at 
org.apache.iceberg.BaseMetastoreTableOperations.lambda$refreshFromMetadataLocation$1(BaseMetastoreTableOperations.java:191)
    at 
org.apache.iceberg.BaseMetastoreTableOperations$$Lambda$686/0x000840e1a840.run(Unknown
 Source)
    at org.apache.iceberg.util.Tasks$Builder.runTaskWithRetry(Tasks.java:404)
    at org.apache.iceberg.util.Tasks$Builder.runSingleThreaded(Tasks.java:214)
    at org.apache.iceberg.util.Tasks$Builder.run(Tasks.java:198)
    at org.apache.iceberg.util.Tasks$Builder.run(Tasks.java:190)
    at 
org.apache.iceberg.BaseMetastoreTableOperations.refreshFromMetadataLocation(BaseMetastoreTableOperations.java:191)
    at 
org.apache.iceberg.BaseMetastoreTableOperations.refreshFromMetadataLocation(BaseMetastoreTableOperations.java:176)
    at 
org.apache.iceberg.BaseMetastoreTableOperations.refreshFromMetadataLocation(BaseMetastoreTableOperations.java:171)
    at 
org.apache.iceberg.hive.HiveTableOperations.doRefresh(HiveTableOperations.java:153)
    at 
org.apache.iceberg.BaseMetastoreTableOperations.refresh(BaseMetastoreTableOperations.java:96)
    at 
org.apache.iceberg.BaseMetastoreTableOperations.current(BaseMetastoreTableOperations.java:79)
    at 
org.apache.iceberg.BaseMetastoreCatalog.loadTable(BaseMetastoreCatalog.java:44)
    at org.apache.iceberg.mr.Catalogs.loadTable(Catalogs.java:116)
    at org.apache.iceberg.mr.Catalogs.loadTable(Catalogs.java:106)
    at 
org.apache.iceberg.mr.hive.HiveIcebergStorageHandler.getBasicStatistics(HiveIcebergStorageHandler.java:309)
    at 
org.apache.hadoop.hive.ql.stats.BasicStatsTask$BasicStatsProcessor.(BasicStatsTask.java:138)
    at 
org.apache.hadoop.hive.ql.stats.BasicStatsTask.aggregateStats(BasicStatsTask.java:301)
    at 
org.apache.hadoop.hive.ql.stats.BasicStatsTask.process(BasicStatsTask.java:108)
    at org.apache.hadoop.hive.ql.exec.StatsTask.execute(StatsTask.java:107)
    at org.apache.hadoop.hive.ql.exec.Task.executeTask(Task.java:213)
    at 
org.apache.hadoop.hive.ql.exec.TaskRunner.runSequential(TaskRunner.java:105)
    at org.apache.hadoop.hive.ql.Executor.launchTask(Executor.java:360)
    at org.apache.hadoop.hive.ql.Executor.launchTasks(Executor.java:333)
    at org.apache.hadoop.hive.ql.Executor.runTasks(Executor.java:250)
    at org.apache.hadoop.hive.ql.Executor.execute(Executor.java:111)
    at org.apache.hadoop.hive.ql.Driver.runInternal(Driver.java:806)
    at org.apache.hadoop.hive.ql.Driver.run(Driver.java:540)
    at org.apache.hadoop.hive.ql.Driver.run(Driver.java:534)
    at org.apache.hadoop.hive.ql.reexec.ReExecDriver.run(ReExecDriver.java:166)
    at 
org.apache.hive.service.cli.operation.SQLOperation.runQuery(SQLOperation.java:232)
    at 
org.apache.hive.service.cli.operation.SQLOperation.access$700(SQLOperation.java:89)
    at 
org.apache.hive.service.cli.operation.SQLOperation$BackgroundWork

[jira] [Created] (HIVE-26699) Iceberg: S3 fadvise can hurt JSON parsing significantly in DWX

2022-11-03 Thread Rajesh Balamohan (Jira)

Rajesh Balamohan created HIVE-26699:
---

 Summary: Iceberg: S3 fadvise can hurt JSON parsing significantly 
in DWX
 Key: HIVE-26699
 URL: https://issues.apache.org/jira/browse/HIVE-26699
 Project: Hive
  Issue Type: Improvement
Reporter: Rajesh Balamohan


Hive reads JSON metadata information (TableMetadataParser::read()) multiple 
times; E.g during query compilation, AM split computation, stats computation, 
during commits  etc.

 

With large JSON files (due to multiple inserts), it takes a lot longer time 
with S3 FS with "fs.s3a.experimental.input.fadvise" set to "random". (e.g in 
the order of 10x).To be on safer side, it will be good to set this to "normal" 
mode in configs, when reading iceberg tables.



--
This message was sent by Atlassian Jira
(v8.20.10#820010)

[jira] [Created] (HIVE-26714) Iceberg delete files are read twice during query processing causing delays

2022-11-08 Thread Rajesh Balamohan (Jira)

Rajesh Balamohan created HIVE-26714:
---

 Summary: Iceberg delete files are read twice during query 
processing causing delays
 Key: HIVE-26714
 URL: https://issues.apache.org/jira/browse/HIVE-26714
 Project: Hive
  Issue Type: Improvement
  Components: Iceberg integration
Reporter: Rajesh Balamohan
 Attachments: Screenshot 2022-11-08 at 9.37.17 PM.png

Delete positions are read twice in query processing causing delays in runtime.

!Screenshot 2022-11-08 at 9.37.17 PM.png|width=707,height=629!



--
This message was sent by Atlassian Jira
(v8.20.10#820010)

[jira] [Created] (HIVE-26874) Iceberg: Positional delete files are not cached

2022-12-19 Thread Rajesh Balamohan (Jira)

Rajesh Balamohan created HIVE-26874:
---

 Summary: Iceberg: Positional delete files are not cached 
 Key: HIVE-26874
 URL: https://issues.apache.org/jira/browse/HIVE-26874
 Project: Hive
  Issue Type: Improvement
Reporter: Rajesh Balamohan


With iceberg v2 (MOR mode), "positional delete" files are not cached causing 
runtime delays.



--
This message was sent by Atlassian Jira
(v8.20.10#820010)

[jira] [Created] (HIVE-26913) HiveVectorizedReader::parquetRecordReader should reuse footer information

2023-01-09 Thread Rajesh Balamohan (Jira)

Rajesh Balamohan created HIVE-26913:
---

 Summary: HiveVectorizedReader::parquetRecordReader should reuse 
footer information
 Key: HIVE-26913
 URL: https://issues.apache.org/jira/browse/HIVE-26913
 Project: Hive
  Issue Type: Improvement
  Components: Iceberg integration
Reporter: Rajesh Balamohan


HiveVectorizedReader::parquetRecordReader should reuse details of parquet 
footer, instead of reading it again.

 

It reads parquet footer here:

[https://github.com/apache/hive/blob/master/iceberg/iceberg-handler/src/main/java/org/apache/iceberg/mr/hive/vector/HiveVectorizedReader.java#L230-L232]

Again it reads the footer here for constructing vectorized recordreader

[https://github.com/apache/hive/blob/master/iceberg/iceberg-handler/src/main/java/org/apache/iceberg/mr/hive/vector/HiveVectorizedReader.java#L249]

 

[https://github.com/apache/hive/blob/master/ql/src/java/org/apache/hadoop/hive/ql/io/parquet/VectorizedParquetInputFormat.java#L50]

 

Check the codepath of 
VectorizedParquetRecordReader::setupMetadataAndParquetSplit

[https://github.com/apache/hive/blob/6b0139188aba6a95808c8d1bec63a651ec9e4bdc/ql/src/java/org/apache/hadoop/hive/ql/io/parquet/vector/VectorizedParquetRecordReader.java#L180]

 

It should be possible to share "ParquetMetadata" in 
VectorizedParuqetRecordReader.

 



--
This message was sent by Atlassian Jira
(v8.20.10#820010)

[jira] [Created] (HIVE-26917) Upgrade parquet to 1.12.3

2023-01-09 Thread Rajesh Balamohan (Jira)

Rajesh Balamohan created HIVE-26917:
---

 Summary: Upgrade parquet to 1.12.3
 Key: HIVE-26917
 URL: https://issues.apache.org/jira/browse/HIVE-26917
 Project: Hive
  Issue Type: Improvement
Reporter: Rajesh Balamohan






--
This message was sent by Atlassian Jira
(v8.20.10#820010)

[jira] [Created] (HIVE-26927) Iceberg: Add support for set_current_snapshotid

2023-01-10 Thread Rajesh Balamohan (Jira)

Rajesh Balamohan created HIVE-26927:
---

 Summary: Iceberg: Add support for set_current_snapshotid
 Key: HIVE-26927
 URL: https://issues.apache.org/jira/browse/HIVE-26927
 Project: Hive
  Issue Type: Improvement
  Components: Iceberg integration
Reporter: Rajesh Balamohan


Currently, hive supports "rollback" feature. Once rolledback,  it is not 
possible to move from older snapshot to newer snapshot.

It ends up throwing 
{color:#0747a6}"org.apache.iceberg.exceptions.ValidationException: Cannot roll 
back to snapshot, not an ancestor of the current state:" {color}error.

It will be good to support "set_current_snapshot" function to move to different 
snapshot ids.

 

 



--
This message was sent by Atlassian Jira
(v8.20.10#820010)

[jira] [Created] (HIVE-26928) LlapIoImpl::getParquetFooterBuffersFromCache throws exception when metadata cache is disabled

2023-01-10 Thread Rajesh Balamohan (Jira)

Rajesh Balamohan created HIVE-26928:
---

 Summary: LlapIoImpl::getParquetFooterBuffersFromCache throws 
exception when metadata cache is disabled
 Key: HIVE-26928
 URL: https://issues.apache.org/jira/browse/HIVE-26928
 Project: Hive
  Issue Type: Improvement
  Components: Iceberg integration
Reporter: Rajesh Balamohan


When metadata / LLAP cache is disabled, "iceberg + parquet" throws the 
following error.

It should check for "metadatacache" correctly or fix it in LlapIoImpl.

 
{noformat}

Caused by: java.lang.NullPointerException: Metadata cache must not be null
    at com.google.common.base.Preconditions.checkNotNull(Preconditions.java:897)
    at 
org.apache.hadoop.hive.llap.io.api.impl.LlapIoImpl.getParquetFooterBuffersFromCache(LlapIoImpl.java:467)
    at 
org.apache.iceberg.mr.hive.vector.HiveVectorizedReader.parquetRecordReader(HiveVectorizedReader.java:227)
    at 
org.apache.iceberg.mr.hive.vector.HiveVectorizedReader.reader(HiveVectorizedReader.java:162)
    at java.base/jdk.internal.reflect.NativeMethodAccessorImpl.invoke0(Native 
Method)
    at 
java.base/jdk.internal.reflect.NativeMethodAccessorImpl.invoke(NativeMethodAccessorImpl.java:62)
    at 
java.base/jdk.internal.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.java:43)
    at java.base/java.lang.reflect.Method.invoke(Method.java:566)
    at 
org.apache.iceberg.common.DynMethods$UnboundMethod.invokeChecked(DynMethods.java:65)
    at 
org.apache.iceberg.common.DynMethods$UnboundMethod.invoke(DynMethods.java:77)
    at 
org.apache.iceberg.common.DynMethods$StaticMethod.invoke(DynMethods.java:196)
    at 
org.apache.iceberg.mr.mapreduce.IcebergInputFormat$IcebergRecordReader.openVectorized(IcebergInputFormat.java:331)
    at 
org.apache.iceberg.mr.mapreduce.IcebergInputFormat$IcebergRecordReader.open(IcebergInputFormat.java:377)
    at 
org.apache.iceberg.mr.mapreduce.IcebergInputFormat$IcebergRecordReader.nextTask(IcebergInputFormat.java:270)
    at 
org.apache.iceberg.mr.mapreduce.IcebergInputFormat$IcebergRecordReader.initialize(IcebergInputFormat.java:266)
    at 
org.apache.iceberg.mr.mapred.AbstractMapredIcebergRecordReader.(AbstractMapredIcebergRecordReader.java:40)
    at 
org.apache.iceberg.mr.hive.vector.HiveIcebergVectorizedRecordReader.(HiveIcebergVectorizedRecordReader.java:41)
 {noformat}



--
This message was sent by Atlassian Jira
(v8.20.10#820010)

[jira] [Created] (HIVE-26944) FileSinkOperator shouldn't check for compactiontable for every row being processed

2023-01-15 Thread Rajesh Balamohan (Jira)

Rajesh Balamohan created HIVE-26944:
---

 Summary: FileSinkOperator shouldn't check for compactiontable for 
every row being processed
 Key: HIVE-26944
 URL: https://issues.apache.org/jira/browse/HIVE-26944
 Project: Hive
  Issue Type: Improvement
Reporter: Rajesh Balamohan
 Attachments: Screenshot 2023-01-16 at 10.32.24 AM.png





--
This message was sent by Atlassian Jira
(v8.20.10#820010)

[jira] [Created] (HIVE-26950) (CTLT) Create external table like V2 table is not preserving table properties

2023-01-16 Thread Rajesh Balamohan (Jira)

Rajesh Balamohan created HIVE-26950:
---

 Summary: (CTLT) Create external table like V2 table is not 
preserving table properties
 Key: HIVE-26950
 URL: https://issues.apache.org/jira/browse/HIVE-26950
 Project: Hive
  Issue Type: Improvement
  Components: Iceberg integration
Reporter: Rajesh Balamohan


# Create an external iceberg V2 table. e.g t1
 # "create external table t2 like t1" <--- This ends up creating V1 table and 
"format-version=2" is not retained and "'format'='iceberg/parquet'" is also not 
retained.



--
This message was sent by Atlassian Jira
(v8.20.10#820010)

[jira] [Created] (HIVE-26951) Setting details in PositionDeleteInfo takes up lot of CPU cycles

2023-01-16 Thread Rajesh Balamohan (Jira)

Rajesh Balamohan created HIVE-26951:
---

 Summary: Setting details in PositionDeleteInfo takes up lot of CPU 
cycles
 Key: HIVE-26951
 URL: https://issues.apache.org/jira/browse/HIVE-26951
 Project: Hive
  Issue Type: Improvement
  Components: Iceberg integration
Reporter: Rajesh Balamohan
 Attachments: Screenshot 2023-01-17 at 11.29.29 AM.png, Screenshot 
2023-01-17 at 11.29.36 AM.png

!Screenshot 2023-01-17 at 11.29.29 AM.png|width=898,height=532!

 

 

!Screenshot 2023-01-17 at 11.29.36 AM.png|width=1000,height=591!

 

This was observed with merge-into statements.



--
This message was sent by Atlassian Jira
(v8.20.10#820010)

[jira] [Created] (HIVE-26974) CTL from iceberg table should copy partition fields correctly

2023-01-22 Thread Rajesh Balamohan (Jira)

Rajesh Balamohan created HIVE-26974:
---

 Summary: CTL from iceberg table should copy partition fields 
correctly
 Key: HIVE-26974
 URL: https://issues.apache.org/jira/browse/HIVE-26974
 Project: Hive
  Issue Type: Improvement
  Components: Iceberg integration
Reporter: Rajesh Balamohan


# Create iceberg table. Ensure it to have a partition field.
 # run "create external table like x"
 # Created table in #2 misses out on creating relevant partition field.



--
This message was sent by Atlassian Jira
(v8.20.10#820010)

[jira] [Created] (HIVE-26975) MERGE: Wrong reducer estimate causing smaller files to be created

2023-01-23 Thread Rajesh Balamohan (Jira)

Rajesh Balamohan created HIVE-26975:
---

 Summary: MERGE: Wrong reducer estimate causing smaller files to be 
created
 Key: HIVE-26975
 URL: https://issues.apache.org/jira/browse/HIVE-26975
 Project: Hive
  Issue Type: Improvement
  Components: Iceberg integration
Reporter: Rajesh Balamohan


* "Merge into" estimates wrong number of reducers causing more number of small 
files to be created.* e.g 400+ files in 3+ MB file each.*
 * This can be reproduced by writing data into "store_sales" table in iceberg 
format via another source table (using merge-into).
 ** e.g  Running this few times will create wrong number of reduce tasks 
causing lot of small files to be created in iceberg table.

{noformat}
MERGE INTO store_sales_t t

using ssv s

ON ( t.ss_item_sk = s.ss_item_sk

 AND t.ss_customer_sk = s.ss_customer_sk

 AND t.ss_sold_date_sk = "2451181"

 AND ( ( Floor(( s.ss_item_sk ) / 1000) * 1000 ) BETWEEN 1000 AND 2000 )

 AND s.ss_ext_discount_amt < 0.0 )

WHEN matched AND t.ss_ext_discount_amt IS NULL THEN

  UPDATE SET ss_ext_discount_amt = 0.0

WHEN NOT matched THEN

  INSERT ( ss_sold_time_sk,

   ss_item_sk,

   ss_customer_sk,

   ss_cdemo_sk,

   ss_hdemo_sk,

   ss_addr_sk,

   ss_store_sk,

   ss_promo_sk,

   ss_ticket_number,

   ss_quantity,

   ss_wholesale_cost,

   ss_list_price,

   ss_sales_price,

   ss_ext_discount_amt,

   ss_ext_sales_price,

   ss_ext_wholesale_cost,

   ss_ext_list_price,

   ss_ext_tax,

   ss_coupon_amt,

   ss_net_paid,

   ss_net_paid_inc_tax,

   ss_net_profit,

   ss_sold_date_sk )

  VALUES ( s.ss_sold_time_sk,

   s.ss_item_sk,

   s.ss_customer_sk,

   s.ss_cdemo_sk,

   s.ss_hdemo_sk,

   s.ss_addr_sk,

   s.ss_store_sk,

   s.ss_promo_sk,

   s.ss_ticket_number,

   s.ss_quantity,

   s.ss_wholesale_cost,

   s.ss_list_price,

   s.ss_sales_price,

   s.ss_ext_discount_amt,

   s.ss_ext_sales_price,

   s.ss_ext_wholesale_cost,

   s.ss_ext_list_price,

   s.ss_ext_tax,

   s.ss_coupon_amt,

   s.ss_net_paid,

   s.ss_net_paid_inc_tax,

   s.ss_net_profit,

   "2451181") 

{noformat}



--
This message was sent by Atlassian Jira
(v8.20.10#820010)

[jira] [Created] (HIVE-26978) Stale "Runtime stats" causes poor query planning

2023-01-23 Thread Rajesh Balamohan (Jira)

Rajesh Balamohan created HIVE-26978:
---

 Summary: Stale "Runtime stats" causes poor query planning
 Key: HIVE-26978
 URL: https://issues.apache.org/jira/browse/HIVE-26978
 Project: Hive
  Issue Type: Improvement
  Components: HiveServer2
Reporter: Rajesh Balamohan
 Attachments: Screenshot 2023-01-24 at 10.23.16 AM.png

* Runtime stats can be stored in hiveserver or in metastore via 
"hive.query.reexecution.stats.persist.scope".
 * Though the table is dropped and recreated, it ends up showing old stats via 
"RUNTIME" stats. Here is an example (note that the table is empty, but gets 
datasize and numRows from RUNTIME stats)
 * This causes suboptimal plan for "MERGE INTO" queries by creating CUSTOM_EDGE 
instead of broadcast edge.

!Screenshot 2023-01-24 at 10.23.16 AM.png|width=2053,height=753!

 

 



--
This message was sent by Atlassian Jira
(v8.20.10#820010)

[jira] [Created] (HIVE-26997) Iceberg: Vectorization gets disabled at runtime in merge-into statements

2023-01-29 Thread Rajesh Balamohan (Jira)

Rajesh Balamohan created HIVE-26997:
---

 Summary: Iceberg: Vectorization gets disabled at runtime in 
merge-into statements
 Key: HIVE-26997
 URL: https://issues.apache.org/jira/browse/HIVE-26997
 Project: Hive
  Issue Type: Improvement
  Components: Iceberg integration
Reporter: Rajesh Balamohan
 Attachments: explain_merge_into.txt

*Query:*

Think of "ssv" table as a table containing trickle feed data in the following 
query. "store_sales_delete_1" is the destination table.

 
{noformat}
MERGE INTO tpcds_1000_iceberg_mor_v4.store_sales_delete_1 t USING 
tpcds_1000_update.ssv s ON (t.ss_item_sk = s.ss_item_sk
                                                                                
              AND t.ss_customer_sk=s.ss_customer_sk
                                                                                
              AND t.ss_sold_date_sk = "2451181"
                                                                                
              AND ((Floor((s.ss_item_sk) / 1000) * 1000) BETWEEN 1000 AND 2000)
                                                                                
              AND s.ss_ext_discount_amt < 0.0) WHEN matched
AND t.ss_ext_discount_amt IS NULL THEN
UPDATE
SET ss_ext_discount_amt = 0.0 WHEN NOT matched THEN
INSERT (ss_sold_time_sk,
        ss_item_sk,
        ss_customer_sk,
        ss_cdemo_sk,
        ss_hdemo_sk,
        ss_addr_sk,
        ss_store_sk,
        ss_promo_sk,
        ss_ticket_number,
        ss_quantity,
        ss_wholesale_cost,
        ss_list_price,
        ss_sales_price,
        ss_ext_discount_amt,
        ss_ext_sales_price,
        ss_ext_wholesale_cost,
        ss_ext_list_price,
        ss_ext_tax,
        ss_coupon_amt,
        ss_net_paid,
        ss_net_paid_inc_tax,
        ss_net_profit,
        ss_sold_date_sk)
VALUES (s.ss_sold_time_sk,
        s.ss_item_sk,
        s.ss_customer_sk,
        s.ss_cdemo_sk,
        s.ss_hdemo_sk,
        s.ss_addr_sk,
        s.ss_store_sk,
        s.ss_promo_sk,
        s.ss_ticket_number,
        s.ss_quantity,
        s.ss_wholesale_cost,
        s.ss_list_price,
        s.ss_sales_price,
        s.ss_ext_discount_amt,
        s.ss_ext_sales_price,
        s.ss_ext_wholesale_cost,
        s.ss_ext_list_price,
        s.ss_ext_tax,
        s.ss_coupon_amt,
        s.ss_net_paid,
        s.ss_net_paid_inc_tax,
        s.ss_net_profit,
        "2451181")

 {noformat}
 

 

*Issue:*
 # Map phase is not getting vectorized due to "PARTITION_{_}SPEC{_}_ID" column

{noformat}
Map notVectorizedReason: Select expression for SELECT operator: Virtual column 
PARTITION__SPEC__ID is not supported {noformat}
 

2. "Reducer 2" stage isn't vectorized. 
{noformat}
Reduce notVectorizedReason: exception: java.lang.RuntimeException: Full Outer 
Small Table Key Mapping duplicate column 0 in ordered column map {0=(value 
column: 30, type info: int), 1=(value column: 31, type info: int)} when adding 
value column 53, type into int stack trace: 
org.apache.hadoop.hive.ql.exec.vector.VectorColumnOrderedMap.add(VectorColumnOrderedMap.java:102),
 
org.apache.hadoop.hive.ql.exec.vector.VectorColumnSourceMapping.add(VectorColumnSourceMapping.java:41),
 
org.apache.hadoop.hive.ql.optimizer.physical.Vectorizer.canSpecializeMapJoin(Vectorizer.java:3865),
 
org.apache.hadoop.hive.ql.optimizer.physical.Vectorizer.validateAndVectorizeOperator(Vectorizer.java:5246),
 
org.apache.hadoop.hive.ql.optimizer.physical.Vectorizer.doProcessChild(Vectorizer.java:988),
 
org.apache.hadoop.hive.ql.optimizer.physical.Vectorizer.doProcessChildren(Vectorizer.java:874),
 
org.apache.hadoop.hive.ql.optimizer.physical.Vectorizer.validateAndVectorizeOperatorTree(Vectorizer.java:841),
 
org.apache.hadoop.hive.ql.optimizer.physical.Vectorizer.access$2400(Vectorizer.java:251),
 
org.apache.hadoop.hive.ql.optimizer.physical.Vectorizer$VectorizationDispatcher.validateAndVectorizeReduceOperators(Vectorizer.java:2298),
 
org.apache.hadoop.hive.ql.optimizer.physical.Vectorizer$VectorizationDispatcher.validateAndVectorizeReduceOperators(Vectorizer.java:2246),
 
org.apache.hadoop.hive.ql.optimizer.physical.Vectorizer$VectorizationDispatcher.validateAndVectorizeReduceWork(Vectorizer.java:2224),
 
org.apache.hadoop.hive.ql.optimizer.physical.Vectorizer$VectorizationDispatcher.convertReduceWork(Vectorizer.java:2206),
 
org.apache.hadoop.hive.ql.optimizer.physical.Vectorizer$VectorizationDispatcher.dispatch(Vectorizer.java:1038),
 
org.apache.hadoop.hive.ql.lib.TaskGraphWalker.dispatch(TaskGraphWalker.java:111),
 org.apache.hadoop.hive.ql.lib.TaskGraphWalker.walk(TaskGraphWalker.java:180), 
... {noformat}
 

I have attached the explain plan for this, which has details on this.



--
This message was sent by Atlassian Jira
(v8.20.10#820010)

[jira] [Created] (HIVE-27003) Iceberg: Vectorization missed out for update/delete due to virtual columns

2023-01-30 Thread Rajesh Balamohan (Jira)

Rajesh Balamohan created HIVE-27003:
---

 Summary: Iceberg: Vectorization missed out for update/delete due 
to virtual columns
 Key: HIVE-27003
 URL: https://issues.apache.org/jira/browse/HIVE-27003
 Project: Hive
  Issue Type: Improvement
  Components: Iceberg integration
Reporter: Rajesh Balamohan
 Attachments: delete_iceberg_vect.txt, update_iceberg_vect.txt

Vectorization is missed out during table scan due to the addition of virtual 
columns during scans. I will attach the plan details here with.



--
This message was sent by Atlassian Jira
(v8.20.10#820010)

[jira] [Created] (HIVE-27005) Iceberg: Col stats are not used in queries

2023-01-30 Thread Rajesh Balamohan (Jira)

Rajesh Balamohan created HIVE-27005:
---

 Summary: Iceberg: Col stats are not used in queries
 Key: HIVE-27005
 URL: https://issues.apache.org/jira/browse/HIVE-27005
 Project: Hive
  Issue Type: Improvement
  Components: Iceberg integration
Reporter: Rajesh Balamohan
 Attachments: col_stats.txt

1. Though, insert-queries compute colstats during runtime, they are not 
persisted in HMS during final call. 

2. Due to #1, col stats are not available during runtime for hive queries. This 
includes col stats, NDV etc. So unless users explicitly run "analyse table" 
statements, queries can be have suboptimal plans.

E.g 
[col_stats.txt{^}!https://jira.cloudera.com/images/icons/link_attachment_7.gif|width=7,height=7!{^}|https://jira.cloudera.com/secure/attachment/658390/658390_col_stats.txt](note
 that there is no col stats being used)



--
This message was sent by Atlassian Jira
(v8.20.10#820010)

[jira] [Created] (HIVE-27010) Reduce compilation time

2023-01-31 Thread Rajesh Balamohan (Jira)

Rajesh Balamohan created HIVE-27010:
---

 Summary: Reduce compilation time
 Key: HIVE-27010
 URL: https://issues.apache.org/jira/browse/HIVE-27010
 Project: Hive
  Issue Type: Improvement
Reporter: Rajesh Balamohan


Context: Post HIVE-24645, compilation time for queries has increased.



--
This message was sent by Atlassian Jira
(v8.20.10#820010)

[jira] [Created] (HIVE-27013) Provide an option to enable iceberg manifest caching via table properties

2023-02-01 Thread Rajesh Balamohan (Jira)

Rajesh Balamohan created HIVE-27013:
---

 Summary: Provide an option to enable iceberg manifest caching via 
table properties
 Key: HIVE-27013
 URL: https://issues.apache.org/jira/browse/HIVE-27013
 Project: Hive
  Issue Type: Improvement
  Components: Iceberg integration
Reporter: Rajesh Balamohan


{color:#22}I tried the following thinking that it would work with iceberg 
manifest caching; but it didn't.{color}
{color:#22}{noformat}{color}
{color:#22}alter table store_sales set 
tblproperties('io.manifest.cac{color}{color:#22}he-enabled'='true'); 
\{noformat}{color}
{color:#22}Creating this ticket as a placeholder to fix the same.{color}



--
This message was sent by Atlassian Jira
(v8.20.10#820010)

[jira] [Created] (HIVE-27014) Iceberg: getSplits/planTasks should filter out relevant folders instead of scanning entire table

2023-02-02 Thread Rajesh Balamohan (Jira)

Rajesh Balamohan created HIVE-27014:
---

 Summary: Iceberg: getSplits/planTasks should filter out relevant 
folders instead of scanning entire table
 Key: HIVE-27014
 URL: https://issues.apache.org/jira/browse/HIVE-27014
 Project: Hive
  Issue Type: Improvement
  Components: Iceberg integration
Reporter: Rajesh Balamohan


With dynamic partition pruning, only relevant folders in fact tables are 
scanned.

In tez, DynamicPartitionPruner will set the relevant filters.In iceberg, these 
filters are applied after "Table:planTasks()" is invoked in iceberg. This 
forces entire table metadata to be scanned and then throw off the unwanted 
partitions. 

This makes split computation expensive (e.g for store_sales, it has to look at 
all 1800+ partitions and throw off unwanted partitions).

For short running queries, it takes 3-5+ seconds for split computation. 
Creating this ticket as a placeholder to make use of the relevant filters from 
DPP.



--
This message was sent by Atlassian Jira
(v8.20.10#820010)

[jira] [Created] (HIVE-27049) Iceberg: Provide current snapshot version in show-create-table

2023-02-07 Thread Rajesh Balamohan (Jira)

Rajesh Balamohan created HIVE-27049:
---

 Summary: Iceberg: Provide current snapshot version in 
show-create-table
 Key: HIVE-27049
 URL: https://issues.apache.org/jira/browse/HIVE-27049
 Project: Hive
  Issue Type: Improvement
  Components: Iceberg integration
Reporter: Rajesh Balamohan


It will be helpful to show "current snapshot" id in "show create table" 
statement. This will help in easier debugging. Otherwise, user has to 
explicitly query the metadata or read the JSON file to get this info.



--
This message was sent by Atlassian Jira
(v8.20.10#820010)

[jira] [Created] (HIVE-27050) Iceberg: MOR: Restrict reducer extrapolation to contain number of small files being created

2023-02-07 Thread Rajesh Balamohan (Jira)

Rajesh Balamohan created HIVE-27050:
---

 Summary: Iceberg: MOR: Restrict reducer extrapolation to contain 
number of small files being created
 Key: HIVE-27050
 URL: https://issues.apache.org/jira/browse/HIVE-27050
 Project: Hive
  Issue Type: Improvement
  Components: Iceberg integration
Reporter: Rajesh Balamohan


Scenario:
 # Create a simple table in iceberg (MOR mode). e.g store_sales_delete_1
 # Insert some data into it. 
 # Run an update statement as follows
 ## "update  store_sales_delete_1 set ss_sold_time_sk=699060 where 
ss_sold_time_sk=69906"

Hive estimates the number of reducers as "1". But due to 
"hive.tez.max.partition.factor" which defaults to "2.0", it will double the 
number of reducers.

To put in perspective, it will create very small positional delete files 
spreading across different reducers. This will cause problems during reading, 
as all files should be opened for reading.

 
 # When iceberg MOR tables are involved in update/delete/merges, disable 
"hive.tez.max.partition.factor"; or set it to "1.0" irrespective of the user 
setting;
 # Have explicit logs for easier debugging; User shouldn't be confused on why 
the setting is not taking into effect.



--
This message was sent by Atlassian Jira
(v8.20.10#820010)

[jira] [Created] (HIVE-27084) Iceberg: Stats are not populated correctly during query compilation

2023-02-15 Thread Rajesh Balamohan (Jira)

Rajesh Balamohan created HIVE-27084:
---

 Summary: Iceberg: Stats are not populated correctly during query 
compilation
 Key: HIVE-27084
 URL: https://issues.apache.org/jira/browse/HIVE-27084
 Project: Hive
  Issue Type: Improvement
  Components: Iceberg integration
Reporter: Rajesh Balamohan


- Table stats are not properly used/computed during query compilation phase.
 - Here is an example. Check out the query with the filter which give more data 
than the regular query

This is just an example, real world queries can have bad query plans due to this

{{10470974584 with filter, vs 303658262936 without filter}}

{noformat}
explain select count(*) from store_sales where ss_sold_date_sk=2450822 and 
ss_wholesale_cost > 0.0

Explain
STAGE DEPENDENCIES:
  Stage-1 is a root stage
  Stage-0 depends on stages: Stage-1

STAGE PLANS:
  Stage: Stage-1
Tez
  DagId: hive_20230216065808_80d68e3f-3a6b-422b-9265-50bc707ae3c6:48
  Edges:
Reducer 2 <- Map 1 (CUSTOM_SIMPLE_EDGE)
  DagName: hive_20230216065808_80d68e3f-3a6b-422b-9265-50bc707ae3c6:48
  Vertices:
Map 1
Map Operator Tree:
TableScan
  alias: store_sales
  filterExpr: ((ss_sold_date_sk = 2450822) and 
(ss_wholesale_cost > 0)) (type: boolean)
  Statistics: Num rows: 2755519629 Data size: 303658262936 
Basic stats: COMPLETE Column stats: NONE
  Filter Operator
predicate: ((ss_sold_date_sk = 2450822) and 
(ss_wholesale_cost > 0)) (type: boolean)
Statistics: Num rows: 5 Data size: 550 Basic stats: 
COMPLETE Column stats: NONE
Select Operator
  Statistics: Num rows: 5 Data size: 550 Basic stats: 
COMPLETE Column stats: NONE
  Group By Operator
aggregations: count()
minReductionHashAggr: 0.99
mode: hash
outputColumnNames: _col0
Statistics: Num rows: 1 Data size: 124 Basic stats: 
COMPLETE Column stats: NONE
Reduce Output Operator
  null sort order:
  sort order:
  Statistics: Num rows: 1 Data size: 124 Basic stats: 
COMPLETE Column stats: NONE
  value expressions: _col0 (type: bigint)
Execution mode: vectorized, llap
LLAP IO: all inputs (cache only)
Reducer 2
Execution mode: vectorized, llap
Reduce Operator Tree:
  Group By Operator
aggregations: count(VALUE._col0)
mode: mergepartial
outputColumnNames: _col0
Statistics: Num rows: 1 Data size: 124 Basic stats: COMPLETE 
Column stats: NONE
File Output Operator
  compressed: false
  Statistics: Num rows: 1 Data size: 124 Basic stats: COMPLETE 
Column stats: NONE
  table:
  input format: 
org.apache.hadoop.mapred.SequenceFileInputFormat
  output format: 
org.apache.hadoop.hive.ql.io.HiveSequenceFileOutputFormat
  serde: org.apache.hadoop.hive.serde2.lazy.LazySimpleSerDe

  Stage: Stage-0
Fetch Operator
  limit: -1
  Processor Tree:
ListSink

58 rows selected (0.73 seconds)



explain select count(*) from store_sales where ss_sold_date_sk=2450822
INFO  : Starting task [Stage-3:EXPLAIN] in serial mode
INFO  : Completed executing 
command(queryId=hive_20230216065813_e51482a2-1c9a-41a7-b1b3-9aec2fba9ba7); Time 
taken: 0.061 seconds
INFO  : OK
Explain
STAGE DEPENDENCIES:
  Stage-1 is a root stage
  Stage-0 depends on stages: Stage-1

STAGE PLANS:
  Stage: Stage-1
Tez
  DagId: hive_20230216065813_e51482a2-1c9a-41a7-b1b3-9aec2fba9ba7:49
  Edges:
Reducer 2 <- Map 1 (CUSTOM_SIMPLE_EDGE)
  DagName: hive_20230216065813_e51482a2-1c9a-41a7-b1b3-9aec2fba9ba7:49
  Vertices:
Map 1
Map Operator Tree:
TableScan
  alias: store_sales
  filterExpr: (ss_sold_date_sk = 2450822) (type: boolean)
  Statistics: Num rows: 2755519629 Data size: 10470974584 Basic 
stats: COMPLETE Column stats: NONE
  Filter Operator
predicate: (ss_sold_date_sk = 2450822) (type: boolean)
Statistics: Num rows: 5 Data size: 18 Basic stats: COMPLETE 
Column stats: NONE
Select Operator
  Statistics: Num rows: 5 Data size: 18 Basic stats: 
COMPLETE Column stats: NONE
  Group By Operator
aggregations: count()
minReductionHashAggr: 0.99

[jira] [Created] (HIVE-27099) Iceberg: select count(*) from table queries all data

2023-02-23 Thread Rajesh Balamohan (Jira)

Rajesh Balamohan created HIVE-27099:
---

 Summary: Iceberg: select count(*) from table queries all data
 Key: HIVE-27099
 URL: https://issues.apache.org/jira/browse/HIVE-27099
 Project: Hive
  Issue Type: Improvement
Reporter: Rajesh Balamohan


select count is scanning all data. Though it has complete basic stats, it 
launched tez job which wasn't needed. Second issue is, it ended up scanning 
ENTIRE 148 GB dataset which is completely not required. It should have got the 
data from parq files itself. Ideal situation is getting entire records from 
manifest itself.

Data is stored in parquet format in external tables. This may be broken for 
parquet, as for ORC it is able to read less data (footer info). 

1. Consider fixing count( * ) for parq
2. Check if it is possible to read stats from iceberg manifests after #1.


{noformat}

explain select count(*) from store_sales;

Explain
STAGE DEPENDENCIES:
  Stage-1 is a root stage
  Stage-0 depends on stages: Stage-1

STAGE PLANS:
  Stage: Stage-1
Tez
  DagId: hive_20230223031934_2abeb3b9-8c18-4ff7-a8f9-df7368010189:5
  Edges:
Reducer 2 <- Map 1 (CUSTOM_SIMPLE_EDGE)
  DagName: hive_20230223031934_2abeb3b9-8c18-4ff7-a8f9-df7368010189:5
  Vertices:
Map 1
Map Operator Tree:
TableScan
  alias: store_sales
  Statistics: Num rows: 2879966589 Data size: 195666988943 
Basic stats: COMPLETE Column stats: COMPLETE
  Select Operator
Statistics: Num rows: 2879966589 Data size: 195666988943 
Basic stats: COMPLETE Column stats: COMPLETE
Group By Operator
  aggregations: count()
  minReductionHashAggr: 0.5
  mode: hash
  outputColumnNames: _col0
  Statistics: Num rows: 1 Data size: 8 Basic stats: 
COMPLETE Column stats: COMPLETE
  Reduce Output Operator
null sort order:
sort order:
Statistics: Num rows: 1 Data size: 8 Basic stats: 
COMPLETE Column stats: COMPLETE
value expressions: _col0 (type: bigint)
Execution mode: vectorized
Reducer 2
Execution mode: vectorized
Reduce Operator Tree:
  Group By Operator
aggregations: count(VALUE._col0)
mode: mergepartial
outputColumnNames: _col0
Statistics: Num rows: 1 Data size: 8 Basic stats: COMPLETE 
Column stats: COMPLETE
File Output Operator
  compressed: false
  Statistics: Num rows: 1 Data size: 8 Basic stats: COMPLETE 
Column stats: COMPLETE
  table:
  input format: 
org.apache.hadoop.mapred.SequenceFileInputFormat
  output format: 
org.apache.hadoop.hive.ql.io.HiveSequenceFileOutputFormat
  serde: org.apache.hadoop.hive.serde2.lazy.LazySimpleSerDe

  Stage: Stage-0
Fetch Operator
  limit: -1
  Processor Tree:
ListSink

53 rows selected (1.454 seconds)

0: jdbc:hive2://ve0:218> select count(*) from store_sales;
INFO  : Query ID = hive_20230223031940_9ff5d61d-1fe2-4476-a561-7820e4a3a5f8
INFO  : Total jobs = 1
INFO  : Launching Job 1 out of 1
INFO  : Starting task [Stage-1:MAPRED] in serial mode
INFO  : Subscribed to counters: [] for queryId: 
hive_20230223031940_9ff5d61d-1fe2-4476-a561-7820e4a3a5f8
INFO  : Session is already open
INFO  : Dag name: select count(*) from store_sales (Stage-1)
INFO  : Status: Running (Executing on YARN cluster with App id 
application_1676286357243_0061)

--
VERTICES  MODESTATUS  TOTAL  COMPLETED  RUNNING  PENDING  
FAILED  KILLED
--
Map 1 .. container SUCCEEDED76776700
   0   0
Reducer 2 .. container SUCCEEDED  1  100
   0   0
--
VERTICES: 02/02  [==>>] 100%  ELAPSED TIME: 54.94 s
--
INFO  : Status: DAG finished successfully in 54.85 seconds
INFO  :
INFO  : Query Execution Summary
INFO  : 
--
INFO  : OPERATIONDURATION
INFO  : 
--
INFO  : Compile Query

[jira] [Created] (HIVE-27119) Iceberg: Delete from table generates lot of files

2023-03-02 Thread Rajesh Balamohan (Jira)

Rajesh Balamohan created HIVE-27119:
---

 Summary: Iceberg: Delete from table generates lot of files
 Key: HIVE-27119
 URL: https://issues.apache.org/jira/browse/HIVE-27119
 Project: Hive
  Issue Type: Improvement
  Components: Iceberg integration
Reporter: Rajesh Balamohan


With "delete" it generates lot of files due to the way data is sent to the 
reducers. Files per partition is impacted by the number of reduce tasks.

One way could be to explicitly control the number of reducers; Creating this 
ticket to have a long term fix.
 
{noformat}
 explain delete from store_Sales where ss_customer_sk % 10 = 0;
INFO  : Compiling 
command(queryId=hive_20230303021031_855dd644-8f67-482d-98d7-e9f70b56ae0b): 
explain delete from store_Sales where ss_customer_sk % 10 = 0
INFO  : No Stats for tpcds_1000_iceberg_mor_v4@store_sales, Columns: 
ss_sold_time_sk, ss_cdemo_sk, ss_promo_sk, ss_ext_discount_amt, 
ss_ext_sales_price, ss_net_profit, ss_addr_sk, ss_ticket_number, 
ss_wholesale_cost, ss_item_sk, ss_ext_list_price, ss_sold_date_sk, ss_store_sk, 
ss_coupon_amt, ss_quantity, ss_list_price, ss_sales_price, ss_customer_sk, 
ss_ext_wholesale_cost, ss_net_paid, ss_ext_tax, ss_hdemo_sk, ss_net_paid_inc_tax
INFO  : Semantic Analysis Completed (retrial = false)
INFO  : Created Hive schema: Schema(fieldSchemas:[FieldSchema(name:Explain, 
type:string, comment:null)], properties:null)
INFO  : Completed compiling 
command(queryId=hive_20230303021031_855dd644-8f67-482d-98d7-e9f70b56ae0b); Time 
taken: 0.704 seconds
INFO  : Executing 
command(queryId=hive_20230303021031_855dd644-8f67-482d-98d7-e9f70b56ae0b): 
explain delete from store_Sales where ss_customer_sk % 10 = 0
INFO  : Starting task [Stage-4:EXPLAIN] in serial mode
INFO  : Completed executing 
command(queryId=hive_20230303021031_855dd644-8f67-482d-98d7-e9f70b56ae0b); Time 
taken: 0.005 seconds
INFO  : OK
Explain
STAGE DEPENDENCIES:
  Stage-1 is a root stage
  Stage-2 depends on stages: Stage-1
  Stage-0 depends on stages: Stage-2
  Stage-3 depends on stages: Stage-0

STAGE PLANS:
  Stage: Stage-1
Tez
  DagId: hive_20230303021031_855dd644-8f67-482d-98d7-e9f70b56ae0b:377
  Edges:
Reducer 2 <- Map 1 (SIMPLE_EDGE)
  DagName: hive_20230303021031_855dd644-8f67-482d-98d7-e9f70b56ae0b:377
  Vertices:
Map 1
Map Operator Tree:
TableScan
  alias: store_sales
  filterExpr: ((ss_customer_sk % 10) = 0) (type: boolean)
  Statistics: Num rows: 2755519629 Data size: 3643899155232 
Basic stats: COMPLETE Column stats: NONE
  Filter Operator
predicate: ((ss_customer_sk % 10) = 0) (type: boolean)
Statistics: Num rows: 1377759814 Data size: 1821949576954 
Basic stats: COMPLETE Column stats: NONE
Select Operator
  expressions: PARTITION__SPEC__ID (type: int), 
PARTITION__HASH (type: bigint), FILE__PATH (type: string), ROW__POSITION (type: 
bigint), ss_sold_time_sk (type: int), ss_item_sk (type: int), ss_customer_sk 
(type: int), ss_cdemo_sk (type: int), ss_hdemo_sk (type: int), ss_addr_sk 
(type: int), ss_store_sk (type: int), ss_promo_sk (type: int), ss_ticket_number 
(type: bigint), ss_quantity (type: int), ss_wholesale_cost (type: 
decimal(7,2)), ss_list_price (type: decimal(7,2)), ss_sales_price (type: 
decimal(7,2)), ss_ext_discount_amt (type: decimal(7,2)), ss_ext_sales_price 
(type: decimal(7,2)), ss_ext_wholesale_cost (type: decimal(7,2)), 
ss_ext_list_price (type: decimal(7,2)), ss_ext_tax (type: decimal(7,2)), 
ss_coupon_amt (type: decimal(7,2)), ss_net_paid (type: decimal(7,2)), 
ss_net_paid_inc_tax (type: decimal(7,2)), ss_net_profit (type: decimal(7,2)), 
ss_sold_date_sk (type: int)
  outputColumnNames: _col0, _col1, _col2, _col3, _col4, 
_col5, _col6, _col7, _col8, _col9, _col10, _col11, _col12, _col13, _col14, 
_col15, _col16, _col17, _col18, _col19, _col20, _col21, _col22, _col23, _col24, 
_col25, _col26
  Statistics: Num rows: 1377759814 Data size: 1821949576954 
Basic stats: COMPLETE Column stats: NONE
  Reduce Output Operator
key expressions: _col0 (type: int), _col1 (type: 
bigint), _col2 (type: string), _col3 (type: bigint)
null sort order: 
sort order: 
Statistics: Num rows: 1377759814 Data size: 
1821949576954 Basic stats: COMPLETE Column stats: NONE
value expressions: _col4 (type: int), _col5 (type: 
int), _col6 (type: int), _col7 (type: int), _col8 (type: int), _col9 (type: 
int), _col10 (type: int), _col11 (type: int), _col12 (type: bigint), _col13 
(type: int), _col14 (type: decimal(7,2)), _col15 (type: decimal(7,2)), _col16 
(type: decimal(7,2)), _col17 (type

[jira] [Created] (HIVE-27144) Alter table partitions need not DBNotificationListener for external tables

2023-03-15 Thread Rajesh Balamohan (Jira)

Rajesh Balamohan created HIVE-27144:
---

 Summary: Alter table partitions need not DBNotificationListener 
for external tables
 Key: HIVE-27144
 URL: https://issues.apache.org/jira/browse/HIVE-27144
 Project: Hive
  Issue Type: Improvement
  Components: HiveServer2
Reporter: Rajesh Balamohan


DBNotificationListener for external tables may not be needed. 

Even for "analyze table blah compute statistics for columns" for external 
partitioned tables, it invokes DBNotificationListener for all partitions. 


{noformat}
at org.datanucleus.store.query.Query.execute(Query.java:1726)
  at org.datanucleus.api.jdo.JDOQuery.executeInternal(JDOQuery.java:374)
  at org.datanucleus.api.jdo.JDOQuery.execute(JDOQuery.java:216)
  at 
org.apache.hadoop.hive.metastore.ObjectStore.addNotificationEvent(ObjectStore.java:11774)
  at jdk.internal.reflect.GeneratedMethodAccessor135.invoke(Unknown Source)
  at 
jdk.internal.reflect.DelegatingMethodAccessorImpl.invoke(java.base@11.0.18/DelegatingMethodAccessorImpl.java:43)
  at java.lang.reflect.Method.invoke(java.base@11.0.18/Method.java:566)
  at 
org.apache.hadoop.hive.metastore.RawStoreProxy.invoke(RawStoreProxy.java:97)
  at com.sun.proxy.$Proxy33.addNotificationEvent(Unknown Source)
  at 
org.apache.hive.hcatalog.listener.DbNotificationListener.process(DbNotificationListener.java:1308)
  at 
org.apache.hive.hcatalog.listener.DbNotificationListener.onAlterPartition(DbNotificationListener.java:458)
  at 
org.apache.hadoop.hive.metastore.MetaStoreListenerNotifier$14.notify(MetaStoreListenerNotifier.java:161)
  at 
org.apache.hadoop.hive.metastore.MetaStoreListenerNotifier.notifyEvent(MetaStoreListenerNotifier.java:328)
  at 
org.apache.hadoop.hive.metastore.MetaStoreListenerNotifier.notifyEvent(MetaStoreListenerNotifier.java:390)
  at 
org.apache.hadoop.hive.metastore.HiveAlterHandler.alterPartitions(HiveAlterHandler.java:863)
  at 
org.apache.hadoop.hive.metastore.HiveMetaStore$HMSHandler.alter_partitions_with_environment_context(HiveMetaStore.java:6253)
  at 
org.apache.hadoop.hive.metastore.HiveMetaStore$HMSHandler.alter_partitions_req(HiveMetaStore.java:6201)
  at 
jdk.internal.reflect.NativeMethodAccessorImpl.invoke0(java.base@11.0.18/Native 
Method)
  at 
jdk.internal.reflect.NativeMethodAccessorImpl.invoke(java.base@11.0.18/NativeMethodAccessorImpl.java:62)
  at 
jdk.internal.reflect.DelegatingMethodAccessorImpl.invoke(java.base@11.0.18/DelegatingMethodAccessorImpl.java:43)
  at java.lang.reflect.Method.invoke(java.base@11.0.18/Method.java:566)
  at 
org.apache.hadoop.hive.metastore.RetryingHMSHandler.invokeInternal(RetryingHMSHandler.java:160)
  at 
org.apache.hadoop.hive.metastore.RetryingHMSHandler.invoke(RetryingHMSHandler.java:121)
  at com.sun.proxy.$Proxy34.alter_partitions_req(Unknown Source)
  at 
org.apache.hadoop.hive.metastore.api.ThriftHiveMetastore$Processor$alter_partitions_req.getResult(ThriftHiveMetastore.java:21532)
  at 
org.apache.hadoop.hive.metastore.api.ThriftHiveMetastore$Processor$alter_partitions_req.getResult(ThriftHiveMetastore.java:21511)
  at org.apache.thrift.ProcessFunction.process(ProcessFunction.java:38)
  at org.apache.thrift.TBaseProcessor.process(TBaseProcessor.java:38)
  at 
org.apache.hadoop.hive.metastore.security.HadoopThriftAuthBridge$Server$TUGIAssumingProcessor$1.run(HadoopThriftAuthBridge.java:652)
  at 
org.apache.hadoop.hive.metastore.security.HadoopThriftAuthBridge$Server$TUGIAssumingProcessor$1.run(HadoopThriftAuthBridge.java:647)
  at java.security.AccessController.doPrivileged(java.base@11.0.18/Native 
Method)
{noformat}



--
This message was sent by Atlassian Jira
(v8.20.10#820010)

[jira] [Created] (HIVE-27159) Filters are not pushed down for decimal format in Parquet

2023-03-20 Thread Rajesh Balamohan (Jira)

Rajesh Balamohan created HIVE-27159:
---

 Summary: Filters are not pushed down for decimal format in Parquet
 Key: HIVE-27159
 URL: https://issues.apache.org/jira/browse/HIVE-27159
 Project: Hive
  Issue Type: Improvement
Reporter: Rajesh Balamohan


Decimal filters are not created and pushed down in parquet readers. This causes 
latency delays and unwanted row processing in query execution. 

It throws exception in runtime and processes more rows. 

E.g Q13.

{noformat}

Parquet: (Map 1)

INFO  : Task Execution Summary
INFO  : 
--
INFO  :   VERTICES  DURATION(ms)   CPU_TIME(ms)GC_TIME(ms)   
INPUT_RECORDS   OUTPUT_RECORDS
INFO  : 
--
INFO  :  Map 1  31254.00  0  0 
549,181,950  133
INFO  :  Map 3  0.00  0  0  
73,049  365
INFO  :  Map 4   2027.00  0  0   
6,000,0001,689,919
INFO  :  Map 5  0.00  0  0   
7,2001,440
INFO  :  Map 6517.00  0  0   
1,920,800  493,920
INFO  :  Map 7  0.00  0  0   
1,0021,002
INFO  :  Reducer 2  18716.00  0  0 
1330
INFO  : 
--

ORC:


INFO  : Task Execution Summary
INFO  : 
--
INFO  :   VERTICES  DURATION(ms)   CPU_TIME(ms)GC_TIME(ms)   
INPUT_RECORDS   OUTPUT_RECORDS
INFO  : 
--
INFO  :  Map 1   6556.00  0  0 
267,146,063  152
INFO  :  Map 3  0.00  0  0  
10,000  365
INFO  :  Map 4   2014.00  0  0   
6,000,0001,689,919
INFO  :  Map 5  0.00  0  0   
7,2001,440
INFO  :  Map 6504.00  0  0   
1,920,800  493,920
INFO  :  Reducer 2   3159.00  0  0 
1520
INFO  : 
--

{noformat}




{noformat}
 Map 1
Map Operator Tree:
TableScan
  alias: store_sales
  filterExpr: (ss_hdemo_sk is not null and ss_addr_sk is not 
null and ss_cdemo_sk is not null and ss_store_sk is not null and 
((ss_sales_price >= 100) or (ss_sales_price <= 150) or (ss_sales_price >= 50) 
or (ss_sales_price <= 100) or (ss_sales_price >= 150) or (ss_sales_price <= 
200)) and ((ss_net_profit >= 100) or (ss_net_profit <= 200) or (ss_net_profit 
>= 150) or (ss_net_profit <= 300) or (ss_net_profit >= 50) or (ss_net_profit <= 
250))) (type: boolean)
  probeDecodeDetails: cacheKey:HASH_MAP_MAPJOIN_112_container, 
bigKeyColName:ss_hdemo_sk, smallTablePos:1, keyRatio:5.042575832290721E-6
  Statistics: Num rows: 2750380056 Data size: 1321831086472 
Basic stats: COMPLETE Column stats: COMPLETE
  Filter Operator
predicate: (ss_hdemo_sk is not null and ss_addr_sk is not 
null and ss_cdemo_sk is not null and ss_store_sk is not null and 
((ss_sales_price >= 100) or (ss_sales_price <= 150) or (ss_sales_price >= 50) 
or (ss_sales_price <= 100) or (ss_sales_price >= 150) or (ss_sales_price <= 
200)) and ((ss_net_profit >= 100) or (ss_net_profit <= 200) or (ss_net_profit 
>= 150) or (ss_net_profit <= 300) or (ss_net_profit >= 50) or (ss_net_profit <= 
250))) (type: boolean)
Statistics: Num rows: 2500252205 Data size: 1201619783884 
Basic stats: COMPLETE Column stats: COMPLETE
Select Operator
  expressions: ss_cdemo_sk (type: bigint), ss_hdemo_sk 
(type: bigint), ss_addr_sk (type: bigint), ss_store_sk (type: bigint), 
ss_quantity (type: int), ss_ext_sales_price (type: decimal(7,2)), 
ss_ext_wholesale_cost (type: decimal(7,2)), ss_sold_date_sk (type: bigint), 
ss_net_profit BETWEEN 100 AND 200 (type: boolean), ss_net_profit BETWEEN 150 
AND 300 (type: boolean), ss_net_profit BETWEEN 50 AND 250 (type: boolean), 
ss_sales_price BETWEEN 100 AND 150 (type: boolean), ss_sales_price BETWEEN 50 
AND 100 (type: boolean), ss_sales_price BETWEEN 150 AND 200 (type: boolean)

[jira] [Created] (HIVE-27183) Iceberg: Table information is loaded multiple times

2023-03-27 Thread Rajesh Balamohan (Jira)

Rajesh Balamohan created HIVE-27183:
---

 Summary: Iceberg: Table information is loaded multiple times
 Key: HIVE-27183
 URL: https://issues.apache.org/jira/browse/HIVE-27183
 Project: Hive
  Issue Type: Improvement
Reporter: Rajesh Balamohan


HMS::getTable invokes "HiveIcebergMetaHook::postGetTable" which internally 
loads iceberg table again.

If this isn't needed or needed only for show-create-table, do not load the 
table again.
{noformat}
    at jdk.internal.misc.Unsafe.park(java.base@11.0.18/Native Method)
    - parking to wait for  <0x00066f84eef0> (a 
java.util.concurrent.CompletableFuture$Signaller)
    at 
java.util.concurrent.locks.LockSupport.park(java.base@11.0.18/LockSupport.java:194)
    at 
java.util.concurrent.CompletableFuture$Signaller.block(java.base@11.0.18/CompletableFuture.java:1796)
    at 
java.util.concurrent.ForkJoinPool.managedBlock(java.base@11.0.18/ForkJoinPool.java:3128)
    at 
java.util.concurrent.CompletableFuture.waitingGet(java.base@11.0.18/CompletableFuture.java:1823)
    at 
java.util.concurrent.CompletableFuture.get(java.base@11.0.18/CompletableFuture.java:1998)
    at org.apache.hadoop.util.functional.FutureIO.awaitFuture(FutureIO.java:77)
    at 
org.apache.iceberg.hadoop.HadoopInputFile.newStream(HadoopInputFile.java:196)
    at org.apache.iceberg.TableMetadataParser.read(TableMetadataParser.java:263)
    at org.apache.iceberg.TableMetadataParser.read(TableMetadataParser.java:258)
    at 
org.apache.iceberg.BaseMetastoreTableOperations.lambda$refreshFromMetadataLocation$0(BaseMetastoreTableOperations.java:177)
    at 
org.apache.iceberg.BaseMetastoreTableOperations$$Lambda$609/0x000840e18040.apply(Unknown
 Source)
    at 
org.apache.iceberg.BaseMetastoreTableOperations.lambda$refreshFromMetadataLocation$1(BaseMetastoreTableOperations.java:191)
    at 
org.apache.iceberg.BaseMetastoreTableOperations$$Lambda$610/0x000840e18440.run(Unknown
 Source)
    at org.apache.iceberg.util.Tasks$Builder.runTaskWithRetry(Tasks.java:404)
    at org.apache.iceberg.util.Tasks$Builder.runSingleThreaded(Tasks.java:214)
    at org.apache.iceberg.util.Tasks$Builder.run(Tasks.java:198)
    at org.apache.iceberg.util.Tasks$Builder.run(Tasks.java:190)
    at 
org.apache.iceberg.BaseMetastoreTableOperations.refreshFromMetadataLocation(BaseMetastoreTableOperations.java:191)
    at 
org.apache.iceberg.BaseMetastoreTableOperations.refreshFromMetadataLocation(BaseMetastoreTableOperations.java:176)
    at 
org.apache.iceberg.BaseMetastoreTableOperations.refreshFromMetadataLocation(BaseMetastoreTableOperations.java:171)
    at 
org.apache.iceberg.hive.HiveTableOperations.doRefresh(HiveTableOperations.java:153)
    at 
org.apache.iceberg.BaseMetastoreTableOperations.refresh(BaseMetastoreTableOperations.java:96)
    at 
org.apache.iceberg.BaseMetastoreTableOperations.current(BaseMetastoreTableOperations.java:79)
    at 
org.apache.iceberg.BaseMetastoreCatalog.loadTable(BaseMetastoreCatalog.java:44)
    at org.apache.iceberg.mr.Catalogs.loadTable(Catalogs.java:115)
    at org.apache.iceberg.mr.Catalogs.loadTable(Catalogs.java:105)
    at 
org.apache.iceberg.mr.hive.IcebergTableUtil.lambda$getTable$1(IcebergTableUtil.java:99)
    at 
org.apache.iceberg.mr.hive.IcebergTableUtil$$Lambda$552/0x000840d59840.apply(Unknown
 Source)
    at 
org.apache.iceberg.mr.hive.IcebergTableUtil.lambda$getTable$4(IcebergTableUtil.java:111)
    at 
org.apache.iceberg.mr.hive.IcebergTableUtil$$Lambda$557/0x000840d58c40.get(Unknown
 Source)
    at java.util.Optional.orElseGet(java.base@11.0.18/Optional.java:369)
    at 
org.apache.iceberg.mr.hive.IcebergTableUtil.getTable(IcebergTableUtil.java:108)
    at 
org.apache.iceberg.mr.hive.IcebergTableUtil.getTable(IcebergTableUtil.java:69)
    at 
org.apache.iceberg.mr.hive.IcebergTableUtil.getTable(IcebergTableUtil.java:73)
    at 
org.apache.iceberg.mr.hive.HiveIcebergMetaHook.postGetTable(HiveIcebergMetaHook.java:931)
    at 
org.apache.hadoop.hive.metastore.HiveMetaStoreClient.executePostGetTableHook(HiveMetaStoreClient.java:2638)
    at 
org.apache.hadoop.hive.metastore.HiveMetaStoreClient.getTable(HiveMetaStoreClient.java:2624)
    at 
org.apache.hadoop.hive.ql.metadata.SessionHiveMetaStoreClient.getTable(SessionHiveMetaStoreClient.java:267)
    at jdk.internal.reflect.GeneratedMethodAccessor137.invoke(Unknown Source)
    at 
jdk.internal.reflect.DelegatingMethodAccessorImpl.invoke(java.base@11.0.18/DelegatingMethodAccessorImpl.java:43)
    at java.lang.reflect.Method.invoke(java.base@11.0.18/Method.java:566)
    at 
org.apache.hadoop.hive.metastore.RetryingMetaStoreClient.invoke(RetryingMetaStoreClient.java:216)
    at com.sun.proxy.$Proxy56.getTable(Unknown Source)
    at jdk.internal.reflect.GeneratedMethodAccessor137.invoke(Unknown Source)
    at 
jdk.internal.reflect.DelegatingMethodAccessorImpl.invoke(java.base@11.0.18/DelegatingMetho

[jira] [Created] (HIVE-27184) Add class name profiling option in ProfileServlet

2023-03-27 Thread Rajesh Balamohan (Jira)

Rajesh Balamohan created HIVE-27184:
---

 Summary: Add class name profiling option in ProfileServlet
 Key: HIVE-27184
 URL: https://issues.apache.org/jira/browse/HIVE-27184
 Project: Hive
  Issue Type: Improvement
  Components: HiveServer2
Reporter: Rajesh Balamohan


With async-profiler "-e classame.method", it is possible to profile specific 
events. Currently profileServlet supports events like cpu, alloc, lock etc. It 
will be good to enhance to support method name profiling as well.



--
This message was sent by Atlassian Jira
(v8.20.10#820010)

[jira] [Created] (HIVE-27188) Explore usage of FilterApi.in(C column, Set values) in Parquet instead of nested OR

2023-03-28 Thread Rajesh Balamohan (Jira)

Rajesh Balamohan created HIVE-27188:
---

 Summary: Explore usage of FilterApi.in(C column, Set values) in 
Parquet instead of nested OR
 Key: HIVE-27188
 URL: https://issues.apache.org/jira/browse/HIVE-27188
 Project: Hive
  Issue Type: Improvement
Reporter: Rajesh Balamohan


Following query can throw stackoverflow exception with "Xss256K".

Currently it generates nested OR filter

[https://github.com/apache/hive/blob/master/ql/src/java/org/apache/hadoop/hive/ql/io/parquet/FilterPredicateLeafBuilder.java#L43-L52]

Instead, need to explore the possibility of using {color:#de350b}FilterApi.in(C 
column, Set values) {color:#172b4d}in parquet{color}{color}

 
{noformat}
drop table if exists test;

create external table test (i int) stored as parquet;

insert into test values (1),(2),(3);

select count(*) from test where i in (1, 2, 3, 4, 5, 6, 7, 8, 9, 10, 11, 12, 
13, 14, 15, 16, 17, 18, 19, 20, 21, 22, 23, 24, 25, 26, 27, 28, 29, 30, 31, 32, 
33, 34, 35, 36, 37, 38, 39, 40, 41, 42, 43, 44, 45, 46, 47, 48, 49, 50, 51, 52, 
53, 54, 55, 56, 57, 58, 59, 60, 61, 62, 63, 64, 65, 66, 67, 68, 69, 70, 71, 72, 
73, 74, 75, 76, 77, 78, 79, 80, 81, 82, 83, 84, 85, 86, 87, 88, 89, 90, 91, 92, 
93, 94, 95, 96, 97, 98, 99, 100, 101, 102, 103, 104, 105, 106, 107, 108, 109, 
110, 111, 112, 113, 114, 115, 116, 117, 118, 119, 120, 121, 122, 123, 124, 125, 
126, 127, 128, 129, 130, 131, 132, 133, 134, 135, 136, 137, 138, 139, 140, 141, 
142, 143, 144, 145, 146, 147, 148, 149, 150, 151, 152, 153, 154, 155, 156, 157, 
158, 159, 160, 161, 162, 163, 164, 165, 166, 167, 168, 169, 170, 171, 172, 173, 
174, 175, 176, 177, 178, 179, 180, 181, 182, 183, 184, 185, 186, 187, 188, 189, 
190, 191, 192, 193, 194, 195, 196, 197, 198, 199, 200, 201, 202, 203, 204, 205, 
206, 207, 208, 209, 210, 211, 212, 213, 214, 215, 216, 217, 218, 219, 220, 221, 
222, 223, 224, 225, 226, 227, 228, 229, 230, 231, 232, 233, 234, 235, 236, 237, 
238, 239, 240, 241, 242, 243);

 {noformat}



--
This message was sent by Atlassian Jira
(v8.20.10#820010)

[jira] [Created] (HIVE-20816) FastHiveDecimal throws Exception (RuntimeException: Unexpected #3)

2018-10-26 Thread Rajesh Balamohan (JIRA)

Rajesh Balamohan created HIVE-20816:
---

 Summary: FastHiveDecimal throws Exception (RuntimeException: 
Unexpected #3)
 Key: HIVE-20816
 URL: https://issues.apache.org/jira/browse/HIVE-20816
 Project: Hive
  Issue Type: Improvement
Affects Versions: 2.3.2
Reporter: Rajesh Balamohan


{noformat}
with t1 as (
...
...
)
select id, max(abs(c1))) from t1 group by id;
{noformat}

throws the following exception

{noformat}
g.Thread.run(Thread.java:748)
Caused by: org.apache.hadoop.hive.ql.metadata.HiveException: 
org.apache.hadoop.hive.ql.metadata.HiveException: 
org.apache.hadoop.hive.ql.metadata.HiveException: java.lang.RuntimeException: 
Unexpected #3
 at 
org.apache.hadoop.hive.ql.exec.GroupByOperator.closeOp(GroupByOperator.java:1126)
 at org.apache.hadoop.hive.ql.exec.Operator.close(Operator.java:697)
 at org.apache.hadoop.hive.ql.exec.Operator.close(Operator.java:711)
 at org.apache.hadoop.hive.ql.exec.Operator.close(Operator.java:711)
 at org.apache.hadoop.hive.ql.exec.Operator.close(Operator.java:711)
 at org.apache.hadoop.hive.ql.exec.Operator.close(Operator.java:711)

...
Caused by: org.apache.hadoop.hive.ql.metadata.HiveException: 
org.apache.hadoop.hive.ql.metadata.HiveException: java.lang.RuntimeException: 
Unexpected #3
at 
org.apache.hadoop.hive.ql.exec.GroupByOperator.flush(GroupByOperator.java:1084)
at 
org.apache.hadoop.hive.ql.exec.GroupByOperator.closeOp(GroupByOperator.java:1123)
... 18 more
Caused by: org.apache.hadoop.hive.ql.metadata.HiveException: 
java.lang.RuntimeException: Unexpected #3
at 
org.apache.hadoop.hive.ql.exec.ReduceSinkOperator.process(ReduceSinkOperator.java:397)
at org.apache.hadoop.hive.ql.exec.Operator.forward(Operator.java:897)
at 
org.apache.hadoop.hive.ql.exec.GroupByOperator.forward(GroupByOperator.java:1047)
at 
org.apache.hadoop.hive.ql.exec.GroupByOperator.flush(GroupByOperator.java:1067)
... 19 more

{noformat}



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

[jira] [Created] (HIVE-20886) Fix NPE: GenericUDFLower

2018-11-08 Thread Rajesh Balamohan (JIRA)

Rajesh Balamohan created HIVE-20886:
---

 Summary: Fix NPE: GenericUDFLower
 Key: HIVE-20886
 URL: https://issues.apache.org/jira/browse/HIVE-20886
 Project: Hive
  Issue Type: Improvement
  Components: Hive
Reporter: Rajesh Balamohan
Assignee: Rajesh Balamohan


{noformat}
create table if not exists test1(uuid array);
select lower(uuid) from test1;

Error: Error while compiling statement: FAILED: NullPointerException null 
(state=42000,code=4)
{noformat}



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

[jira] [Created] (HIVE-20928) NPE in StatsUtils for complex type

2018-11-15 Thread Rajesh Balamohan (JIRA)

Rajesh Balamohan created HIVE-20928:
---

 Summary: NPE in StatsUtils for complex type
 Key: HIVE-20928
 URL: https://issues.apache.org/jira/browse/HIVE-20928
 Project: Hive
  Issue Type: Bug
  Components: Hive
Affects Versions: 2.3.4
Reporter: Rajesh Balamohan



{noformat}
Caused by: java.lang.NullPointerException
at 
org.apache.hadoop.hive.ql.stats.StatsUtils.getWritableSize(StatsUtils.java:1147)
at 
org.apache.hadoop.hive.ql.stats.StatsUtils.getSizeOfMap(StatsUtils.java:1108)
at 
org.apache.hadoop.hive.ql.stats.StatsUtils.getSizeOfComplexTypes(StatsUtils.java:978)
at 
org.apache.hadoop.hive.ql.stats.StatsUtils.getAvgColLenOf(StatsUtils.java:916)
at 
org.apache.hadoop.hive.ql.stats.StatsUtils.getColStatisticsFromExpression(StatsUtils.java:1374)
at 
org.apache.hadoop.hive.ql.stats.StatsUtils.getColStatisticsFromExprMap(StatsUtils.java:1197)
at 
org.apache.hadoop.hive.ql.optimizer.stats.annotation.StatsRulesProcFactory$GroupByStatsRule.process(StatsRulesProcFactory.java:1009)
at 
org.apache.hadoop.hive.ql.lib.DefaultRuleDispatcher.dispatch(DefaultRuleDispatcher.java:90)
at 
org.apache.hadoop.hive.ql.lib.DefaultGraphWalker.dispatchAndReturn(DefaultGraphWalker.java:105)
at 
org.apache.hadoop.hive.ql.lib.DefaultGraphWalker.dispatch(DefaultGraphWalker.java:89)
at 
org.apache.hadoop.hive.ql.lib.LevelOrderWalker.walk(LevelOrderWalker.java:143)
at 
org.apache.hadoop.hive.ql.lib.LevelOrderWalker.startWalking(LevelOrderWalker.java:122)
at 
org.apache.hadoop.hive.ql.optimizer.stats.annotation.AnnotateWithStatistics.transform(AnnotateWithStatistics.java:78)
at 
org.apache.hadoop.hive.ql.parse.spark.SparkCompiler.runStatsAnnotation(SparkCompiler.java:240)
{noformat}

Issue should be there in master as well.



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

[jira] [Created] (HIVE-20974) TezTask should set task exception on failures

2018-11-27 Thread Rajesh Balamohan (JIRA)

Rajesh Balamohan created HIVE-20974:
---

 Summary: TezTask should set task exception on failures
 Key: HIVE-20974
 URL: https://issues.apache.org/jira/browse/HIVE-20974
 Project: Hive
  Issue Type: Improvement
  Components: Hive
Reporter: Rajesh Balamohan


TezTask logs the error as "Failed to execute tez graph" and proceeds further. 
"TaskRunner.runSequentail()" code would not be able to get these exceptions for 
TezTask. If there are any failure hooks configured, these exceptions wouldn't 
show up.





--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

[jira] [Created] (HIVE-21102) Optimize SparkPlanGenerator for getInputPaths (emptyFile checks)

2019-01-08 Thread Rajesh Balamohan (JIRA)

Rajesh Balamohan created HIVE-21102:
---

 Summary: Optimize SparkPlanGenerator for getInputPaths (emptyFile 
checks)
 Key: HIVE-21102
 URL: https://issues.apache.org/jira/browse/HIVE-21102
 Project: Hive
  Issue Type: Improvement
Reporter: Rajesh Balamohan






--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

[jira] [Created] (HIVE-21104) PTF with nested structure throws ClassCastException

2019-01-08 Thread Rajesh Balamohan (JIRA)

Rajesh Balamohan created HIVE-21104:
---

 Summary: PTF with nested structure throws ClassCastException
 Key: HIVE-21104
 URL: https://issues.apache.org/jira/browse/HIVE-21104
 Project: Hive
  Issue Type: Bug
  Components: Hive
Reporter: Rajesh Balamohan



{noformat}
DROP TABLE IF EXISTS dummy;
CREATE TABLE dummy (i int);
INSERT INTO TABLE dummy VALUES (1);

DROP TABLE IF EXISTS struct_table_example;
CREATE TABLE struct_table_example (a int, s1 struct ) STORED AS ORC;

INSERT INTO TABLE struct_table_example SELECT 1, named_struct('f1', false, 
'f2', 'test', 'f3', 3, 'f4', 4)  FROM dummy;

select s1.f1, s1.f2, rank() over (partition by s1.f2 order by s1.f4) from 
struct_table_example;
{noformat}

This would throw the following error

{noformat}
Caused by: java.lang.RuntimeException: 
org.apache.hadoop.hive.ql.metadata.HiveException: Hive Runtime Error while 
processing row (tag=0) 
{"key":{"reducesinkkey0":"test","reducesinkkey1":4},"value":{"_col1":{"f1":false,"f2":"test","f3":3,"f4":4}}}
at 
org.apache.hadoop.hive.ql.exec.tez.ReduceRecordSource.pushRecord(ReduceRecordSource.java:297)
at 
org.apache.hadoop.hive.ql.exec.tez.ReduceRecordProcessor.run(ReduceRecordProcessor.java:317)
at 
org.apache.hadoop.hive.ql.exec.tez.TezProcessor.initializeAndRunProcessor(TezProcessor.java:185)
... 14 more
Caused by: org.apache.hadoop.hive.ql.metadata.HiveException: Hive Runtime Error 
while processing row (tag=0) 
{"key":{"reducesinkkey0":"test","reducesinkkey1":4},"value":{"_col1":{"f1":false,"f2":"test","f3":3,"f4":4}}}
at 
org.apache.hadoop.hive.ql.exec.tez.ReduceRecordSource$GroupIterator.next(ReduceRecordSource.java:365)
at 
org.apache.hadoop.hive.ql.exec.tez.ReduceRecordSource.pushRecord(ReduceRecordSource.java:287)
... 16 more
Caused by: java.lang.ClassCastException: 
org.apache.hadoop.hive.serde2.lazybinary.LazyBinaryStruct cannot be cast to 
org.apache.hadoop.io.IntWritable
at 
org.apache.hadoop.hive.serde2.objectinspector.primitive.WritableIntObjectInspector.getPrimitiveJavaObject(WritableIntObjectInspector.java:46)
at 
org.apache.hadoop.hive.serde2.objectinspector.ObjectInspectorUtils.copyToStandardObject(ObjectInspectorUtils.java:412)
at 
org.apache.hadoop.hive.ql.udf.generic.GenericUDAFRank.copyToStandardObject(GenericUDAFRank.java:219)
at 
org.apache.hadoop.hive.ql.udf.generic.GenericUDAFRank$GenericUDAFAbstractRankEvaluator.iterate(GenericUDAFRank.java:154)
at 
org.apache.hadoop.hive.ql.udf.generic.GenericUDAFEvaluator.aggregate(GenericUDAFEvaluator.java:192)
at 
org.apache.hadoop.hive.ql.udf.ptf.WindowingTableFunction.processRow(WindowingTableFunction.java:407)
at 
org.apache.hadoop.hive.ql.exec.PTFOperator$PTFInvocation.processRow(PTFOperator.java:325)
at 
org.apache.hadoop.hive.ql.exec.PTFOperator.process(PTFOperator.java:139)
at org.apache.hadoop.hive.ql.exec.Operator.forward(Operator.java:897)
at 
org.apache.hadoop.hive.ql.exec.SelectOperator.process(SelectOperator.java:95)
at 
org.apache.hadoop.hive.ql.exec.tez.ReduceRecordSource$GroupIterator.next(ReduceRecordSource.java:356)
... 17 more
]], Vertex did not succeed due to OWN_TASK_FAILURE, failedTasks:1 
killedTasks:0, Vertex vertex_1546783872011_263870_1_01 [Reducer 2] 
killed/failed due to:OWN_TASK_FAILURE]DAG did not succeed due to 
VERTEX_FAILURE. failedVertices:1 killedVertices:0
at org.apache.hadoop.hive.ql.exec.tez.TezTask.execute(TezTask.java:196)
at org.apache.hadoop.hive.ql.exec.Task.executeTask(Task.java:199)
at 
org.apache.hadoop.hive.ql.exec.TaskRunner.runSequential(TaskRunner.java:100)
at org.apache.hadoop.hive.ql.exec.TaskRunner.run(TaskRunner.java:79) 
(state=08S01,code=2)
{noformat}




--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

[jira] [Created] (HIVE-21136) Kryo exception : Unable to create serializer for class AtomicReference

2019-01-18 Thread Rajesh Balamohan (JIRA)

Rajesh Balamohan created HIVE-21136:
---

 Summary: Kryo exception : Unable to create serializer  for class 
AtomicReference
 Key: HIVE-21136
 URL: https://issues.apache.org/jira/browse/HIVE-21136
 Project: Hive
  Issue Type: Bug
  Components: Serializers/Deserializers
Reporter: Rajesh Balamohan
Assignee: Rajesh Balamohan


{noformat}

Caused by: org.apache.hive.com.esotericsoftware.kryo.KryoException: 
java.lang.IllegalArgumentException: Unable to create serializer 
"org.apache.hive.com.esotericsoftware.kryo.serializers.FieldSerializer" for 
class: java.util.concurrent.atomic.AtomicReference
Serialization trace:
_tableInfo (org.codehaus.jackson.sym.BytesToNameCanonicalizer)
_rootByteSymbols (org.codehaus.jackson.JsonFactory)
jsonFactory (brickhouse.udf.json.ToJsonUDF)
genericUDF (org.apache.hadoop.hive.ql.plan.ExprNodeGenericFuncDesc)
chidren (org.apache.hadoop.hive.ql.plan.ExprNodeGenericFuncDesc)
colExprMap (org.apache.hadoop.hive.ql.exec.GroupByOperator)
childOperators (org.apache.hadoop.hive.ql.exec.SelectOperator)
childOperators (org.apache.hadoop.hive.ql.exec.LateralViewJoinOperator)
childOperators (org.apache.hadoop.hive.ql.exec.SelectOperator)
childOperators (org.apache.hadoop.hive.ql.exec.LateralViewForwardOperator)
childOperators (org.apache.hadoop.hive.ql.exec.LateralViewJoinOperator)
childOperators (org.apache.hadoop.hive.ql.exec.SelectOperator)
childOperators (org.apache.hadoop.hive.ql.exec.LateralViewForwardOperator)
childOperators (org.apache.hadoop.hive.ql.exec.LateralViewJoinOperator)
childOperators (org.apache.hadoop.hive.ql.exec.SelectOperator)
childOperators (org.apache.hadoop.hive.ql.exec.LateralViewForwardOperator)
childOperators (org.apache.hadoop.hive.ql.exec.GroupByOperator)
reducer (org.apache.hadoop.hive.ql.plan.ReduceWork)
    at 
org.apache.hive.com.esotericsoftware.kryo.serializers.ObjectField.read(ObjectField.java:144)
    at 
org.apache.hive.com.esotericsoftware.kryo.serializers.FieldSerializer.read(FieldSerializer.java:551)
    at 
org.apache.hive.com.esotericsoftware.kryo.Kryo.readObjectOrNull(Kryo.java:759)
    at 
org.apache.hadoop.hive.ql.exec.SerializationUtilities$KryoWithHooks.readObjectOrNull(SerializationUtilities.java:199)
    at 
org.apache.hive.com.esotericsoftware.kryo.serializers.ObjectField.read(ObjectField.java:132)
    at 
org.apache.hive.com.esotericsoftware.kryo.serializers.FieldSerializer.read(FieldSerializer.java:551)
    at org.apache.hive.com.esotericsoftware.kryo.Kryo.readObject(Kryo.java:708)
    at 
org.apache.hadoop.hive.ql.exec.SerializationUtilities$KryoWithHooks.readObject(SerializationUtilities.java:214)
    at 
org.apache.hive.com.esotericsoftware.kryo.serializers.ObjectField.read(ObjectField.java:125)
    at 
org.apache.hive.com.esotericsoftware.kryo.serializers.FieldSerializer.read(FieldSerializer.java:551)
    at org.apache.hive.com.esotericsoftware.kryo.Kryo.readObject(Kryo.java:708)
    at 
org.apache.hadoop.hive.ql.exec.SerializationUtilities$KryoWithHooks.readObject(SerializationUtilities.java:214)
    at 
org.apache.hive.com.esotericsoftware.kryo.serializers.ObjectField.read(ObjectField.java:125)
    at 
org.apache.hive.com.esotericsoftware.kryo.serializers.FieldSerializer.read(FieldSerializer.java:551)
    at 
org.apache.hive.com.esotericsoftware.kryo.Kryo.readClassAndObject(Kryo.java:790)
    at 
org.apache.hadoop.hive.ql.exec.SerializationUtilities$KryoWithHooks.readClassAndObject(SerializationUtilities.java:176)
    at 
org.apache.hive.com.esotericsoftware.kryo.serializers.CollectionSerializer.read(CollectionSerializer.java:134)
    at 
org.apache.hive.com.esotericsoftware.kryo.serializers.CollectionSerializer.read(CollectionSerializer.java:40)
    at org.apache.hive.com.esotericsoftware.kryo.Kryo.readObject(Kryo.java:708)
    at 
org.apache.hadoop.hive.ql.exec.SerializationUtilities$KryoWithHooks.readObject(SerializationUtilities.java:214)
    at 
org.apache.hive.com.esotericsoftware.kryo.serializers.ObjectField.read(ObjectField.java:125)
    at 
org.apache.hive.com.esotericsoftware.kryo.serializers.FieldSerializer.read(FieldSerializer.java:551)
    at 
org.apache.hive.com.esotericsoftware.kryo.Kryo.readClassAndObject(Kryo.java:790)
    at 
org.apache.hadoop.hive.ql.exec.SerializationUtilities$KryoWithHooks.readClassAndObject(SerializationUtilities.java:176)
    at 
org.apache.hive.com.esotericsoftware.kryo.serializers.MapSerializer.read(MapSerializer.java:161)
    at 
org.apache.hive.com.esotericsoftware.kryo.serializers.MapSerializer.read(MapSerializer.java:39)
    at org.apache.hive.com.esotericsoftware.kryo.Kryo.readObject(Kryo.java:708)
    at 
org.apache.hadoop.hive.ql.exec.SerializationUtilities$KryoWithHooks.readObject(SerializationUtilities.java:214)
    at 
org.apache.hive.com.esotericsoftware.kryo.serializers.ObjectField.read(ObjectField.java:125)
    at 
org.apache.hive.com.esotericsoftw

[jira] [Created] (HIVE-21162) MetaStoreListenerNotifier events can get fired even when exceptions are thrown

2019-01-24 Thread Rajesh Balamohan (JIRA)

Rajesh Balamohan created HIVE-21162:
---

 Summary: MetaStoreListenerNotifier events can get fired even when 
exceptions are thrown
 Key: HIVE-21162
 URL: https://issues.apache.org/jira/browse/HIVE-21162
 Project: Hive
  Issue Type: Bug
  Components: Standalone Metastore
Reporter: Rajesh Balamohan


[https://github.com/apache/hive/blob/master/standalone-metastore/metastore-server/src/main/java/org/apache/hadoop/hive/metastore/HiveMetaStore.java#L3870]

 

When same partition is added twice, it ends up throwing 
{{PartitionAlreadyExistsException}}.  However, by then even listeners are 
notified.



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

[jira] [Created] (HIVE-21312) FSStatsAggregator::connect is slow

2019-02-23 Thread Rajesh Balamohan (JIRA)

Rajesh Balamohan created HIVE-21312:
---

 Summary: FSStatsAggregator::connect is slow
 Key: HIVE-21312
 URL: https://issues.apache.org/jira/browse/HIVE-21312
 Project: Hive
  Issue Type: Improvement
  Components: Statistics
Reporter: Rajesh Balamohan






--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

[jira] [Created] (HIVE-21331) Metastore should throw exception back if it is not able to delete the folder

2019-02-27 Thread Rajesh Balamohan (JIRA)

Rajesh Balamohan created HIVE-21331:
---

 Summary: Metastore should throw exception back if it is not able 
to delete the folder
 Key: HIVE-21331
 URL: https://issues.apache.org/jira/browse/HIVE-21331
 Project: Hive
  Issue Type: Improvement
  Components: Metastore
Reporter: Rajesh Balamohan


[https://github.com/apache/hive/blob/master/standalone-metastore/metastore-server/src/main/java/org/apache/hadoop/hive/metastore/HiveMetaStore.java#L2678]

 

In one of the cases, table got deleted from HMS, but the data was not deleted. 
On looking at the issue, `deleteDir` is not throwing the exception back.  Real 
exception gets logged (in this case it was user quota limit exceeeded 
exception), but the managed table gets dropped without deleting the data.



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

[jira] [Created] (HIVE-21431) Vectorization: ltrim throws ArrayIndexOutOfBounds in corner cases

2019-03-12 Thread Rajesh Balamohan (JIRA)

Rajesh Balamohan created HIVE-21431:
---

 Summary: Vectorization: ltrim throws ArrayIndexOutOfBounds in 
corner cases
 Key: HIVE-21431
 URL: https://issues.apache.org/jira/browse/HIVE-21431
 Project: Hive
  Issue Type: Bug
  Components: Vectorization
Affects Versions: 2.3.4
Reporter: Rajesh Balamohan


In corner cases, {{ltrim}} with string columns throws 
arraryindexoutofboundsexception with vectorization enabled. {{HIVE-19565}} seem 
to fix corner cases.  But in another corner case, {{length[]}} was all {{0}} 
and this causes {{-1}} to be returned in the length to be set in the target 
vector. I will check if i can get a easier repro for this.



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

[jira] [Created] (HIVE-21439) Provide an option to reduce lookup overhead for bucketed tables

2019-03-13 Thread Rajesh Balamohan (JIRA)

Rajesh Balamohan created HIVE-21439:
---

 Summary: Provide an option to reduce lookup overhead for bucketed 
tables
 Key: HIVE-21439
 URL: https://issues.apache.org/jira/browse/HIVE-21439
 Project: Hive
  Issue Type: Bug
Reporter: Rajesh Balamohan


If a table is bucketed, `OpTraitsRulesProcFactory::TableScanRule` ends up 
verifying if the partitions have got the same number of files as the number of 
buckets in table. 
https://github.com/apache/hive/blob/master/ql/src/java/org/apache/hadoop/hive/ql/optimizer/metainfo/annotation/OpTraitsRulesProcFactory.java#L185

In large tables, this turns out to be very time consuming operation. It would 
be good to have an option to by pass this depending on need basis.



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

[jira] [Created] (HIVE-21475) SparkClientUtilities::urlFromPathString should handle viewfs to avoid UDF ClassNotFoundExcpetion

2019-03-19 Thread Rajesh Balamohan (JIRA)

Rajesh Balamohan created HIVE-21475:
---

 Summary: SparkClientUtilities::urlFromPathString should handle 
viewfs to avoid UDF ClassNotFoundExcpetion
 Key: HIVE-21475
 URL: https://issues.apache.org/jira/browse/HIVE-21475
 Project: Hive
  Issue Type: Bug
Reporter: Rajesh Balamohan






--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

[jira] [Created] (HIVE-21503) Vectorization: query with regex gives incorrect results with vectorization

2019-03-25 Thread Rajesh Balamohan (JIRA)

Rajesh Balamohan created HIVE-21503:
---

 Summary: Vectorization: query with regex gives incorrect results 
with vectorization
 Key: HIVE-21503
 URL: https://issues.apache.org/jira/browse/HIVE-21503
 Project: Hive
  Issue Type: Bug
  Components: Vectorization
Reporter: Rajesh Balamohan


i see wrong results with vectorization. Without vectorization, it works fine. 
Suspecting minor issue in {{StringGroupColConcatCharScalar}}
{noformat}
e.g 

WHEN x like '%radio%' THEN 'radio' 
WHEN x like '%tv%' THEN 'tv'
{noformat}



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

[jira] [Created] (HIVE-21520) Query "Submit plan" time reported is incorrect

2019-03-26 Thread Rajesh Balamohan (JIRA)

Rajesh Balamohan created HIVE-21520:
---

 Summary: Query "Submit plan" time reported is incorrect
 Key: HIVE-21520
 URL: https://issues.apache.org/jira/browse/HIVE-21520
 Project: Hive
  Issue Type: Bug
Reporter: Rajesh Balamohan


Hive master branch + LLAP
{noformat}
Query Execution Summary
--
OPERATION    DURATION
--
Compile Query   0.00s
Prepare Plan    0.00s
Get Query Coordinator (AM)  0.00s
Submit Plan 1553658149.89s
Start DAG   0.53s
Run DAG 0.43s
--
{noformat}



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

[jira] [Created] (HIVE-21565) Utilities::isEmptyPath should throw back FNFE instead of returning true

2019-04-02 Thread Rajesh Balamohan (JIRA)

Rajesh Balamohan created HIVE-21565:
---

 Summary: Utilities::isEmptyPath should throw back FNFE instead of 
returning true
 Key: HIVE-21565
 URL: https://issues.apache.org/jira/browse/HIVE-21565
 Project: Hive
  Issue Type: Bug
Reporter: Rajesh Balamohan


In case there is a {{viewfs}} configured and it ends up throwing FNFE, current 
codepath silently ignores the error and ends up creating an empty file. 


{noformat}
at org.apache.hadoop.fs.viewfs.InodeTree.resolve(InodeTree.java:403)
at 
org.apache.hadoop.fs.viewfs.ViewFileSystem.listStatus(ViewFileSystem.java:374)
at org.apache.hadoop.fs.FileSystem.listStatus(FileSystem.java:1497)
at org.apache.hadoop.fs.FileSystem.listStatus(FileSystem.java:1537)
at 
org.apache.hadoop.hive.ql.exec.Utilities.isEmptyPath(Utilities.java:2350)
at 
org.apache.hadoop.hive.ql.exec.Utilities.isEmptyPath(Utilities.java:2343)
at 
org.apache.hadoop.hive.ql.exec.Utilities$GetInputPathsCallable.call(Utilities.java:3128)
at 
org.apache.hadoop.hive.ql.exec.Utilities.getInputPaths(Utilities.java:3092)
at 
org.apache.hadoop.hive.ql.exec.spark.SparkPlanGenerator.cloneJobConf(SparkPlanGenerator.java:303)
at 
org.apache.hadoop.hive.ql.exec.spark.SparkPlanGenerator.generate(SparkPlanGenerator.java:226)
at 
org.apache.hadoop.hive.ql.exec.spark.SparkPlanGenerator.generate(SparkPlanGenerator.java:109)
at 
org.apache.hadoop.hive.ql.exec.spark.RemoteHiveSparkClient$JobStatusJob.call(RemoteHiveSparkClient.java:346)
at 
org.apache.hive.spark.client.RemoteDriver$JobWrapper.call(RemoteDriver.java:358)
at 
org.apache.hive.spark.client.RemoteDriver$JobWrapper.call(RemoteDriver.java:323)
at java.util.concurrent.FutureTask.run(FutureTask.java:266)
at 
java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1149)
at 
java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:624)
{noformat}



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

[jira] [Created] (HIVE-21622) Provide an option to invoke `ReflectionUtil::newInstance` without storing in constructor_cache

2019-04-17 Thread Rajesh Balamohan (JIRA)

Rajesh Balamohan created HIVE-21622:
---

 Summary: Provide an option to invoke `ReflectionUtil::newInstance` 
without storing in constructor_cache
 Key: HIVE-21622
 URL: https://issues.apache.org/jira/browse/HIVE-21622
 Project: Hive
  Issue Type: Bug
  Components: HiveServer2
Reporter: Rajesh Balamohan
 Attachments: Screenshot 2019-04-17 at 2.17.21 PM.png

In certain cases, UDFs would be dynamically registered/deregistered often. This 
can clutter "constructor_cache" of "ReflectionUtil" and cause memory pressure.

 

!Screenshot 2019-04-17 at 2.17.21 PM.png!

 

It would be good to provide an option to invoke ReflectionUtil without hitting 
constructor cache.



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

[jira] [Created] (HIVE-21684) tmp table space directory should be removed on session close

2019-05-02 Thread Rajesh Balamohan (JIRA)

Rajesh Balamohan created HIVE-21684:
---

 Summary: tmp table space directory should be removed on session 
close
 Key: HIVE-21684
 URL: https://issues.apache.org/jira/browse/HIVE-21684
 Project: Hive
  Issue Type: Bug
Reporter: Rajesh Balamohan


`_tmp_space.db` folder should be deleted on session close. 

{noformat}
org.apache.hadoop.ipc.RemoteException(org.apache.hadoop.hdfs.protocol.FSLimitException$MaxDirectoryItemsExceededException):
 The directory item limit of...
{noformat}



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

[jira] [Created] (HIVE-21778) CBO: "Struct is not null" gets evaluated as `nullable` always causing pushdown miss in the query

2019-05-22 Thread Rajesh Balamohan (JIRA)

Rajesh Balamohan created HIVE-21778:
---

 Summary: CBO: "Struct is not null" gets evaluated as `nullable` 
always causing pushdown miss in the query
 Key: HIVE-21778
 URL: https://issues.apache.org/jira/browse/HIVE-21778
 Project: Hive
  Issue Type: Bug
  Components: CBO
Affects Versions: 2.3.5
Reporter: Rajesh Balamohan



{noformat}
drop table if exists test_struct;
CREATE external TABLE test_struct
(
  f1 string,
  demo_struct struct,
  datestr string
);

set hive.cbo.enable=true;
explain select * from etltmp.test_struct where datestr='2019-01-01' and 
demo_struct is not null;



STAGE PLANS:
  Stage: Stage-0
Fetch Operator
  limit: -1
  Processor Tree:
TableScan
  alias: test_struct
  filterExpr: (datestr = '2019-01-01') (type: boolean) <- Note that 
demo_struct filter is not added here
  Filter Operator
predicate: (datestr = '2019-01-01') (type: boolean)
Select Operator
  expressions: f1 (type: string), demo_struct (type: 
struct), '2019-01-01' (type: string)
  outputColumnNames: _col0, _col1, _col2
  ListSink


set hive.cbo.enable=false;
explain select * from etltmp.test_struct where datestr='2019-01-01' and 
demo_struct is not null;


STAGE PLANS:
  Stage: Stage-0
Fetch Operator
  limit: -1
  Processor Tree:
TableScan
  alias: test_struct
  filterExpr: ((datestr = '2019-01-01') and demo_struct is not null) 
(type: boolean) <- Note that demo_struct filter is added when CBO is turned 
off
  Filter Operator
predicate: ((datestr = '2019-01-01') and demo_struct is not null) 
(type: boolean)
Select Operator
  expressions: f1 (type: string), demo_struct (type: 
struct), '2019-01-01' (type: string)
  outputColumnNames: _col0, _col1, _col2
  ListSink

{noformat}

In CalcitePlanner::genFilterRelNode, the following code misses to evaluate this 
filter. 
{noformat}
RexNode factoredFilterExpr = RexUtil
  .pullFactors(cluster.getRexBuilder(), convertedFilterExpr);
{noformat}

Note that even if we add `demo_struct.f1` it would end up pushing the filter 
correctly. Suspecting {code}RexCall::isAlwaysTrue{code} is evaluating to true 
in this case.



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

[jira] [Created] (HIVE-21971) HS2 leaks classload due to `ReflectionUtils::CONSTRUCTOR_CACHE` with temporary functions + GenericUDF

2019-07-08 Thread Rajesh Balamohan (JIRA)

Rajesh Balamohan created HIVE-21971:
---

 Summary: HS2 leaks classload due to 
`ReflectionUtils::CONSTRUCTOR_CACHE` with temporary functions + GenericUDF
 Key: HIVE-21971
 URL: https://issues.apache.org/jira/browse/HIVE-21971
 Project: Hive
  Issue Type: Bug
  Components: HiveServer2
Affects Versions: 2.3.4
Reporter: Rajesh Balamohan


https://issues.apache.org/jira/browse/HIVE-10329 helped in moving away from 
hadoop's ReflectionUtils constructor cache issue 
(https://issues.apache.org/jira/browse/HADOOP-10513).

However, there are corner cases where hadoop's {{ReflectionUtils}} is in use 
and this causes gradual build up of memory in HS2.

I have observed this in Hive 2.3. But the codepath in master for this has not 
changed much.

Easiest way to repro would be to add a temp function which extends 
{{GenericUDF}}. In {{FunctionRegistry::cloneGenericUDF,}} this would 
end up using {{org.apache.hadoop.util.ReflectionUtils.newInstance}} which in 
turn lands up in COSNTRUCTOR_CACHE of ReflectionUtils. 


{noformat}

CREATE TEMPORARY FUNCTION dummy AS 'com.hive.test.DummyGenericUDF' USING JAR 
'file:///home/test/udf/dummy.jar';

select dummy();

at 
org.apache.hadoop.util.ReflectionUtils.newInstance(ReflectionUtils.java:107)
at 
org.apache.hadoop.hive.ql.exec.FunctionRegistry.cloneGenericUDF(FunctionRegistry.java:1353)
at 
org.apache.hadoop.hive.ql.exec.FunctionInfo.getGenericUDF(FunctionInfo.java:122)
at 
org.apache.hadoop.hive.ql.parse.TypeCheckProcFactory$DefaultExprProcessor.getXpathOrFuncExprNodeDesc(TypeCheckProcFactory.java:983)
at 
org.apache.hadoop.hive.ql.parse.TypeCheckProcFactory$DefaultExprProcessor.process(TypeCheckProcFactory.java:1359)
at 
org.apache.hadoop.hive.ql.lib.DefaultRuleDispatcher.dispatch(DefaultRuleDispatcher.java:90)
at 
org.apache.hadoop.hive.ql.lib.DefaultGraphWalker.dispatchAndReturn(DefaultGraphWalker.java:105)
at 
org.apache.hadoop.hive.ql.lib.DefaultGraphWalker.dispatch(DefaultGraphWalker.java:89)
at 
org.apache.hadoop.hive.ql.lib.ExpressionWalker.walk(ExpressionWalker.java:76)
at 
org.apache.hadoop.hive.ql.lib.DefaultGraphWalker.startWalking(DefaultGraphWalker.java:120)
{noformat}

Note: Reflection based invocation of hadoop's `ReflectionUtils::clear` was 
removed in 2.x. 



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

[jira] [Created] (HIVE-21993) HS/HMS delegationstore with ZK can degrade performance when jute.maxBuffer is reached

2019-07-14 Thread Rajesh Balamohan (JIRA)

Rajesh Balamohan created HIVE-21993:
---

 Summary: HS/HMS delegationstore with ZK can degrade performance 
when jute.maxBuffer is reached
 Key: HIVE-21993
 URL: https://issues.apache.org/jira/browse/HIVE-21993
 Project: Hive
  Issue Type: Bug
  Components: Hive
Affects Versions: 2.3.4, 3.0.0, 4.0.0
Reporter: Rajesh Balamohan


DelegationStore can be configured to run in-mem/DB/ZK based TokenStores. 
{{TokenStoreDelegationTokenSecretManager}} purges the tokens (>24 hours) 
periodically every 1 hour by default. 

+Issue:+
When large number of delegation tokens are present in ZK, 
{{TokenStoreDelegationTokenSecretManager::removeExpiredTokens}} can throw the 
following exception when connecting to ZK.

{{noformat}}
WARN [main-SendThread(xyz:2181)]: org.apache.zookeeper.ClientCnxn: Session 
0x36a161083865cd9 for server xyz/1.2.3.4:2181, unexpected error, closing socket 
connection and attempting reconnect
java.io.IOException: Packet len68985070 is out of range!
at 
org.apache.zookeeper.ClientCnxnSocket.readLength(ClientCnxnSocket.java:112) 
~[zookeeper-3.4.6.jar]
at 
org.apache.zookeeper.ClientCnxnSocketNIO.doIO(ClientCnxnSocketNIO.java:79) 
~[zookeeper-3.4.6.jar]
at 
org.apache.zookeeper.ClientCnxnSocketNIO.doTransport(ClientCnxnSocketNIO.java:366)
 ~[zookeeper-3.4.6.jar]
at org.apache.zookeeper.ClientCnxn$SendThread.run(ClientCnxn.java:1081) 
[zookeeper-3.4.6.jar]

...
...

 INFO [main-EventThread]: 
org.apache.curator.framework.state.ConnectionStateManager: State change: 
SUSPENDED
 ERROR [Thread[Thread-13,5,main]]: 
org.apache.hadoop.hive.thrift.TokenStoreDelegationTokenSecretManager: 
ExpiredTokenRemover thread received unexpected exception. 
org.apache.hadoop.hive.thrift.DelegationTokenStore$TokenStoreException: Error 
getting children for /hivedelegationMETASTORE/tokens
org.apache.hadoop.hive.thrift.DelegationTokenStore$TokenStoreException: Error 
getting children for /hivedelegationMETASTORE/tokens
at 
org.apache.hadoop.hive.thrift.ZooKeeperTokenStore.zkGetChildren(ZooKeeperTokenStore.java:280)
 ~[hive-exec-x.y.z.jar]
at 
org.apache.hadoop.hive.thrift.ZooKeeperTokenStore.getAllDelegationTokenIdentifiers(ZooKeeperTokenStore.java:413)
 ~[hive-exec-x.y.z.jar]
at 
org.apache.hadoop.hive.thrift.TokenStoreDelegationTokenSecretManager.removeExpiredTokens(TokenStoreDelegationTokenSecretManager.java:238)
 ~[hive-exec-x.y.z.jar]
at 
org.apache.hadoop.hive.thrift.TokenStoreDelegationTokenSecretManager$ExpiredTokenRemover.run(TokenStoreDelegationTokenSecretManager.java:309)
 [hive-exec-x.y.z.jar]
at java.lang.Thread.run(Thread.java:748) [?:1.8.0_171]
Caused by: org.apache.zookeeper.KeeperException$ConnectionLossException: 
KeeperErrorCode = ConnectionLoss for /hivedelegationMETASTORE/tokens
at org.apache.zookeeper.KeeperException.create(KeeperException.java:99) 
~[zookeeper-3.4.6.jar]
at org.apache.zookeeper.KeeperException.create(KeeperException.java:51) 
~[zookeeper-3.4.6.jar]
at org.apache.zookeeper.ZooKeeper.getChildren(ZooKeeper.java:1590) 
~[zookeeper-3.4.6.jar]
at 
org.apache.curator.framework.imps.GetChildrenBuilderImpl$3.call(GetChildrenBuilderImpl.java:214)
 ~[curator-framework-2.7.1.jar:?]
at 
org.apache.curator.framework.imps.GetChildrenBuilderImpl$3.call(GetChildrenBuilderImpl.java:203)
 ~[curator-framework-2.7.1.jar:?]
at org.apache.curator.RetryLoop.callWithRetry(RetryLoop.java:107) 
~[curator-client-2.7.1.jar:?]
at 
org.apache.curator.framework.imps.GetChildrenBuilderImpl.pathInForeground(GetChildrenBuilderImpl.java:200)
 ~[curator-framework-2.7.1.jar:?]
at 
org.apache.curator.framework.imps.GetChildrenBuilderImpl.forPath(GetChildrenBuilderImpl.java:191)
 ~[curator-framework-2.7.1.jar:?]
at 
org.apache.curator.framework.imps.GetChildrenBuilderImpl.forPath(GetChildrenBuilderImpl.java:38)
 ~[curator-framework-2.7.1.jar:?]
at 
org.apache.hadoop.hive.thrift.ZooKeeperTokenStore.zkGetChildren(ZooKeeperTokenStore.java:278)
 ~[hive-exec-x.y.z.jar]
... 4 more
{{noformat}}

When packet length is greater than {{jute.maxBuffer}}, it ends up throwing this 
exception and it reconnects the connection.

However, the same ZK client is being used for {{addToken, removeToken}} calls 
which are run in different threads. This creates problems when 
creating/deleting tokens.

1. Issue in creating tokens:

New token is added at the same time, when ZK client is in suspended state (due 
to above mentioned reason). Node is already created by curator, but before it 
could verify connection gets into stale state. So curator framework retries and 
ends up with the following exception. So creating tokens fails often.

{{noformat}}

Caused by: 
org.apache.hadoop.hive.thrift.DelegationTokenStore$TokenStoreException: Error 
creating new node wi

[jira] [Created] (HIVE-22013) "Show table extended" should not compute table statistics

2019-07-20 Thread Rajesh Balamohan (JIRA)

Rajesh Balamohan created HIVE-22013:
---

 Summary: "Show table extended" should not compute table statistics
 Key: HIVE-22013
 URL: https://issues.apache.org/jira/browse/HIVE-22013
 Project: Hive
  Issue Type: Bug
  Components: Hive
Reporter: Rajesh Balamohan


In some of the `show table extended` statements, following codepath is invoked

[https://github.com/apache/hive/blob/master/ql/src/java/org/apache/hadoop/hive/ql/metadata/formatting/TextMetaDataFormatter.java#L421]

[https://github.com/apache/hive/blob/master/ql/src/java/org/apache/hadoop/hive/ql/metadata/formatting/TextMetaDataFormatter.java#L449]

[https://github.com/apache/hive/blob/master/ql/src/java/org/apache/hadoop/hive/ql/metadata/formatting/TextMetaDataFormatter.java#L468]

1. Not sure why this invokes stats computation. This should be removed?
 2. Even if #1 is needed, it would be broken when {{tblPath}} and 
{{partitionPaths}} are different (i.e when both of them of them are in 
different fs or configured via router etc).



--
This message was sent by Atlassian JIRA
(v7.6.14#76016)

[jira] [Created] (HIVE-22039) Query with CBO crashes HS2 in corner cases

2019-07-24 Thread Rajesh Balamohan (JIRA)

Rajesh Balamohan created HIVE-22039:
---

 Summary: Query with CBO crashes HS2 in corner cases 
 Key: HIVE-22039
 URL: https://issues.apache.org/jira/browse/HIVE-22039
 Project: Hive
  Issue Type: Bug
  Components: CBO
Affects Versions: 2.3.4, 3.1.1
Reporter: Rajesh Balamohan


Here is a very simple repro for this case.

This along with CBO would crash HS2. It runs into a infinite loop creating too 
many number of RexCalls and finally OOMs.

This is observed in 2.x, 3.x.

With 4.x (master branch), it does not happen. Master has 
{{calcite-core-1.19.0.jar}}, where as 3.x has {{calcite-core-1.16.0.jar}}. 

{noformat}

drop table if exists tableA;
drop table if exists tableB;

create table if not exists tableA(id int, reporting_date string) stored as orc;
create table if not exists tableB(id int, reporting_date string) partitioned by 
(datestr string) stored as orc;



explain with tableA_cte as (
select
id,
reporting_date
from tableA
  ),

tableA_cte_2 as (
select
0 as id,
reporting_date
from tableA
  ),

tableA_cte_5 as (
  select * from tableA_cte
  union 
  select * from tableA_cte_2  
),

tableB_cte_0 as (
select
id,
reporting_date
from tableB   
where reporting_date  = '2018-10-29'
  ),

tableB_cte_1 as (
select
0 as id,
reporting_date
from tableB  
where datestr = '2018-10-29'  
  ),


tableB_cte_4 as (
select * from tableB_cte_0
union 
select * from tableB_cte_1
  )

select
  a.id as id,
  b.reporting_date
from tableA_cte_5 a
join tableB_cte_4 b on (a.id = b.id and a.reporting_date = b.reporting_date);

{noformat}



--
This message was sent by Atlassian JIRA
(v7.6.14#76016)

[jira] [Created] (HIVE-22102) Reduce HMS call when createing HiveSession

2019-08-12 Thread Rajesh Balamohan (JIRA)

Rajesh Balamohan created HIVE-22102:
---

 Summary: Reduce HMS call when createing HiveSession
 Key: HIVE-22102
 URL: https://issues.apache.org/jira/browse/HIVE-22102
 Project: Hive
  Issue Type: Improvement
  Components: HiveServer2
Reporter: Rajesh Balamohan


When establishing HiveSession, it ends up configuring session 
variables/settings.

As part of it, it ends up checking the database details. 
[https://github.com/apache/hive/blob/master/service/src/java/org/apache/hive/service/cli/session/HiveSessionImpl.java#L314]

Even if it is `default` DB, it ends up making this check. In corner cases, 
these calls turn out to be expensive.
{noformat}
2019-08-13T03:16:57,189  INFO [b42ba57f-1740-4174-855d-4e3f08319ca5 
HiveServer2-Handler-Pool: Thread-1552313] metadata.Hive: Total time spent in 
this metastore function was greater than 1000ms : getDatabase_(String, )=13265
{noformat}
We can just skip this check if its `DEFAULT_DATABASE_NAME` (default) DB. This 
may not be an issue for CachedStore.



--
This message was sent by Atlassian JIRA
(v7.6.14#76016)

[jira] [Created] (HIVE-22214) Explain vectorization should disable user level explain

2019-09-18 Thread Rajesh Balamohan (Jira)

Rajesh Balamohan created HIVE-22214:
---

 Summary: Explain vectorization should disable user level explain
 Key: HIVE-22214
 URL: https://issues.apache.org/jira/browse/HIVE-22214
 Project: Hive
  Issue Type: Improvement
  Components: Hive
Reporter: Rajesh Balamohan






--
This message was sent by Atlassian Jira
(v8.3.4#803005)

[jira] [Created] (HIVE-22246) Beeline reflector should handle map types

2019-09-26 Thread Rajesh Balamohan (Jira)

Rajesh Balamohan created HIVE-22246:
---

 Summary: Beeline reflector should handle map types
 Key: HIVE-22246
 URL: https://issues.apache.org/jira/browse/HIVE-22246
 Project: Hive
  Issue Type: Bug
  Components: Beeline
Reporter: Rajesh Balamohan


Since beeline {{Reflector}} is not handling Map types, it ends up converting 
values from {{beeline.properties}} to "null" and throws NPE with {{"}}beeline 
--hivevar x=1 --hivevar y=1".



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

[jira] [Created] (HIVE-22269) Missing stats in the operator with "hive.optimize.sort.dynamic.partition" (SortedDynPartitionOptimizer) misestimates reducer count

2019-09-30 Thread Rajesh Balamohan (Jira)

Rajesh Balamohan created HIVE-22269:
---

 Summary: Missing stats in the operator with 
"hive.optimize.sort.dynamic.partition" (SortedDynPartitionOptimizer) 
misestimates reducer count
 Key: HIVE-22269
 URL: https://issues.apache.org/jira/browse/HIVE-22269
 Project: Hive
  Issue Type: Bug
  Components: Statistics
Reporter: Rajesh Balamohan


{{hive.optimize.sort.dynamic.partition=true}} introduces new stage to reduce 
number of writes in dynamic partitioning usecase. Earlier 
{{SortedDynPartitionOptimizer}} added this new operator via {{Optimizer.java}} 
and the stats for the newly added operator was populated via 
{{StatsRulesProcFactory$ReduceSinkStatsRule}}.

However, with "HIVE-20703" this got changed. This is moved to {{TezCompiler}} 
for cost based decision. Though the operator gets added correctly, the stats 
for this does not get added (as it runs after runStatsAnnotation()).

This causes reducer count to be mis-estimated in the query.
{noformat}
e.g For the following query, reducer_2 would be estimated as "2" instead of 
"1009". This causes huge delay in the runtime.

explain 
from tpcds_xtext_1000.store_sales ss
insert overwrite table store_sales partition (ss_sold_date_sk)
select
ss.ss_sold_time_sk,
ss.ss_item_sk,
ss.ss_customer_sk,
ss.ss_cdemo_sk,
ss.ss_hdemo_sk,
ss.ss_addr_sk,
ss.ss_store_sk,
ss.ss_promo_sk,
ss.ss_ticket_number,
ss.ss_quantity,
ss.ss_wholesale_cost,
ss.ss_list_price,
ss.ss_sales_price,
ss.ss_ext_discount_amt,
ss.ss_ext_sales_price,
ss.ss_ext_wholesale_cost,
ss.ss_ext_list_price,
ss.ss_ext_tax,
ss.ss_coupon_amt,
ss.ss_net_paid,
ss.ss_net_paid_inc_tax,
ss.ss_net_profit,
ss.ss_sold_date_sk
where ss.ss_sold_date_sk is not null
insert overwrite table store_sales partition (ss_sold_date_sk)
select
ss.ss_sold_time_sk,
ss.ss_item_sk,
ss.ss_customer_sk,
ss.ss_cdemo_sk,
ss.ss_hdemo_sk,
ss.ss_addr_sk,
ss.ss_store_sk,
ss.ss_promo_sk,
ss.ss_ticket_number,
ss.ss_quantity,
ss.ss_wholesale_cost,
ss.ss_list_price,
ss.ss_sales_price,
ss.ss_ext_discount_amt,
ss.ss_ext_sales_price,
ss.ss_ext_wholesale_cost,
ss.ss_ext_list_price,
ss.ss_ext_tax,
ss.ss_coupon_amt,
ss.ss_net_paid,
ss.ss_net_paid_inc_tax,
ss.ss_net_profit,
ss.ss_sold_date_sk
where ss.ss_sold_date_sk is null
distribute by ss.ss_item_sk
;
{noformat}



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

[jira] [Created] (HIVE-22316) Avoid hostname resolution in LlapInputFormat

2019-10-09 Thread Rajesh Balamohan (Jira)

Rajesh Balamohan created HIVE-22316:
---

 Summary: Avoid hostname resolution in LlapInputFormat
 Key: HIVE-22316
 URL: https://issues.apache.org/jira/browse/HIVE-22316
 Project: Hive
  Issue Type: Improvement
  Components: llap
Reporter: Rajesh Balamohan
 Attachments: Screenshot 2019-10-10 at 10.13.48 AM.png

Attaching prof output, which showed up when running short query. It would be 
good to have the hostname as static final.



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

[jira] [Created] (HIVE-22379) Reduce db lookups during dynamic partition loading

2019-10-21 Thread Rajesh Balamohan (Jira)

Rajesh Balamohan created HIVE-22379:
---

 Summary: Reduce db lookups during dynamic partition loading
 Key: HIVE-22379
 URL: https://issues.apache.org/jira/browse/HIVE-22379
 Project: Hive
  Issue Type: Improvement
Reporter: Rajesh Balamohan


{\{HiveAlterHandler::alterPartitions}} could lookup all partition details via 
single \{{getPartition}} call instead of multiple calls.



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

[jira] [Created] (HIVE-22383) `alterPartitions` is invoked twice during dynamic partition load causing runtime delay

2019-10-21 Thread Rajesh Balamohan (Jira)

Rajesh Balamohan created HIVE-22383:
---

 Summary: `alterPartitions` is invoked twice during dynamic 
partition load causing runtime delay
 Key: HIVE-22383
 URL: https://issues.apache.org/jira/browse/HIVE-22383
 Project: Hive
  Issue Type: Bug
Reporter: Rajesh Balamohan


First invocation in {{Hive::loadDynamicPartitions}}

https://github.com/apache/hive/blob/master/ql/src/java/org/apache/hadoop/hive/ql/metadata/Hive.java#L2978
https://github.com/apache/hive/blob/master/ql/src/java/org/apache/hadoop/hive/ql/metadata/Hive.java#L2638

Second invocation in {{BasicStatsTask::aggregateStats}}

https://github.com/apache/hive/blob/master/ql/src/java/org/apache/hadoop/hive/ql/stats/BasicStatsTask.java#L335

This leads to good amount of delay in dynamic partition loading.

 



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

[jira] [Created] (HIVE-22385) Repl: Perf fixes

2019-10-21 Thread Rajesh Balamohan (Jira)

Rajesh Balamohan created HIVE-22385:
---

 Summary: Repl: Perf fixes
 Key: HIVE-22385
 URL: https://issues.apache.org/jira/browse/HIVE-22385
 Project: Hive
  Issue Type: Improvement
  Components: repl
Reporter: Rajesh Balamohan


Creating this high level ticket for tracking repl perf fixes.



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

[jira] [Created] (HIVE-22386) Repl: Optimise ReplDumpTask::bootStrapDump

2019-10-21 Thread Rajesh Balamohan (Jira)

Rajesh Balamohan created HIVE-22386:
---

 Summary: Repl: Optimise ReplDumpTask::bootStrapDump
 Key: HIVE-22386
 URL: https://issues.apache.org/jira/browse/HIVE-22386
 Project: Hive
  Issue Type: Sub-task
Reporter: Rajesh Balamohan


{\{ReplDumpTask::bootStrapDump}} dumps one table at a time within a database. 
This data is written in separate folders per table. This can be optimized to 
write in parallel.



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

[jira] [Created] (HIVE-22387) Repl: Reduce FS lookups in repl bootstrap

2019-10-21 Thread Rajesh Balamohan (Jira)

Rajesh Balamohan created HIVE-22387:
---

 Summary: Repl: Reduce FS lookups in repl bootstrap
 Key: HIVE-22387
 URL: https://issues.apache.org/jira/browse/HIVE-22387
 Project: Hive
  Issue Type: Sub-task
  Components: repl
Reporter: Rajesh Balamohan


During bootstrap, \{{dbRoot}} is obtained per database. This need not be 
validated for every table dump (in \{{TableExport.Paths}}).



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

[jira] [Created] (HIVE-22389) Repl: Optimise ReplDumpTask.incrementalDump

2019-10-21 Thread Rajesh Balamohan (Jira)

Rajesh Balamohan created HIVE-22389:
---

 Summary: Repl: Optimise ReplDumpTask.incrementalDump
 Key: HIVE-22389
 URL: https://issues.apache.org/jira/browse/HIVE-22389
 Project: Hive
  Issue Type: Sub-task
  Components: repl
Reporter: Rajesh Balamohan






--
This message was sent by Atlassian Jira
(v8.3.4#803005)

1 2 3 4 5 >

1 - 100 of 416 matches

Mail list logo