[jira] [Commented] (HIVE-24819) CombineHiveInputFormat format seems to be returning row count in the multiple of Maps

okumin (Jira) Thu, 25 Feb 2021 07:24:04 -0800


    [ 
https://issues.apache.org/jira/browse/HIVE-24819?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17290989#comment-17290989
 ]


okumin commented on HIVE-24819:
-------------------------------

[~jitender], I believe the user mailing list below is the right place to ask 
something.

[https://hive.apache.org/mailing_lists.html]

 

As to your issue, I think you don't have to use 
`org.apache.hadoop.hive.ql.io.CombineHiveInputFormat` with Tez.

CombineHiveInputFormat is capable of putting fragmented files/blocks together. 
I understand this is mainly used for other than Tez.

Tez has a similar and hopefully more powerful feature, Tez grouping. It will 
group blocks by itself.

[https://cwiki.apache.org/confluence/display/TEZ/How+initial+task+parallelism+works]

 

So I recommend using HiveInputFormat and tweak parallelism with 
`tez.grouping.min-size` and `tez.grouping.max-size`.

To be honest, I don't know why TABLESAMPLE returns the wrong result.

> CombineHiveInputFormat format seems to be returning row count in the multiple 
> of Maps 
> --------------------------------------------------------------------------------------
>
>                 Key: HIVE-24819
>                 URL: https://issues.apache.org/jira/browse/HIVE-24819
>             Project: Hive
>          Issue Type: Bug
>         Environment: Apache Hive (version 3.1.0.3.1.0.0-78)
> Driver: Hive JDBC (version 3.1.0.3.1.0.0-78)
> Transaction isolation: TRANSACTION_REPEATABLE_READ
> Beeline version 3.1.0.3.1.0.0-78 by Apache Hive
>            Reporter: Jitender Kumar
>            Priority: Critical
>
> Hi Team,
> This is the first time I am writing a bug using apache Jira, so pardon me if 
> I am unintentionally breaking any protocols. 
> I am facing the following issue (on a multi-node cluster) when I set 
> hive.tez.input.format to  
> org.apache.hadoop.hive.ql.io.CombineHiveInputFormat. 
> Just for demonstration purposes, I will be executing the following query for 
> multiple cases. 
> _select count(1) from dbname.personal_data_rc tablesample(1000 rows);_
> *Case1*
> mapred.map.tasks=2
> hive.tez.input.format=org.apache.hadoop.hive.ql.io.HiveInputFormat
> *Output*
> 1000
> *Case 2*
> mapred.map.tasks=2
> hive.tez.input.format=org.apache.hadoop.hive.ql.io.CombineHiveInputFormat
> *Output*
> 2000
> *Case 3*
> mapred.map.tasks=3
> hive.tez.input.format=org.apache.hadoop.hive.ql.io.CombineHiveInputFormat
> *Output*
> 3000
> After 3 maps set as default, out remains same, i.e multiple of 3. 
> Can you help me understand why if I have TABLESAMPLE set to 1000 rows, it is 
> giving me more number of rows? Is there any other property that must be used 
> with CombineHiveInputFormat or is it an issue with CombineHiveInputFormat 
> only? 
> I have tried to look for a solution but in the end i had to come here. Please 
> share your inputs ASAP as one of our client is looking for a solution or 
> explaination regarding this? 
> For now as a workaround we have changed it to following.  
> *hive.tez.input.format=org.apache.hadoop.hive.ql.io.HiveInputFormat*
>  



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

[jira] [Commented] (HIVE-24819) CombineHiveInputFormat format seems to be returning row count in the multiple of Maps

Reply via email to