[
https://issues.apache.org/jira/browse/HIVE-24819?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17290989#comment-17290989
]
okumin commented on HIVE-24819:
-------------------------------
[~jitender], I believe the user mailing list below is the right place to ask
something.
[https://hive.apache.org/mailing_lists.html]
As to your issue, I think you don't have to use
`org.apache.hadoop.hive.ql.io.CombineHiveInputFormat` with Tez.
CombineHiveInputFormat is capable of putting fragmented files/blocks together.
I understand this is mainly used for other than Tez.
Tez has a similar and hopefully more powerful feature, Tez grouping. It will
group blocks by itself.
[https://cwiki.apache.org/confluence/display/TEZ/How+initial+task+parallelism+works]
So I recommend using HiveInputFormat and tweak parallelism with
`tez.grouping.min-size` and `tez.grouping.max-size`.
To be honest, I don't know why TABLESAMPLE returns the wrong result.
> CombineHiveInputFormat format seems to be returning row count in the multiple
> of Maps
> --------------------------------------------------------------------------------------
>
> Key: HIVE-24819
> URL: https://issues.apache.org/jira/browse/HIVE-24819
> Project: Hive
> Issue Type: Bug
> Environment: Apache Hive (version 3.1.0.3.1.0.0-78)
> Driver: Hive JDBC (version 3.1.0.3.1.0.0-78)
> Transaction isolation: TRANSACTION_REPEATABLE_READ
> Beeline version 3.1.0.3.1.0.0-78 by Apache Hive
> Reporter: Jitender Kumar
> Priority: Critical
>
> Hi Team,
> This is the first time I am writing a bug using apache Jira, so pardon me if
> I am unintentionally breaking any protocols.
> I am facing the following issue (on a multi-node cluster) when I set
> hive.tez.input.format to
> org.apache.hadoop.hive.ql.io.CombineHiveInputFormat.
> Just for demonstration purposes, I will be executing the following query for
> multiple cases.
> _select count(1) from dbname.personal_data_rc tablesample(1000 rows);_
> *Case1*
> mapred.map.tasks=2
> hive.tez.input.format=org.apache.hadoop.hive.ql.io.HiveInputFormat
> *Output*
> 1000
> *Case 2*
> mapred.map.tasks=2
> hive.tez.input.format=org.apache.hadoop.hive.ql.io.CombineHiveInputFormat
> *Output*
> 2000
> *Case 3*
> mapred.map.tasks=3
> hive.tez.input.format=org.apache.hadoop.hive.ql.io.CombineHiveInputFormat
> *Output*
> 3000
> After 3 maps set as default, out remains same, i.e multiple of 3.
> Can you help me understand why if I have TABLESAMPLE set to 1000 rows, it is
> giving me more number of rows? Is there any other property that must be used
> with CombineHiveInputFormat or is it an issue with CombineHiveInputFormat
> only?
> I have tried to look for a solution but in the end i had to come here. Please
> share your inputs ASAP as one of our client is looking for a solution or
> explaination regarding this?
> For now as a workaround we have changed it to following.
> *hive.tez.input.format=org.apache.hadoop.hive.ql.io.HiveInputFormat*
>
--
This message was sent by Atlassian Jira
(v8.3.4#803005)