[
https://issues.apache.org/jira/browse/HIVE-16498?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
]
Stamatis Zampetakis updated HIVE-16498:
---------------------------------------
Fix Version/s: (was: 1.2.0)
I cleared the fixVersion field since this ticket is not resolved. Please review
this ticket and if the fix is already committed to a specific version please
set the version accordingly and mark the ticket as RESOLVED.
According to the JIRA guidelines
(https://cwiki.apache.org/confluence/display/Hive/HowToContribute) the
fixVersion should be set only when the issue is resolved/closed.
> [Tez] ReduceRecordProcessor has no check to see if all the operators are done
> or not and is reading complete data
> -----------------------------------------------------------------------------------------------------------------
>
> Key: HIVE-16498
> URL: https://issues.apache.org/jira/browse/HIVE-16498
> Project: Hive
> Issue Type: Bug
> Affects Versions: 1.2.0, 1.3.0
> Reporter: Adesh Kumar Rao
> Priority: Major
> Attachments: HIVE-16498.1.patch
>
>
> ReducerRecordProcessor is not checking if the reducer (Operator) is done or
> not and this causes reading of useless data.
> It can be reproduced by a reduce side join.
> The data for large_table is generated by following shell script and a table
> can be created from the file `large.txt`
> {code:java}
> for (( j=1 ; j <=20; j++))
> do
> for (( i=1; i <= 1000000; i++ ))
> do
> echo "$i,$j" >> large.txt
> done
> done
> {code}
> {code:java}
> create external table large_table ( i int, j int) row format delimited fields
> terminated by ',' location "hdfs://<some-hdfs-location>";
> set hive.auto.convert.join=false; -- So that reduce side join is used instead
> of MapJoin
> select * from large_table a join large_table b on a,j = b.j limit 100;
> {code}
> The above join query is stuck reading all the data from table (because of no
> check) and does not seem to finish in real time as compared to MR or even Tez
> with MapJoin enabled.
> For reference, the same query takes around 5-6 minutes on MR and 2-3 minutes
> in case of MapJoin on Tez.
--
This message was sent by Atlassian Jira
(v8.20.10#820010)