[
https://issues.apache.org/jira/browse/HIVE-11265?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=15011856#comment-15011856
]
Siddharth Seth commented on HIVE-11265:
---------------------------------------
Should've updated this earlier.
The 10 instances being run are actually on a 20 node cluster. Data is
distributed across the 20 nodes. With an even distribution etc, ~50% of the
data would be on nodes which do not run LLAP instances - and hence become
non-local. The scheduler sends requests which do not match locality to random
nodes - which would explain some fragments going all over the place.
There's another jira (or there will be another jira if it doesn't already
exist) to allow for alternate placement policies - instead of relying on HDFS
locations. TEZ-2879 is what allows this.
Resolving this as done.
> LLAP: investigate locality issues
> ---------------------------------
>
> Key: HIVE-11265
> URL: https://issues.apache.org/jira/browse/HIVE-11265
> Project: Hive
> Issue Type: Sub-task
> Reporter: Sergey Shelukhin
> Assignee: Siddharth Seth
>
> Running q27 with split-waves 0.9 on 10 nodes x 16 executors, I get 140
> mappers reading store_sales, and 5~ more assorted vertices.
> When running the query repeatedly, one would expect good locality, i.e. the
> same splits (files+stripes) being processed on the same nodes most of the
> time.
> However, this is only the case for 40-50% of the stripes in my experience.
> When the query is run 10 times in a row, an average split (file+stripe) is
> read on ~4 machines. Some are actually read on a different machine every run
> :)
> This affects cache hit ratio.
> Understandably in real scenarios we won't get 100% locality, but we should
> not be getting bad locality in simple cases like this.
--
This message was sent by Atlassian JIRA
(v6.3.4#6332)