[ 
https://issues.apache.org/jira/browse/HIVE-11265?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Sergey Shelukhin updated HIVE-11265:
------------------------------------
    Description: 
Running q27 with split-waves 0.9 on 10 nodes x 16 executors, I get 140 mappers 
reading store_sales, and 5~ more assorted vertices.
When running the query repeatedly, one would expect good locality, i.e. the 
same splits (files+stripes) being processed on the same nodes most of the time.
However, this is only the case for 40-50% of the stripes in my experience. When 
the query is run 10 times in a row, an average split (file+stripe) is read on 
~4 machines. Some are actually read on a different machine every run :)

This affects cache hit ratio.
Understandably in real scenarios we won't get 100% locality, but we should not 
be getting bad locality in simple cases like this.

  was:
Running q27 with split-waves 0.9 on 10 nodes x 16 executors, I get 140 mappers 
reading store_sales, and 5~ more assorted vertices.
When running the query repeatedly, one would expect good locality, i.e. the 
same splits (files+stripes) being processed on the same nodes most of the time.
However, this is only the case for 40-50% of the stripes in my experience. When 
the query is run 10 times in a row, an average split (file+stripe) is read on 
~4 machine. Some are actually read on a different machine every run :)

This affects cache hit ratio.
Understandably in real scenarios we won't get 100% locality, but we should not 
be getting bad locality in simple cases like this.


> LLAP: investigate locality issues
> ---------------------------------
>
>                 Key: HIVE-11265
>                 URL: https://issues.apache.org/jira/browse/HIVE-11265
>             Project: Hive
>          Issue Type: Sub-task
>            Reporter: Sergey Shelukhin
>            Assignee: Siddharth Seth
>
> Running q27 with split-waves 0.9 on 10 nodes x 16 executors, I get 140 
> mappers reading store_sales, and 5~ more assorted vertices.
> When running the query repeatedly, one would expect good locality, i.e. the 
> same splits (files+stripes) being processed on the same nodes most of the 
> time.
> However, this is only the case for 40-50% of the stripes in my experience. 
> When the query is run 10 times in a row, an average split (file+stripe) is 
> read on ~4 machines. Some are actually read on a different machine every run 
> :)
> This affects cache hit ratio.
> Understandably in real scenarios we won't get 100% locality, but we should 
> not be getting bad locality in simple cases like this.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

Reply via email to