[ https://issues.apache.org/jira/browse/HIVE-11265?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ]
Sergey Shelukhin updated HIVE-11265: ------------------------------------ Description: Running q27 with split-waves 0.9 on 10 nodes x 16 executors, I get 140 mappers reading store_sales, and 5~ more assorted vertices. When running the query repeatedly, one would expect good locality, i.e. the same splits (files+stripes) being processed on the same nodes most of the time. However, this is only the case for 40-50% of the stripes in my experience. When the query is run 10 times in a row, an average split (file+stripe) is read on ~4 machines. Some are actually read on a different machine every run :) This affects cache hit ratio. Understandably in real scenarios we won't get 100% locality, but we should not be getting bad locality in simple cases like this. was: Running q27 with split-waves 0.9 on 10 nodes x 16 executors, I get 140 mappers reading store_sales, and 5~ more assorted vertices. When running the query repeatedly, one would expect good locality, i.e. the same splits (files+stripes) being processed on the same nodes most of the time. However, this is only the case for 40-50% of the stripes in my experience. When the query is run 10 times in a row, an average split (file+stripe) is read on ~4 machine. Some are actually read on a different machine every run :) This affects cache hit ratio. Understandably in real scenarios we won't get 100% locality, but we should not be getting bad locality in simple cases like this. > LLAP: investigate locality issues > --------------------------------- > > Key: HIVE-11265 > URL: https://issues.apache.org/jira/browse/HIVE-11265 > Project: Hive > Issue Type: Sub-task > Reporter: Sergey Shelukhin > Assignee: Siddharth Seth > > Running q27 with split-waves 0.9 on 10 nodes x 16 executors, I get 140 > mappers reading store_sales, and 5~ more assorted vertices. > When running the query repeatedly, one would expect good locality, i.e. the > same splits (files+stripes) being processed on the same nodes most of the > time. > However, this is only the case for 40-50% of the stripes in my experience. > When the query is run 10 times in a row, an average split (file+stripe) is > read on ~4 machines. Some are actually read on a different machine every run > :) > This affects cache hit ratio. > Understandably in real scenarios we won't get 100% locality, but we should > not be getting bad locality in simple cases like this. -- This message was sent by Atlassian JIRA (v6.3.4#6332)