[
https://issues.apache.org/jira/browse/HIVE-23947?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17170772#comment-17170772
]
Ádám Szita commented on HIVE-23947:
-----------------------------------
Committed to master. Thanks for reviewing [~pvary].
> Cache affinity is unset for text files read by LLAP
> ---------------------------------------------------
>
> Key: HIVE-23947
> URL: https://issues.apache.org/jira/browse/HIVE-23947
> Project: Hive
> Issue Type: Bug
> Components: llap
> Reporter: Ádám Szita
> Assignee: Ádám Szita
> Priority: Major
> Labels: pull-request-available
> Time Spent: 20m
> Remaining Estimate: 0h
>
> LLAP relies on HostAffinitySplitLocationProvider to route the same splits to
> always the same LLAP daemons. By having such consistent split of data among
> the nodes we can gain a good hit ratio and thus good performance.
> For text files this is almost never granted:
> HostAffinitySplitLocationProvider is never used, because HS2 does not set the
> cache affinity flag in the job conf for text inputformat content during
> compile. The launched Tez AM will have to rely on HDFS location information
> to route the splits (and therefore tasks) to the executor nodes. This
> location information might not have a good overlap with where the actual
> daemons are, or in S3 case, the Tez AM will mostly choose executors in a
> random way.
> This in turn will result in the hit ratio hardly reaching 100%, each time we
> re-run the same query, some disk/s3 read will still occur. That is until the
> same content gets populated into all the daemons (after running the query
> tens or hundreds of times) causing poor performance.
--
This message was sent by Atlassian Jira
(v8.3.4#803005)