[
https://issues.apache.org/jira/browse/TEZ-4245?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17331241#comment-17331241
]
Rajesh Balamohan commented on TEZ-4245:
---------------------------------------
Not yet [~jeagles] . Need to think through more corner cases (as in, to ensure
that it doesn't regress any other use case).
> Optimise split grouping when locality information is set to null/empty
> ----------------------------------------------------------------------
>
> Key: TEZ-4245
> URL: https://issues.apache.org/jira/browse/TEZ-4245
> Project: Apache Tez
> Issue Type: Improvement
> Reporter: Rajesh Balamohan
> Priority: Major
> Attachments: TEZ-4245.1.patch
>
> Time Spent: 10m
> Remaining Estimate: 0h
>
> In objectstores like S3, locality information always shows up as "localhost".
> Having this information in inputsplit slows down scheduling as explained in
> https://issues.apache.org/jira/browse/HIVE-14060 Systems like hive remove
> "localhost" information from splits.
>
> Split information without any locality information (localhost/null/empty)
> should be treated equally, so that split grouping can do meaningful grouping
> based on cluster size. This is to avoid creating small split groups, which
> can significantly increase runtime due to sequential processing (i.e same map
> task getting lots of inputs and system ends up spending time in
> open/seek/close on objectstores).
>
--
This message was sent by Atlassian Jira
(v8.3.4#803005)