Ngone51 commented on pull request #28911: URL: https://github.com/apache/spark/pull/28911#issuecomment-656545231
Ah, I get your point and I can imagine how it may affect current locality preference. Let's take an example to see if we're on the same page. For example, now we have executor1 and executor2 on node1, executor3 and executor4 on node2. And we also have 10 shuffle data bytes on executor1 and executor2 from task1 and task2 separately. Besides, we also have 40 shuffle data bytes on executor3 and executor4 from task3 and task4 separately. (Assuming all the shuffle data are for the same reduce partition.) With the current implementation of `getLocationsWithLargestOutputs`, we only count an executor's host as a locality prefer location when [shuffle data for a certain reduce partiton on this executor] / [total shuffle data]) >= fractionThreshold(default 0.2). So, in this case, only node2 is considered as a preferred location because (40 / 10 + 10 + 40 + 40) = 0.4 >= 0.2. But node1 is not because (10 / 10 + 10 + 40 + 40) = 0.1 < 0.2. However, node1 can also be a preferred location if we aggregate the size of the shuffle data on the same host, since we will have (10 + 10 / 10 + 10 + 40 + 40) = 0.2 >= 0.2. It looks reasonable to me. cc @attilapiros @tgravescs @jiangxb1987 @holdenk Any ideas? ---------------------------------------------------------------- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. For queries about this service, please contact Infrastructure at: [email protected] --------------------------------------------------------------------- To unsubscribe, e-mail: [email protected] For additional commands, e-mail: [email protected]
