[GitHub] [spark] Ngone51 commented on pull request #28911: [SPARK-32077][CORE] Support host-local shuffle data reading when external shuffle service is disabled

GitBox Fri, 10 Jul 2020 01:07:34 -0700


Ngone51 commented on pull request #28911:
URL: https://github.com/apache/spark/pull/28911#issuecomment-656545231



   Ah, I get your point and I can imagine how it may affect current locality 
preference. Let's take an example to see if we're on the same page.
   
   For example, now we have executor1 and executor2 on node1, executor3 and 
executor4 on node2. And we also have 10 shuffle data bytes on executor1 and 
executor2 from task1 and task2 separately.  Besides, we also have 40 shuffle 
data bytes on executor3 and executor4 from task3 and task4 separately. 
(Assuming all the shuffle data are for the same reduce partition.)
   
   With the current implementation of `getLocationsWithLargestOutputs`, we only 
count an executor's host as a locality prefer location when [shuffle data for a 
certain reduce partiton on this executor] / [total shuffle data]) >= 
fractionThreshold(default 0.2). So, in this case, only node2 is considered as a 
preferred location because (40 / 10 + 10 + 40 + 40) = 0.4 >= 0.2. But node1 is 
not because (10 / 10 + 10 + 40 + 40) = 0.1 < 0.2.
   
   However, node1 can also be a preferred location if we aggregate the size of 
the shuffle data on the same host, since we will have (10 + 10 / 10 + 10 + 40 + 
40) = 0.2 >= 0.2.
   
   It looks reasonable to me. cc @attilapiros @tgravescs @jiangxb1987 @holdenk 
Any ideas?


----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
[email protected]



---------------------------------------------------------------------
To unsubscribe, e-mail: [email protected]
For additional commands, e-mail: [email protected]

[GitHub] [spark] Ngone51 commented on pull request #28911: [SPARK-32077][CORE] Support host-local shuffle data reading when external shuffle service is disabled

Reply via email to