Ngone51 commented on pull request #31876:
URL: https://github.com/apache/spark/pull/31876#issuecomment-812542883


   (Sorry for the delay, was busy with internal stuff..)
   
   So I have removed all the methods from the interface `Location`. And now, 
the casting to `BlockMangerId` happens in these 4 places:
   
   a) ShuffleBlockFetcherIterator
   
   Castings here should be extracted to a Spark native shuffle reader, so it 
should be fine.
   
   b) DAGScheduler/MapOutputTrakcer
   
   * use the `host` or `executorId` from `BlockManagerId` to manage shuffle map 
outputs, e.g.,
   
   `removeOutputsOnHost(...)`
   `removeOutputsOnExecutor(...)`
   
   * use the `host` from `BlockManagerId` as the preferred location, e.g.,
   `getPreferredLocationsForShuffle`
   `getMapLocation`
   
   
   c) TaskSetManager
   Using both `host` and `executorId` to update the `HealthyTracker`
   
   d) JsonProtocol
   convert the `BlockManagerId` into a Json
   
   For cases b,c,d, I'll try to get rid of the casting in later commits. One 
feasible way is to use the pattern match to skip other Locations. At the same 
time, I'm still thinking if there would be a better way to unify the behavior 
of locations. e.g., for storage like HDFS, which doesn't have a specific host, 
we could probably use "*" to represent it. And for `executorId`, although some 
storage doesn't have a meaningful value, each map task actually does have a 
corresponding executorId (but kind of agree that adding `executorId` would be 
confusing ).
   


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
[email protected]



---------------------------------------------------------------------
To unsubscribe, e-mail: [email protected]
For additional commands, e-mail: [email protected]

Reply via email to