Ngone51 commented on pull request #31876: URL: https://github.com/apache/spark/pull/31876#issuecomment-812542883
(Sorry for the delay, was busy with internal stuff..) So I have removed all the methods from the interface `Location`. And now, the casting to `BlockMangerId` happens in these 4 places: a) ShuffleBlockFetcherIterator Castings here should be extracted to a Spark native shuffle reader, so it should be fine. b) DAGScheduler/MapOutputTrakcer * use the `host` or `executorId` from `BlockManagerId` to manage shuffle map outputs, e.g., `removeOutputsOnHost(...)` `removeOutputsOnExecutor(...)` * use the `host` from `BlockManagerId` as the preferred location, e.g., `getPreferredLocationsForShuffle` `getMapLocation` c) TaskSetManager Using both `host` and `executorId` to update the `HealthyTracker` d) JsonProtocol convert the `BlockManagerId` into a Json For cases b,c,d, I'll try to get rid of the casting in later commits. One feasible way is to use the pattern match to skip other Locations. At the same time, I'm still thinking if there would be a better way to unify the behavior of locations. e.g., for storage like HDFS, which doesn't have a specific host, we could probably use "*" to represent it. And for `executorId`, although some storage doesn't have a meaningful value, each map task actually does have a corresponding executorId (but kind of agree that adding `executorId` would be confusing ). -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. For queries about this service, please contact Infrastructure at: [email protected] --------------------------------------------------------------------- To unsubscribe, e-mail: [email protected] For additional commands, e-mail: [email protected]
