hiboyang commented on pull request #31876:
URL: https://github.com/apache/spark/pull/31876#issuecomment-812963288


   > (Sorry for the delay, was busy with internal stuff..)
   > 
   > So I have removed all the methods from the interface `Location`. And now, 
the casting to `BlockMangerId` happens in these 4 places:
   > 
   > a) ShuffleBlockFetcherIterator
   > 
   > Castings here should be extracted to a Spark native shuffle reader, so it 
should be fine.
   > 
   > b) DAGScheduler/MapOutputTrakcer
   > 
   > * use the `host` or `executorId` from `BlockManagerId` to manage shuffle 
map outputs, e.g.,
   > 
   > `removeOutputsOnHost(...)`
   > `removeOutputsOnExecutor(...)`
   > 
   > * use the `host` from `BlockManagerId` as the preferred location, e.g.,
   >   `getPreferredLocationsForShuffle`
   >   `getMapLocation`
   > 
   > c) TaskSetManager
   > Using both `host` and `executorId` to update the `HealthyTracker`
   > 
   > d) JsonProtocol
   > convert the `BlockManagerId` into a Json
   > 
   > For cases b,c,d, I'll try to get rid of the casting in later commits. One 
feasible way is to use the pattern match to skip other Locations. At the same 
time, I'm still thinking if there would be a better way to unify the behavior 
of locations. e.g., for storage like HDFS, which doesn't have a specific host, 
we could probably use "*" to represent it. And for `executorId`, although some 
storage doesn't have a meaningful value, each map task actually does have a 
corresponding executorId (but kind of agree that adding `executorId` would be 
confusing ).
   
   
   
   > (Sorry for the delay, was busy with internal stuff..)
   > 
   > So I have removed all the methods from the interface `Location`. And now, 
the casting to `BlockMangerId` happens in these 4 places:
   > 
   > a) ShuffleBlockFetcherIterator
   > 
   > Castings here should be extracted to a Spark native shuffle reader, so it 
should be fine.
   > 
   > b) DAGScheduler/MapOutputTrakcer
   > 
   > * use the `host` or `executorId` from `BlockManagerId` to manage shuffle 
map outputs, e.g.,
   > 
   > `removeOutputsOnHost(...)`
   > `removeOutputsOnExecutor(...)`
   > 
   > * use the `host` from `BlockManagerId` as the preferred location, e.g.,
   >   `getPreferredLocationsForShuffle`
   >   `getMapLocation`
   > 
   > c) TaskSetManager
   > Using both `host` and `executorId` to update the `HealthyTracker`
   > 
   > d) JsonProtocol
   > convert the `BlockManagerId` into a Json
   > 
   > For cases b,c,d, I'll try to get rid of the casting in later commits. One 
feasible way is to use the pattern match to skip other Locations. At the same 
time, I'm still thinking if there would be a better way to unify the behavior 
of locations. e.g., for storage like HDFS, which doesn't have a specific host, 
we could probably use "*" to represent it. And for `executorId`, although some 
storage doesn't have a meaningful value, each map task actually does have a 
corresponding executorId (but kind of agree that adding `executorId` would be 
confusing ).
   
   Thanks for the change! It might be tricky to deal with casting in b,c,d and 
skip other location types.
   
   Echoing on previous suggestion to expose `BlockManagerId` and make it 
extensible, do we want to try that approach? We could rename `BlockManagerId` 
to some other generic name like `BlockLocation`.
   
   
   
   


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
[email protected]



---------------------------------------------------------------------
To unsubscribe, e-mail: [email protected]
For additional commands, e-mail: [email protected]

Reply via email to