akpatnam25 commented on PR #2669:
URL: https://github.com/apache/celeborn/pull/2669#issuecomment-2285272083

   Thanks @FMX!! Yes, I understand the concern. This is basically only since 
our rack configs will never change.  I described more below 
:slightly_smiling_face:
   In our environment, there are 2 ways for the master to know the 
`networkLocation`:
   - Query external REST API for each worker to get the `networkLocation`  -- 
this is an expensive call for 400+ workers
   - Worker sends its `networkLocation` as a part of the registration process
   
   Given this, we chose the 2nd approach above. Essentially, when a worker 
sends its `networkLocation`, that `networkLocation` will never change for us. 
The current logic in the code is that if the `networkLocation` upon reading the 
snapshot is default, then it will again try to [query the external API at the 
Master](https://github.com/apache/celeborn/blob/main/master/src/main/java/org/apache/celeborn/service/deploy/master/clustermeta/AbstractMetaManager.java#L342-L346)
 again once more (this is not a desired code path, but is basically the backup 
in case the `networkLocation` is null in the snapshot). This means that 
`networkLocation` must be persisted into Ratis which is basically the cause of 
this PR. cc @mridulm 


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: [email protected]

For queries about this service, please contact Infrastructure at:
[email protected]

Reply via email to