akpatnam25 commented on PR #2669: URL: https://github.com/apache/celeborn/pull/2669#issuecomment-2285272083
Thanks @FMX!! Yes, I understand the concern. This is basically only since our rack configs will never change. I described more below :slightly_smiling_face: In our environment, there are 2 ways for the master to know the `networkLocation`: - Query external REST API for each worker to get the `networkLocation` -- this is an expensive call for 400+ workers - Worker sends its `networkLocation` as a part of the registration process Given this, we chose the 2nd approach above. Essentially, when a worker sends its `networkLocation`, that `networkLocation` will never change for us. The current logic in the code is that if the `networkLocation` upon reading the snapshot is default, then it will again try to [query the external API at the Master](https://github.com/apache/celeborn/blob/main/master/src/main/java/org/apache/celeborn/service/deploy/master/clustermeta/AbstractMetaManager.java#L342-L346) again once more (this is not a desired code path, but is basically the backup in case the `networkLocation` is null in the snapshot). This means that `networkLocation` must be persisted into Ratis which is basically the cause of this PR. cc @mridulm -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: [email protected] For queries about this service, please contact Infrastructure at: [email protected]
