s0nskar commented on PR #2373: URL: https://github.com/apache/celeborn/pull/2373#issuecomment-2058422163
> a) If recomputation happens, we should fail the stage and not allow retries - this will prevent data loss. > b) We should recommend enabling replication to leverage this feature - this minimizes the risk of data loss which would trigger recomputation. @mridulm @cfmcgrady As enabling replication is expensive and some uses might not want to enable it. Another way to handle could be to only fail the stage if stage has any skew partition read. This way it will only affect the stages where skew and will not affect the stages or apps where there is no skew, this can increase the overall reliability for huge percentage of apps. We can make it configurable to give more control to the user. WDYT? -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: [email protected] For queries about this service, please contact Infrastructure at: [email protected]
