HeartSaVioR commented on PR #56542: URL: https://github.com/apache/spark/pull/56542#issuecomment-4725657429
@yugan95 First of all, welcome to Apache Spark community and thanks for your first contribution! I'm not a PMC member nor a maintainer of SQL area, but given the large scope of change across multiple modules with huge code diff while addressing a specific use case, I wonder we should make a consensus about the direction in prior. Apache Spark has a process for this - https://spark.apache.org/improvement-proposals.html The main purpose is to build a consensus on the community that the improvement is something we want to adopt. The Heilmeier isn't purposed to bring up detailed design, but high-level design is appreciated (this change obviously warrants it since new distributed data exchange with RPC is introduced). Also probably need a much clearer answer about "when" users will be benefited by this change, especially that this is "opt-in" than opt-out. 5TB vs 2GB example in the JIRA ticket doesn't feel like a very general case, or might need more data about the trade-off between the cost of eliminating shuffle vs retrieving data via remote RPC instead of pre-loading the whole shard after shuffle - if you were users which criteria warrants this feature to be enabled? The process requires one PMC member to be a shepherd - if you don't have one to contact, probably start with dev@ mailing list with empty shepherd, and I assume you can find volunteer as long as your proposal is on consensus to the shape of "good to go". Thanks again! -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: [email protected] For queries about this service, please contact Infrastructure at: [email protected] --------------------------------------------------------------------- To unsubscribe, e-mail: [email protected] For additional commands, e-mail: [email protected]
