Re: [PR] [SPARK-57487][SQL] Support distributed map join for medium-sized tables via SQL hint [spark]

via GitHub Tue, 16 Jun 2026 20:40:03 -0700


HeartSaVioR commented on PR #56542:
URL: https://github.com/apache/spark/pull/56542#issuecomment-4725657429


   @yugan95 
   First of all, welcome to Apache Spark community and thanks for your first 
contribution!
   
   I'm not a PMC member nor a maintainer of SQL area, but given the large scope 
of change across multiple modules with huge code diff while addressing a 
specific use case, I wonder we should make a consensus about the direction in 
prior. 
   
   Apache Spark has a process for this - 
https://spark.apache.org/improvement-proposals.html
   
   The main purpose is to build a consensus on the community that the 
improvement is something we want to adopt. The Heilmeier isn't purposed to 
bring up detailed design, but high-level design is appreciated (this change 
obviously warrants it since new distributed data exchange with RPC is 
introduced). Also probably need a much clearer answer about "when" users will 
be benefited by this change, especially that this is "opt-in" than opt-out. 5TB 
vs 2GB example in the JIRA ticket doesn't feel like a very general case, or 
might need more data about the trade-off between the cost of eliminating 
shuffle vs retrieving data via remote RPC instead of pre-loading the whole 
shard after shuffle - if you were users which criteria warrants this feature to 
be enabled?
   
   The process requires one PMC member to be a shepherd - if you don't have one 
to contact, probably start with dev@ mailing list with empty shepherd, and I 
assume you can find volunteer as long as your proposal is on consensus to the 
shape of "good to go".
   
   Thanks again!


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: [email protected]

For queries about this service, please contact Infrastructure at:
[email protected]


---------------------------------------------------------------------
To unsubscribe, e-mail: [email protected]
For additional commands, e-mail: [email protected]

Re: [PR] [SPARK-57487][SQL] Support distributed map join for medium-sized tables via SQL hint [spark]

Reply via email to