yugan95 commented on PR #56542:
URL: https://github.com/apache/spark/pull/56542#issuecomment-4726978777

   @HeartSaVioR Thanks for the suggestion on the SPIP process.
   
   A couple of clarifications on the scope and entry point:
   
   This feature is double-gated — it requires `spark.shard.enabled=true` at 
application launch to start the shard infrastructure, and an explicit SQL hint 
(`/*+ DISTMAPJOIN(table) */`) per query to activate the join strategy. Both are 
off by default, so existing behavior is completely unaffected. The planner 
never considers it automatically. There is no cost-based selection. Users opt 
in per query, per table.
   
   The use case is the gap between broadcast join and shuffle join: the build 
side is too large to broadcast (tens of GB) but the probe side is very large 
(PB-scale in our case — a minor correction to the JIRA ticket which mentions 
5PB, not 5TB). Shuffling the probe side at that scale is expensive. Distributed 
map join avoids the probe-side shuffle by partitioning the build side into 
remotely queryable shards, with a Bloom filter to reduce RPC volume. This has 
been running stably in our production environment.
   
   I'm happy to draft a SPIP proposal if the community feels that's the right 
process for this scope. Also glad to provide more detailed benchmarks on the 
trade-off between shuffle cost and RPC cost across different data profiles. 
Open to either path.


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: [email protected]

For queries about this service, please contact Infrastructure at:
[email protected]


---------------------------------------------------------------------
To unsubscribe, e-mail: [email protected]
For additional commands, e-mail: [email protected]

Reply via email to