Jing Zhang created HUDI-7241:
--------------------------------
Summary: spark sql choose broad cast has join for very large HUDI
relation
Key: HUDI-7241
URL: https://issues.apache.org/jira/browse/HUDI-7241
Project: Apache Hudi
Issue Type: Improvement
Reporter: Jing Zhang
Attachments: image-2023-12-20-12-32-25-783.png
After apply HUDI-6941 in the internal HUDI version(based on 0.14.0 version),
there is a frequent occurrence of the execution plan selecting "broadcast hash
join" to broadcast a large HUDI data source.
!image-2023-12-20-12-32-25-783.png!
I tried to investigate the cause of this issue.
In those cases, usingHadoopFsRelation to read HUDI source, and Spark
JoinSelection would call HadoopFsRelation#sizeInBytes to estimate the relation
size to decide whether use broadcast join or not. And
HadoopFsRelation#sizeInBytes would call HoodieFileIndex#sizeInBytes. But at the
moment, no partitions are loaded because using default lazy Hudi's file-index
implementation's file listing mode. So FileIndex#cachedAllInputFileSlices is an
empty map, then HadoopFsRelation#sizeInBytes returns 0, it caused the
suboptimal join plan.
After apply HUDI-6941, more cases could enabled lazy list mode by default, so
the issue has become more frequent.
--
This message was sent by Atlassian Jira
(v8.20.10#820010)