Jing Zhang created HUDI-7241:
--------------------------------

             Summary: spark sql choose broad cast has join for very large HUDI 
relation
                 Key: HUDI-7241
                 URL: https://issues.apache.org/jira/browse/HUDI-7241
             Project: Apache Hudi
          Issue Type: Improvement
            Reporter: Jing Zhang
         Attachments: image-2023-12-20-12-32-25-783.png

After apply HUDI-6941 in the internal HUDI version(based on 0.14.0 version), 
there is a frequent occurrence of the execution plan selecting "broadcast hash 
join" to broadcast a large HUDI data source.
 !image-2023-12-20-12-32-25-783.png! 
I tried to investigate the cause of this issue.
In those cases, usingHadoopFsRelation to read HUDI source, and Spark 
JoinSelection would call HadoopFsRelation#sizeInBytes to estimate the relation 
size to decide whether use broadcast join or not. And 
HadoopFsRelation#sizeInBytes would call HoodieFileIndex#sizeInBytes. But at the 
moment, no partitions are loaded because using default lazy Hudi's file-index 
implementation's file listing mode. So FileIndex#cachedAllInputFileSlices is an 
empty map, then HadoopFsRelation#sizeInBytes returns 0, it caused the 
suboptimal join plan.
After apply HUDI-6941, more cases could enabled lazy list mode by default, so 
the issue has become more frequent.



--
This message was sent by Atlassian Jira
(v8.20.10#820010)

Reply via email to