Hi All: I find impala choose join algorithm by comparing data transmission size between broad cast and shuffle join while generating physical execution plan. what I am confused is why impala choose broadcast as default implement(such as table do not compute stats) ?
In my experience, shuffle join maybe the better choice, and some of my queries use broadcast join between two subquery with huge resultset and the query costs has difference up to ten times (8s and 80s). I think user should always compute stats for every partition, do you guys have some good suggestion about this. Thanks a lot
