Hi, I hope this answers your question.

You can hint the broadcast in SQL as detailed here:
https://jaceklaskowski.gitbooks.io/mastering-spark-sql/spark-sql-joins-broadcast.html
(thanks
Jacek :) )
I'd recommend creating a temporary table with the trimming you use in the
join (for clarity). Also keep in mind using the methods is more
powerful/readable than
using Spark SQL directly (as happens with the broadcast case, although it
depends on personal preference).

Regards,

Ruben

-- 
Rubén Berenguel

On 29 July 2019 at 07:12:30, zhangliyun (kelly...@126.com) wrote:

Hi all:
   i want to ask a question about   broadcast join in spark sql.


```
   select A.*,B.nsf_cards_ratio * 1.00 / A.nsf_on_entry as nsf_ratio_to_pop
from B
left join A
on trim(A.country) = trim(B.cntry_code);
```
here A is a small table only 8 rows, but somehow the statistics of table A
has problem.

A join B is sort merged join while the join key ( trim(A.country) =
trim(B.cntry_code)) only
has serveral values( neary 21 countries).  is there any way i force spark
sql to use
broadcast join (I can not use enlarge the
spark.sql.autoBroadcastJoinThreshold  as i did not know the detail size of
spark sql deal with it ).

I tried to print the physical plan , but it did not show the table size and
i did not know
how to enlarge the value of spark.sql.autoBroadcastJoinThreshold to force
the sort merge join to
broadcast join.


```
== Parsed Logical Plan ==
'Project [ArrayBuffer(cc_base_part1).*, (('cc_base_part1.nsf_cards_ratio *
1.00) / 'cc_rank_agg.nsf_on_entry) AS nsf_ratio_to_pop#369]
+- 'Join LeftOuter, ('trim('cc_base_part1.country) =
'trim('cc_rank_agg.cntry_code))
   :- 'UnresolvedRelation `cc_base_part1`
   +- 'UnresolvedRelation `cc_rank_agg`

== Analyzed Logical Plan ==
cust_id: string, country: string, cc_id: decimal(38,0), bin_hmac: string,
credit_card_created_date: string, card_usage: smallint, cc_category:
string, cc_size: string, nsf_risk: string, nsf_cards_ratio: decimal(18,2),
dt: string, nsf_ratio_to_pop: decimal(38,6)
Project [cust_id#372, country#373, cc_id#374, bin_hmac#375,
credit_card_created_date#376, card_usage#377, cc_category#378, cc_size#379,
nsf_risk#380, nsf_cards_ratio#381, dt#382,
CheckOverflow((promote_precision(cast(CheckOverflow((promote_precision(cast(nsf_cards_ratio#381
as decimal(18,2))) * promote_precision(cast(1.00 as decimal(18,2)))),
DecimalType(22,4)) as decimal(38,16))) /
promote_precision(cast(nsf_on_entry#386 as decimal(38,16)))),
DecimalType(38,6)) AS nsf_ratio_to_pop#369]
+- Join LeftOuter, (trim(country#373, None) = trim(cntry_code#383, None))
   :- SubqueryAlias cc_base_part1
   :  +- HiveTableRelation `fpv365h`.`cc_base_part1`,
org.apache.hadoop.hive.serde2.lazy.LazySimpleSerDe, [cust_id#372,
country#373, cc_id#374, bin_hmac#375, credit_card_created_date#376,
card_usage#377, cc_category#378, cc_size#379, nsf_risk#380,
nsf_cards_ratio#381], [dt#382]
   +- SubqueryAlias cc_rank_agg
      +- HiveTableRelation `fpv365h`.`cc_rank_agg`,
org.apache.hadoop.hive.serde2.lazy.LazySimpleSerDe, [cntry_code#383,
num_tot_cards#384L, num_nsf_cards#385L, nsf_on_entry#386], [dt#387]



```

Does spark have any command to show the table size  when printing the
physical plan ?   Appreciate if you can help my question.


Best regards

Kelly Zhang

Reply via email to