RE: Spark 1.5.2, why the broadcast join shuffle so much data in the last step

2016-03-24 Thread Yong Zhang
in JIRA SPARK-13383 Thanks Yong From: java8...@hotmail.com To: dav...@databricks.com CC: user@spark.apache.org Subject: RE: Spark 1.5.2, why the broadcast join shuffle so much data in the last step Date: Wed, 23 Mar 2016 20:30:42 -0400 Sounds good. I will manual merge this patch on 1.6.1

RE: Spark 1.5.2, why the broadcast join shuffle so much data in the last step

2016-03-23 Thread Yong Zhang
Sounds good. I will manual merge this patch on 1.6.1, and test again for my case tomorrow on my environment and will update later. Thanks Yong > Date: Wed, 23 Mar 2016 16:20:23 -0700 > Subject: Re: Spark 1.5.2, why the broadcast join shuffle so much data in the > last step &g

Re: Spark 1.5.2, why the broadcast join shuffle so much data in the last step

2016-03-23 Thread Ted Yu
gt; [date_time#25L,visid_low#461L,visid_high#460L,account_id#976] > > +- Project [soid_e1#30 AS > > account_id#976,visid_high#460L,visid_low#461L,date_time#25L,ip#127] > > +- Filter (instr(event_list#105,202) > 0) > > +-

Re: Spark 1.5.2, why the broadcast join shuffle so much data in the last step

2016-03-23 Thread Davies Liu
,soid_e1#30] >> > >> > join2.explain(true) >> > == Physical Plan == >> > Filter (date_time#25L > date_time#513L) >> > BroadcastHashJoin [visid_high#948L,visid_low#949L], >> > [visid_high#460L,visid_low#461L], BuildRight >> > Scan ParquetRe

RE: Spark 1.5.2, why the broadcast join shuffle so much data in the last step

2016-03-23 Thread Yong Zhang
do anything with "broadcast" problem. Thanks Yong > Date: Wed, 23 Mar 2016 10:14:19 -0700 > Subject: Re: Spark 1.5.2, why the broadcast join shuffle so much data in the > last step > From: dav...@databricks.com > To: java8...@hotmail.com > CC: user@spark.apache.org >

Re: Spark 1.5.2, why the broadcast join shuffle so much data in the last step

2016-03-23 Thread Davies Liu
t; When using the broadcast join, the job still generates 3 stages, same as > SortMergeJoin, but I am not sure this makes sense. > Ideally, in "Broadcast", the first stage scan the "trialRaw" data, using the > filter (instr(event_list#105,202) > 0), which BTW will filter out 99% of > data, then