in
JIRA SPARK-13383
Thanks
Yong
From: java8...@hotmail.com
To: dav...@databricks.com
CC: user@spark.apache.org
Subject: RE: Spark 1.5.2, why the broadcast join shuffle so much data in the
last step
Date: Wed, 23 Mar 2016 20:30:42 -0400
Sounds good.
I will manual merge this patch on 1.6.1
Sounds good.
I will manual merge this patch on 1.6.1, and test again for my case tomorrow on
my environment and will update later.
Thanks
Yong
> Date: Wed, 23 Mar 2016 16:20:23 -0700
> Subject: Re: Spark 1.5.2, why the broadcast join shuffle so much data in the
> last step
&g
gt; [date_time#25L,visid_low#461L,visid_high#460L,account_id#976]
> > +- Project [soid_e1#30 AS
> > account_id#976,visid_high#460L,visid_low#461L,date_time#25L,ip#127]
> > +- Filter (instr(event_list#105,202) > 0)
> > +-
,soid_e1#30]
>> >
>> > join2.explain(true)
>> > == Physical Plan ==
>> > Filter (date_time#25L > date_time#513L)
>> > BroadcastHashJoin [visid_high#948L,visid_low#949L],
>> > [visid_high#460L,visid_low#461L], BuildRight
>> > Scan ParquetRe
do
anything with "broadcast" problem.
Thanks
Yong
> Date: Wed, 23 Mar 2016 10:14:19 -0700
> Subject: Re: Spark 1.5.2, why the broadcast join shuffle so much data in the
> last step
> From: dav...@databricks.com
> To: java8...@hotmail.com
> CC: user@spark.apache.org
>
t; When using the broadcast join, the job still generates 3 stages, same as
> SortMergeJoin, but I am not sure this makes sense.
> Ideally, in "Broadcast", the first stage scan the "trialRaw" data, using the
> filter (instr(event_list#105,202) > 0), which BTW will filter out 99% of
> data, then