subject:"\"Re\\\: Auto BroadcastJoin optimization failed in latest Spark\""

Re: Auto BroadcastJoin optimization failed in latest Spark

2014-12-04 Thread Jianshi Huang

With Liancheng's suggestion, I've tried setting

 spark.sql.hive.convertMetastoreParquet  false

but still analyze noscan return -1 in rawDataSize

Jianshi


On Fri, Dec 5, 2014 at 3:33 PM, Jianshi Huang 
wrote:

> If I run ANALYZE without NOSCAN, then Hive can successfully get the size:
>
> parameters:{numFiles=0, EXTERNAL=TRUE, transient_lastDdlTime=1417764589,
> COLUMN_STATS_ACCURATE=true, totalSize=0, numRows=1156, rawDataSize=76296}
>
> Is Hive's PARQUET support broken?
>
> Jianshi
>
>
> On Fri, Dec 5, 2014 at 3:30 PM, Jianshi Huang 
> wrote:
>
>> Sorry for the late of follow-up.
>>
>> I used Hao's DESC EXTENDED command and found some clue:
>>
>> new (broadcast broken Spark build):
>> parameters:{numFiles=0, EXTERNAL=TRUE, transient_lastDdlTime=1417763892,
>> COLUMN_STATS_ACCURATE=false, totalSize=0, numRows=-1, rawDataSize=-1}
>>
>> old (broadcast working Spark build):
>> parameters:{EXTERNAL=TRUE, transient_lastDdlTime=1417763591,
>> totalSize=56166}
>>
>> Looks like the table size computation failed in the latest version.
>>
>> I've run the analyze command:
>>
>>   ANALYZE TABLE $table COMPUTE STATISTICS noscan
>>
>> And the tables are created from Parquet files:
>>
>> e.g.
>> CREATE EXTERNAL TABLE table1 (
>>   code int,
>>   desc string
>> )
>> STORED AS PARQUET
>> LOCATION '/user/jianshuang/data/dim_tables/table1.parquet'
>>
>>
>> Anyone knows what went wrong?
>>
>>
>> Thanks,
>> Jianshi
>>
>>
>>
>> On Fri, Nov 28, 2014 at 1:24 PM, Cheng, Hao  wrote:
>>
>>>  Hi Jianshi,
>>>
>>> I couldn’t reproduce that with latest MASTER, and I can always get the
>>> BroadcastHashJoin for managed tables (in .csv file) in my testing, are
>>> there any external tables in your case?
>>>
>>>
>>>
>>> In general probably couple of things you can try first (with
>>> HiveContext):
>>>
>>> 1)  ANALYZE TABLE xxx COMPUTE STATISTICS NOSCAN; (apply that to all
>>> of the tables);
>>>
>>> 2)  SET spark.sql.autoBroadcastJoinThreshold=xxx; (Set the
>>> threshold as a greater value, it is 1024*1024*10 by default, just make sure
>>> the maximum dimension tables size (in bytes) is less than this)
>>>
>>> 3)  Always put the main table(the biggest table) in the left-most
>>> among the inner joins;
>>>
>>>
>>>
>>> DESC EXTENDED tablename; -- this will print the detail information for
>>> the statistic table size (the field “totalSize”)
>>>
>>> EXPLAIN EXTENDED query; -- this will print the detail physical plan.
>>>
>>>
>>>
>>> Let me know if you still have problem.
>>>
>>>
>>>
>>> Hao
>>>
>>>
>>>
>>> *From:* Jianshi Huang [mailto:jianshi.hu...@gmail.com]
>>> *Sent:* Thursday, November 27, 2014 10:24 PM
>>> *To:* Cheng, Hao
>>> *Cc:* user
>>> *Subject:* Re: Auto BroadcastJoin optimization failed in latest Spark
>>>
>>>
>>>
>>> Hi Hao,
>>>
>>>
>>>
>>> I'm using inner join as Broadcast join didn't work for left joins
>>> (thanks for the links for the latest improvements).
>>>
>>>
>>>
>>> And I'm using HiveConext and it worked in a previous build (10/12) when
>>> joining 15 dimension tables.
>>>
>>>
>>>
>>> Jianshi
>>>
>>>
>>>
>>> On Thu, Nov 27, 2014 at 8:35 AM, Cheng, Hao  wrote:
>>>
>>>  Are all of your join keys the same? and I guess the join type are all
>>> “Left” join, https://github.com/apache/spark/pull/3362 probably is what
>>> you need.
>>>
>>>
>>>
>>> And, SparkSQL doesn’t support the multiway-join (and multiway-broadcast
>>> join) currently,  https://github.com/apache/spark/pull/3270 should be
>>> another optimization for this.
>>>
>>>
>>>
>>>
>>>
>>> *From:* Jianshi Huang [mailto:jianshi.hu...@gmail.com]
>>> *Sent:* Wednesday, November 26, 2014 4:36 PM
>>> *To:* user
>>> *Subject:* Auto BroadcastJoin optimization failed in latest Spark
>>>
>>>
>>>
>>> Hi,
>>>
>>>
>>>
>>> I've confirmed that the latest Spark with either Hive 0.12 or 0.13.1
>>> fails optimizing auto broadcast join in my query. I have a query that joins
>>> a huge fact table with 15 tiny dimension tables.
>>>
>>>
>>>
>>> I'm currently using an older version of Spark which was built on Oct. 12.
>>>
>>>
>>>
>>> Anyone else has met similar situation?
>>>
>>>
>>>
>>> --
>>>
>>> Jianshi Huang
>>>
>>> LinkedIn: jianshi
>>> Twitter: @jshuang
>>> Github & Blog: http://huangjs.github.com/
>>>
>>>
>>>
>>>
>>>
>>> --
>>>
>>> Jianshi Huang
>>>
>>> LinkedIn: jianshi
>>> Twitter: @jshuang
>>> Github & Blog: http://huangjs.github.com/
>>>
>>
>>
>>
>> --
>> Jianshi Huang
>>
>> LinkedIn: jianshi
>> Twitter: @jshuang
>> Github & Blog: http://huangjs.github.com/
>>
>
>
>
> --
> Jianshi Huang
>
> LinkedIn: jianshi
> Twitter: @jshuang
> Github & Blog: http://huangjs.github.com/
>



-- 
Jianshi Huang

LinkedIn: jianshi
Twitter: @jshuang
Github & Blog: http://huangjs.github.com/

Re: Auto BroadcastJoin optimization failed in latest Spark

2014-12-04 Thread Jianshi Huang

If I run ANALYZE without NOSCAN, then Hive can successfully get the size:

parameters:{numFiles=0, EXTERNAL=TRUE, transient_lastDdlTime=1417764589,
COLUMN_STATS_ACCURATE=true, totalSize=0, numRows=1156, rawDataSize=76296}

Is Hive's PARQUET support broken?

Jianshi


On Fri, Dec 5, 2014 at 3:30 PM, Jianshi Huang 
wrote:

> Sorry for the late of follow-up.
>
> I used Hao's DESC EXTENDED command and found some clue:
>
> new (broadcast broken Spark build):
> parameters:{numFiles=0, EXTERNAL=TRUE, transient_lastDdlTime=1417763892,
> COLUMN_STATS_ACCURATE=false, totalSize=0, numRows=-1, rawDataSize=-1}
>
> old (broadcast working Spark build):
> parameters:{EXTERNAL=TRUE, transient_lastDdlTime=1417763591,
> totalSize=56166}
>
> Looks like the table size computation failed in the latest version.
>
> I've run the analyze command:
>
>   ANALYZE TABLE $table COMPUTE STATISTICS noscan
>
> And the tables are created from Parquet files:
>
> e.g.
> CREATE EXTERNAL TABLE table1 (
>   code int,
>   desc string
> )
> STORED AS PARQUET
> LOCATION '/user/jianshuang/data/dim_tables/table1.parquet'
>
>
> Anyone knows what went wrong?
>
>
> Thanks,
> Jianshi
>
>
>
> On Fri, Nov 28, 2014 at 1:24 PM, Cheng, Hao  wrote:
>
>>  Hi Jianshi,
>>
>> I couldn’t reproduce that with latest MASTER, and I can always get the
>> BroadcastHashJoin for managed tables (in .csv file) in my testing, are
>> there any external tables in your case?
>>
>>
>>
>> In general probably couple of things you can try first (with HiveContext):
>>
>> 1)  ANALYZE TABLE xxx COMPUTE STATISTICS NOSCAN; (apply that to all
>> of the tables);
>>
>> 2)  SET spark.sql.autoBroadcastJoinThreshold=xxx; (Set the threshold
>> as a greater value, it is 1024*1024*10 by default, just make sure the
>> maximum dimension tables size (in bytes) is less than this)
>>
>> 3)  Always put the main table(the biggest table) in the left-most
>> among the inner joins;
>>
>>
>>
>> DESC EXTENDED tablename; -- this will print the detail information for
>> the statistic table size (the field “totalSize”)
>>
>> EXPLAIN EXTENDED query; -- this will print the detail physical plan.
>>
>>
>>
>> Let me know if you still have problem.
>>
>>
>>
>> Hao
>>
>>
>>
>> *From:* Jianshi Huang [mailto:jianshi.hu...@gmail.com]
>> *Sent:* Thursday, November 27, 2014 10:24 PM
>> *To:* Cheng, Hao
>> *Cc:* user
>> *Subject:* Re: Auto BroadcastJoin optimization failed in latest Spark
>>
>>
>>
>> Hi Hao,
>>
>>
>>
>> I'm using inner join as Broadcast join didn't work for left joins (thanks
>> for the links for the latest improvements).
>>
>>
>>
>> And I'm using HiveConext and it worked in a previous build (10/12) when
>> joining 15 dimension tables.
>>
>>
>>
>> Jianshi
>>
>>
>>
>> On Thu, Nov 27, 2014 at 8:35 AM, Cheng, Hao  wrote:
>>
>>  Are all of your join keys the same? and I guess the join type are all
>> “Left” join, https://github.com/apache/spark/pull/3362 probably is what
>> you need.
>>
>>
>>
>> And, SparkSQL doesn’t support the multiway-join (and multiway-broadcast
>> join) currently,  https://github.com/apache/spark/pull/3270 should be
>> another optimization for this.
>>
>>
>>
>>
>>
>> *From:* Jianshi Huang [mailto:jianshi.hu...@gmail.com]
>> *Sent:* Wednesday, November 26, 2014 4:36 PM
>> *To:* user
>> *Subject:* Auto BroadcastJoin optimization failed in latest Spark
>>
>>
>>
>> Hi,
>>
>>
>>
>> I've confirmed that the latest Spark with either Hive 0.12 or 0.13.1
>> fails optimizing auto broadcast join in my query. I have a query that joins
>> a huge fact table with 15 tiny dimension tables.
>>
>>
>>
>> I'm currently using an older version of Spark which was built on Oct. 12.
>>
>>
>>
>> Anyone else has met similar situation?
>>
>>
>>
>> --
>>
>> Jianshi Huang
>>
>> LinkedIn: jianshi
>> Twitter: @jshuang
>> Github & Blog: http://huangjs.github.com/
>>
>>
>>
>>
>>
>> --
>>
>> Jianshi Huang
>>
>> LinkedIn: jianshi
>> Twitter: @jshuang
>> Github & Blog: http://huangjs.github.com/
>>
>
>
>
> --
> Jianshi Huang
>
> LinkedIn: jianshi
> Twitter: @jshuang
> Github & Blog: http://huangjs.github.com/
>



-- 
Jianshi Huang

LinkedIn: jianshi
Twitter: @jshuang
Github & Blog: http://huangjs.github.com/

Re: Auto BroadcastJoin optimization failed in latest Spark

2014-12-04 Thread Jianshi Huang

Sorry for the late of follow-up.

I used Hao's DESC EXTENDED command and found some clue:

new (broadcast broken Spark build):
parameters:{numFiles=0, EXTERNAL=TRUE, transient_lastDdlTime=1417763892,
COLUMN_STATS_ACCURATE=false, totalSize=0, numRows=-1, rawDataSize=-1}

old (broadcast working Spark build):
parameters:{EXTERNAL=TRUE, transient_lastDdlTime=1417763591,
totalSize=56166}

Looks like the table size computation failed in the latest version.

I've run the analyze command:

  ANALYZE TABLE $table COMPUTE STATISTICS noscan

And the tables are created from Parquet files:

e.g.
CREATE EXTERNAL TABLE table1 (
  code int,
  desc string
)
STORED AS PARQUET
LOCATION '/user/jianshuang/data/dim_tables/table1.parquet'


Anyone knows what went wrong?


Thanks,
Jianshi



On Fri, Nov 28, 2014 at 1:24 PM, Cheng, Hao  wrote:

>  Hi Jianshi,
>
> I couldn’t reproduce that with latest MASTER, and I can always get the
> BroadcastHashJoin for managed tables (in .csv file) in my testing, are
> there any external tables in your case?
>
>
>
> In general probably couple of things you can try first (with HiveContext):
>
> 1)  ANALYZE TABLE xxx COMPUTE STATISTICS NOSCAN; (apply that to all
> of the tables);
>
> 2)  SET spark.sql.autoBroadcastJoinThreshold=xxx; (Set the threshold
> as a greater value, it is 1024*1024*10 by default, just make sure the
> maximum dimension tables size (in bytes) is less than this)
>
> 3)  Always put the main table(the biggest table) in the left-most
> among the inner joins;
>
>
>
> DESC EXTENDED tablename; -- this will print the detail information for the
> statistic table size (the field “totalSize”)
>
> EXPLAIN EXTENDED query; -- this will print the detail physical plan.
>
>
>
> Let me know if you still have problem.
>
>
>
> Hao
>
>
>
> *From:* Jianshi Huang [mailto:jianshi.hu...@gmail.com]
> *Sent:* Thursday, November 27, 2014 10:24 PM
> *To:* Cheng, Hao
> *Cc:* user
> *Subject:* Re: Auto BroadcastJoin optimization failed in latest Spark
>
>
>
> Hi Hao,
>
>
>
> I'm using inner join as Broadcast join didn't work for left joins (thanks
> for the links for the latest improvements).
>
>
>
> And I'm using HiveConext and it worked in a previous build (10/12) when
> joining 15 dimension tables.
>
>
>
> Jianshi
>
>
>
> On Thu, Nov 27, 2014 at 8:35 AM, Cheng, Hao  wrote:
>
>  Are all of your join keys the same? and I guess the join type are all
> “Left” join, https://github.com/apache/spark/pull/3362 probably is what
> you need.
>
>
>
> And, SparkSQL doesn’t support the multiway-join (and multiway-broadcast
> join) currently,  https://github.com/apache/spark/pull/3270 should be
> another optimization for this.
>
>
>
>
>
> *From:* Jianshi Huang [mailto:jianshi.hu...@gmail.com]
> *Sent:* Wednesday, November 26, 2014 4:36 PM
> *To:* user
> *Subject:* Auto BroadcastJoin optimization failed in latest Spark
>
>
>
> Hi,
>
>
>
> I've confirmed that the latest Spark with either Hive 0.12 or 0.13.1 fails
> optimizing auto broadcast join in my query. I have a query that joins a
> huge fact table with 15 tiny dimension tables.
>
>
>
> I'm currently using an older version of Spark which was built on Oct. 12.
>
>
>
> Anyone else has met similar situation?
>
>
>
> --
>
> Jianshi Huang
>
> LinkedIn: jianshi
> Twitter: @jshuang
> Github & Blog: http://huangjs.github.com/
>
>
>
>
>
> --
>
> Jianshi Huang
>
> LinkedIn: jianshi
> Twitter: @jshuang
> Github & Blog: http://huangjs.github.com/
>



-- 
Jianshi Huang

LinkedIn: jianshi
Twitter: @jshuang
Github & Blog: http://huangjs.github.com/

RE: Auto BroadcastJoin optimization failed in latest Spark

2014-11-27 Thread Cheng, Hao

Hi Jianshi,
I couldn’t reproduce that with latest MASTER, and I can always get the 
BroadcastHashJoin for managed tables (in .csv file) in my testing, are there 
any external tables in your case?

In general probably couple of things you can try first (with HiveContext):

1)  ANALYZE TABLE xxx COMPUTE STATISTICS NOSCAN; (apply that to all of the 
tables);

2)  SET spark.sql.autoBroadcastJoinThreshold=xxx; (Set the threshold as a 
greater value, it is 1024*1024*10 by default, just make sure the maximum 
dimension tables size (in bytes) is less than this)

3)  Always put the main table(the biggest table) in the left-most among the 
inner joins;

DESC EXTENDED tablename; -- this will print the detail information for the 
statistic table size (the field “totalSize”)
EXPLAIN EXTENDED query; -- this will print the detail physical plan.

Let me know if you still have problem.

Hao

From: Jianshi Huang [mailto:jianshi.hu...@gmail.com]
Sent: Thursday, November 27, 2014 10:24 PM
To: Cheng, Hao
Cc: user
Subject: Re: Auto BroadcastJoin optimization failed in latest Spark

Hi Hao,

I'm using inner join as Broadcast join didn't work for left joins (thanks for 
the links for the latest improvements).

And I'm using HiveConext and it worked in a previous build (10/12) when joining 
15 dimension tables.

Jianshi

On Thu, Nov 27, 2014 at 8:35 AM, Cheng, Hao 
mailto:hao.ch...@intel.com>> wrote:
Are all of your join keys the same? and I guess the join type are all “Left” 
join, https://github.com/apache/spark/pull/3362 probably is what you need.

And, SparkSQL doesn’t support the multiway-join (and multiway-broadcast join) 
currently,  https://github.com/apache/spark/pull/3270 should be another 
optimization for this.


From: Jianshi Huang 
[mailto:jianshi.hu...@gmail.com<mailto:jianshi.hu...@gmail.com>]
Sent: Wednesday, November 26, 2014 4:36 PM
To: user
Subject: Auto BroadcastJoin optimization failed in latest Spark

Hi,

I've confirmed that the latest Spark with either Hive 0.12 or 0.13.1 fails 
optimizing auto broadcast join in my query. I have a query that joins a huge 
fact table with 15 tiny dimension tables.

I'm currently using an older version of Spark which was built on Oct. 12.

Anyone else has met similar situation?

--
Jianshi Huang

LinkedIn: jianshi
Twitter: @jshuang
Github & Blog: http://huangjs.github.com/



--
Jianshi Huang

LinkedIn: jianshi
Twitter: @jshuang
Github & Blog: http://huangjs.github.com/

Re: Auto BroadcastJoin optimization failed in latest Spark

2014-11-27 Thread Jianshi Huang

Hi Hao,

I'm using inner join as Broadcast join didn't work for left joins (thanks
for the links for the latest improvements).

And I'm using HiveConext and it worked in a previous build (10/12) when
joining 15 dimension tables.

Jianshi

On Thu, Nov 27, 2014 at 8:35 AM, Cheng, Hao  wrote:

>  Are all of your join keys the same? and I guess the join type are all
> “Left” join, https://github.com/apache/spark/pull/3362 probably is what
> you need.
>
>
>
> And, SparkSQL doesn’t support the multiway-join (and multiway-broadcast
> join) currently,  https://github.com/apache/spark/pull/3270 should be
> another optimization for this.
>
>
>
>
>
> *From:* Jianshi Huang [mailto:jianshi.hu...@gmail.com]
> *Sent:* Wednesday, November 26, 2014 4:36 PM
> *To:* user
> *Subject:* Auto BroadcastJoin optimization failed in latest Spark
>
>
>
> Hi,
>
>
>
> I've confirmed that the latest Spark with either Hive 0.12 or 0.13.1 fails
> optimizing auto broadcast join in my query. I have a query that joins a
> huge fact table with 15 tiny dimension tables.
>
>
>
> I'm currently using an older version of Spark which was built on Oct. 12.
>
>
>
> Anyone else has met similar situation?
>
>
>
> --
>
> Jianshi Huang
>
> LinkedIn: jianshi
> Twitter: @jshuang
> Github & Blog: http://huangjs.github.com/
>



-- 
Jianshi Huang

LinkedIn: jianshi
Twitter: @jshuang
Github & Blog: http://huangjs.github.com/

RE: Auto BroadcastJoin optimization failed in latest Spark

2014-11-26 Thread Cheng, Hao

Are all of your join keys the same? and I guess the join type are all “Left” 
join, https://github.com/apache/spark/pull/3362 probably is what you need.

And, SparkSQL doesn’t support the multiway-join (and multiway-broadcast join) 
currently,  https://github.com/apache/spark/pull/3270 should be another 
optimization for this.


From: Jianshi Huang [mailto:jianshi.hu...@gmail.com]
Sent: Wednesday, November 26, 2014 4:36 PM
To: user
Subject: Auto BroadcastJoin optimization failed in latest Spark

Hi,

I've confirmed that the latest Spark with either Hive 0.12 or 0.13.1 fails 
optimizing auto broadcast join in my query. I have a query that joins a huge 
fact table with 15 tiny dimension tables.

I'm currently using an older version of Spark which was built on Oct. 12.

Anyone else has met similar situation?

--
Jianshi Huang

LinkedIn: jianshi
Twitter: @jshuang
Github & Blog: http://huangjs.github.com/

Re: Auto BroadcastJoin optimization failed in latest Spark

Re: Auto BroadcastJoin optimization failed in latest Spark

Re: Auto BroadcastJoin optimization failed in latest Spark

RE: Auto BroadcastJoin optimization failed in latest Spark

Re: Auto BroadcastJoin optimization failed in latest Spark

RE: Auto BroadcastJoin optimization failed in latest Spark

6 matches

Site Navigation

Mail list logo

Footer information