Re: Auto BroadcastJoin optimization failed in latest Spark
With Liancheng's suggestion, I've tried setting spark.sql.hive.convertMetastoreParquet false but still analyze noscan return -1 in rawDataSize Jianshi On Fri, Dec 5, 2014 at 3:33 PM, Jianshi Huang wrote: > If I run ANALYZE without NOSCAN, then Hive can successfully get the size: > > parameters:{numFiles=0, EXTERNAL=TRUE, transient_lastDdlTime=1417764589, > COLUMN_STATS_ACCURATE=true, totalSize=0, numRows=1156, rawDataSize=76296} > > Is Hive's PARQUET support broken? > > Jianshi > > > On Fri, Dec 5, 2014 at 3:30 PM, Jianshi Huang > wrote: > >> Sorry for the late of follow-up. >> >> I used Hao's DESC EXTENDED command and found some clue: >> >> new (broadcast broken Spark build): >> parameters:{numFiles=0, EXTERNAL=TRUE, transient_lastDdlTime=1417763892, >> COLUMN_STATS_ACCURATE=false, totalSize=0, numRows=-1, rawDataSize=-1} >> >> old (broadcast working Spark build): >> parameters:{EXTERNAL=TRUE, transient_lastDdlTime=1417763591, >> totalSize=56166} >> >> Looks like the table size computation failed in the latest version. >> >> I've run the analyze command: >> >> ANALYZE TABLE $table COMPUTE STATISTICS noscan >> >> And the tables are created from Parquet files: >> >> e.g. >> CREATE EXTERNAL TABLE table1 ( >> code int, >> desc string >> ) >> STORED AS PARQUET >> LOCATION '/user/jianshuang/data/dim_tables/table1.parquet' >> >> >> Anyone knows what went wrong? >> >> >> Thanks, >> Jianshi >> >> >> >> On Fri, Nov 28, 2014 at 1:24 PM, Cheng, Hao wrote: >> >>> Hi Jianshi, >>> >>> I couldn’t reproduce that with latest MASTER, and I can always get the >>> BroadcastHashJoin for managed tables (in .csv file) in my testing, are >>> there any external tables in your case? >>> >>> >>> >>> In general probably couple of things you can try first (with >>> HiveContext): >>> >>> 1) ANALYZE TABLE xxx COMPUTE STATISTICS NOSCAN; (apply that to all >>> of the tables); >>> >>> 2) SET spark.sql.autoBroadcastJoinThreshold=xxx; (Set the >>> threshold as a greater value, it is 1024*1024*10 by default, just make sure >>> the maximum dimension tables size (in bytes) is less than this) >>> >>> 3) Always put the main table(the biggest table) in the left-most >>> among the inner joins; >>> >>> >>> >>> DESC EXTENDED tablename; -- this will print the detail information for >>> the statistic table size (the field “totalSize”) >>> >>> EXPLAIN EXTENDED query; -- this will print the detail physical plan. >>> >>> >>> >>> Let me know if you still have problem. >>> >>> >>> >>> Hao >>> >>> >>> >>> *From:* Jianshi Huang [mailto:jianshi.hu...@gmail.com] >>> *Sent:* Thursday, November 27, 2014 10:24 PM >>> *To:* Cheng, Hao >>> *Cc:* user >>> *Subject:* Re: Auto BroadcastJoin optimization failed in latest Spark >>> >>> >>> >>> Hi Hao, >>> >>> >>> >>> I'm using inner join as Broadcast join didn't work for left joins >>> (thanks for the links for the latest improvements). >>> >>> >>> >>> And I'm using HiveConext and it worked in a previous build (10/12) when >>> joining 15 dimension tables. >>> >>> >>> >>> Jianshi >>> >>> >>> >>> On Thu, Nov 27, 2014 at 8:35 AM, Cheng, Hao wrote: >>> >>> Are all of your join keys the same? and I guess the join type are all >>> “Left” join, https://github.com/apache/spark/pull/3362 probably is what >>> you need. >>> >>> >>> >>> And, SparkSQL doesn’t support the multiway-join (and multiway-broadcast >>> join) currently, https://github.com/apache/spark/pull/3270 should be >>> another optimization for this. >>> >>> >>> >>> >>> >>> *From:* Jianshi Huang [mailto:jianshi.hu...@gmail.com] >>> *Sent:* Wednesday, November 26, 2014 4:36 PM >>> *To:* user >>> *Subject:* Auto BroadcastJoin optimization failed in latest Spark >>> >>> >>> >>> Hi, >>> >>> >>> >>> I've confirmed that the latest Spark with either Hive 0.12 or 0.13.1 >>> fails optimizing auto broadcast join in my query. I have a query that joins >>> a huge fact table with 15 tiny dimension tables. >>> >>> >>> >>> I'm currently using an older version of Spark which was built on Oct. 12. >>> >>> >>> >>> Anyone else has met similar situation? >>> >>> >>> >>> -- >>> >>> Jianshi Huang >>> >>> LinkedIn: jianshi >>> Twitter: @jshuang >>> Github & Blog: http://huangjs.github.com/ >>> >>> >>> >>> >>> >>> -- >>> >>> Jianshi Huang >>> >>> LinkedIn: jianshi >>> Twitter: @jshuang >>> Github & Blog: http://huangjs.github.com/ >>> >> >> >> >> -- >> Jianshi Huang >> >> LinkedIn: jianshi >> Twitter: @jshuang >> Github & Blog: http://huangjs.github.com/ >> > > > > -- > Jianshi Huang > > LinkedIn: jianshi > Twitter: @jshuang > Github & Blog: http://huangjs.github.com/ > -- Jianshi Huang LinkedIn: jianshi Twitter: @jshuang Github & Blog: http://huangjs.github.com/
Re: Auto BroadcastJoin optimization failed in latest Spark
If I run ANALYZE without NOSCAN, then Hive can successfully get the size: parameters:{numFiles=0, EXTERNAL=TRUE, transient_lastDdlTime=1417764589, COLUMN_STATS_ACCURATE=true, totalSize=0, numRows=1156, rawDataSize=76296} Is Hive's PARQUET support broken? Jianshi On Fri, Dec 5, 2014 at 3:30 PM, Jianshi Huang wrote: > Sorry for the late of follow-up. > > I used Hao's DESC EXTENDED command and found some clue: > > new (broadcast broken Spark build): > parameters:{numFiles=0, EXTERNAL=TRUE, transient_lastDdlTime=1417763892, > COLUMN_STATS_ACCURATE=false, totalSize=0, numRows=-1, rawDataSize=-1} > > old (broadcast working Spark build): > parameters:{EXTERNAL=TRUE, transient_lastDdlTime=1417763591, > totalSize=56166} > > Looks like the table size computation failed in the latest version. > > I've run the analyze command: > > ANALYZE TABLE $table COMPUTE STATISTICS noscan > > And the tables are created from Parquet files: > > e.g. > CREATE EXTERNAL TABLE table1 ( > code int, > desc string > ) > STORED AS PARQUET > LOCATION '/user/jianshuang/data/dim_tables/table1.parquet' > > > Anyone knows what went wrong? > > > Thanks, > Jianshi > > > > On Fri, Nov 28, 2014 at 1:24 PM, Cheng, Hao wrote: > >> Hi Jianshi, >> >> I couldn’t reproduce that with latest MASTER, and I can always get the >> BroadcastHashJoin for managed tables (in .csv file) in my testing, are >> there any external tables in your case? >> >> >> >> In general probably couple of things you can try first (with HiveContext): >> >> 1) ANALYZE TABLE xxx COMPUTE STATISTICS NOSCAN; (apply that to all >> of the tables); >> >> 2) SET spark.sql.autoBroadcastJoinThreshold=xxx; (Set the threshold >> as a greater value, it is 1024*1024*10 by default, just make sure the >> maximum dimension tables size (in bytes) is less than this) >> >> 3) Always put the main table(the biggest table) in the left-most >> among the inner joins; >> >> >> >> DESC EXTENDED tablename; -- this will print the detail information for >> the statistic table size (the field “totalSize”) >> >> EXPLAIN EXTENDED query; -- this will print the detail physical plan. >> >> >> >> Let me know if you still have problem. >> >> >> >> Hao >> >> >> >> *From:* Jianshi Huang [mailto:jianshi.hu...@gmail.com] >> *Sent:* Thursday, November 27, 2014 10:24 PM >> *To:* Cheng, Hao >> *Cc:* user >> *Subject:* Re: Auto BroadcastJoin optimization failed in latest Spark >> >> >> >> Hi Hao, >> >> >> >> I'm using inner join as Broadcast join didn't work for left joins (thanks >> for the links for the latest improvements). >> >> >> >> And I'm using HiveConext and it worked in a previous build (10/12) when >> joining 15 dimension tables. >> >> >> >> Jianshi >> >> >> >> On Thu, Nov 27, 2014 at 8:35 AM, Cheng, Hao wrote: >> >> Are all of your join keys the same? and I guess the join type are all >> “Left” join, https://github.com/apache/spark/pull/3362 probably is what >> you need. >> >> >> >> And, SparkSQL doesn’t support the multiway-join (and multiway-broadcast >> join) currently, https://github.com/apache/spark/pull/3270 should be >> another optimization for this. >> >> >> >> >> >> *From:* Jianshi Huang [mailto:jianshi.hu...@gmail.com] >> *Sent:* Wednesday, November 26, 2014 4:36 PM >> *To:* user >> *Subject:* Auto BroadcastJoin optimization failed in latest Spark >> >> >> >> Hi, >> >> >> >> I've confirmed that the latest Spark with either Hive 0.12 or 0.13.1 >> fails optimizing auto broadcast join in my query. I have a query that joins >> a huge fact table with 15 tiny dimension tables. >> >> >> >> I'm currently using an older version of Spark which was built on Oct. 12. >> >> >> >> Anyone else has met similar situation? >> >> >> >> -- >> >> Jianshi Huang >> >> LinkedIn: jianshi >> Twitter: @jshuang >> Github & Blog: http://huangjs.github.com/ >> >> >> >> >> >> -- >> >> Jianshi Huang >> >> LinkedIn: jianshi >> Twitter: @jshuang >> Github & Blog: http://huangjs.github.com/ >> > > > > -- > Jianshi Huang > > LinkedIn: jianshi > Twitter: @jshuang > Github & Blog: http://huangjs.github.com/ > -- Jianshi Huang LinkedIn: jianshi Twitter: @jshuang Github & Blog: http://huangjs.github.com/
Re: Auto BroadcastJoin optimization failed in latest Spark
Sorry for the late of follow-up. I used Hao's DESC EXTENDED command and found some clue: new (broadcast broken Spark build): parameters:{numFiles=0, EXTERNAL=TRUE, transient_lastDdlTime=1417763892, COLUMN_STATS_ACCURATE=false, totalSize=0, numRows=-1, rawDataSize=-1} old (broadcast working Spark build): parameters:{EXTERNAL=TRUE, transient_lastDdlTime=1417763591, totalSize=56166} Looks like the table size computation failed in the latest version. I've run the analyze command: ANALYZE TABLE $table COMPUTE STATISTICS noscan And the tables are created from Parquet files: e.g. CREATE EXTERNAL TABLE table1 ( code int, desc string ) STORED AS PARQUET LOCATION '/user/jianshuang/data/dim_tables/table1.parquet' Anyone knows what went wrong? Thanks, Jianshi On Fri, Nov 28, 2014 at 1:24 PM, Cheng, Hao wrote: > Hi Jianshi, > > I couldn’t reproduce that with latest MASTER, and I can always get the > BroadcastHashJoin for managed tables (in .csv file) in my testing, are > there any external tables in your case? > > > > In general probably couple of things you can try first (with HiveContext): > > 1) ANALYZE TABLE xxx COMPUTE STATISTICS NOSCAN; (apply that to all > of the tables); > > 2) SET spark.sql.autoBroadcastJoinThreshold=xxx; (Set the threshold > as a greater value, it is 1024*1024*10 by default, just make sure the > maximum dimension tables size (in bytes) is less than this) > > 3) Always put the main table(the biggest table) in the left-most > among the inner joins; > > > > DESC EXTENDED tablename; -- this will print the detail information for the > statistic table size (the field “totalSize”) > > EXPLAIN EXTENDED query; -- this will print the detail physical plan. > > > > Let me know if you still have problem. > > > > Hao > > > > *From:* Jianshi Huang [mailto:jianshi.hu...@gmail.com] > *Sent:* Thursday, November 27, 2014 10:24 PM > *To:* Cheng, Hao > *Cc:* user > *Subject:* Re: Auto BroadcastJoin optimization failed in latest Spark > > > > Hi Hao, > > > > I'm using inner join as Broadcast join didn't work for left joins (thanks > for the links for the latest improvements). > > > > And I'm using HiveConext and it worked in a previous build (10/12) when > joining 15 dimension tables. > > > > Jianshi > > > > On Thu, Nov 27, 2014 at 8:35 AM, Cheng, Hao wrote: > > Are all of your join keys the same? and I guess the join type are all > “Left” join, https://github.com/apache/spark/pull/3362 probably is what > you need. > > > > And, SparkSQL doesn’t support the multiway-join (and multiway-broadcast > join) currently, https://github.com/apache/spark/pull/3270 should be > another optimization for this. > > > > > > *From:* Jianshi Huang [mailto:jianshi.hu...@gmail.com] > *Sent:* Wednesday, November 26, 2014 4:36 PM > *To:* user > *Subject:* Auto BroadcastJoin optimization failed in latest Spark > > > > Hi, > > > > I've confirmed that the latest Spark with either Hive 0.12 or 0.13.1 fails > optimizing auto broadcast join in my query. I have a query that joins a > huge fact table with 15 tiny dimension tables. > > > > I'm currently using an older version of Spark which was built on Oct. 12. > > > > Anyone else has met similar situation? > > > > -- > > Jianshi Huang > > LinkedIn: jianshi > Twitter: @jshuang > Github & Blog: http://huangjs.github.com/ > > > > > > -- > > Jianshi Huang > > LinkedIn: jianshi > Twitter: @jshuang > Github & Blog: http://huangjs.github.com/ > -- Jianshi Huang LinkedIn: jianshi Twitter: @jshuang Github & Blog: http://huangjs.github.com/
RE: Auto BroadcastJoin optimization failed in latest Spark
Hi Jianshi, I couldn’t reproduce that with latest MASTER, and I can always get the BroadcastHashJoin for managed tables (in .csv file) in my testing, are there any external tables in your case? In general probably couple of things you can try first (with HiveContext): 1) ANALYZE TABLE xxx COMPUTE STATISTICS NOSCAN; (apply that to all of the tables); 2) SET spark.sql.autoBroadcastJoinThreshold=xxx; (Set the threshold as a greater value, it is 1024*1024*10 by default, just make sure the maximum dimension tables size (in bytes) is less than this) 3) Always put the main table(the biggest table) in the left-most among the inner joins; DESC EXTENDED tablename; -- this will print the detail information for the statistic table size (the field “totalSize”) EXPLAIN EXTENDED query; -- this will print the detail physical plan. Let me know if you still have problem. Hao From: Jianshi Huang [mailto:jianshi.hu...@gmail.com] Sent: Thursday, November 27, 2014 10:24 PM To: Cheng, Hao Cc: user Subject: Re: Auto BroadcastJoin optimization failed in latest Spark Hi Hao, I'm using inner join as Broadcast join didn't work for left joins (thanks for the links for the latest improvements). And I'm using HiveConext and it worked in a previous build (10/12) when joining 15 dimension tables. Jianshi On Thu, Nov 27, 2014 at 8:35 AM, Cheng, Hao mailto:hao.ch...@intel.com>> wrote: Are all of your join keys the same? and I guess the join type are all “Left” join, https://github.com/apache/spark/pull/3362 probably is what you need. And, SparkSQL doesn’t support the multiway-join (and multiway-broadcast join) currently, https://github.com/apache/spark/pull/3270 should be another optimization for this. From: Jianshi Huang [mailto:jianshi.hu...@gmail.com<mailto:jianshi.hu...@gmail.com>] Sent: Wednesday, November 26, 2014 4:36 PM To: user Subject: Auto BroadcastJoin optimization failed in latest Spark Hi, I've confirmed that the latest Spark with either Hive 0.12 or 0.13.1 fails optimizing auto broadcast join in my query. I have a query that joins a huge fact table with 15 tiny dimension tables. I'm currently using an older version of Spark which was built on Oct. 12. Anyone else has met similar situation? -- Jianshi Huang LinkedIn: jianshi Twitter: @jshuang Github & Blog: http://huangjs.github.com/ -- Jianshi Huang LinkedIn: jianshi Twitter: @jshuang Github & Blog: http://huangjs.github.com/
Re: Auto BroadcastJoin optimization failed in latest Spark
Hi Hao, I'm using inner join as Broadcast join didn't work for left joins (thanks for the links for the latest improvements). And I'm using HiveConext and it worked in a previous build (10/12) when joining 15 dimension tables. Jianshi On Thu, Nov 27, 2014 at 8:35 AM, Cheng, Hao wrote: > Are all of your join keys the same? and I guess the join type are all > “Left” join, https://github.com/apache/spark/pull/3362 probably is what > you need. > > > > And, SparkSQL doesn’t support the multiway-join (and multiway-broadcast > join) currently, https://github.com/apache/spark/pull/3270 should be > another optimization for this. > > > > > > *From:* Jianshi Huang [mailto:jianshi.hu...@gmail.com] > *Sent:* Wednesday, November 26, 2014 4:36 PM > *To:* user > *Subject:* Auto BroadcastJoin optimization failed in latest Spark > > > > Hi, > > > > I've confirmed that the latest Spark with either Hive 0.12 or 0.13.1 fails > optimizing auto broadcast join in my query. I have a query that joins a > huge fact table with 15 tiny dimension tables. > > > > I'm currently using an older version of Spark which was built on Oct. 12. > > > > Anyone else has met similar situation? > > > > -- > > Jianshi Huang > > LinkedIn: jianshi > Twitter: @jshuang > Github & Blog: http://huangjs.github.com/ > -- Jianshi Huang LinkedIn: jianshi Twitter: @jshuang Github & Blog: http://huangjs.github.com/
RE: Auto BroadcastJoin optimization failed in latest Spark
Are all of your join keys the same? and I guess the join type are all “Left” join, https://github.com/apache/spark/pull/3362 probably is what you need. And, SparkSQL doesn’t support the multiway-join (and multiway-broadcast join) currently, https://github.com/apache/spark/pull/3270 should be another optimization for this. From: Jianshi Huang [mailto:jianshi.hu...@gmail.com] Sent: Wednesday, November 26, 2014 4:36 PM To: user Subject: Auto BroadcastJoin optimization failed in latest Spark Hi, I've confirmed that the latest Spark with either Hive 0.12 or 0.13.1 fails optimizing auto broadcast join in my query. I have a query that joins a huge fact table with 15 tiny dimension tables. I'm currently using an older version of Spark which was built on Oct. 12. Anyone else has met similar situation? -- Jianshi Huang LinkedIn: jianshi Twitter: @jshuang Github & Blog: http://huangjs.github.com/