[ https://issues.apache.org/jira/browse/SPARK-35837?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17384375#comment-17384375 ]
Sean R. Owen commented on SPARK-35837: -------------------------------------- What would this look like in Spark though? how do you surface recommendations? > Recommendations for Common Query Problems > ----------------------------------------- > > Key: SPARK-35837 > URL: https://issues.apache.org/jira/browse/SPARK-35837 > Project: Spark > Issue Type: New Feature > Components: SQL > Affects Versions: 3.2.0 > Reporter: Yuming Wang > Priority: Major > > Teradata supports [Recommendations for Common Query > Problems|https://docs.teradata.com/r/wada1XMYPkZVTqPKz2CNaw/JE7PEg6H~4nBZYEGphxxsg]. > We can implement a similar feature. > 1. Detect the most skew values for join. The user decides whether these are > needed. > 2. Detect the most skew values for window function. The user decides whether > these are needed. > 3. Detect the bucket read, for example, Analyzer add a cast to bucket column. > 4. Recommend the user add a filter condition to the partition column of the > partition table. > 5. Check the condition of join, for example, the result of cast string to > double may be incorrect. > For example: > {code:sql} > 0: jdbc:hive2://localhost:10000/default> EXPLAIN RECOMMENDATION > 0: jdbc:hive2://localhost:10000/default> SELECT a.*, > 0: jdbc:hive2://localhost:10000/default> CASE > 0: jdbc:hive2://localhost:10000/default> WHEN ( NOT ( a.exclude = 1 > 0: jdbc:hive2://localhost:10000/default> AND a.cobrand > = 6 > 0: jdbc:hive2://localhost:10000/default> AND > a.primary_app_id IN ( 1462, 2878, 2571 ) ) ) > 0: jdbc:hive2://localhost:10000/default> AND ( > a.valid_page_count = 1 ) THEN 1 > 0: jdbc:hive2://localhost:10000/default> ELSE 0 > 0: jdbc:hive2://localhost:10000/default> END AS is_singlepage, > 0: jdbc:hive2://localhost:10000/default> ca.bsns_vrtcl_name > 0: jdbc:hive2://localhost:10000/default> FROM (SELECT * > 0: jdbc:hive2://localhost:10000/default> FROM (SELECT *, > 0: jdbc:hive2://localhost:10000/default> 'VI' AS > page_type > 0: jdbc:hive2://localhost:10000/default> FROM tbl1 > 0: jdbc:hive2://localhost:10000/default> UNION > 0: jdbc:hive2://localhost:10000/default> SELECT *, > 0: jdbc:hive2://localhost:10000/default> 'SRP' AS > page_type > 0: jdbc:hive2://localhost:10000/default> FROM tbl2 > 0: jdbc:hive2://localhost:10000/default> UNION > 0: jdbc:hive2://localhost:10000/default> SELECT *, > 0: jdbc:hive2://localhost:10000/default> 'My Garage' > AS page_type > 0: jdbc:hive2://localhost:10000/default> FROM tbl3 > 0: jdbc:hive2://localhost:10000/default> UNION > 0: jdbc:hive2://localhost:10000/default> SELECT *, > 0: jdbc:hive2://localhost:10000/default> 'Motors > Homepage' AS page_type > 0: jdbc:hive2://localhost:10000/default> FROM tbl4) t > 0: jdbc:hive2://localhost:10000/default> WHERE session_start_dt > BETWEEN ( '2020-01-01' ) AND ( > 0: jdbc:hive2://localhost:10000/default> > CURRENT_DATE() - 10 )) a > 0: jdbc:hive2://localhost:10000/default> LEFT JOIN (SELECT item_id, > 0: jdbc:hive2://localhost:10000/default> > item_site_id, > 0: jdbc:hive2://localhost:10000/default> auct_end_dt, > 0: jdbc:hive2://localhost:10000/default> > leaf_categ_id > 0: jdbc:hive2://localhost:10000/default> FROM tbl5 > 0: jdbc:hive2://localhost:10000/default> WHERE auct_end_dt > >= ( '2020-01-01' )) itm > 0: jdbc:hive2://localhost:10000/default> ON a.item_id = > itm.item_id > 0: jdbc:hive2://localhost:10000/default> LEFT JOIN tbl6 ca > 0: jdbc:hive2://localhost:10000/default> ON itm.leaf_categ_id = > ca.leaf_categ_id > 0: jdbc:hive2://localhost:10000/default> AND > itm.item_site_id = ca.site_id; > +-----------------------------------------------------------------------------------------------------------------------------------------+--+ > | result > | > +-----------------------------------------------------------------------------------------------------------------------------------------+--+ > | 1. Detect the most skew values for join > | > | Check join: Join LeftOuter, ((leaf_categ_id#1453 = leaf_categ_id#3020) > AND (cast(item_site_id#1444 as decimal(9,0)) = site_id#3022)) | > | table: tbl5 > | > | leaf_categ_id, item_site_id, count > | > | 171243, 0, 115412614 > | > | 176984, 3, 81003252 > | > | 176985, 3, 75035585 > | > | 179680, 77, 71831618 > | > | 171243, 3, 69821716 > | > | Check join: Join LeftOuter, (item_id#3244 = cast(item_id#1439 as > decimal(20,0))) | > | table: tbl1 > | > | item_id, count > | > | null, 254738836 > | > | 232644544775, 35530 > | > | 372028681865, 24189 > | > | 124014973844, 21725 > | > | 303476035378, 17075 > | > | > | > | 2. Detect the most skew values for window function > | > | > | > | 3. Detect the bucket read > | > | Check join: SortMergeJoin [item_id#3244], [cast(item_id#1439 as > decimal(20,0))], LeftOuter | > | The data type do not match for bucket read, consider cast `item_id` to > decimal(18,0). | > | > | > | 4. Add a filter condition to partition column > | > | > | > | 5. Check the dangerous join condition > | > | > | > +-----------------------------------------------------------------------------------------------------------------------------------------+--+ > 27 rows selected (152.685 seconds) > {code} > The optimized query is: > {code:sql} > SELECT a.*, > CASE > WHEN ( NOT ( a.exclude = 1 > AND a.cobrand = 6 > AND a.primary_app_id IN ( 1462, 2878, 2571 ) ) ) > AND ( a.valid_page_count = 1 ) THEN 1 > ELSE 0 > END AS is_singlepage, > ca.bsns_vrtcl_name > FROM (SELECT * > FROM (SELECT *, > 'VI' AS page_type > FROM tbl1 where item_id is not null -- Checked with user, the > null values is not needed. Add a filter condition. > UNION > SELECT *, > 'SRP' AS page_type > FROM tbl2 > UNION > SELECT *, > 'My Garage' AS page_type > FROM tbl3 > UNION > SELECT *, > 'Motors Homepage' AS page_type > FROM tbl4) t > WHERE session_start_dt BETWEEN ( '2020-01-01' ) AND ( > CURRENT_DATE() - 10 )) a > LEFT JOIN (SELECT item_id, > item_site_id, > auct_end_dt, > leaf_categ_id > FROM tbl5 > WHERE auct_end_dt >= ( '2020-01-01' )) itm > ON cast(a.item_id as decimal(18, 0)) = cast(itm.item_id as > decimal(18, 0)) -- Checked with user. cast to decimal(18, 0) is safe. > LEFT JOIN tbl6 ca > ON itm.leaf_categ_id = ca.leaf_categ_id > AND itm.item_site_id = ca.site_id > {code} > > With these optimizations, the query execution time is reduced from 42 minutes > to 6.3 minutes. > -- This message was sent by Atlassian Jira (v8.3.4#803005) --------------------------------------------------------------------- To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org