[
https://issues.apache.org/jira/browse/SPARK-35837?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17384375#comment-17384375
]
Sean R. Owen commented on SPARK-35837:
--------------------------------------
What would this look like in Spark though? how do you surface recommendations?
> Recommendations for Common Query Problems
> -----------------------------------------
>
> Key: SPARK-35837
> URL: https://issues.apache.org/jira/browse/SPARK-35837
> Project: Spark
> Issue Type: New Feature
> Components: SQL
> Affects Versions: 3.2.0
> Reporter: Yuming Wang
> Priority: Major
>
> Teradata supports [Recommendations for Common Query
> Problems|https://docs.teradata.com/r/wada1XMYPkZVTqPKz2CNaw/JE7PEg6H~4nBZYEGphxxsg].
> We can implement a similar feature.
> 1. Detect the most skew values for join. The user decides whether these are
> needed.
> 2. Detect the most skew values for window function. The user decides whether
> these are needed.
> 3. Detect the bucket read, for example, Analyzer add a cast to bucket column.
> 4. Recommend the user add a filter condition to the partition column of the
> partition table.
> 5. Check the condition of join, for example, the result of cast string to
> double may be incorrect.
> For example:
> {code:sql}
> 0: jdbc:hive2://localhost:10000/default> EXPLAIN RECOMMENDATION
> 0: jdbc:hive2://localhost:10000/default> SELECT a.*,
> 0: jdbc:hive2://localhost:10000/default> CASE
> 0: jdbc:hive2://localhost:10000/default> WHEN ( NOT ( a.exclude = 1
> 0: jdbc:hive2://localhost:10000/default> AND a.cobrand
> = 6
> 0: jdbc:hive2://localhost:10000/default> AND
> a.primary_app_id IN ( 1462, 2878, 2571 ) ) )
> 0: jdbc:hive2://localhost:10000/default> AND (
> a.valid_page_count = 1 ) THEN 1
> 0: jdbc:hive2://localhost:10000/default> ELSE 0
> 0: jdbc:hive2://localhost:10000/default> END AS is_singlepage,
> 0: jdbc:hive2://localhost:10000/default> ca.bsns_vrtcl_name
> 0: jdbc:hive2://localhost:10000/default> FROM (SELECT *
> 0: jdbc:hive2://localhost:10000/default> FROM (SELECT *,
> 0: jdbc:hive2://localhost:10000/default> 'VI' AS
> page_type
> 0: jdbc:hive2://localhost:10000/default> FROM tbl1
> 0: jdbc:hive2://localhost:10000/default> UNION
> 0: jdbc:hive2://localhost:10000/default> SELECT *,
> 0: jdbc:hive2://localhost:10000/default> 'SRP' AS
> page_type
> 0: jdbc:hive2://localhost:10000/default> FROM tbl2
> 0: jdbc:hive2://localhost:10000/default> UNION
> 0: jdbc:hive2://localhost:10000/default> SELECT *,
> 0: jdbc:hive2://localhost:10000/default> 'My Garage'
> AS page_type
> 0: jdbc:hive2://localhost:10000/default> FROM tbl3
> 0: jdbc:hive2://localhost:10000/default> UNION
> 0: jdbc:hive2://localhost:10000/default> SELECT *,
> 0: jdbc:hive2://localhost:10000/default> 'Motors
> Homepage' AS page_type
> 0: jdbc:hive2://localhost:10000/default> FROM tbl4) t
> 0: jdbc:hive2://localhost:10000/default> WHERE session_start_dt
> BETWEEN ( '2020-01-01' ) AND (
> 0: jdbc:hive2://localhost:10000/default>
> CURRENT_DATE() - 10 )) a
> 0: jdbc:hive2://localhost:10000/default> LEFT JOIN (SELECT item_id,
> 0: jdbc:hive2://localhost:10000/default>
> item_site_id,
> 0: jdbc:hive2://localhost:10000/default> auct_end_dt,
> 0: jdbc:hive2://localhost:10000/default>
> leaf_categ_id
> 0: jdbc:hive2://localhost:10000/default> FROM tbl5
> 0: jdbc:hive2://localhost:10000/default> WHERE auct_end_dt
> >= ( '2020-01-01' )) itm
> 0: jdbc:hive2://localhost:10000/default> ON a.item_id =
> itm.item_id
> 0: jdbc:hive2://localhost:10000/default> LEFT JOIN tbl6 ca
> 0: jdbc:hive2://localhost:10000/default> ON itm.leaf_categ_id =
> ca.leaf_categ_id
> 0: jdbc:hive2://localhost:10000/default> AND
> itm.item_site_id = ca.site_id;
> +-----------------------------------------------------------------------------------------------------------------------------------------+--+
> | result
> |
> +-----------------------------------------------------------------------------------------------------------------------------------------+--+
> | 1. Detect the most skew values for join
> |
> | Check join: Join LeftOuter, ((leaf_categ_id#1453 = leaf_categ_id#3020)
> AND (cast(item_site_id#1444 as decimal(9,0)) = site_id#3022)) |
> | table: tbl5
> |
> | leaf_categ_id, item_site_id, count
> |
> | 171243, 0, 115412614
> |
> | 176984, 3, 81003252
> |
> | 176985, 3, 75035585
> |
> | 179680, 77, 71831618
> |
> | 171243, 3, 69821716
> |
> | Check join: Join LeftOuter, (item_id#3244 = cast(item_id#1439 as
> decimal(20,0))) |
> | table: tbl1
> |
> | item_id, count
> |
> | null, 254738836
> |
> | 232644544775, 35530
> |
> | 372028681865, 24189
> |
> | 124014973844, 21725
> |
> | 303476035378, 17075
> |
> |
> |
> | 2. Detect the most skew values for window function
> |
> |
> |
> | 3. Detect the bucket read
> |
> | Check join: SortMergeJoin [item_id#3244], [cast(item_id#1439 as
> decimal(20,0))], LeftOuter |
> | The data type do not match for bucket read, consider cast `item_id` to
> decimal(18,0). |
> |
> |
> | 4. Add a filter condition to partition column
> |
> |
> |
> | 5. Check the dangerous join condition
> |
> |
> |
> +-----------------------------------------------------------------------------------------------------------------------------------------+--+
> 27 rows selected (152.685 seconds)
> {code}
> The optimized query is:
> {code:sql}
> SELECT a.*,
> CASE
> WHEN ( NOT ( a.exclude = 1
> AND a.cobrand = 6
> AND a.primary_app_id IN ( 1462, 2878, 2571 ) ) )
> AND ( a.valid_page_count = 1 ) THEN 1
> ELSE 0
> END AS is_singlepage,
> ca.bsns_vrtcl_name
> FROM (SELECT *
> FROM (SELECT *,
> 'VI' AS page_type
> FROM tbl1 where item_id is not null -- Checked with user, the
> null values is not needed. Add a filter condition.
> UNION
> SELECT *,
> 'SRP' AS page_type
> FROM tbl2
> UNION
> SELECT *,
> 'My Garage' AS page_type
> FROM tbl3
> UNION
> SELECT *,
> 'Motors Homepage' AS page_type
> FROM tbl4) t
> WHERE session_start_dt BETWEEN ( '2020-01-01' ) AND (
> CURRENT_DATE() - 10 )) a
> LEFT JOIN (SELECT item_id,
> item_site_id,
> auct_end_dt,
> leaf_categ_id
> FROM tbl5
> WHERE auct_end_dt >= ( '2020-01-01' )) itm
> ON cast(a.item_id as decimal(18, 0)) = cast(itm.item_id as
> decimal(18, 0)) -- Checked with user. cast to decimal(18, 0) is safe.
> LEFT JOIN tbl6 ca
> ON itm.leaf_categ_id = ca.leaf_categ_id
> AND itm.item_site_id = ca.site_id
> {code}
>
> With these optimizations, the query execution time is reduced from 42 minutes
> to 6.3 minutes.
>
--
This message was sent by Atlassian Jira
(v8.3.4#803005)
---------------------------------------------------------------------
To unsubscribe, e-mail: [email protected]
For additional commands, e-mail: [email protected]