[jira] [Commented] (SPARK-19145) Timestamp to String casting is slowing the query significantly

gagan taneja (JIRA) Sat, 04 Mar 2017 19:42:50 -0800

    [ 
https://issues.apache.org/jira/browse/SPARK-19145?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=15896046#comment-15896046
 ]


gagan taneja commented on SPARK-19145:
--------------------------------------

17/03/04 19:05:32 TRACE HiveSessionState$$anon$1: 
=== Applying Rule 
org.apache.spark.sql.catalyst.analysis.TypeCoercion$PromoteStrings ===

Before
'Project [*]                                                                    
                                
!+- 'Filter (time#88 >= 2017-01-02 19:53:51)                                    
                                 
    +- SubqueryAlias person_view, `rule_test`.`person_view`                     
                                    
       +- Project [gen_attr_0#83 AS name#86, gen_attr_1#84 AS age#87, 
gen_attr_2#85 AS time#88]               
          +- SubqueryAlias person_table                                         
                                       
             +- Project [gen_attr_0#83, gen_attr_1#84, gen_attr_2#85]           
                                         
                +- Filter (gen_attr_1#84 > 10)                                  
                                            
                   +- SubqueryAlias gen_subquery_0                              
                                             
                      +- Project [name#89 AS gen_attr_0#83, age#90 AS 
gen_attr_1#84, time#91 AS gen_attr_2#85]                    
                         +- MetastoreRelation rule_test, person_table           
                                                     

After
'Project [*]                                                                    
                                
+- Filter (cast(time#88 as string) >= 2017-01-02 19:53:51)
    +- SubqueryAlias person_view, `rule_test`.`person_view`
        +- Project [gen_attr_0#83 AS name#86, gen_attr_1#84 AS age#87, 
gen_attr_2#85 AS time#88]
            +- SubqueryAlias person_table
              +- Project [gen_attr_0#83, gen_attr_1#84, gen_attr_2#85]
                  +- Filter (gen_attr_1#84 > 10)
                      +- SubqueryAlias gen_subquery
                          +- Project [name#89 AS gen_attr_0#83, age#90 AS 
gen_attr_1#84, time#91 AS gen_attr_2#85]
                              +- MetastoreRelation rule_test, person_table

> Timestamp to String casting is slowing the query significantly
> --------------------------------------------------------------
>
>                 Key: SPARK-19145
>                 URL: https://issues.apache.org/jira/browse/SPARK-19145
>             Project: Spark
>          Issue Type: Bug
>          Components: SQL
>    Affects Versions: 2.1.0
>            Reporter: gagan taneja
>
> i have a time series table with timestamp column 
> Following query
> SELECT COUNT(*) AS `count`
>                FROM `default`.`table`
>                WHERE `time` >= '2017-01-02 19:53:51'
> AND `time` <= '2017-01-09 19:53:51' LIMIT 50000
> is significantly SLOWER than 
> SELECT COUNT(*) AS `count`
> FROM `default`.`table`
> WHERE `time` >= to_utc_timestamp('2017-01-02 19:53:51','YYYY-MM-DD 
> HH24:MI:SS−0800')
>   AND `time` <= to_utc_timestamp('2017-01-09 19:53:51','YYYY-MM-DD 
> HH24:MI:SS−0800') LIMIT 50000
> After investigation i found that in the first query time colum is cast to 
> String before applying the filter 
> However in the second query no such casting is performed and its a filter 
> with long value 
> Below are the generate Physical plan for slower execution followed by 
> physical plan for faster execution 
> SELECT COUNT(*) AS `count`
>                FROM `default`.`table`
>                WHERE `time` >= '2017-01-02 19:53:51'
> AND `time` <= '2017-01-09 19:53:51' LIMIT 50000
> == Physical Plan ==
> CollectLimit 50000
> +- *HashAggregate(keys=[], functions=[count(1)], output=[count#3290L])
>    +- Exchange SinglePartition
>       +- *HashAggregate(keys=[], functions=[partial_count(1)], 
> output=[count#3339L])
>          +- *Project
>             +- *Filter ((isnotnull(time#3314) && (cast(time#3314 as string) 
> >= 2017-01-02 19:53:51)) && (cast(time#3314 as string) <= 2017-01-09 
> 19:53:51))
>                +- *FileScan parquet default.cstat[time#3314] Batched: true, 
> Format: Parquet, Location: 
> InMemoryFileIndex[hdfs://10.65.55.220/user/spark/spark-warehouse/cstat], 
> PartitionFilters: [], PushedFilters: [IsNotNull(time)], ReadSchema: 
> struct<time:timestamp>
> SELECT COUNT(*) AS `count`
> FROM `default`.`table`
> WHERE `time` >= to_utc_timestamp('2017-01-02 19:53:51','YYYY-MM-DD 
> HH24:MI:SS−0800')
>   AND `time` <= to_utc_timestamp('2017-01-09 19:53:51','YYYY-MM-DD 
> HH24:MI:SS−0800') LIMIT 50000
> == Physical Plan ==
> CollectLimit 50000
> +- *HashAggregate(keys=[], functions=[count(1)], output=[count#3238L])
>    +- Exchange SinglePartition
>       +- *HashAggregate(keys=[], functions=[partial_count(1)], 
> output=[count#3287L])
>          +- *Project
>             +- *Filter ((isnotnull(time#3262) && (time#3262 >= 
> 1483404831000000)) && (time#3262 <= 1484009631000000))
>                +- *FileScan parquet default.cstat[time#3262] Batched: true, 
> Format: Parquet, Location: 
> InMemoryFileIndex[hdfs://10.65.55.220/user/spark/spark-warehouse/cstat], 
> PartitionFilters: [], PushedFilters: [IsNotNull(time), 
> GreaterThanOrEqual(time,2017-01-02 19:53:51.0), 
> LessThanOrEqual(time,2017-01-09..., ReadSchema: struct<time:timestamp>
> In Impala both query run efficiently without and performance difference
> Spark should be able to parse the Date string and convert to Long/Timestamp 
> during generation of Optimized Logical Plan so that both the query would have 
> similar performance



--
This message was sent by Atlassian JIRA
(v6.3.15#6346)

---------------------------------------------------------------------
To unsubscribe, e-mail: [email protected]
For additional commands, e-mail: [email protected]

[jira] [Commented] (SPARK-19145) Timestamp to String casting is slowing the query significantly

Reply via email to