alamb commented on issue #12509:
URL: https://github.com/apache/datafusion/issues/12509#issuecomment-2366775581

   I have been thinking about this, and I came up with a third option which is 
to "push the casting into the scan"
   
   Consider this plan for q28:
   
   ```
   
+---------------+------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------$
   | plan_type     | plan                                                       
                                                                                
                                                                                
                                                                                
      $
   
+---------------+------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------$
   | logical_plan  | Sort: l DESC NULLS FIRST, fetch=25                         
                                                                                
                                                                                
                                                                                
      $
   |               |   Projection: 
regexp_replace(hits_partitioned.Referer,Utf8("^https?://(?:www\.)?([^/]+)/.*$"),Utf8("\1"))
 AS k, avg(character_length(hits_partitioned.Referer)) AS l, count(*) AS c, 
min(hits_partitioned.Referer)                                                   
                                            $
   |               |     Filter: count(*) > Int64(100000)                       
                                                                                
                                                                                
                                                                                
      $
   |               |       Aggregate: groupBy=[[regexp_replace(__common_expr_1 
AS hits_partitioned.Referer, Utf8("^https?://(?:www\.)?([^/]+)/.*$"), 
Utf8("\1"))]], aggr=[[avg(CAST(character_length(__common_expr_1 AS 
hits_partitioned.Referer) AS Float64)), count(Int64(1)) AS count(*), 
min(hits_partitioned.Referer)]]          $
   |               |         Projection: CAST(hits_partitioned.Referer AS Utf8) 
AS __common_expr_1, hits_partitioned.Referer                                    
                                                                                
                                                                                
      $
   |               |           Filter: hits_partitioned.Referer != 
BinaryView("")                                                                  
                                                                                
                                                                                
                   $
   |               |             TableScan: hits_partitioned 
projection=[Referer], partial_filters=[hits_partitioned.Referer != 
BinaryView("")]  
   ```
   
   The
   ```
    Projection: CAST(hits_partitioned.Referer AS Utf8) AS __common_expr_1, 
hits_partitioned.Referer
   ```
   
   Is what is causing a non trivial slowdown.
   
   The issue is that `hits_partitioned.Referer` is read as a `BinaryView`
   
   The problem is that BinaryView --> Utf8View conversion is much slower than 
reading Utf8View directly out of the parquet file due to the Utf8 optimization 
described by @XiangpengHao  in "Section 2.1: From binary to strings" of the 
[string view 
blog](https://www.influxdata.com/blog/faster-queries-with-stringview-part-one-influxdb/).
   
   
   ## Option 3: Implement push down casting  (maybe as an Analyzer rule??) 
   
   The theory here is that some readers( such as the parquet reader) can 
produce the data more effiicently in a particular format than creating it first 
in one format before datafusion casts it to another.
   
   So the plan above would basically push the cast down so the parquet reader 
read the `hits_partitioned.Referer` as a `Utf8View` to begin with
   


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: [email protected]

For queries about this service, please contact Infrastructure at:
[email protected]


---------------------------------------------------------------------
To unsubscribe, e-mail: [email protected]
For additional commands, e-mail: [email protected]

Reply via email to