alamb opened a new issue, #12509:
URL: https://github.com/apache/datafusion/issues/12509

   **Is your feature request related to a problem or challenge? Please describe 
what you are trying to do.**
   
   While working on enabling `StringView` by default in 
https://github.com/apache/datafusion/pull/12092 I noticed that some of the 
clickbench queries got 10% slower and looked into it. 
   
   The plan looks like this:
   
   ```sql
   DataFusion CLI v41.0.0
   
+---------------+----------------------------------------------------------------------------------------------------$
   | plan_type     | plan                                                       
                                        $
   
+---------------+----------------------------------------------------------------------------------------------------$
   | physical_plan | AggregateExec: mode=Final, gby=[], aggr=[count(*)]         
                                        $
   |               |   CoalescePartitionsExec                                   
                                        $
   |               |     AggregateExec: mode=Partial, gby=[], aggr=[count(*)]   
                                        $
   |               |       ProjectionExec: expr=[]                              
                                        $
   |               |         CoalesceBatchesExec: target_batch_size=8192        
                                        $
   |               |           FilterExec: CAST(URL@0 AS Utf8View) LIKE 
%google%                                        $
   |               |             ParquetExec: file_groups={16 groups: 
[[Users/andrewlamb/Software/datafusion/benchmarks/$
   
+---------------+----------------------------------------------------------------------------------------------------$
   2 row(s) fetched.
   Elapsed 0.065 seconds.
   ```
   
   When looking at the flamegraphs, you can see the `CAST` spends a huge amount 
of time validating utf8 (more time than actually evaluating the `LIKE` 
predicate actually):
   <img width="1585" alt="Screenshot 2024-09-17 at 11 06 34 AM" 
src="https://github.com/user-attachments/assets/3a97574f-da5b-4e5c-9de5-555ebadb79ef";>
   
   
   Here are the full flamegraphs for comparison:
   
[q20-flamegraph-main](https://github.com/user-attachments/assets/5cb0f83b-ce56-4a98-b0f7-c79c28d48b15)
   
[q20-flamegraph-stringview](https://github.com/user-attachments/assets/8ed966d5-be8e-4994-8809-e8e271e978f0)
   
   
   I belive the issue is here:
   ```
   |               |           FilterExec: CAST(URL@0 AS Utf8View) LIKE %google%
   ```
   
   This filter first *CAST`s the URL column to Utf8View and then evaluates  
`LIKE`
   
   Converting `BinaryArray` --> `StringArray`as is done without StringView is 
relatively faster because it is done with a single large function call
   
   However, converting `BinaryViewArrar` --> `StringViewArray` is not as it 
makes many small function calls. The parquet reader has a special optimization 
for this as descsribed in "Section 2.1: From binary to strings" of the [Using 
StringView / German Style Strings to Make Queries Faster: Part 1 - Reading 
Parquet](https://www.influxdata.com/blog/faster-queries-with-stringview-part-one-influxdb/)
 from  @XiangpengHao 
   
   **Describe the solution you'd like**
   I would like this query to go as fast / faster with Utf8View / BinaryView 
enabled.
   
   Bonus points if it went faster even without Utf-8 enabled
   
   **Describe alternatives you've considered**
   
   ### Option 1: `LIKE` for binary
   One option is to skip validating  UTF8 entirely and evaluate `LIKE` directly 
on binary. This would mean if the column is read as binary we could cast the 
argument `'%google%'` to binary and then evaluate `LIKE` directly on the binary 
column. This would skip validaitng utf8 completely
   
   Unfortunately, it appears that the `like` kernel is only implemented for 
`StringArray` and `StringViewArray` at the moment, not BinaryArray:  
https://docs.rs/arrow-string/53.0.0/src/arrow_string/like.rs.html#110-149
   
   Another related option would be to potentially special case the `LIKE` 
rewite in this case for just prefix / contians / suffix -- in this case rewrite 
`<binary> LIKE <const that starts and ends with '%'`> --> `<binary> CONTAINS 
<string>` 
   
   ### Option 2: resolve the column as `Utf8` rather than `Binary`
   
   For some reason the schema of `hits.parquet` (the single file from 
ClickBench) has the `URL` column (and others) as `Utf8` (strings) but the 
`hits_partitioned` file resolves it as Binary. \
   
   We could change the schema resolution logicic to resolve the column as a 
String instead. 
   
   This option is probably slower than option 1 but I think it is more inline 
with what the intended semantics (these columns contain logical stirngs) and 
the parquet reader includes the fast read path for such strings and would be 
more general.
   
   I will file a separate ticket to track this idea
   
   **Additional context**
   <!--
   Add any other context or screenshots about the feature request here.
   -->
   


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: [email protected]

For queries about this service, please contact Infrastructure at:
[email protected]


---------------------------------------------------------------------
To unsubscribe, e-mail: [email protected]
For additional commands, e-mail: [email protected]

Reply via email to