alamb opened a new issue, #10921:
URL: https://github.com/apache/datafusion/issues/10921

   ### Is your feature request related to a problem or challenge?
   
   Part of https://github.com/apache/datafusion/issues/10918
   
   In order to take advantage of the parquet writer generating StringViewArrays 
( https://github.com/apache/arrow-rs/issues/5530 from @ariesdevil (❤️ ) ) we 
need to make sure datafusion doesn't immediately cast the array back to 
`StringView` which would undo the benefits
   
   ```
                   ▲                       
   ┌ ─ ─ ─ ─ ─ ─ ┐ │   After filtering,    
     StringArray   │   any unfiltered rows 
   └ ─ ─ ─ ─ ─ ─ ┘ │   are gathered via    
         ...       │   the `take` kernel   
                   │                       
    ┌────────────────────────────┐         
    │                            │         
    │         FilterExec         │         
    │                            │         
    └────────────────────────────┘         
                   ▲                       
   ┌ ─ ─ ─ ─ ─ ─ ┐ │                       
     StringArray   │                       
   └ ─ ─ ─ ─ ─ ─ ┘ │   Reading String data 
                   │   from a Parquet file 
         ...       │   results in          
                   │   StringArrays passed 
   ┌ ─ ─ ─ ─ ─ ─ ┐ │                       
     StringArray   │                       
   └ ─ ─ ─ ─ ─ ─ ┘ │                       
                   │                       
    ┌────────────────────────────┐         
    │                            │         
    │        ParquetExec         │         
    │                            │         
    └────────────────────────────┘         
                                           
                                           
                                           
         Current situation                 
   ```
   
   
   ### Describe the solution you'd like
   
   To support a phased rollout of this feature, I recommend we focus at first 
on only the first filtering operation
   
   Specifically get to the point where the parquet reader will read data out as 
StringView like this:
   
   ```
                   ▲              
   ┌ ─ ─ ─ ─ ─ ─ ┐ │              
     StringArray   │              
   └ ─ ─ ─ ─ ─ ─ ┘ │              
         ...       │              
                   │              
    ┌────────────────────────────┐
    │                            │
    │         FilterExec         │
    │                            │
    └────────────────────────────┘
   ┌ ─ ─ ─ ─ ─ ─ ┐ ▲              
    StringViewArr  │              
   │     ay      │ │              
    ─ ─ ─ ─ ─ ─ ─  │              
         ...       │              
                   │              
   ┌ ─ ─ ─ ─ ─ ─ ┐ │              
    StringViewArr  │              
   │     ay      │ │              
    ─ ─ ─ ─ ─ ─ ─  │              
                   │              
    ┌────────────────────────────┐
    │                            │
    │        ParquetExec         │
    │                            │
    └────────────────────────────┘
                                  
                                  
                                  
         Intermediate             
         Situation 1: pass        
         StringViewArray          
         between ParquetExec      
   ```
   
   ### Describe alternatives you've considered
   
   I suggest we:
   1. Make a  [configuration 
setting](https://datafusion.apache.org/user-guide/configs.html) like "force 
StringViewArray"  when reading parquet so we can test this. When this setting 
is enabled, DataFusion should configure the ParquetExec to produce 
`StringViewArray` regardless of the type stored in the parquet file
   2. Then work on incrementally rolling out support / testing for various 
filter expressions (especially string functions like substring and 
https://github.com/apache/datafusion/issues/10919)
   
   ### Additional context
   
   _No response_


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: github-unsubscr...@datafusion.apache.org.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


---------------------------------------------------------------------
To unsubscribe, e-mail: github-unsubscr...@datafusion.apache.org
For additional commands, e-mail: github-h...@datafusion.apache.org

Reply via email to