crepererum opened a new issue, #4370:
URL: https://github.com/apache/arrow-datafusion/issues/4370

   **Is your feature request related to a problem or challenge? Please describe 
what you are trying to do.**
   In InfluxDB IOx, we have some users that query the data with simple regex 
expressions that don't really need a regex but (I guess) regexes are used for 
convenience or technical reasons (e.g. auto-generated expressions). For "regex 
match" and "regex not match", we have the following cases:
   
   | Case     | Example           | Description | Logical Rewrite (for "match") 
 |
   | -------- | ----------------- | ----------- | 
------------------------------ |
   | Empty    | `''`              | Match all   | `col IS NOT NULL`             
 |
   | OR-chain | `'foo\|bar\|baz'` | Any of      | `(col = 'foo') OR (col = 
'bar') OR (col = 'baz')`<br><br>`col IN ('foo', 'bar', 'baz')` |
   
   Now the fact that they are expressed as regex instead of a simple rewritten 
form has a bunch of performance consequences. These regex predicates are NOT 
considered for pruning (because how would you prune an arbitrary regex):
   
   
https://github.com/apache/arrow-datafusion/blob/e1204a5bf72c119123404463befb716adbdcff25/datafusion/core/src/physical_optimizer/pruning.rs#L818-L871
   
   Finally they are NOT pushed down into `ParquetExec`. 
   
   **Describe the solution you'd like**
   Transform simple regex expressions into their equivalent logical expression.
   
   **Describe alternatives you've considered**
   Extend the pruning expression framework and `ParquetExec` to handle regexes. 
However this seems unnecessary complex and maybe even counterproductive, since 
regexes per se can be really expensive+complex to evaluate.
   
   **Additional context**
   \-
   


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: [email protected]

For queries about this service, please contact Infrastructure at:
[email protected]

Reply via email to