alamb commented on PR #8631:
URL: 
https://github.com/apache/arrow-datafusion/pull/8631#issuecomment-1868084932

   I also verified that this PR improves the entire query from 
https://github.com/apache/arrow-datafusion/issues/8492, which also uses 
`regexp_replace`. 
   
   * `main`:  41754 rows in set (40 shown). Query took **146.106** seconds.
   * `this PR`: 41754 rows in set (40 shown). Query took **15.432** seconds.
   
   I double checked at it seems like `regexp_replace` already has a similar 
optimization to avoid recompiling the regexp (though it could be even more 
optimized for constants):  
https://github.com/apache/arrow-datafusion/blob/2ffda2a9a893455e55cd773d9dd4f426a61d8cd3/datafusion/physical-expr/src/regex_expressions.rs#L101
   
   ```sql
   File Edit Options Buffers Tools SQL Help
   SELECT
     month,
     ext,
     COUNT(DISTINCT project_name) AS project_count
   FROM (
     SELECT
       project_name,
       DATE_TRUNC('month', uploaded_on) AS month,
       NULLIF(
         REPLACE(
           REPLACE(
             REPLACE(
               REGEXP_REPLACE(
                 REGEXP_REPLACE(
                   REGEXP_MATCH(path, CONCAT('(', '.([a-z0-9]+)$', ')'))[2],
                   'cxx|cpp|cc|c|hpp|h',
                   'C/C++',
                   'g'
                 ),
                 '^f.*$',
                 'Fortran',
                 'g'
               ),
               'rs',
               'Rust'
             ),
             'go',
             'Go'
           ),
           'asm',
           'Assembly'
         ),
         ''
       ) AS ext
     FROM pypi
     WHERE COALESCE(
         ARRAY_LENGTH(
           REGEXP_MATCH(path, 
'.(asm|c|cc|cpp|cxx|h|hpp|rs|[Ff][0-9]{0,2}(?:or)?|go)$')
         ) > 0,
         FALSE
       )
       AND NOT COALESCE(ARRAY_LENGTH(REGEXP_MATCH(path, '(^|/)test(|s|ing)')) > 
0, FALSE)
       AND NOT STRPOS(path, '/site-packages/') > 0
   )
   WHERE ext IS NOT NULL
   GROUP BY month, ext
   ORDER BY month DESC, project_count DESC
   ```
   Here is the explain
   
   
   ```
   DataFusion CLI v34.0.0
   
+---------------+------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------
 
------------------------------------------------------------------------------------+
   | plan_type     | plan                                                       
                                                                                
                                                                                
                                                                                
                                                                                
                                                                                
                                                                                
                                                                                
                                                                                
                                                                                
                                                                                
                                                                                
                                      
                                                                                
     |
   
+---------------+------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------
 
------------------------------------------------------------------------------------+
   | logical_plan  | Sort: month DESC NULLS FIRST, project_count DESC NULLS 
FIRST                                                                           
                                                                                
                                                                                
                                                                                
                                                                                
                                                                                
                                                                                
                                                                                
                                                                                
                                                                                
                                                                                
                                          
                                                                                
     |
   |               |   Projection: month, ext, COUNT(alias1) AS project_count   
                                                                                
                                                                                
                                                                                
                                                                                
                                                                                
                                                                                
                                                                                
                                                                                
                                                                                
                                                                                
                                                                                
                                      
                                                                                
     |
   |               |     Aggregate: groupBy=[[month, ext]], 
aggr=[[COUNT(alias1)]]                                                          
                                                                                
                                                                                
                                                                                
                                                                                
                                                                                
                                                                                
                                                                                
                                                                                
                                                                                
                                                                                
                                                          
                                                                                
     |
   |               |       Aggregate: groupBy=[[month, ext, pypi.project_name 
AS alias1]], aggr=[[]]                                                          
                                                                                
                                                                                
                                                                                
                                                                                
                                                                                
                                                                                
                                                                                
                                                                                
                                                                                
                                                                                
                                        
                                                                                
     |
   |               |         Projection: pypi.project_name, 
date_trunc(Utf8("month"), pypi.uploaded_on) AS month, 
nullif(replace(replace(replace(regexp_replace(regexp_replace((regexp_match(pypi.path,
 Utf8("(.([a-z0-9]+)$)")))[Int64(2)], Utf8("cxx|cpp|cc|c|hpp|h"), 
Utf8("C/C++"), Utf8("g")), Utf8("^f.*$"), Utf8("Fortran"), Utf8("g")), 
Utf8("rs"), Utf8("Rust")), Utf8("go"), Utf8("Go")), Utf8("asm"), 
Utf8("Assembly")), Utf8("")) AS ext                                             
                                                                                
                                                                                
                                                                                
                                                                                
                                                                                
                                                                                
                                     
                                                                                
     |
   |               |           Filter: 
nullif(replace(replace(replace(regexp_replace(regexp_replace((regexp_match(pypi.path,
 Utf8("(.([a-z0-9]+)$)")))[Int64(2)], Utf8("cxx|cpp|cc|c|hpp|h"), 
Utf8("C/C++"), Utf8("g")), Utf8("^f.*$"), Utf8("Fortran"), Utf8("g")), 
Utf8("rs"), Utf8("Rust")), Utf8("go"), Utf8("Go")), Utf8("asm"), 
Utf8("Assembly")), Utf8("")) IS NOT NULL AND 
coalesce(array_length(regexp_match(pypi.path, 
Utf8(".(asm|c|cc|cpp|cxx|h|hpp|rs|[Ff][0-9]{0,2}(?:or)?|go)$"))) > UInt64(0), 
Boolean(false)) AND NOT coalesce(array_length(regexp_match(pypi.path, 
Utf8("(^|/)test(|s|ing)"))) > UInt64(0), Boolean(false)) AND strpos(pypi.path, 
Utf8("/site-packages/")) <= Int32(0)                                            
                                                                                
                                                                                
                                                                                
                                  
                                                                                
     |
   |               |             TableScan: pypi projection=[project_name, 
project_version, project_release, uploaded_on, path, archive_path, size, hash, 
skip_reason, lines, repository], 
partial_filters=[nullif(replace(replace(replace(regexp_replace(regexp_replace((regexp_match(pypi.path,
 Utf8("(.([a-z0-9]+)$)")))[Int64(2)], Utf8("cxx|cpp|cc|c|hpp|h"), 
Utf8("C/C++"), Utf8("g")), Utf8("^f.*$"), Utf8("Fortran"), Utf8("g")), 
Utf8("rs"), Utf8("Rust")), Utf8("go"), Utf8("Go")), Utf8("asm"), 
Utf8("Assembly")), Utf8("")) IS NOT NULL, 
coalesce(array_length(regexp_match(pypi.path, 
Utf8(".(asm|c|cc|cpp|cxx|h|hpp|rs|[Ff][0-9]{0,2}(?:or)?|go)$"))) > UInt64(0), 
Boolean(false)), NOT coalesce(array_length(regexp_match(pypi.path, 
Utf8("(^|/)test(|s|ing)"))) > UInt64(0), Boolean(false)), strpos(pypi.path, 
Utf8("/site-packages/")) <= Int32(0)]                                           
                                                                                
                                      
                                                                                
     |
   | physical_plan | SortPreservingMergeExec: [month@0 DESC,project_count@2 
DESC]                                                                           
                                                                                
                                                                                
                                                                                
                                                                                
                                                                                
                                                                                
                                                                                
                                                                                
                                                                                
                                                                                
                                          
                                                                                
     |
   |               |   SortExec: expr=[month@0 DESC,project_count@2 DESC]       
                                                                                
                                                                                
                                                                                
                                                                                
                                                                                
                                                                                
                                                                                
                                                                                
                                                                                
                                                                                
                                                                                
                                      
                                                                                
     |
   |               |     ProjectionExec: expr=[month@0 as month, ext@1 as ext, 
COUNT(alias1)@2 as project_count]                                               
                                                                                
                                                                                
                                                                                
                                                                                
                                                                                
                                                                                
                                                                                
                                                                                
                                                                                
                                                                                
                                       
                                                                                
     |
   |               |       AggregateExec: mode=FinalPartitioned, gby=[month@0 
as month, ext@1 as ext], aggr=[COUNT(alias1)]                                   
                                                                                
                                                                                
                                                                                
                                                                                
                                                                                
                                                                                
                                                                                
                                                                                
                                                                                
                                                                                
                                        
                                                                                
     |
   |               |         CoalesceBatchesExec: target_batch_size=8192        
                                                                                
                                                                                
                                                                                
                                                                                
                                                                                
                                                                                
                                                                                
                                                                                
                                                                                
                                                                                
                                                                                
                                      
                                                                                
     |
   |               |           RepartitionExec: partitioning=Hash([month@0, 
ext@1], 16), input_partitions=16                                                
                                                                                
                                                                                
                                                                                
                                                                                
                                                                                
                                                                                
                                                                                
                                                                                
                                                                                
                                                                                
                                          
                                                                                
     |
   |               |             AggregateExec: mode=Partial, gby=[month@0 as 
month, ext@1 as ext], aggr=[COUNT(alias1)]                                      
                                                                                
                                                                                
                                                                                
                                                                                
                                                                                
                                                                                
                                                                                
                                                                                
                                                                                
                                                                                
                                        
                                                                                
     |
   |               |               AggregateExec: mode=FinalPartitioned, 
gby=[month@0 as month, ext@1 as ext, alias1@2 as alias1], aggr=[]               
                                                                                
                                                                                
                                                                                
                                                                                
                                                                                
                                                                                
                                                                                
                                                                                
                                                                                
                                                                                
                                             
                                                                                
     |
   |               |                 CoalesceBatchesExec: 
target_batch_size=8192                                                          
                                                                                
                                                                                
                                                                                
                                                                                
                                                                                
                                                                                
                                                                                
                                                                                
                                                                                
                                                                                
                                                            
                                                                                
     |
   |               |                   RepartitionExec: 
partitioning=Hash([month@0, ext@1, alias1@2], 16), input_partitions=16          
                                                                                
                                                                                
                                                                                
                                                                                
                                                                                
                                                                                
                                                                                
                                                                                
                                                                                
                                                                                
                                                              
                                                                                
     |
   |               |                     AggregateExec: mode=Partial, 
gby=[month@1 as month, ext@2 as ext, project_name@0 as alias1], aggr=[]         
                                                                                
                                                                                
                                                                                
                                                                                
                                                                                
                                                                                
                                                                                
                                                                                
                                                                                
                                                                                
                                                
                                                                                
     |
   |               |                       ProjectionExec: expr=[project_name@0 
as project_name, date_trunc(month, uploaded_on@3) as month, 
nullif(replace(replace(replace(regexp_replace(regexp_replace((regexp_match(path@4,
 (.([a-z0-9]+)$))).[2], cxx|cpp|cc|c|hpp|h, C/C++, g), ^f.*$, Fortran, g), rs, 
Rust), go, Go), asm, Assembly), ) as ext]                                       
                                                                                
                                                                                
                                                                                
                                                                                
                                                                                
                                                                                
                                                                                
                                                         
                                                                                
     |
   |               |                         CoalesceBatchesExec: 
target_batch_size=8192                                                          
                                                                                
                                                                                
                                                                                
                                                                                
                                                                                
                                                                                
                                                                                
                                                                                
                                                                                
                                                                                
                                                    
                                                                                
     |
   |               |                           FilterExec: 
nullif(replace(replace(replace(regexp_replace(regexp_replace((regexp_match(path@4,
 (.([a-z0-9]+)$))).[2], cxx|cpp|cc|c|hpp|h, C/C++, g), ^f.*$, Fortran, g), rs, 
Rust), go, Go), asm, Assembly), ) IS NOT NULL AND 
coalesce(array_length(regexp_match(path@4, 
.(asm|c|cc|cpp|cxx|h|hpp|rs|[Ff][0-9]{0,2}(?:or)?|go)$)) > 0, false) AND NOT 
coalesce(array_length(regexp_match(path@4, (^|/)test(|s|ing))) > 0, false) AND 
strpos(path@4, /site-packages/) <= 0                                            
                                                                                
                                                                                
                                                                                
                                                                                
                                                                                
                                                 
                                                                                
     |
   |               |                             ParquetExec: file_groups={16 
groups: 
[[Users/andrewlamb/Downloads/df_8492/pypi/index-0.parquet:0..245559486], 
[Users/andrewlamb/Downloads/df_8492/pypi/index-0.parquet:245559486..491118972], 
[Users/andrewlamb/Downloads/df_8492/pypi/index-0.parquet:491118972..736678458], 
[Users/andrewlamb/Downloads/df_8492/pypi/index-0.parquet:736678458..982237944], 
[Users/andrewlamb/Downloads/df_8492/pypi/index-0.parquet:982237944..1227797430],
 ...]}, projection=[project_name, project_version, project_release, 
uploaded_on, path, archive_path, size, hash, skip_reason, lines, repository], 
predicate=nullif(replace(replace(replace(regexp_replace(regexp_replace((regexp_match(path@4,
 (.([a-z0-9]+)$))).[2], cxx|cpp|cc|c|hpp|h, C/C++, g), ^f.*$, Fortran, g), rs, 
Rust), go, Go), asm, Assembly), ) IS NOT NULL AND 
coalesce(array_length(regexp_match(path@4, 
.(asm|c|cc|cpp|cxx|h|hpp|rs|[Ff][0-9]{0,2}(?:or)?|go)$)) > 0, false) AND NOT 
coalesce(array_length(regexp_mat
 ch(path@4, (^|/)test(|s|ing))) > 0, false) AND strpos(path@4, /site-packages/) 
<= 0 |
   |               |                                                            
                                                                                
                                                                                
                                                                                
                                                                                
                                                                                
                                                                                
                                                                                
                                                                                
                                                                                
                                                                                
                                                                                
                                      
                                                                                
     |
   
+---------------+------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------
 
------------------------------------------------------------------------------------+
   2 rows in set. Query took 0.014 seconds.
   ```


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: [email protected]

For queries about this service, please contact Infrastructure at:
[email protected]

Reply via email to