alamb commented on PR #8631: URL: https://github.com/apache/arrow-datafusion/pull/8631#issuecomment-1868084932
I also verified that this PR improves the entire query from https://github.com/apache/arrow-datafusion/issues/8492, which also uses `regexp_replace`. * `main`: 41754 rows in set (40 shown). Query took **146.106** seconds. * `this PR`: 41754 rows in set (40 shown). Query took **15.432** seconds. I double checked at it seems like `regexp_replace` already has a similar optimization to avoid recompiling the regexp (though it could be even more optimized for constants): https://github.com/apache/arrow-datafusion/blob/2ffda2a9a893455e55cd773d9dd4f426a61d8cd3/datafusion/physical-expr/src/regex_expressions.rs#L101 ```sql File Edit Options Buffers Tools SQL Help SELECT month, ext, COUNT(DISTINCT project_name) AS project_count FROM ( SELECT project_name, DATE_TRUNC('month', uploaded_on) AS month, NULLIF( REPLACE( REPLACE( REPLACE( REGEXP_REPLACE( REGEXP_REPLACE( REGEXP_MATCH(path, CONCAT('(', '.([a-z0-9]+)$', ')'))[2], 'cxx|cpp|cc|c|hpp|h', 'C/C++', 'g' ), '^f.*$', 'Fortran', 'g' ), 'rs', 'Rust' ), 'go', 'Go' ), 'asm', 'Assembly' ), '' ) AS ext FROM pypi WHERE COALESCE( ARRAY_LENGTH( REGEXP_MATCH(path, '.(asm|c|cc|cpp|cxx|h|hpp|rs|[Ff][0-9]{0,2}(?:or)?|go)$') ) > 0, FALSE ) AND NOT COALESCE(ARRAY_LENGTH(REGEXP_MATCH(path, '(^|/)test(|s|ing)')) > 0, FALSE) AND NOT STRPOS(path, '/site-packages/') > 0 ) WHERE ext IS NOT NULL GROUP BY month, ext ORDER BY month DESC, project_count DESC ``` Here is the explain ``` DataFusion CLI v34.0.0 +---------------+------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------ ------------------------------------------------------------------------------------+ | plan_type | plan | +---------------+------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------ ------------------------------------------------------------------------------------+ | logical_plan | Sort: month DESC NULLS FIRST, project_count DESC NULLS FIRST | | | Projection: month, ext, COUNT(alias1) AS project_count | | | Aggregate: groupBy=[[month, ext]], aggr=[[COUNT(alias1)]] | | | Aggregate: groupBy=[[month, ext, pypi.project_name AS alias1]], aggr=[[]] | | | Projection: pypi.project_name, date_trunc(Utf8("month"), pypi.uploaded_on) AS month, nullif(replace(replace(replace(regexp_replace(regexp_replace((regexp_match(pypi.path, Utf8("(.([a-z0-9]+)$)")))[Int64(2)], Utf8("cxx|cpp|cc|c|hpp|h"), Utf8("C/C++"), Utf8("g")), Utf8("^f.*$"), Utf8("Fortran"), Utf8("g")), Utf8("rs"), Utf8("Rust")), Utf8("go"), Utf8("Go")), Utf8("asm"), Utf8("Assembly")), Utf8("")) AS ext | | | Filter: nullif(replace(replace(replace(regexp_replace(regexp_replace((regexp_match(pypi.path, Utf8("(.([a-z0-9]+)$)")))[Int64(2)], Utf8("cxx|cpp|cc|c|hpp|h"), Utf8("C/C++"), Utf8("g")), Utf8("^f.*$"), Utf8("Fortran"), Utf8("g")), Utf8("rs"), Utf8("Rust")), Utf8("go"), Utf8("Go")), Utf8("asm"), Utf8("Assembly")), Utf8("")) IS NOT NULL AND coalesce(array_length(regexp_match(pypi.path, Utf8(".(asm|c|cc|cpp|cxx|h|hpp|rs|[Ff][0-9]{0,2}(?:or)?|go)$"))) > UInt64(0), Boolean(false)) AND NOT coalesce(array_length(regexp_match(pypi.path, Utf8("(^|/)test(|s|ing)"))) > UInt64(0), Boolean(false)) AND strpos(pypi.path, Utf8("/site-packages/")) <= Int32(0) | | | TableScan: pypi projection=[project_name, project_version, project_release, uploaded_on, path, archive_path, size, hash, skip_reason, lines, repository], partial_filters=[nullif(replace(replace(replace(regexp_replace(regexp_replace((regexp_match(pypi.path, Utf8("(.([a-z0-9]+)$)")))[Int64(2)], Utf8("cxx|cpp|cc|c|hpp|h"), Utf8("C/C++"), Utf8("g")), Utf8("^f.*$"), Utf8("Fortran"), Utf8("g")), Utf8("rs"), Utf8("Rust")), Utf8("go"), Utf8("Go")), Utf8("asm"), Utf8("Assembly")), Utf8("")) IS NOT NULL, coalesce(array_length(regexp_match(pypi.path, Utf8(".(asm|c|cc|cpp|cxx|h|hpp|rs|[Ff][0-9]{0,2}(?:or)?|go)$"))) > UInt64(0), Boolean(false)), NOT coalesce(array_length(regexp_match(pypi.path, Utf8("(^|/)test(|s|ing)"))) > UInt64(0), Boolean(false)), strpos(pypi.path, Utf8("/site-packages/")) <= Int32(0)] | | physical_plan | SortPreservingMergeExec: [month@0 DESC,project_count@2 DESC] | | | SortExec: expr=[month@0 DESC,project_count@2 DESC] | | | ProjectionExec: expr=[month@0 as month, ext@1 as ext, COUNT(alias1)@2 as project_count] | | | AggregateExec: mode=FinalPartitioned, gby=[month@0 as month, ext@1 as ext], aggr=[COUNT(alias1)] | | | CoalesceBatchesExec: target_batch_size=8192 | | | RepartitionExec: partitioning=Hash([month@0, ext@1], 16), input_partitions=16 | | | AggregateExec: mode=Partial, gby=[month@0 as month, ext@1 as ext], aggr=[COUNT(alias1)] | | | AggregateExec: mode=FinalPartitioned, gby=[month@0 as month, ext@1 as ext, alias1@2 as alias1], aggr=[] | | | CoalesceBatchesExec: target_batch_size=8192 | | | RepartitionExec: partitioning=Hash([month@0, ext@1, alias1@2], 16), input_partitions=16 | | | AggregateExec: mode=Partial, gby=[month@1 as month, ext@2 as ext, project_name@0 as alias1], aggr=[] | | | ProjectionExec: expr=[project_name@0 as project_name, date_trunc(month, uploaded_on@3) as month, nullif(replace(replace(replace(regexp_replace(regexp_replace((regexp_match(path@4, (.([a-z0-9]+)$))).[2], cxx|cpp|cc|c|hpp|h, C/C++, g), ^f.*$, Fortran, g), rs, Rust), go, Go), asm, Assembly), ) as ext] | | | CoalesceBatchesExec: target_batch_size=8192 | | | FilterExec: nullif(replace(replace(replace(regexp_replace(regexp_replace((regexp_match(path@4, (.([a-z0-9]+)$))).[2], cxx|cpp|cc|c|hpp|h, C/C++, g), ^f.*$, Fortran, g), rs, Rust), go, Go), asm, Assembly), ) IS NOT NULL AND coalesce(array_length(regexp_match(path@4, .(asm|c|cc|cpp|cxx|h|hpp|rs|[Ff][0-9]{0,2}(?:or)?|go)$)) > 0, false) AND NOT coalesce(array_length(regexp_match(path@4, (^|/)test(|s|ing))) > 0, false) AND strpos(path@4, /site-packages/) <= 0 | | | ParquetExec: file_groups={16 groups: [[Users/andrewlamb/Downloads/df_8492/pypi/index-0.parquet:0..245559486], [Users/andrewlamb/Downloads/df_8492/pypi/index-0.parquet:245559486..491118972], [Users/andrewlamb/Downloads/df_8492/pypi/index-0.parquet:491118972..736678458], [Users/andrewlamb/Downloads/df_8492/pypi/index-0.parquet:736678458..982237944], [Users/andrewlamb/Downloads/df_8492/pypi/index-0.parquet:982237944..1227797430], ...]}, projection=[project_name, project_version, project_release, uploaded_on, path, archive_path, size, hash, skip_reason, lines, repository], predicate=nullif(replace(replace(replace(regexp_replace(regexp_replace((regexp_match(path@4, (.([a-z0-9]+)$))).[2], cxx|cpp|cc|c|hpp|h, C/C++, g), ^f.*$, Fortran, g), rs, Rust), go, Go), asm, Assembly), ) IS NOT NULL AND coalesce(array_length(regexp_match(path@4, .(asm|c|cc|cpp|cxx|h|hpp|rs|[Ff][0-9]{0,2}(?:or)?|go)$)) > 0, false) AND NOT coalesce(array_length(regexp_mat ch(path@4, (^|/)test(|s|ing))) > 0, false) AND strpos(path@4, /site-packages/) <= 0 | | | | +---------------+------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------ ------------------------------------------------------------------------------------+ 2 rows in set. Query took 0.014 seconds. ``` -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: [email protected] For queries about this service, please contact Infrastructure at: [email protected]
