alamb opened a new issue, #8492: URL: https://github.com/apache/arrow-datafusion/issues/8492
### Describe the bug As reported by @cpcloud in https://github.com/ibis-project/ibis/pull/7703 The most relevant portions: **DataFusion** DataFusion never ran out of memory and had a memory profile similar to DuckDB: single digit GBs peak memory. However, it was still extremely slow compared to DuckDB, about 9-10 minutes to run the whole workload. Similarly to Polars I compared both the Ibis implementation and a hand-written SQL version (built from the generated Ibis code). Both had the same performance I also looked at perf top while the DataFusion workload was running and saw this:  ### To Reproduce TBD (first thing would be to get a datafusion only reproducer) Looks like the query, from https://github.com/ibis-project/ibis/pull/7703 is ```sql SELECT month, ext, COUNT(DISTINCT project_name) AS project_count FROM ( SELECT project_name, DATE_TRUNC('month', uploaded_on) AS month, NULLIF( REPLACE( REPLACE( REPLACE( REGEXP_REPLACE( REGEXP_REPLACE( REGEXP_MATCH(path, CONCAT('(', '\.([a-z0-9]+)$', ')'))[2], 'cxx|cpp|cc|c|hpp|h', 'C/C++', 'g' ), '^f.*$', 'Fortran', 'g' ), 'rs', 'Rust' ), 'go', 'Go' ), 'asm', 'Assembly' ), '' ) AS ext FROM pypi WHERE COALESCE( ARRAY_LENGTH( REGEXP_MATCH(path, '\.(asm|c|cc|cpp|cxx|h|hpp|rs|[Ff][0-9]{0,2}(?:or)?|go)$') ) > 0, FALSE ) AND NOT COALESCE(ARRAY_LENGTH(REGEXP_MATCH(path, '(^|/)test(|s|ing)')) > 0, FALSE) AND NOT STRPOS(path, '/site-packages/') > 0 ) WHERE ext IS NOT NULL GROUP BY month, ext ORDER BY month DESC, project_count DESC ``` ### Expected behavior _No response_ ### Additional context _No response_ -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: [email protected] For queries about this service, please contact Infrastructure at: [email protected]
