alamb opened a new issue, #8492:
URL: https://github.com/apache/arrow-datafusion/issues/8492

   ### Describe the bug
   
   As reported by @cpcloud  in https://github.com/ibis-project/ibis/pull/7703
   
   The most relevant portions:
   
   
   **DataFusion**
   DataFusion never ran out of memory and had a memory profile similar to 
DuckDB:
   single digit GBs peak memory.
   
   However, it was still extremely slow compared to DuckDB, about 9-10 minutes 
to
   run the whole workload.
   
   Similarly to Polars I compared both the Ibis implementation and a 
hand-written
   SQL version (built from the generated Ibis code). Both had the same 
performance
   
   I also looked at perf top while the DataFusion workload was running and saw 
this:
   
   
![289368886-a33bf90f-8877-43b0-8a4d-328f9721f264](https://github.com/apache/arrow-datafusion/assets/490673/bde9dc16-b7fb-4bde-9af6-b980f3f0e07a)
   
   
   ### To Reproduce
   
   TBD (first thing would be to get a datafusion only reproducer)
   
   Looks like the query, from https://github.com/ibis-project/ibis/pull/7703 is
   
   ```sql
   SELECT
     month,
     ext,
     COUNT(DISTINCT project_name) AS project_count
   FROM (
     SELECT
       project_name,
       DATE_TRUNC('month', uploaded_on) AS month,
       NULLIF(
         REPLACE(
           REPLACE(
             REPLACE(
               REGEXP_REPLACE(
                 REGEXP_REPLACE(
                   REGEXP_MATCH(path, CONCAT('(', '\.([a-z0-9]+)$', ')'))[2],
                   'cxx|cpp|cc|c|hpp|h',
                   'C/C++',
                   'g'
                 ),
                 '^f.*$',
                 'Fortran',
                 'g'
               ),
               'rs',
               'Rust'
             ),
             'go',
             'Go'
           ),
           'asm',
           'Assembly'
         ),
         ''
       ) AS ext
     FROM pypi
     WHERE COALESCE(
         ARRAY_LENGTH(
           REGEXP_MATCH(path, 
'\.(asm|c|cc|cpp|cxx|h|hpp|rs|[Ff][0-9]{0,2}(?:or)?|go)$')
         ) > 0,
         FALSE
       )
       AND NOT COALESCE(ARRAY_LENGTH(REGEXP_MATCH(path, '(^|/)test(|s|ing)')) > 
0, FALSE)
       AND NOT STRPOS(path, '/site-packages/') > 0
   )
   WHERE ext IS NOT NULL
   GROUP BY month, ext
   ORDER BY month DESC, project_count DESC
   ```
   
   ### Expected behavior
   
   _No response_
   
   ### Additional context
   
   _No response_


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: [email protected]

For queries about this service, please contact Infrastructure at:
[email protected]

Reply via email to