Josh Taylor created ARROW-10275:
-----------------------------------

             Summary: [Rust] [Datafusion] GROUP BY with a high cardinality 
doesn't seem to finish
                 Key: ARROW-10275
                 URL: https://issues.apache.org/jira/browse/ARROW-10275
             Project: Apache Arrow
          Issue Type: Bug
          Components: Rust - DataFusion
    Affects Versions: 2.0.0
         Environment: Ubuntu 20.04
            Reporter: Josh Taylor


Group by with a high cardinality (columns with lots of unique values) don't 
seem to finish.

I've tried with both datafusion-cli and this:

[https://github.com/joshuataylor/parquet-group-by/blob/main/src/main.rs]

When doing O_ORDERKEY there are ~15 000 000 unique records, so it seems to 
stall. I've tried with limit but it doesn't work either.

My parquet file: 
https://drive.google.com/file/d/1aCW7SW2rUVioSePduhgo_91F5-xDMyjp/view?usp=sharing



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

Reply via email to