[GitHub] [arrow-datafusion] alamb opened a new issue #107: [Datafusion] GROUP BY with a high cardinality doesn't seem to finish

GitBox Mon, 26 Apr 2021 06:22:12 -0700


alamb opened a new issue #107:
URL: https://github.com/apache/arrow-datafusion/issues/107



   *Note*: migrated from original JIRA: 
https://issues.apache.org/jira/browse/ARROW-10275
   
   Group by with a high cardinality (columns with lots of unique values) don't 
seem to finish.
   
   I've tried with both datafusion-cli and this:
   
   [https://github.com/joshuataylor/parquet-group-by/blob/main/src/main.rs]
   
   When doing O_ORDERKEY there are ~15 000 000 unique records, so it seems to 
stall. I've tried with limit but it doesn't work either.
   
   My parquet file: 
[https://drive.google.com/file/d/1aCW7SW2rUVioSePduhgo_91F5-xDMyjp/view?usp=sharing]
   
   datafusion-cli:
   {code:java}
   CREATE EXTERNAL TABLE something STORED AS PARQUET LOCATION 'demo.parquet';
   select O_ORDERKEY from something group by O_ORDERKEY;
   {code}
    


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
[email protected]

[GitHub] [arrow-datafusion] alamb opened a new issue #107: [Datafusion] GROUP BY with a high cardinality doesn't seem to finish

Reply via email to