alamb opened a new issue #107:
URL: https://github.com/apache/arrow-datafusion/issues/107


   *Note*: migrated from original JIRA: 
https://issues.apache.org/jira/browse/ARROW-10275
   
   Group by with a high cardinality (columns with lots of unique values) don't 
seem to finish.
   
   I've tried with both datafusion-cli and this:
   
   [https://github.com/joshuataylor/parquet-group-by/blob/main/src/main.rs]
   
   When doing O_ORDERKEY there are ~15 000 000 unique records, so it seems to 
stall. I've tried with limit but it doesn't work either.
   
   My parquet file: 
[https://drive.google.com/file/d/1aCW7SW2rUVioSePduhgo_91F5-xDMyjp/view?usp=sharing]
   
   datafusion-cli:
   {code:java}
   CREATE EXTERNAL TABLE something STORED AS PARQUET LOCATION 'demo.parquet';
   select O_ORDERKEY from something group by O_ORDERKEY;
   {code}
    


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
[email protected]


Reply via email to