Re: [I] Parallel NDSON file reading [arrow-datafusion]

via GitHub Wed, 27 Dec 2023 07:49:40 -0800


marvinlanhenke commented on issue #8502:
URL: 
https://github.com/apache/arrow-datafusion/issues/8502#issuecomment-1870424777


   @alamb 
   
   I did some basic benchmarking.
   
   ## Methodology:
   
   1. Generated a 60mil rows NDJSON file (~3.7G)
   2. Run tests with datafusion-cli (before / after changes)
   3. `create external table json_test stored as json location 
'/home/ml/data_60m.json';`
   4. `select * from json_test;` & `select * from json_test where a > 5;`
   
   ## Results:
   |query|before|after|
   |---|---|---|
   |`select * from json_test;`| ~24s|~24s|
   |`select * from json_test where a > 5;`| ~26s|~11s|
   
   When applying a filter and `explain select * from json_test where a > 5;` 
   we can see the repartitioning happening (file_groups: 12).
   
   However, when simply running `select * from json_test`. 
   File_groups remain at 1 and we get no parallel reading.
   
   I think this issue relates to: #6983
   Haven't tested it with a dataframe; however the issue seems to remain, at 
least for the datafusion-cli 
   (tested with JSON and CSV)
   
   
![image](https://github.com/apache/arrow-datafusion/assets/62298609/870f3a0e-2096-4856-bd2e-8aaa30589b0a)
   
   
   
   


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: [email protected]

For queries about this service, please contact Infrastructure at:
[email protected]

Re: [I] Parallel NDSON file reading [arrow-datafusion]

Reply via email to