[GitHub] [arrow-datafusion] alamb commented on pull request #5057: Parquet parallel scan

via GitHub Sat, 28 Jan 2023 05:23:47 -0800


alamb commented on PR #5057:
URL: 
https://github.com/apache/arrow-datafusion/pull/5057#issuecomment-1407398447


   My measurements suggest this setting can improve the performance with single 
large parquet files significantly (over 2x in my measurement). 👨‍🍳 👌   -- very 
nice
   
   I tested this out by making a 9G parquet file from 
https://github.com/tustvold/access-log-gen/
   
   Then using datafusion-cli:
   
   ```sql
   ❯ select avg(request_bytes), avg(response_bytes), avg(response_status), host 
from '/Users/alamb/Software/access-log-gen/logs.9G.parquet' group by host;
   ...
   927 rows in set. Query took 2.313 seconds.
   ```
   
   And then I enabled this setting:
   
   ```sql
   ❯ set datafusion.optimizer.repartition_file_scans = true;
   0 rows in set. Query took 0.000 seconds.
   ❯ select avg(request_bytes), avg(response_bytes), avg(response_status), host 
from '/Users/alamb/Software/access-log-gen/logs.9G.parquet' group by host;
   
   927 rows in set. Query took 0.962 seconds.
   ```
   
   😮 


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: [email protected]

For queries about this service, please contact Infrastructure at:
[email protected]

[GitHub] [arrow-datafusion] alamb commented on pull request #5057: Parquet parallel scan

Reply via email to