[GitHub] [arrow] alippai commented on pull request #8283: ARROW-9707: [Rust] [DataFusion] DataFusion Scheduler Prototype [WIP]

GitBox Sun, 27 Sep 2020 15:26:43 -0700


alippai commented on pull request #8283:
URL: https://github.com/apache/arrow/pull/8283#issuecomment-699695827



   @andygrove I think now you understand all my issues I had previously. The 
scheduler proposal and the recent comments regarding the concurrency are all 
superb, I think you are on track. Thanks for listening about my fears 
previously.
   
   My only note: 
https://github.com/apache/arrow/pull/8283#issuecomment-699655553 likely you 
want to read a largish partition in "one go". AFAIR HDFS creates ~128MB large 
parquet chunks. Reading ~100MB large parquet files, or large columns with tens 
of MBs of data in one go will likely increase the throughput. While using local 
disks values over a few MBs won't make any difference, but using S3, HDFS, 
GPFS, NFS it can be beneficial. 
   
   I couldn't find how the TPC-H parquet files you test with are structured, 
can you give me some pointers?


----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
[email protected]

[GitHub] [arrow] alippai commented on pull request #8283: ARROW-9707: [Rust] [DataFusion] DataFusion Scheduler Prototype [WIP]

Reply via email to