andygrove commented on pull request #8283:
URL: https://github.com/apache/arrow/pull/8283#issuecomment-700199462


   That makes sense, and we already have some funky channel and thread
   interaction in the DataFusion parquet reader that we could probably adapt
   fairly easily. We could introduce a config setting for max concurrent
   parquet readers.
   
   On Mon, Sep 28, 2020 at 12:12 PM Andrew Lamb <[email protected]>
   wrote:
   
   > When I run the TPC-H query I am testing against a data set that has 240
   > Parquet files. If we just try and run everything at once with async/await
   > and have tokio do the scheduling, we will end up with 240 files open at
   > once with reads happening against all of them, which is inefficient.
   >
   > One way to avoid this type of resource usage explosion is if the Parquet
   > reader itself limits the number of outstanding Tasks that it submits. For
   > example, with a tokio channel or something.
   >
   > It seems to me the challenge is not really "scheduling" per se, but more
   > "resource allocation"
   >
   > —
   > You are receiving this because you were mentioned.
   > Reply to this email directly, view it on GitHub
   > <https://github.com/apache/arrow/pull/8283#issuecomment-700197576>, or
   > unsubscribe
   > 
<https://github.com/notifications/unsubscribe-auth/AAHEBRGAPSBS2HWZRE2PI73SIDGX5ANCNFSM4R3A4JHA>
   > .
   >
   


----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
[email protected]


Reply via email to