thinkharderdev commented on issue #30: URL: https://github.com/apache/arrow-ballista/issues/30#issuecomment-1276062272
> > I'm a little confused here. Avoiding fetching 100s of parquet files is more like an optimizer issue? > > What I mean is that now, if you give a datafusion plan 5000 parquet files (maybe for a `SUM()` type query) , it will likely try to start reading / decoding all 5000 files concurrently, even if the downstream operators can only consume a small fraction at once . This means the resources (file handles and/or memory buffers) for reading all 5000 are held open during the plan. > > It also means if the plan terminates early (e.g. `... LIMIT 10`) a large amount of IO will be done / wasted. > > What we would like to happen is that a smaller subset are read and fully processed before new ones are opened. Of course, one challenge with doing this is we still need sufficient parallelism to hide IO latencies. Within a given partition `FileStream` will still process them sequentially right? -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: [email protected] For queries about this service, please contact Infrastructure at: [email protected]
