Sure - The use case is to read a large partitioned dataset, consisting of tens or hundreds of Parquet files. A reader expects to scan through the data in order of the partition key. However, to improve performance, we'd like to begin loading files N+1, N+2, ... N + k while the consumer is still reading file N, so that it doesn't have to wait every time it opens a new file, and to help hide any latency or slowness that might be happening on the backend. We also don't want to be in a situation where file N+2 is ready but file N+1 isn't, because that doesn't help us (we still have to wait for N+1 to load).
This is why I mention the project is quite similar to the Datasets project - Datasets likely covers all the functionality we would eventually need. Best, David On 4/29/20, Antoine Pitrou <anto...@python.org> wrote: > > Le 29/04/2020 à 20:49, David Li a écrit : >> >> However, we noticed this doesn’t actually bring us the expected >> benefits. Consider files A, B, and C being buffered in parallel; right >> now, all I/O goes through an internal I/O pool, and so several >> operations for each of the three files get added to the pool. However, >> they get serviced in some random order, and so it’s possible for file >> C to finish all its I/O operations before file B can. Then, a consumer >> is unnecessarily stuck waiting for those to complete. > > It would be good if you explained your use case a bit more precisely. > Are you expecting the files to be read in a particular order? If so, why? > > Regards > > Antoine. >