Re: [Discuss] Proposal for optimizing Datasets over S3/object storage

David Li Wed, 29 Apr 2020 14:31:20 -0700

Sure -

The use case is to read a large partitioned dataset, consisting of
tens or hundreds of Parquet files. A reader expects to scan through
the data in order of the partition key. However, to improve
performance, we'd like to begin loading files N+1, N+2, ... N + k
while the consumer is still reading file N, so that it doesn't have to
wait every time it opens a new file, and to help hide any latency or
slowness that might be happening on the backend. We also don't want to
be in a situation where file N+2 is ready but file N+1 isn't, because
that doesn't help us (we still have to wait for N+1 to load).


This is why I mention the project is quite similar to the Datasets
project - Datasets likely covers all the functionality we would
eventually need.

Best,
David

On 4/29/20, Antoine Pitrou <anto...@python.org> wrote:
>
> Le 29/04/2020 à 20:49, David Li a écrit :
>>
>> However, we noticed this doesn’t actually bring us the expected
>> benefits. Consider files A, B, and C being buffered in parallel; right
>> now, all I/O goes through an internal I/O pool, and so several
>> operations for each of the three files get added to the pool. However,
>> they get serviced in some random order, and so it’s possible for file
>> C to finish all its I/O operations before file B can. Then, a consumer
>> is unnecessarily stuck waiting for those to complete.
>
> It would be good if you explained your use case a bit more precisely.
> Are you expecting the files to be read in a particular order?  If so, why?
>
> Regards
>
> Antoine.
>

Re: [Discuss] Proposal for optimizing Datasets over S3/object storage

Reply via email to