Re: [Discuss] Proposal for optimizing Datasets over S3/object storage

Antoine Pitrou Thu, 06 Feb 2020 00:52:13 -0800

On Wed, 5 Feb 2020 16:37:17 -0500
David Li <li.david...@gmail.com> wrote:
> 
> As a separate step, prefetching/caching should also make use of a
> global (or otherwise shared) IO thread pool, so that parallel reads of
> different files implicitly coordinate work with each other as well.
> Then, you could queue up reads of several Parquet files, such that a
> slow network call for one file doesn't block progress for other files,
> without issuing reads for all of these files at once.


Typically you can solve this by having enough IO concurrency at once :-)
I'm not sure having sophisticated global coordination (based on which
algorithms) would bring anything.  Would you care to elaborate?

> It's unclear to me what readahead at the record batch level would
> accomplish - Parquet reads each column chunk in a row group as a
> whole, and if the row groups are large, then multiple record batches
> would fall in the same row group, so then we wouldn't gain any
> parallelism, no? (Admittedly, I'm not familiar with the internals
> here.)

Well, if each row group is read as a whole, then readahead can be
applied at the row group level (e.g. read K row groups in advance).

Regards

Antoine.

Re: [Discuss] Proposal for optimizing Datasets over S3/object storage

Reply via email to