On Wed, 5 Feb 2020 16:37:17 -0500 David Li <li.david...@gmail.com> wrote: > > As a separate step, prefetching/caching should also make use of a > global (or otherwise shared) IO thread pool, so that parallel reads of > different files implicitly coordinate work with each other as well. > Then, you could queue up reads of several Parquet files, such that a > slow network call for one file doesn't block progress for other files, > without issuing reads for all of these files at once.
Typically you can solve this by having enough IO concurrency at once :-) I'm not sure having sophisticated global coordination (based on which algorithms) would bring anything. Would you care to elaborate? > It's unclear to me what readahead at the record batch level would > accomplish - Parquet reads each column chunk in a row group as a > whole, and if the row groups are large, then multiple record batches > would fall in the same row group, so then we wouldn't gain any > parallelism, no? (Admittedly, I'm not familiar with the internals > here.) Well, if each row group is read as a whole, then readahead can be applied at the row group level (e.g. read K row groups in advance). Regards Antoine.