Le 18/03/2020 à 18:30, David Li a écrit : >> Instead of S3, you can use the Slow streams and Slow filesystem >> implementations. It may better protect against varying external conditions. > > I think we'd want several different benchmarks - we want to ensure we > don't regress local filesystem performance, and we also want to > measure in an actual S3 environment. It would also be good to measure > S3-compatible systems like Google's. > >>> - Use the coalescing inside the Parquet reader (even without a column >>> filter hint - this would subsume PARQUET-1698) >> >> I'm assuming this would be done at the RowGroupReader level, right? > > Ideally we'd be able to coalesce across row groups as well, though > maybe it'd be easier to start with within-row-group-only (I need to > familiarize myself with the reader more). > >> I don't understand what the "advantage" would be. Can you elaborate? > > As Wes said, empirically you can get more bandwidth out of S3 with > multiple concurrent HTTP requests. There is a cost to doing so > (establishing a new connection takes time), hence why the coalescing > tries to group small reads (to fully utilize one connection) and split > large reads (to be able to take advantage of multiple connections).
If that's S3-specific (or even AWS-specific) it might better be done inside the S3 filesystem. For other filesystems I don't think it makes sense to split reads. Regards Antoine.