Re: [Discuss] Proposal for optimizing Datasets over S3/object storage

Antoine Pitrou Wed, 18 Mar 2020 11:09:26 -0700


Le 18/03/2020 à 18:30, David Li a écrit :
>> Instead of S3, you can use the Slow streams and Slow filesystem 
>> implementations.  It may better protect against varying external conditions.
> 
> I think we'd want several different benchmarks - we want to ensure we
> don't regress local filesystem performance, and we also want to
> measure in an actual S3 environment. It would also be good to measure
> S3-compatible systems like Google's.
> 
>>> - Use the coalescing inside the Parquet reader (even without a column
>>> filter hint - this would subsume PARQUET-1698)
>>
>> I'm assuming this would be done at the RowGroupReader level, right?
> 
> Ideally we'd be able to coalesce across row groups as well, though
> maybe it'd be easier to start with within-row-group-only (I need to
> familiarize myself with the reader more).
> 
>> I don't understand what the "advantage" would be.  Can you elaborate?
> 
> As Wes said, empirically you can get more bandwidth out of S3 with
> multiple concurrent HTTP requests. There is a cost to doing so
> (establishing a new connection takes time), hence why the coalescing
> tries to group small reads (to fully utilize one connection) and split
> large reads (to be able to take advantage of multiple connections).


If that's S3-specific (or even AWS-specific) it might better be done
inside the S3 filesystem.  For other filesystems I don't think it makes
sense to split reads.

Regards

Antoine.

Re: [Discuss] Proposal for optimizing Datasets over S3/object storage

Reply via email to