Re: [Discuss] Proposal for optimizing Datasets over S3/object storage

David Li Wed, 18 Mar 2020 11:39:11 -0700

For us it applies to S3-like systems, not only S3 itself, at least.

It does make sense to limit it to some filesystems. The behavior would
be opt-in at the Parquet reader level, so at the Datasets or
Filesystem layer we can take care of enabling the flag for filesystems
where it actually helps.


I've filed these issues:
- ARROW-8151 to benchmark S3File+Parquet
(https://issues.apache.org/jira/browse/ARROW-8151)
- ARROW-8152 to split large reads
(https://issues.apache.org/jira/browse/ARROW-8152)
- PARQUET-1820 to use a column filter hint with coalescing
(https://issues.apache.org/jira/browse/PARQUET-1820)

in addition to PARQUET-1698 which is just about pre-buffering the
entire row group (which we can now do with ARROW-7995).

Best,
David

On 3/18/20, Antoine Pitrou <anto...@python.org> wrote:
>
> Le 18/03/2020 à 18:30, David Li a écrit :
>>> Instead of S3, you can use the Slow streams and Slow filesystem
>>> implementations.  It may better protect against varying external
>>> conditions.
>>
>> I think we'd want several different benchmarks - we want to ensure we
>> don't regress local filesystem performance, and we also want to
>> measure in an actual S3 environment. It would also be good to measure
>> S3-compatible systems like Google's.
>>
>>>> - Use the coalescing inside the Parquet reader (even without a column
>>>> filter hint - this would subsume PARQUET-1698)
>>>
>>> I'm assuming this would be done at the RowGroupReader level, right?
>>
>> Ideally we'd be able to coalesce across row groups as well, though
>> maybe it'd be easier to start with within-row-group-only (I need to
>> familiarize myself with the reader more).
>>
>>> I don't understand what the "advantage" would be.  Can you elaborate?
>>
>> As Wes said, empirically you can get more bandwidth out of S3 with
>> multiple concurrent HTTP requests. There is a cost to doing so
>> (establishing a new connection takes time), hence why the coalescing
>> tries to group small reads (to fully utilize one connection) and split
>> large reads (to be able to take advantage of multiple connections).
>
> If that's S3-specific (or even AWS-specific) it might better be done
> inside the S3 filesystem.  For other filesystems I don't think it makes
> sense to split reads.
>
> Regards
>
> Antoine.
>

Re: [Discuss] Proposal for optimizing Datasets over S3/object storage

Reply via email to