Re: [I] Parquet deserialization speeds slower on Linux [arrow]

via GitHub Mon, 23 Oct 2023 02:23:55 -0700


fjetter commented on issue #38389:
URL: https://github.com/apache/arrow/issues/38389#issuecomment-1774777681

I ran a couple of pyspy benchmarks on pure `pq.read_table` downloading from
S3. I ran two tests, one with column projection and one with bulk reading. Both
show basically the same profile but with different weighting of components.

This profile shows the case where I'm reading a file and are selecting about
half it's columns (a mix between different dtypes)

![image](https://github.com/apache/arrow/assets/8629629/a210e5fd-d557-41e9-bd29-702c4f1adcc9)

Note how the read_table request is split into three parts

1. A HEAD request that infers whether the provided path is a file or a
directory
https://github.com/apache/arrow/blob/ac2d207611ce25c91fb9fc90d5eaff2933609660/python/pyarrow/parquet/core.py#L2481
This is latency bound which on S3 is typically 100-200ms but can vary quite
strongly. This is a request we cannot cache or use to pre-fetch *any* kind of
data. In this example, this alone took 20% of the entire read time
2. The initialization of the `FileSystemDataset` object. In native profiles,
this points to `arrow::dataset::Fragment::ReadPhysicalSchema` so I assume this
is fetching the footer. This is probably unavoidable but at least this request
could be used to pre-fetch some payload data but I'm not sure if this is
actually done (I guess not since, the `buffer_size` kwarg is zero by default).
In this specific example, this is about 10% of the read
3. The final section is now the actual reading of the file.

So, that's 30% where we're doing nothing/not a lot? I'm not sure at which
point the pre_buffering can kick on or how this works. This stuff does not show
up in my profile since it's the arrow native threadpool.

At least this initial HEAD request appears to be bad, particularly if we're
fetching just a couple of columns from otherwise already small-ish files. The
file I was looking at is one of the TPCH lineitem files which in our dataset
version is 22.4MiB large.

--
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: [email protected]

For queries about this service, please contact Infrastructure at:
[email protected]

Re: [I] Parquet deserialization speeds slower on Linux [arrow]

Reply via email to