fjetter commented on issue #38389:
URL: https://github.com/apache/arrow/issues/38389#issuecomment-1774777681

   I ran a couple of pyspy benchmarks on pure `pq.read_table` downloading from 
S3. I ran two tests, one with column projection and one with bulk reading. Both 
show basically the same profile but with different weighting of components.
   
   This profile shows the case where I'm reading a file and are selecting about 
half it's columns (a mix between different dtypes)
   
   
![image](https://github.com/apache/arrow/assets/8629629/a210e5fd-d557-41e9-bd29-702c4f1adcc9)
   
   Note how the read_table request is split into three parts
   
   1. A HEAD request that infers whether the provided path is a file or a 
directory 
https://github.com/apache/arrow/blob/ac2d207611ce25c91fb9fc90d5eaff2933609660/python/pyarrow/parquet/core.py#L2481
 This is latency bound which on S3 is typically 100-200ms but can vary quite 
strongly. This is a request we cannot cache or use to pre-fetch *any* kind of 
data. In this example, this alone took 20% of the entire read time
   2. The initialization of the `FileSystemDataset` object. In native profiles, 
this points to `arrow::dataset::Fragment::ReadPhysicalSchema` so I assume this 
is fetching the footer. This is probably unavoidable but at least this request 
could be used to pre-fetch some payload data but I'm not sure if this is 
actually done (I guess not since, the `buffer_size` kwarg is zero by default). 
In this specific example, this is about 10% of the read
   3. The final section is now the actual reading of the file.
   
   So, that's 30% where we're doing nothing/not a lot? I'm not sure at which 
point the pre_buffering can kick on or how this works. This stuff does not show 
up in my profile since it's the arrow native threadpool.
   
   At least this initial HEAD request appears to be bad, particularly if we're 
fetching just a couple of columns from otherwise already small-ish files. The 
file I was looking at is one of the TPCH lineitem files which in our dataset 
version is 22.4MiB large.


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: [email protected]

For queries about this service, please contact Infrastructure at:
[email protected]

Reply via email to