eeroel commented on issue #38591:
URL: https://github.com/apache/arrow/issues/38591#issuecomment-1797900711
Here's a reproducible example that doesn't use FileSystemDataset but
`parquet.read_table`:
```import pyarrow._s3fs
pyarrow._s3fs.initialize_s3(pyarrow._s3fs.S3LogLevel.Trace)
from pyarrow.dataset import (
ParquetFileFormat,
ParquetFragmentScanOptions
)
import pyarrow
import pyarrow.fs
import pyarrow.parquet
fs = pyarrow.fs.S3FileSystem(region="us-east-2")
fs = pyarrow.fs.SubTreeFileSystem("ursa-labs-taxi-data", fs)
format = ParquetFileFormat(
default_fragment_scan_options=ParquetFragmentScanOptions(pre_buffer=True),
)
pyarrow.parquet.read_table("2019/01/data.parquet", filesystem=fs)```
This does not only make two HEAD requests, but four of them in total (first
two in getting the schema?)
```
cat /tmp/foo2.log | grep HeaderOut
[DEBUG] 2023-11-07 06:30:22.913 CURL [0x1e4d11ec0] (HeaderOut) HEAD
/2019/01/data.parquet HTTP/1.1
[DEBUG] 2023-11-07 06:30:23.059 CURL [0x1e4d11ec0] (HeaderOut) HEAD
/2019/01/data.parquet HTTP/1.1
[DEBUG] 2023-11-07 06:30:23.190 CURL [0x1e4d11ec0] (HeaderOut) GET
/2019/01/data.parquet HTTP/1.1
[DEBUG] 2023-11-07 06:30:23.613 CURL [0x1e4d11ec0] (HeaderOut) GET
/2019/01/data.parquet HTTP/1.1
[DEBUG] 2023-11-07 06:30:23.904 CURL [0x1e4d11ec0] (HeaderOut) HEAD
/2019/01/data.parquet HTTP/1.1
[DEBUG] 2023-11-07 06:30:24.028 CURL [0x17017f000] (HeaderOut) HEAD
/2019/01/data.parquet HTTP/1.1
[DEBUG] 2023-11-07 06:30:24.158 CURL [0x17020b000] (HeaderOut) GET
/2019/01/data.parquet HTTP/1.1
[DEBUG] 2023-11-07 06:30:24.552 CURL [0x170297000] (HeaderOut) GET
/2019/01/data.parquet HTTP/1.1
[DEBUG] 2023-11-07 06:30:26.542 CURL [0x17017f000] (HeaderOut) GET
/2019/01/data.parquet HTTP/1.1
[DEBUG] 2023-11-07 06:30:29.142 CURL [0x17020b000] (HeaderOut) GET
/2019/01/data.parquet HTTP/1.1
```
--
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.
To unsubscribe, e-mail: [email protected]
For queries about this service, please contact Infrastructure at:
[email protected]