Fokko opened a new issue, #33972:
URL: https://github.com/apache/arrow/issues/33972
### Describe the enhancement requested
Hey all,
First of all thanks everyone for working on PyArrow! Really loving it so
far. I'm currently working on PyIceberg that will load an Iceberg table in
PyArrow. For those unfamiliar with Apache Iceberg. This is a table format that
focusses on having huge tables (petabyte size). PyIceberg makes you life easier
by taking care of statistics to boost performance, and all the schema
maintenance. For example, if you change the partitioning of an Iceberg table,
you don't have to directly rewrite all the files, you can do this in an
incremental way.
Now I'm running into some performance issues, and I noticed that PyArrow is
doing more queries than required to S3. I went down the rabbit hole, and was
able to narrow it down to:
```python
import pyarrow.dataset as ds
from pyarrow.fs import S3FileSystem
ONE_MEGABYTE = 1024 * 1024
client_kwargs = {
"endpoint_override": "http://localhost:9000",
"access_key": "admin",
"secret_key": "password",
}
parquet_format = ds.ParquetFileFormat(
use_buffered_stream=True,
pre_buffer=True,
buffer_size=8 * ONE_MEGABYTE
)
fs = S3FileSystem(**client_kwargs)
with
fs.open_input_file("warehouse/wh/nyc/taxis/data/tpep_pickup_datetime_day=2022-04-30/00003-4-89e0ad58-fb77-4512-8679-6f26d8d6ef28-00033.parquet")
as fout:
# First get the fragment
fragment = parquet_format.make_fragment(fout, None)
print(f"Schema: {fragment.physical_schema}")
arrow_table = ds.Scanner.from_fragment(
fragment=fragment
).to_table()
```
I need the schema first, because it can be that a column got renamed, but
the the file hasn't been rewritten against the latest schema. The same goes for
filtering, if you change a column name, and the file still has the old name in
there, then you would like to leverage the predicate pushdown of PyArrow to not
load the data in memory at all.
When looking into the minio logs I can see that it does four requests.
1. A head to check if the file exists
2. The last 64kb from the Parquet file to get the schema
3. Another last 64kb from the parquet file to get the schema
4. A nice beefy 1978578kb request to fetch the data
Looking at the tests, we shouldn't fetch the footer twice:
```python
# with default discovery, no metadata loaded
with assert_opens([fragment.path]):
fragment.ensure_complete_metadata()
assert fragment.row_groups == [0, 1]
# second time -> use cached / no file IO
with assert_opens([]):
fragment.ensure_complete_metadata()
```
Any thoughts or advice? I went through the code a bit already, but my cpp is
a bit rusty
### Component(s)
Python
--
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.
To unsubscribe, e-mail: [email protected]
For queries about this service, please contact Infrastructure at:
[email protected]