ghuls commented on issue #39079:
URL: https://github.com/apache/arrow/issues/39079#issuecomment-1936041192
You are rescanning the whole subdir on every invocation of
`pyarrow.parquet.read_table` Scanning the filesystem only once for parquet
files with `pa.dataset.dataset` and then filtering the data on the key column
with dataset scanner is much faster:
```python
In [14]: with timectx("scan for parquet files using dataset"):
...: write_to_dataset_dataset =
pa.dataset.dataset("write_to_dataset/",
partitioning=pa.dataset.partitioning(flavor="hive"))
...:
scan for parquet files using dataset 40.98821000661701 ms
In [143]: with timectx("load partitions using dataset"):
...: for key in keys:
...:
write_to_dataset_dataset.scanner(filter=(pc.field("key") == key)).to_table()
...:
load partitions using read_table 558.3812920376658 ms
In [144]: with timectx("load partitions using read_table"):
...: for key in keys:
...: pyarrow.parquet.read_table("write_to_dataset",
filters=[("key", "==", key)])
...:
load partitions using read_table 5385.372799937613 ms
```
--
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.
To unsubscribe, e-mail: [email protected]
For queries about this service, please contact Infrastructure at:
[email protected]