CaselIT commented on issue #39079:
URL: https://github.com/apache/arrow/issues/39079#issuecomment-1936106918
For reference on my pc your suggestion is on part compared with re-opening
the row group file each time in the for. This gives me the following times:
```py
import pyarrow.compute as pc
import pyarrow.dataset
with timectx("load partitions using read_table - read dataset once"):
write_to_dataset_dataset = pyarrow.dataset.dataset(
"write_to_dataset/",
partitioning=pyarrow.dataset.partitioning(flavor="hive")
)
for key in keys:
pl.from_arrow(write_to_dataset_dataset.scanner(filter=(pc.field("key") ==
key)).to_table())
```
```
load partitions using read_table - read dataset once 5504.083399995579 ms
```
So I still think this scheme has significant advantages compared to hive
--
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.
To unsubscribe, e-mail: [email protected]
For queries about this service, please contact Infrastructure at:
[email protected]