legout opened a new issue, #35332:
URL: https://github.com/apache/arrow/issues/35332
### Describe the usage question you have. Please include as many useful
details as possible.
I am working with parquet datasets a lot, and I wonder, why loading the
whole dataset using `pyarrow.dataset.Dataset.to_table()` is (sometimes a lot)
slower than `pyarrow.parquet.read_table()`.
Here is some example code:
```python
import time
import pyarrow.dataset as pds
import pyarrow.parquet as pq
from fsspec import filesystem
fs = filesystem("s3")
path = "path/to/parquet_dataset"
def load_pds(path):
s = time.time()
table = pds.dataset(path, filesystem=fs).to_table()
print(f"pds: Loading arrow table with shape {table.shape} took
{time.time() - s:.2f} seconds.")
return table
def load_pq(path):
s = time.time()
table = pq.read_table(path, filesystem=fs)
print(f"pq: Loading arrow table with shape {table.shape} took
{time.time() - s:.2f} seconds.")
return table
table_pds = load_pds(path)
table_pq = load_pq(path)
assert table_pds==table_pq
```
When I run this, for one of my datasets (27 parquet files with a total size
of 550MB) I get the following output:
```
pds: Loading arrow table with shape (130585966, 13) took 34.2 seconds.
pq: Loading arrow table with shape (130585966, 13) took 4.67 seconds.
```
Why is `pds` 7 times slower than `pq`?
### Component(s)
Python
--
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.
To unsubscribe, e-mail: [email protected]
For queries about this service, please contact Infrastructure at:
[email protected]