jorisvandenbossche opened a new issue, #38434:
URL: https://github.com/apache/arrow/issues/38434
Reproducer reading a single toy parquet file:
```python
import numpy as np
import pandas as pd
import pyarrow as pa
import pyarrow.parquet as pq
import pyarrow.dataset as ds
x = np.random.randint(0, 100000, size=(1000000, 10))
df = pd.DataFrame(x)
table = pa.Table.from_pandas(df)
pq.write_table(table, "test.parquet")
pq.write_table(table, "test_row_groups.parquet", row_group_size=1024*64)
dataset = ds.dataset("test.parquet", format="parquet")
dataset_row_groups = ds.dataset("test_row_groups.parquet", format="parquet")
```
The one file has a single row group, the other 16, for the same data.
Reading those with the dataset API:
```
In [3]: %timeit dataset.to_table()
38.3 ms ± 843 µs per loop (mean ± std. dev. of 7 runs, 10 loops each)
In [4]: %timeit dataset.to_table(use_threads=False)
79.4 ms ± 6.24 ms per loop (mean ± std. dev. of 7 runs, 10 loops each)
In [5]: %timeit dataset_row_groups.to_table()
53.4 ms ± 3.9 ms per loop (mean ± std. dev. of 7 runs, 10 loops each)
In [6]: %timeit dataset_row_groups.to_table(use_threads=False)
50.5 ms ± 1.57 ms per loop (mean ± std. dev. of 7 runs, 10 loops each)
```
For the file with a single row group, there is a clear slowdown when not
using threads, while that is not the case for the file with row groups (and I
can also visually on my system monitor confirm that all cores are being used
while running that benchmark).
Depending on whether there are row groups or not, the parallelization might
happen differently? (across row groups vs across columns?)
Using the ParquetFile API, the `use_threads` keyword always has effect:
```
In [7]: %timeit pq.ParquetFile("test.parquet").read()
32.5 ms ± 812 µs per loop (mean ± std. dev. of 7 runs, 10 loops each)
In [8]: %timeit pq.ParquetFile("test.parquet").read(use_threads=False)
85.3 ms ± 1.46 ms per loop (mean ± std. dev. of 7 runs, 10 loops each)
In [9]: %timeit pq.ParquetFile("test_row_groups.parquet").read()
48 ms ± 2.2 ms per loop (mean ± std. dev. of 7 runs, 10 loops each)
In [10]: %timeit
pq.ParquetFile("test_row_groups.parquet").read(use_threads=False)
184 ms ± 35.1 ms per loop (mean ± std. dev. of 7 runs, 10 loops each)
```
(the case with row groups is also considerably slower, but I think that is
because this API always gives a concatenated Table, instead of a Table with
chunks, and so does more work)
--
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.
To unsubscribe, e-mail: [email protected]
For queries about this service, please contact Infrastructure at:
[email protected]