jorisvandenbossche commented on PR #39112:
URL: https://github.com/apache/arrow/pull/39112#issuecomment-1882722210
The `wide-dataframe` case seems a genuine perf regression (and not a flaky
outlier as the other listed cases). That might mean that for wide dataframes,
the new code path is slower compared to the legacy dataset reader (since with
this commit, also when specifying `use_legacy_dataset=True`, the new code path
will be used).
That seems to match with the timing in the `use_legacy_dataset=False` case
of the wide-dataframe benchmark, as now both benchmarks more or less show the
same timing.
However, I can't reproduce this locally with pyarrow 14.0 (where the legacy
reader still exists):
```
import numpy as np
import pandas as pd
import pyarrow as pa
import pyarrow.parquet as pq
dataframe = pd.DataFrame(np.random.rand(100, 10000))
table = pa.Table.from_pandas(dataframe)
pq.write_table(table, "test_wide_dataframe.parquet")
```
```
In [7]: %timeit -r 50 pq.read_table("test_wide_dataframe.parquet",
use_legacy_dataset=True)
392 ms ± 4.67 ms per loop (mean ± std. dev. of 50 runs, 1 loop each)
In [8]: %timeit -r 50 pq.read_table("test_wide_dataframe.parquet",
use_legacy_dataset=False)
350 ms ± 11.5 ms per loop (mean ± std. dev. of 50 runs, 1 loop each)
```
--
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.
To unsubscribe, e-mail: [email protected]
For queries about this service, please contact Infrastructure at:
[email protected]