Re: [PR] GH-31303: [Python] Remove the legacy ParquetDataset custom python-based implementation [arrow]

via GitHub Tue, 09 Jan 2024 01:38:47 -0800


jorisvandenbossche commented on PR #39112:
URL: https://github.com/apache/arrow/pull/39112#issuecomment-1882722210


   The `wide-dataframe` case seems a genuine perf regression (and not a flaky 
outlier as the other listed cases). That might mean that for wide dataframes, 
the new code path is slower compared to the legacy dataset reader (since with 
this commit, also when specifying `use_legacy_dataset=True`, the new code path 
will be used). 
   That seems to match with the timing in the `use_legacy_dataset=False` case 
of the wide-dataframe benchmark, as now both benchmarks more or less show the 
same timing.
   
   However, I can't reproduce this locally with pyarrow 14.0 (where the legacy 
reader still exists):
   
   ```
   import numpy as np
   import pandas as pd
   import pyarrow as pa
   import pyarrow.parquet as pq
   
   dataframe = pd.DataFrame(np.random.rand(100, 10000))
   table = pa.Table.from_pandas(dataframe)
   pq.write_table(table, "test_wide_dataframe.parquet")
   ```
   
   ```
   In [7]: %timeit -r 50 pq.read_table("test_wide_dataframe.parquet", 
use_legacy_dataset=True)
   392 ms ± 4.67 ms per loop (mean ± std. dev. of 50 runs, 1 loop each)
   
   In [8]: %timeit -r 50 pq.read_table("test_wide_dataframe.parquet", 
use_legacy_dataset=False)
   350 ms ± 11.5 ms per loop (mean ± std. dev. of 50 runs, 1 loop each)
   ```
   


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: [email protected]

For queries about this service, please contact Infrastructure at:
[email protected]

Re: [PR] GH-31303: [Python] Remove the legacy ParquetDataset custom python-based implementation [arrow]

Reply via email to