[I] [Python][C++] Disabling threads (`use_threads=False`) doesn't seem to have full effect in Dataset API with multiple row groups [arrow]

via GitHub Tue, 24 Oct 2023 04:53:10 -0700


jorisvandenbossche opened a new issue, #38434:
URL: https://github.com/apache/arrow/issues/38434


   Reproducer reading a single toy parquet file:
   
   ```python
   import numpy as np
   import pandas as pd
   import pyarrow as pa
   import pyarrow.parquet as pq
   import pyarrow.dataset as ds
   
   x = np.random.randint(0, 100000, size=(1000000, 10))
   df = pd.DataFrame(x)
   table = pa.Table.from_pandas(df)
   pq.write_table(table, "test.parquet")
   pq.write_table(table, "test_row_groups.parquet", row_group_size=1024*64)
   
   dataset = ds.dataset("test.parquet", format="parquet")
   dataset_row_groups = ds.dataset("test_row_groups.parquet", format="parquet")
   ```
   
   The one file has a single row group, the other 16, for the same data. 
   Reading those with the dataset API:
   
   ```
   In [3]: %timeit dataset.to_table()
   38.3 ms ± 843 µs per loop (mean ± std. dev. of 7 runs, 10 loops each)
   
   In [4]: %timeit dataset.to_table(use_threads=False)
   79.4 ms ± 6.24 ms per loop (mean ± std. dev. of 7 runs, 10 loops each)
   
   In [5]: %timeit dataset_row_groups.to_table()
   53.4 ms ± 3.9 ms per loop (mean ± std. dev. of 7 runs, 10 loops each)
   
   In [6]: %timeit dataset_row_groups.to_table(use_threads=False)
   50.5 ms ± 1.57 ms per loop (mean ± std. dev. of 7 runs, 10 loops each)
   ```
   
   For the file with a single row group, there is a clear slowdown when not 
using threads, while that is not the case for the file with row groups (and I 
can also visually on my system monitor confirm that all cores are being used 
while running that benchmark). 
   
   Depending on whether there are row groups or not, the parallelization might 
happen differently? (across row groups vs across columns?)
   
   Using the ParquetFile API, the `use_threads` keyword always has effect:
   
   ```
   In [7]: %timeit pq.ParquetFile("test.parquet").read()
   32.5 ms ± 812 µs per loop (mean ± std. dev. of 7 runs, 10 loops each)
   
   In [8]: %timeit pq.ParquetFile("test.parquet").read(use_threads=False)
   85.3 ms ± 1.46 ms per loop (mean ± std. dev. of 7 runs, 10 loops each)
   
   In [9]: %timeit pq.ParquetFile("test_row_groups.parquet").read()
   48 ms ± 2.2 ms per loop (mean ± std. dev. of 7 runs, 10 loops each)
   
   In [10]: %timeit 
pq.ParquetFile("test_row_groups.parquet").read(use_threads=False)
   184 ms ± 35.1 ms per loop (mean ± std. dev. of 7 runs, 10 loops each)
   ```
   
   (the case with row groups is also considerably slower, but I think that is 
because this API always gives a concatenated Table, instead of a Table with 
chunks, and so does more work)
   


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: [email protected]

For queries about this service, please contact Infrastructure at:
[email protected]

[I] [Python][C++] Disabling threads (`use_threads=False`) doesn't seem to have full effect in Dataset API with multiple row groups [arrow]

Reply via email to