[GitHub] [arrow] legout opened a new issue, #35332: pyarrow.dataset.Dataset.to_table() much slower than pyarrow.parquet.read_table() when reading from S3

via GitHub Tue, 25 Apr 2023 03:32:42 -0700


legout opened a new issue, #35332:
URL: https://github.com/apache/arrow/issues/35332


   ### Describe the usage question you have. Please include as many useful 
details as  possible.
   
   
   I am working with parquet datasets a lot, and I wonder, why loading the 
whole dataset using `pyarrow.dataset.Dataset.to_table()` is (sometimes a lot) 
slower than `pyarrow.parquet.read_table()`.
   
   Here is some example code:
   
   ```python
   import time
   import pyarrow.dataset as pds
   import pyarrow.parquet as pq
   from fsspec import filesystem
   
   fs = filesystem("s3")
   
   path = "path/to/parquet_dataset"
   
   def load_pds(path):
       s = time.time()
       table = pds.dataset(path, filesystem=fs).to_table()
       print(f"pds: Loading arrow table with shape {table.shape} took 
{time.time() - s:.2f} seconds.")
       return table
       
   def load_pq(path):
       s = time.time()
       table = pq.read_table(path, filesystem=fs)
       print(f"pq: Loading arrow table with shape {table.shape} took 
{time.time() - s:.2f} seconds.")
       return table
       
   table_pds = load_pds(path)
   table_pq = load_pq(path)
   
   assert table_pds==table_pq
   ```
   
   When I run this, for one of my datasets (27 parquet files with a total size 
of 550MB) I get the following output:
   ```
   pds: Loading arrow table with shape (130585966, 13) took 34.2 seconds.
   pq: Loading arrow table with shape (130585966, 13) took 4.67 seconds.
   ```
   Why is `pds` 7 times slower than `pq`?
   
   
   ### Component(s)
   
   Python


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: [email protected]

For queries about this service, please contact Infrastructure at:
[email protected]

[GitHub] [arrow] legout opened a new issue, #35332: pyarrow.dataset.Dataset.to_table() much slower than pyarrow.parquet.read_table() when reading from S3

Reply via email to