[I] Queries on tables via `register_dataset()` much slower than `register_parquet()` [arrow-datafusion-python]

via GitHub Wed, 14 Feb 2024 04:18:20 -0800


mattaubury opened a new issue, #584:
URL: https://github.com/apache/arrow-datafusion-python/issues/584


   **Describe the bug**
   Queries against tables registered with `register_dataset()` perform around 
80x slower than those registered with `register_parquet()`.
   
   **To Reproduce**
   ```
   import datafusion
   import pyarrow.dataset as ds
   from pathlib import Path
   
   ctx = datafusion.SessionContext()
   ctx.register_parquet("mytable", "*.parquet")
   ctx.register_dataset("mytable2", 
ds.dataset(list(Path(".").glob("*.parquet"))))
   ```
   Fast:
   ```
   %time ctx.sql('select file_date, sum("Price" * "Volume") from mytable group 
by file_date order by file_date').to_arrow_table()
   CPU times: user 2min 41s, sys: 3.35 s, total: 2min 45s
   Wall time: 2.49 s
   ```
   Slow:
   ```
   %time ctx.sql('select file_date, sum("Price" * "Volume") from mytable2 group 
by file_date order by file_date').to_arrow_table()
   CPU times: user 10min 51s, sys: 5min 40s, total: 16min 31s
   Wall time: 3min 18s
   ```
   
   **Expected behavior**
   I'd expect these to be similar performance.
   
   **Additional context**
   The reason I'm using `ds.dataset` is because the actual files I'm 
interesting in accessing are not conveniently globbable (they're across 
multiple directories). So ideally I'd be able to provide a list of files to 
`ctx.register_parquet()` instead of a simple glob.


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: [email protected]

For queries about this service, please contact Infrastructure at:
[email protected]

[I] Queries on tables via `register_dataset()` much slower than `register_parquet()` [arrow-datafusion-python]

Reply via email to