[GitHub] [arrow] jorisvandenbossche commented on pull request #35568: GH-33986: [Python] Sketch out a minimal protocol interface for datasets

via GitHub Wed, 07 Jun 2023 00:27:27 -0700


jorisvandenbossche commented on PR #35568:
URL: https://github.com/apache/arrow/pull/35568#issuecomment-1580094236


   > > @westonpace you are correct that this doesn't define how such dataset 
classes are built. That's left to the consumer, who will write their own 
classes that conform to this API.
   > 
   > That would seem an essential API if this protocol was meant to be used by 
"table formats" to prepare "queries simple enough for query engines to 
understand". So perhaps I am misunderstanding.
   > 
   > Is this protocol meant to be used by "query engines" to "query a table 
format library as if it were a dataset"?
   
   My assumption was that it are the _producers_ that implement the classes 
that conform to this API?
   
   How are the consumer and producer supposed to interact with this protocol?
   
   Taking duckdb as example, the user can currently manually create a pyarrow 
object, and then query automatically from this using duckdb:
   
   ```python
   import pyarrow.dataset as ds
   
   pyarrow_dataset = ds.dataset(...)
   duckdb.sql("SELECT * FROM pyarrow_dataset WHERE ..")
   ```
   
   Is the idea that something similar would then work for any object supporting 
this protocol? (in the assumption that duckdb relaxes it check for a pyarrow 
object to any object conforming to the protocol) For example with delta-lake:
   ```python
   from deltalake import DeltaTable
   
   delta_table = DeltaTable("..")
   duckdb.sql("SELECT * FROM delta_table WHERE ..")
   ```
   
   But if this is the intended usage, I don't understand what the "builder API" 
(https://github.com/apache/arrow/pull/35568#pullrequestreview-1431322458) would 
be meant for? 
   
   > In other words, for a table format to use a query engine, it's not enough 
to pass a single query (e.g. filter / columns / whatever). We need to pass a 
query per file.
   
   @westonpace Why is that not sufficient? I think it is up to the table format 
to translate the single query into a query per file (and execute this)?


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: [email protected]

For queries about this service, please contact Infrastructure at:
[email protected]

[GitHub] [arrow] jorisvandenbossche commented on pull request #35568: GH-33986: [Python] Sketch out a minimal protocol interface for datasets

Reply via email to