[jira] [Comment Edited] (ARROW-16616) [Python] Allow lazy evaluation of filters in Dataset and add Datset.filter method

Alessandro Molina (Jira) Wed, 22 Jun 2022 10:54:06 -0700


    [ 
https://issues.apache.org/jira/browse/ARROW-16616?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17557599#comment-17557599
 ]


Alessandro Molina edited comment on ARROW-16616 at 6/22/22 5:53 PM:
--------------------------------------------------------------------

I don't see much of a conflict with Ibis on long term. Ibis is an interface to 
multiple query engines and the queries you build there can be run against 
different targets. It serves the purpose of having a single interface to an 
infrastructure agnostic environment. Write your Ibis code locally and deploy it 
against a production system that might run something very different from a 
Acero in-memory.

PyArrow instead only exposes the features that are already available in Arrow 
itself, and it's something all the other bindings are doing too (R and Java do 
expose access to the compute engine). What you write in pyarrow won't be able 
to really grow too much in the direction of scaling on multiple nodes, so it's 
better positioned for quick data discovery without having to involve external 
dependencies than for the actual final product which will probably want to be 
based on IBIS.


was (Author: amol-):
I don't see much of a conflict with Ibis on long term. Ibis is an interface to 
multiple query engines and the queries you build there can be run against 
different targets. It serves the purpose of having a single interface to an 
infrastructure agnostic environment. Write your Ibis code locally and deploy it 
against a production system that might run something very different from a 
Acero in-memory.

PyArrow instead only exposes the features that are already available in Arrow 
itself, and it's something all the other bindings are doing too (R and Java do 
expose access to the compute engine). What you write in pyarrow won't be able 
to really grow too much in the direction of scaling it, so it's better 
positioned for quick data discovery without having to involve external 
dependencies than for the actual final product which will probably want to be 
based on IBIS.

> [Python] Allow lazy evaluation of filters in Dataset and add Datset.filter 
> method
> ---------------------------------------------------------------------------------
>
>                 Key: ARROW-16616
>                 URL: https://issues.apache.org/jira/browse/ARROW-16616
>             Project: Apache Arrow
>          Issue Type: Sub-task
>          Components: Python
>            Reporter: Alessandro Molina
>            Assignee: Alessandro Molina
>            Priority: Major
>              Labels: pull-request-available
>             Fix For: 9.0.0
>
>          Time Spent: 0.5h
>  Remaining Estimate: 0h
>
> To keep the {{Dataset}} api compatible with the {{Table}} one in terms of 
> analytics capabilities, we should add a {{Dataset.filter}} method. The 
> initial POC was based on {{_table_filter}} but that required materialising 
> all the {{Dataset}} content after filtering as it returned an 
> {{{}InMemoryDataset{}}}. 
> Given that {{Scanner}} can filter a dataset without actually materialising 
> the data until a final step happens, it would be good to have 
> {{Dataset.filter}} return some form of lazy dataset when the filter is only 
> stored aside and the Scanner is created when data is actually retrieved.
> PS: Also update {{test_dataset_filter}} test to use the {{Dataset.filter}} 
> method



--
This message was sent by Atlassian Jira
(v8.20.7#820007)

[jira] [Comment Edited] (ARROW-16616) [Python] Allow lazy evaluation of filters in Dataset and add Datset.filter method

Reply via email to