[
https://issues.apache.org/jira/browse/ARROW-16616?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17557599#comment-17557599
]
Alessandro Molina edited comment on ARROW-16616 at 6/22/22 5:53 PM:
--------------------------------------------------------------------
I don't see much of a conflict with Ibis on long term. Ibis is an interface to
multiple query engines and the queries you build there can be run against
different targets. It serves the purpose of having a single interface to an
infrastructure agnostic environment. Write your Ibis code locally and deploy it
against a production system that might run something very different from a
Acero in-memory.
PyArrow instead only exposes the features that are already available in Arrow
itself, and it's something all the other bindings are doing too (R and Java do
expose access to the compute engine). What you write in pyarrow won't be able
to really grow too much in the direction of scaling on multiple nodes, so it's
better positioned for quick data discovery without having to involve external
dependencies than for the actual final product which will probably want to be
based on IBIS.
was (Author: amol-):
I don't see much of a conflict with Ibis on long term. Ibis is an interface to
multiple query engines and the queries you build there can be run against
different targets. It serves the purpose of having a single interface to an
infrastructure agnostic environment. Write your Ibis code locally and deploy it
against a production system that might run something very different from a
Acero in-memory.
PyArrow instead only exposes the features that are already available in Arrow
itself, and it's something all the other bindings are doing too (R and Java do
expose access to the compute engine). What you write in pyarrow won't be able
to really grow too much in the direction of scaling it, so it's better
positioned for quick data discovery without having to involve external
dependencies than for the actual final product which will probably want to be
based on IBIS.
> [Python] Allow lazy evaluation of filters in Dataset and add Datset.filter
> method
> ---------------------------------------------------------------------------------
>
> Key: ARROW-16616
> URL: https://issues.apache.org/jira/browse/ARROW-16616
> Project: Apache Arrow
> Issue Type: Sub-task
> Components: Python
> Reporter: Alessandro Molina
> Assignee: Alessandro Molina
> Priority: Major
> Labels: pull-request-available
> Fix For: 9.0.0
>
> Time Spent: 0.5h
> Remaining Estimate: 0h
>
> To keep the {{Dataset}} api compatible with the {{Table}} one in terms of
> analytics capabilities, we should add a {{Dataset.filter}} method. The
> initial POC was based on {{_table_filter}} but that required materialising
> all the {{Dataset}} content after filtering as it returned an
> {{{}InMemoryDataset{}}}.
> Given that {{Scanner}} can filter a dataset without actually materialising
> the data until a final step happens, it would be good to have
> {{Dataset.filter}} return some form of lazy dataset when the filter is only
> stored aside and the Scanner is created when data is actually retrieved.
> PS: Also update {{test_dataset_filter}} test to use the {{Dataset.filter}}
> method
--
This message was sent by Atlassian Jira
(v8.20.7#820007)