[ 
https://issues.apache.org/jira/browse/ARROW-12060?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17406591#comment-17406591
 ] 

Joris Van den Bossche commented on ARROW-12060:
-----------------------------------------------

A quick demo of what a public "Expression.call" currently gives, using a 
compute kernel (log10) that is not directly exposed in the dataset Expression 
class:

{code:python}
import pyarrow.dataset as ds

table = pa.table({'a': range(10)})
ds.write_dataset(table, "test_dataset_expression.parquet", format="feather")
dataset = ds.dataset("test_dataset_expression.parquet/", format="feather")

>>> f = ds.field("a")
# creating expressions
>>> ds.Expression.call("log", [f])
<pyarrow.dataset.Expression log(a)>
>>> ds.Expression.call("log", [f]) > 1
<pyarrow.dataset.Expression (log(a) > 1)>

# using it to project/filter datasets
>>> dataset.to_table(columns={'a': ds.field("a"), 'a_log': 
>>> ds.Expression.call("log10", [ds.field('a')])}).to_pandas()
   a     a_log
0  0      -inf
1  1  0.000000
2  2  0.301030
3  3  0.477121
4  4  0.602060
5  5  0.698970
6  6  0.778151
7  7  0.845098
8  8  0.903090
9  9  0.954243
>>> dataset.to_table(columns={'a': ds.field("a"), 'a_log': 
>>> ds.Expression.call("log10", [ds.field('a')])}, 
>>> filter=ds.Expression.call("log10", [ds.field('a')]) > 0.5).to_pandas()
   a     a_log
0  4  0.602060
1  5  0.698970
2  6  0.778151
3  7  0.845098
4  8  0.903090
5  9  0.954243
{code}

So that seems to work to use a compute kernel in pyarrow.dataset. 

However, it doesn't give a super nice user experience: it basically gives the 
equivalent of {{pc.call_function(..)}}, so eg {{pc.call_function("log10, 
[...])}} instead of {{pc.log10(...)}}. That also means that several of the 
niceties of the python wrapper functions are not available (e.g. validation of 
some arguments and not having to pass them as a list, passing options as 
keyword instead of the class, etc).

I think ideally we would be able to use the compute wrappers directly? Like 
{{pc.log10(ds.field('a'))}} ? Or what would be our preferred user API?


> [Python] Enable calling compute functions on Expressions
> --------------------------------------------------------
>
>                 Key: ARROW-12060
>                 URL: https://issues.apache.org/jira/browse/ARROW-12060
>             Project: Apache Arrow
>          Issue Type: Sub-task
>          Components: Python
>            Reporter: Joris Van den Bossche
>            Assignee: Joris Van den Bossche
>            Priority: Major
>             Fix For: 6.0.0
>
>
> To expose the full power of dataset (projection/filter) expressions, we 
> should ensure that all compute kernels can be used in combination with 
> expressions.



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

Reply via email to