[ 
https://issues.apache.org/jira/browse/ARROW-9742?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17179274#comment-17179274
 ] 

Neville Dipale commented on ARROW-9742:
---------------------------------------

Hi [~jhorstmann], the scalar functions on the rust-dataframe library mainly 
call the Arrow compute functions. As we have implemented compute functions with 
an array being the smallest unit, I iterate the chunked arrays and call scalar 
functions on the arrays, before grouping them again into a chunk.

I explored usin Rayon for parallelising those compute functions, but it's not a 
priority (the project is really for me to explore ideas, with the goal being to 
create a lazy dataframe ala spark).

There's scope to add a lot of compute functions to Arrow so that downstream 
users can reuse them, and so we can optimise performance from one place. I 
haven't yet seen interest in functions like trig, temporal functions (I have a 
Jira open for this as I tend to do a lot of datetime conversions), and other 
functions beyond what we have. I think DF has some of these as UDFs, which 
probably makes sense to keep them there for now.

Regarding performance, we've found some patterns that help with 
autovectorisation when writing compute functions, I think at the least we could 
write them up so that downstream users can at least follow them.

One common mistake I've seen is that we iterate through array values, checking 
if a slot is valid or null, and computing the function if valid. An approach 
that works is to ignore nulls and calculate them from the validty mask.

> [Rust] Create one standard DataFrame API
> ----------------------------------------
>
>                 Key: ARROW-9742
>                 URL: https://issues.apache.org/jira/browse/ARROW-9742
>             Project: Apache Arrow
>          Issue Type: Improvement
>          Components: Rust
>            Reporter: Andy Grove
>            Priority: Major
>              Labels: pull-request-available
>             Fix For: 2.0.0
>
>          Time Spent: 2h 20m
>  Remaining Estimate: 0h
>
>  There was a discussion in last Arrow sync call about the fact that there are 
> numerous Rust DataFrame projects and it would be good to have one standard, 
> in the Arrow repo.
> I do think it would be good to have a DataFrame trait in Arrow, with an 
> implementation in DataFusion, and making it possible for other projects to 
> extend/replace the implementation e.g. for distributed compute, or for GPU 
> compute, as two examples. 
> [~jhorstmann] Does this capture what you were suggesting in the call?



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

Reply via email to