[
https://issues.apache.org/jira/browse/ARROW-9742?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17179274#comment-17179274
]
Neville Dipale commented on ARROW-9742:
---------------------------------------
Hi [~jhorstmann], the scalar functions on the rust-dataframe library mainly
call the Arrow compute functions. As we have implemented compute functions with
an array being the smallest unit, I iterate the chunked arrays and call scalar
functions on the arrays, before grouping them again into a chunk.
I explored usin Rayon for parallelising those compute functions, but it's not a
priority (the project is really for me to explore ideas, with the goal being to
create a lazy dataframe ala spark).
There's scope to add a lot of compute functions to Arrow so that downstream
users can reuse them, and so we can optimise performance from one place. I
haven't yet seen interest in functions like trig, temporal functions (I have a
Jira open for this as I tend to do a lot of datetime conversions), and other
functions beyond what we have. I think DF has some of these as UDFs, which
probably makes sense to keep them there for now.
Regarding performance, we've found some patterns that help with
autovectorisation when writing compute functions, I think at the least we could
write them up so that downstream users can at least follow them.
One common mistake I've seen is that we iterate through array values, checking
if a slot is valid or null, and computing the function if valid. An approach
that works is to ignore nulls and calculate them from the validty mask.
> [Rust] Create one standard DataFrame API
> ----------------------------------------
>
> Key: ARROW-9742
> URL: https://issues.apache.org/jira/browse/ARROW-9742
> Project: Apache Arrow
> Issue Type: Improvement
> Components: Rust
> Reporter: Andy Grove
> Priority: Major
> Labels: pull-request-available
> Fix For: 2.0.0
>
> Time Spent: 2h 20m
> Remaining Estimate: 0h
>
> There was a discussion in last Arrow sync call about the fact that there are
> numerous Rust DataFrame projects and it would be good to have one standard,
> in the Arrow repo.
> I do think it would be good to have a DataFrame trait in Arrow, with an
> implementation in DataFusion, and making it possible for other projects to
> extend/replace the implementation e.g. for distributed compute, or for GPU
> compute, as two examples.
> [~jhorstmann] Does this capture what you were suggesting in the call?
--
This message was sent by Atlassian Jira
(v8.3.4#803005)