Jacob Baumbach created ARROW-12293:
--------------------------------------
Summary: [Rust][DataFusion] Word Count
Key: ARROW-12293
URL: https://issues.apache.org/jira/browse/ARROW-12293
Project: Apache Arrow
Issue Type: Wish
Components: Rust - DataFusion
Reporter: Jacob Baumbach
I am learning DataFusion and tried to do the canonical big data version of
hello world, word count, using DataFusion. I have been unsuccessful, and I am
wondering if word count is even currently possible with DataFusion.
Typically word count involves a flat_map where you split each string based on
the white space contained within each string.
There are two issues I am running into
1) creating a udf that goes from &str -> Vec<&str>. I cannot find an
`arrow::array` that maps to a collection of string, which is preventing me from
creating a udf that can perform the split.
2) Assuming I could get `1` to work, I am not aware of a method that is similar
to flat_map that may be performed on a column. In sql, I believe this is
called `explode`, which I can't find in the codebase, which makes me think
flat_map style operations aren't possible.
My questions are:
Is word count currently possible in DataFusion? If so, how can perform the
split and how can you perform a flat_map? If word count cannot be done, what
would need to be implemented to make it possible?
--
This message was sent by Atlassian Jira
(v8.3.4#803005)