Strategy for mixing large_string and string with chunked arrays

Maarten Breddels Tue, 26 Nov 2019 05:44:54 -0800

Hi Arrow devs,

Small intro: I'm the main Vaex developer, an out of core dataframe
library for Python - https://github.com/vaexio/vaex -, and we're
looking into moving Vaex to use Apache Arrow for the data structure.
At the beginning of this year, we added string support in Vaex, which
required 64 bit offsets. Those were not available back then, so we
added our own data structure for string arrays. Our first step to move
to Apache Arrow is to see if we can use Arrow for the data structure,
and later on, move the strings algorithms of Vaex to Arrow.


(originally posted at https://github.com/apache/arrow/issues/5874)

In vaex I can lazily concatenate dataframes without memory copy. If I
want to implement this using a pa.ChunkedArray, users cannot
concatenate dataframes that have a string column with pa.string type
to a dataframe that has a column with pa.large_string.

In short, there is no arrow data structure to handle this 'mixed
chunked array', but I was wondering if this could change. The only way
out seems to cast them manually to a common type (although blocked by
https://issues.apache.org/jira/browse/ARROW-6071).
Internally I could solve this in vaex, but feedback from building a
DataFrame library with arrow might be useful. Also, it means I cannot
expose the concatenated DataFrame as an arrow table.

Because of this, I am wondering if having two types (large_string and
string) is a good idea in the end since it makes type checking
cumbersome (having to check two types each time).  Could an option be
that there is only 1 string and list type, and that the width of the
indices/offsets can be obtained at runtime? That would also make it
easy to support 16 and 8-bit offsets. That would make Arrow more
flexible and efficient, and I guess it would play better with
pa.ChunkedArray.

Regards,

Maarten Breddels

Strategy for mixing large_string and string with chunked arrays

Reply via email to