hi Maarten

I opened https://issues.apache.org/jira/browse/ARROW-7245 in part based on this.

I think that normalizing to a common type (which would require casting
the offsets buffer, but not the data -- which can be shared -- so not
too wasteful) during concatenation would be the approach I would take.
I would be surprised if normalizing string offsets during record batch
/ table concatenation showed up as a performance or memory use issue
relative to other kinds of operations -- in theory the
string->large_string promotion should be relatively exceptional (< 5%
of the time). I've found in performance tests that creating many
smaller array chunks is faster anyway due to interplay with the memory
allocator.

Of course I think we should have string kernels for both 32-bit and
64-bit variants. Note that Gandiva already has significant string
kernel support (for 32-bit offsets at the moment) and there is
discussion about pre-compiling the LLVM IR into a shared library to
not introduce an LLVM runtime dependency, so we could maintain a
single code path for string algorithms that can be used both in a
JIT-ed setting as well as pre-compiled / interpreted setting. See
https://issues.apache.org/jira/browse/ARROW-7083

Note that many analytic database engines (notably: Dremio, which is
natively Arrow-based) don't support exceeding the 2GB / 32-bit limit
at all and it does not seem to be an impedance in practical use. We
have the Chunked* builder classes [1] in C++ to facilitate the
creation of chunked binary arrays where there is concern about
overflowing the 2GB limit.

Others may have different opinions so I'll let them comment.

- Wes

[1]: 
https://github.com/apache/arrow/blob/master/cpp/src/arrow/array/builder_binary.h#L510

On Tue, Nov 26, 2019 at 7:44 AM Maarten Breddels
<maartenbredd...@gmail.com> wrote:
>
> Hi Arrow devs,
>
> Small intro: I'm the main Vaex developer, an out of core dataframe
> library for Python - https://github.com/vaexio/vaex -, and we're
> looking into moving Vaex to use Apache Arrow for the data structure.
> At the beginning of this year, we added string support in Vaex, which
> required 64 bit offsets. Those were not available back then, so we
> added our own data structure for string arrays. Our first step to move
> to Apache Arrow is to see if we can use Arrow for the data structure,
> and later on, move the strings algorithms of Vaex to Arrow.
>
> (originally posted at https://github.com/apache/arrow/issues/5874)
>
> In vaex I can lazily concatenate dataframes without memory copy. If I
> want to implement this using a pa.ChunkedArray, users cannot
> concatenate dataframes that have a string column with pa.string type
> to a dataframe that has a column with pa.large_string.
>
> In short, there is no arrow data structure to handle this 'mixed
> chunked array', but I was wondering if this could change. The only way
> out seems to cast them manually to a common type (although blocked by
> https://issues.apache.org/jira/browse/ARROW-6071).
> Internally I could solve this in vaex, but feedback from building a
> DataFrame library with arrow might be useful. Also, it means I cannot
> expose the concatenated DataFrame as an arrow table.
>
> Because of this, I am wondering if having two types (large_string and
> string) is a good idea in the end since it makes type checking
> cumbersome (having to check two types each time).  Could an option be
> that there is only 1 string and list type, and that the width of the
> indices/offsets can be obtained at runtime? That would also make it
> easy to support 16 and 8-bit offsets. That would make Arrow more
> flexible and efficient, and I guess it would play better with
> pa.ChunkedArray.
>
> Regards,
>
> Maarten Breddels

Reply via email to