Re: Strategy for mixing large_string and string with chunked arrays

Maarten Breddels Tue, 17 Dec 2019 03:22:19 -0800

Op wo 27 nov. 2019 om 19:37 schreef Wes McKinney <wesmck...@gmail.com>:


> On Tue, Nov 26, 2019 at 9:40 AM Maarten Breddels
> <maartenbredd...@gmail.com> wrote:
> >
> > Op di 26 nov. 2019 om 15:02 schreef Wes McKinney <wesmck...@gmail.com>:
> >
> > > hi Maarten
> > >
> > > I opened https://issues.apache.org/jira/browse/ARROW-7245 in part
> based
> > > on this.
> > >
> > > I think that normalizing to a common type (which would require casting
> > > the offsets buffer, but not the data -- which can be shared -- so not
> > > too wasteful) during concatenation would be the approach I would take.
> > > I would be surprised if normalizing string offsets during record batch
> > > / table concatenation showed up as a performance or memory use issue
> > > relative to other kinds of operations -- in theory the
> > > string->large_string promotion should be relatively exceptional (< 5%
> > > of the time). I've found in performance tests that creating many
> > > smaller array chunks is faster anyway due to interplay with the memory
> > > allocator.
> > >
> >
> > Yes, I think it is rare, but it does mean that if a user wants to
> convert a
> > Vaex dataframe to an Arrow table it might use GB's of RAM (thinking ~1
> > billion rows). Ideally, it would use zero RAM (imagine concatenating many
> > large memory-mapped datasets).
> > I'm ok living with this limitation, but I wanted to raise it before v1.0
> > goes out.
> >
>
> The 1.0 release is about hardening the format and protocol, which
> wouldn't be affected by this discussion. The Binary/String and
> LargeBinary/LargeString are distinct memory layouts and so they need
> to be separate at the protocol level.
>
> At the C++ library / application level there's plenty that could be
> done if this turned out to be an issue. For example, an ExtensionType
> could be defined that allows the storage to be either 32-bit or
> 64-bit.
>

Ok, sounds good.


>
> >
> >
> > >
> > > Of course I think we should have string kernels for both 32-bit and
> > > 64-bit variants. Note that Gandiva already has significant string
> > > kernel support (for 32-bit offsets at the moment) and there is
> > > discussion about pre-compiling the LLVM IR into a shared library to
> > > not introduce an LLVM runtime dependency, so we could maintain a
> > > single code path for string algorithms that can be used both in a
> > > JIT-ed setting as well as pre-compiled / interpreted setting. See
> > > https://issues.apache.org/jira/browse/ARROW-7083
> >
> >
> > That is a very interesting approach, thanks for sharing that resource,
> I'll
> > consider that.
> >
> >
> > > Note that many analytic database engines (notably: Dremio, which is
> > > natively Arrow-based) don't support exceeding the 2GB / 32-bit limit
> > > at all and it does not seem to be an impedance in practical use. We
> > > have the Chunked* builder classes [1] in C++ to facilitate the
> > > creation of chunked binary arrays where there is concern about
> > > overflowing the 2GB limit.
> > >
> > > Others may have different opinions so I'll let them comment.
> > >
> >
> > Yes, I think in many cases it's not a problem at all. Also in vaex, all
> the
> > processing happens in chunks, and no chunk will ever be that large (for
> the
> > near future...).
> > In vaex, when exporting to hdf5, I always write in 1 chunk, and that's
> > where most of my issues show up.
>
> I see. Ideally one would architect around the chunked model since this
> seems to have the best overall performance and scalability qualities.
>

Note that I prefer non-chunked on disk, but chunked in memory (or, while
processing). I think that having 1 linear array gives you the ability to
chunk it up any way you prefer (to make cache hits optimal etc), while if
the on-disk chunking does not match the ideal chunking size, you may end up
processing small chunks or having to memory copies.
Where do you see performance differences between chunked/non-chunked, would
be interesting to know more about that.

cheers,

Maarten


>
> >
> > cheers,
> >
> > Maarten
> >
> >
> > >
> > > - Wes
> > >
> > > [1]:
> > >
> https://github.com/apache/arrow/blob/master/cpp/src/arrow/array/builder_binary.h#L510
> > >
> > > On Tue, Nov 26, 2019 at 7:44 AM Maarten Breddels
> > > <maartenbredd...@gmail.com> wrote:
> > > >
> > > > Hi Arrow devs,
> > > >
> > > > Small intro: I'm the main Vaex developer, an out of core dataframe
> > > > library for Python - https://github.com/vaexio/vaex -, and we're
> > > > looking into moving Vaex to use Apache Arrow for the data structure.
> > > > At the beginning of this year, we added string support in Vaex, which
> > > > required 64 bit offsets. Those were not available back then, so we
> > > > added our own data structure for string arrays. Our first step to
> move
> > > > to Apache Arrow is to see if we can use Arrow for the data structure,
> > > > and later on, move the strings algorithms of Vaex to Arrow.
> > > >
> > > > (originally posted at https://github.com/apache/arrow/issues/5874)
> > > >
> > > > In vaex I can lazily concatenate dataframes without memory copy. If I
> > > > want to implement this using a pa.ChunkedArray, users cannot
> > > > concatenate dataframes that have a string column with pa.string type
> > > > to a dataframe that has a column with pa.large_string.
> > > >
> > > > In short, there is no arrow data structure to handle this 'mixed
> > > > chunked array', but I was wondering if this could change. The only
> way
> > > > out seems to cast them manually to a common type (although blocked by
> > > > https://issues.apache.org/jira/browse/ARROW-6071).
> > > > Internally I could solve this in vaex, but feedback from building a
> > > > DataFrame library with arrow might be useful. Also, it means I cannot
> > > > expose the concatenated DataFrame as an arrow table.
> > > >
> > > > Because of this, I am wondering if having two types (large_string and
> > > > string) is a good idea in the end since it makes type checking
> > > > cumbersome (having to check two types each time).  Could an option be
> > > > that there is only 1 string and list type, and that the width of the
> > > > indices/offsets can be obtained at runtime? That would also make it
> > > > easy to support 16 and 8-bit offsets. That would make Arrow more
> > > > flexible and efficient, and I guess it would play better with
> > > > pa.ChunkedArray.
> > > >
> > > > Regards,
> > > >
> > > > Maarten Breddels
> > >
>

Re: Strategy for mixing large_string and string with chunked arrays

Reply via email to