Op di 26 nov. 2019 om 15:02 schreef Wes McKinney <wesmck...@gmail.com>:
> hi Maarten > > I opened https://issues.apache.org/jira/browse/ARROW-7245 in part based > on this. > > I think that normalizing to a common type (which would require casting > the offsets buffer, but not the data -- which can be shared -- so not > too wasteful) during concatenation would be the approach I would take. > I would be surprised if normalizing string offsets during record batch > / table concatenation showed up as a performance or memory use issue > relative to other kinds of operations -- in theory the > string->large_string promotion should be relatively exceptional (< 5% > of the time). I've found in performance tests that creating many > smaller array chunks is faster anyway due to interplay with the memory > allocator. > Yes, I think it is rare, but it does mean that if a user wants to convert a Vaex dataframe to an Arrow table it might use GB's of RAM (thinking ~1 billion rows). Ideally, it would use zero RAM (imagine concatenating many large memory-mapped datasets). I'm ok living with this limitation, but I wanted to raise it before v1.0 goes out. > > Of course I think we should have string kernels for both 32-bit and > 64-bit variants. Note that Gandiva already has significant string > kernel support (for 32-bit offsets at the moment) and there is > discussion about pre-compiling the LLVM IR into a shared library to > not introduce an LLVM runtime dependency, so we could maintain a > single code path for string algorithms that can be used both in a > JIT-ed setting as well as pre-compiled / interpreted setting. See > https://issues.apache.org/jira/browse/ARROW-7083 That is a very interesting approach, thanks for sharing that resource, I'll consider that. > Note that many analytic database engines (notably: Dremio, which is > natively Arrow-based) don't support exceeding the 2GB / 32-bit limit > at all and it does not seem to be an impedance in practical use. We > have the Chunked* builder classes [1] in C++ to facilitate the > creation of chunked binary arrays where there is concern about > overflowing the 2GB limit. > > Others may have different opinions so I'll let them comment. > Yes, I think in many cases it's not a problem at all. Also in vaex, all the processing happens in chunks, and no chunk will ever be that large (for the near future...). In vaex, when exporting to hdf5, I always write in 1 chunk, and that's where most of my issues show up. cheers, Maarten > > - Wes > > [1]: > https://github.com/apache/arrow/blob/master/cpp/src/arrow/array/builder_binary.h#L510 > > On Tue, Nov 26, 2019 at 7:44 AM Maarten Breddels > <maartenbredd...@gmail.com> wrote: > > > > Hi Arrow devs, > > > > Small intro: I'm the main Vaex developer, an out of core dataframe > > library for Python - https://github.com/vaexio/vaex -, and we're > > looking into moving Vaex to use Apache Arrow for the data structure. > > At the beginning of this year, we added string support in Vaex, which > > required 64 bit offsets. Those were not available back then, so we > > added our own data structure for string arrays. Our first step to move > > to Apache Arrow is to see if we can use Arrow for the data structure, > > and later on, move the strings algorithms of Vaex to Arrow. > > > > (originally posted at https://github.com/apache/arrow/issues/5874) > > > > In vaex I can lazily concatenate dataframes without memory copy. If I > > want to implement this using a pa.ChunkedArray, users cannot > > concatenate dataframes that have a string column with pa.string type > > to a dataframe that has a column with pa.large_string. > > > > In short, there is no arrow data structure to handle this 'mixed > > chunked array', but I was wondering if this could change. The only way > > out seems to cast them manually to a common type (although blocked by > > https://issues.apache.org/jira/browse/ARROW-6071). > > Internally I could solve this in vaex, but feedback from building a > > DataFrame library with arrow might be useful. Also, it means I cannot > > expose the concatenated DataFrame as an arrow table. > > > > Because of this, I am wondering if having two types (large_string and > > string) is a good idea in the end since it makes type checking > > cumbersome (having to check two types each time). Could an option be > > that there is only 1 string and list type, and that the width of the > > indices/offsets can be obtained at runtime? That would also make it > > easy to support 16 and 8-bit offsets. That would make Arrow more > > flexible and efficient, and I guess it would play better with > > pa.ChunkedArray. > > > > Regards, > > > > Maarten Breddels >