Hi Arrow devs, Small intro: I'm the main Vaex developer, an out of core dataframe library for Python - https://github.com/vaexio/vaex -, and we're looking into moving Vaex to use Apache Arrow for the data structure. At the beginning of this year, we added string support in Vaex, which required 64 bit offsets. Those were not available back then, so we added our own data structure for string arrays. Our first step to move to Apache Arrow is to see if we can use Arrow for the data structure, and later on, move the strings algorithms of Vaex to Arrow.
(originally posted at https://github.com/apache/arrow/issues/5874) In vaex I can lazily concatenate dataframes without memory copy. If I want to implement this using a pa.ChunkedArray, users cannot concatenate dataframes that have a string column with pa.string type to a dataframe that has a column with pa.large_string. In short, there is no arrow data structure to handle this 'mixed chunked array', but I was wondering if this could change. The only way out seems to cast them manually to a common type (although blocked by https://issues.apache.org/jira/browse/ARROW-6071). Internally I could solve this in vaex, but feedback from building a DataFrame library with arrow might be useful. Also, it means I cannot expose the concatenated DataFrame as an arrow table. Because of this, I am wondering if having two types (large_string and string) is a good idea in the end since it makes type checking cumbersome (having to check two types each time). Could an option be that there is only 1 string and list type, and that the width of the indices/offsets can be obtained at runtime? That would also make it easy to support 16 and 8-bit offsets. That would make Arrow more flexible and efficient, and I guess it would play better with pa.ChunkedArray. Regards, Maarten Breddels