Hi Korn, Thanks a lot for your comments.
In my opinion, your comments make sense to me. Allowing non-consecutive memory segments will break some good design choices of Arrow. However, there are wide-spread user requirements for non-consecutive memory segments. I am wondering how can we help such users. What advice we can give to them? Memory copy/move can be a solution, but is there a better solution? Is there a third alternative? Can we virtualize the non-consecutive memory segments into a consecutive one? (Although performance overhead is unavoidable.) What do you think? Let's brain-storm it. Best, Liya Fan On Thu, Jul 11, 2019 at 8:05 PM Uwe L. Korn <uw...@xhochy.com> wrote: > Hello Liya, > > I'm quite -1 on this type as Arrow is about efficient columnar structures. > We have opened the standard also to matrix-like types but always keep the > constraint of consecutive memory. Now also adding types where memory is no > longer consecutive but spread in the heap will make the scope of the > project much wider (It seems that we then just turn into a general > serialization framework). > > One of the ideas of a common standard is that some need to make > compromises. I think in this case it is a necessary compromise to not allow > all kind of string representations. > > Uwe > > On Thu, Jul 11, 2019, at 6:01 AM, Fan Liya wrote: > > Hi all, > > > > > > We are thinking of providing varchar/varbinary vectors with a different > > memory layout which exists in a wide range of systems. The memory layout > is > > different from that of VarCharVector in the following ways: > > > > > > 1. > > > > Instead of storing (start offset, end offset), the new layout stores > > (start offset, length) > > 2. > > > > The content of varchars may not be in a consecutive memory region. > > Instead, it can be in arbitrary memory address. > > > > > > Due to these differences in memory layout, it incurs performance overhead > > when converting data between existing systems and VarCharVectors. > > > > The above difference 1 seems insignificant, while difference 2 is > difficult > > to overcome. However, the scenario of difference 2 is prevalent in > > practice: for example we store strings in a series of memory segments. > > Whenever a segment is full, we request a new one. However, these memory > > segments may not be consecutive, because other processes/threads are also > > requesting/releasing memory segments in the meantime. > > > > So we are wondering if it is possible to support such memory layout in > > Arrow. I think there are more systems that are trying to adopting Arrow, > > but are hindered by such difficulty. > > > > Would you please give your valuable feedback? > > > > > > Best, > > > > Liya Fan > > >