Hi Korn,

Thanks a lot for your comments.

In my opinion, your comments make sense to me. Allowing non-consecutive
memory segments will break some good design choices of Arrow.
However, there are wide-spread user requirements for non-consecutive memory
segments. I am wondering how can we help such users. What advice we can
give to them?

Memory copy/move can be a solution, but is there a better solution?
Is there a third alternative? Can we virtualize the non-consecutive memory
segments into a consecutive one? (Although performance overhead is
unavoidable.)

What do you think? Let's brain-storm it.

Best,
Liya Fan


On Thu, Jul 11, 2019 at 8:05 PM Uwe L. Korn <uw...@xhochy.com> wrote:

> Hello Liya,
>
> I'm quite -1 on this type as Arrow is about efficient columnar structures.
> We have opened the standard also to matrix-like types but always keep the
> constraint of consecutive memory. Now also adding types where memory is no
> longer consecutive but spread in the heap will make the scope of the
> project much wider (It seems that we then just turn into a general
> serialization framework).
>
> One of the ideas of a common standard is that some need to make
> compromises. I think in this case it is a necessary compromise to not allow
> all kind of string representations.
>
> Uwe
>
> On Thu, Jul 11, 2019, at 6:01 AM, Fan Liya wrote:
> > Hi all,
> >
> >
> > We are thinking of providing varchar/varbinary vectors with a different
> > memory layout which exists in a wide range of systems. The memory layout
> is
> > different from that of VarCharVector in the following ways:
> >
> >
> >    1.
> >
> >    Instead of storing (start offset, end offset), the new layout stores
> >    (start offset, length)
> >    2.
> >
> >    The content of varchars may not be in a consecutive memory region.
> >    Instead, it can be in arbitrary memory address.
> >
> >
> > Due to these differences in memory layout, it incurs performance overhead
> > when converting data between existing systems and VarCharVectors.
> >
> > The above difference 1 seems insignificant, while difference 2 is
> difficult
> > to overcome. However, the scenario of difference 2 is prevalent in
> > practice: for example we store strings in a series of memory segments.
> > Whenever a segment is full, we request a new one. However, these memory
> > segments may not be consecutive, because other processes/threads are also
> > requesting/releasing memory segments in the meantime.
> >
> > So we are wondering if it is possible to support such memory layout in
> > Arrow. I think there are more systems that are trying to adopting Arrow,
> > but are hindered by such difficulty.
> >
> > Would you please give your valuable feedback?
> >
> >
> > Best,
> >
> > Liya Fan
> >
>

Reply via email to