Per my comments there, the introduction of field buffers was added as part
of the fieldvector addition when we have vectors that weren't field level.
This meant that getbuffers and getfieldbuffers were at different levels at
hierarchy (getbuffers being more general). I believe we no longer have the
case of a vector that is not a field. As such, collapsing fieldvector and
valuevector makes a lot of sense.

In the thread I tried to say that we should capture the use cases and then
consolidate to support those use cases. I think there are definitely two
use cases (with sub in one of them) that I'm aware of:

- get me buffers for writing to a io (socket or disk)
  - transfer the buffers
  - keep the buffers in vector (double reference counts)
- get me the buffers so I can do some kind of low level memory operations
(cheap as possible please)

I actually think we've also failed to complete the other work that we
talked about wrt removing readerIndex and writerIndex from ArrowBuf. They
aren't well maintained by any of the vectors and thus are mostly broken
(and create a bunch of this confusion).

So I'd actually be inclined to fix the readerIndex/writerIndex issue before
we mess with the buffer methods. Removing that whole concept will collapse
the use cases, I believe. Then we only have transfer versus not (which is
what the getBuffers() method already provides).


On Mon, Aug 10, 2020 at 1:44 AM Ji Liu <tianc...@apache.org> wrote:

> Hi Micah, I am afraid it's not a reasonable solution.
> 1. The status is that getFieldBuffers has right order buffer and was used
> in IPC, getBuffers was not used in IPC.
> 2. The purpose of this PR is to use getBuffers in IPC instead, and making
> changes in getFieldBuffers dose not seem to help this problem since it will
> break IPC format by using getBuffers.
>
> Micah Kornfield <emkornfi...@gmail.com> 于2020年8月8日周六 上午11:50写道:
>
> >  Thinking about this some more, I think maybe we should also potentially
> > try to deprecate hold off on any changes to getFieldBuffers.  It should
> > likely follow the same sort of pattern for getBuffers (create a new
> method
> > that doesn't set reader/writer indices and deprecate getFieldBuffers).
> But
> > I think that can be handled in a separate PR?
> >
> > Anybody else have thoughts?
> >
> > -Micah
> >
> > On Tue, Aug 4, 2020 at 11:24 PM Ji Liu <tianc...@apache.org> wrote:
> >
> > > hi liya,
> > > Thanks for your careful review, it is a typo, the order of getBuffers
> is
> > > wrong.
> > >
> > > Fan Liya <liya.fa...@gmail.com> 于2020年8月5日周三 下午2:14写道:
> > >
> > > > Hi Ji,
> > > >
> > > > IMO, for the correct order, the validity buffer should precede the
> > offset
> > > > buffer (e.g. this is the order used by BaseVariableWidthVector &
> > > > BaseLargeVariableWidthVector).
> > > > In ListVector#getBuffers, the offset buffer precedes the validity
> > buffer,
> > > > so I am a little confused why you say the order of
> > ListVector#getBuffers
> > > is
> > > > right?
> > > >
> > > > Best,
> > > > Liya Fan
> > > >
> > > > On Wed, Aug 5, 2020 at 12:32 PM Micah Kornfield <
> emkornfi...@gmail.com
> > >
> > > > wrote:
> > > >
> > > > > FWIW, I lack historical context on how these methods evolved, so
> I'd
> > > > > appreciate insight from anyone who has worked on the java codebase
> > for
> > > a
> > > > > longer period of time.  The current situation seems less then
> ideal.
> > > > >
> > > > > On Tue, Aug 4, 2020 at 12:55 AM Ji Liu <tianc...@apache.org>
> wrote:
> > > > >
> > > > > > Hi all,
> > > > > >
> > > > > >
> > > > > > When I worked on ARROW-7539[1], I met some problems and not sure
> > > what's
> > > > > the
> > > > > > proper way to solve it.
> > > > > >
> > > > > >
> > > > > > This issue was about to avoid set reader/writer indices in
> > > > > > FieldVector#getFieldBuffers according to the following reasons:
> > > > > >
> > > > > > i. getBuffers set reader/writer indices and it's right for the
> > > purpose
> > > > of
> > > > > > sending the data over the wire
> > > > > >
> > > > > > ii. getFieldBuffers is distinct from getBuffers, it should be for
> > > > getting
> > > > > > access to underlying data for higher-performance algorithms
> > > > > >
> > > > > >
> > > > > > Currently in VectorUnloader, we used getFieldBuffers to create
> > > > > > ArrowRecordBatch that's why we keep writer/reader indices in
> > > > > > getFieldBuffers
> > > > > > , we should use getBuffers instead. But during the change, we
> found
> > > > > another
> > > > > > problem:
> > > > > >
> > > > > > The order of validity and offset buffers are not in the same
> order
> > in
> > > > > > ListVector(getBuffers's order is right), changing the API in
> > > > > VectorUnloader
> > > > > > creates problems with serialization/deserialization resulting in
> > test
> > > > > > failures in Dremio which would break backward compatibility with
> > > > existing
> > > > > > serialised files.
> > > > > >
> > > > > >
> > > > > > Micah gives a solution but seems doesn't reach consistent in the
> PR
> > > > > > thread[2
> > > > > > ]:
> > > > > >
> > > > > >    1. Remove setReaderWriterIndeces in getFieldBuffers
> > > > > >    2. Deprecate getBuffers
> > > > > >    3. Introduce a new getIpcBuffers which is unambiguously used
> for
> > > > > writing
> > > > > >    record batches (i.e. in VectorUnloader).
> > > > > >    4. Update documentation where it makes sense based on all this
> > > > > >    conversation.
> > > > > >
> > > > > >
> > > > > > More details and discussions can be seen from the PR and hope to
> > get
> > > > more
> > > > > > feedback.
> > > > > >
> > > > > >
> > > > > >
> > > > > > Thanks,
> > > > > >
> > > > > > Ji Liu
> > > > > >
> > > > > >
> > > > > >
> > > > > > [1] https://issues.apache.org/jira/browse/ARROW-7539
> > > > > >
> > > > > > [2] https://github.com/apache/arrow/pull/6156
> > > > > >
> > > > >
> > > >
> > >
> >
>

Reply via email to