Re: Stored state of incremental writes to fixed size Arrow buffer?

John Muehlhausen Mon, 13 May 2019 06:29:32 -0700

Thanks Wes, do you have any comment on the following from the zdnet story I
linked?


``But the missing piece is streaming, where the velocity of incoming data
poses a special challenge. There are some early experiments to populate
Arrow nodes in microbatches from Kafka. And, as the edge gets smarter
(especially as machine learning is applied), it will also make sense for
Arrow to emerge in a small footprint version, and with it, harvesting some
of the work around transport for feeding filtered or aggregated data up to
the cloud.’’

Specifically, do you view Arrow as a data structure that bridges the batch
and event processing worlds?

I am concerned that with side-car data for the distinction between size and
capacity, someone could rather easily change the Arrow internals spec in
the future such that incremental population (with pre-allocation) is no
longer possible.  By coding this distinction in RecordBatch we are saying
to the future: “Don’t assume this won’t be incrementally populated! Don’t
assume this hasn’t over-allocated something because of actual data not
matching expected data!”

-John

On Mon, May 13, 2019 at 8:07 AM Wes McKinney <wesmck...@gmail.com> wrote:

> hi John,
>
> Sorry, there's a number of fairly long e-mails in this thread; I'm
> having a hard time following all of the details.
>
> I suspect the most parsimonious thing would be to have some "sidecar"
> metadata that tracks the state of your writes into pre-allocated Arrow
> blocks so that readers know to call "Slice" on the blocks to obtain
> only the written-so-far portion. I'm not likely to be in favor of
> making changes to the binary protocol for this use case; if others
> have opinions I'll let them speak for themselves.
>
> - Wes
>
> On Mon, May 13, 2019 at 7:50 AM John Muehlhausen <j...@jgm.org> wrote:
> >
> > Any thoughts on a RecordBatch distinguishing size from capacity? (To
> borrow
> > std::vector terminology)
> >
> > Thanks,
> > John
> >
> > On Thu, May 9, 2019 at 2:46 PM John Muehlhausen <j...@jgm.org> wrote:
> >
> > > Wes et al, I think my core proposal is that Message.fbs:RecordBatch
> split
> > > the "length" parameter into "theoretical max length" and "utilized
> length"
> > > (perhaps not those exact names).
> > >
> > > "theoretical max length is the same as "length" now ... /// ...The
> arrays
> > > in the batch should all have this
> > >
> > > "utilized length" are the number of rows (starting from the first one)
> > > that actually contain interesting data... the rest do not.
> > >
> > > The reason we can have a RecordBatch where these numbers are not the
> same
> > > is that the RecordBatch space was preallocated (for performance
> reasons)
> > > and the number of rows that actually "fit" depends on how correct the
> > > preallocation was.  In any case, it gives the user control of this
> > > space/time tradeoff... wasted space in order to save time in record
> batch
> > > construction.  The fact that some space will usually be wasted when
> there
> > > are variable-length columns (barring extreme luck) with this batch
> > > construction paradigm explains the word "theoretical" above.  This also
> > > gives us the ability to look at a partially constructed batch that is
> still
> > > being constructed, given appropriate user-supplied concurrency control.
> > >
> > > I am not an expert in all of the Arrow variable-length data types, but
> I
> > > think this works if they are all similar to variable-length strings
> where
> > > we advance through "blob storage" by setting the indexes into that
> storage
> > > for the current and next row in order to indicate that we have
> > > incrementally consumed more blob storage.  (Conceptually this storage
> is
> > > "unallocated" after the pre-allocation and before rows are populated.)
> > >
> > > At a high level I am seeking to shore up the format for event ingress
> into
> > > real-time analytics that have some look-back window.  If I'm not
> mistaken I
> > > think this is the subject of the last multi-sentence paragraph here?:
> > > https://zd.net/2H0LlBY
> > >
> > > Currently we have a less-efficient paradigm where "microbatches" (e.g.
> of
> > > length 1 for minimal latency) have to spin the CPU periodically in
> order to
> > > be combined into buffers where we get the columnar layout benefit.
> With
> > > pre-allocation we can deal with microbatches (a partially populated
> larger
> > > RecordBatch) and immediately have the columnar layout benefits for the
> > > populated section with no additional computation.
> > >
> > > For example, consider an event processing system that calculates a
> "moving
> > > average" as events roll in.  While this is somewhat contrived lets
> assume
> > > that the moving average window is 1000 periods and our pre-allocation
> > > ("theoretical max length") of RecordBatch elements is 100.  The
> algorithm
> > > would be something like this, for a list of RecordBatch buffers in
> memory:
> > >
> > > initialization():
> > >   set up configuration of expected variable length storage
> requirements,
> > > e.g. the template RecordBatch mentioned below
> > >
> > > onIncomingEvent(event):
> > >   obtain lock /// cf. swoopIn() below
> > >   if last RecordBatch theoretical max length is not less than utilized
> > > length or variable-length components of "event" will not fit in
> remaining
> > > blob storage:
> > >     create a new RecordBatch pre-allocation of max utilized length 100
> and
> > > with blob preallocation that is max(expected, event .. in case the
> single
> > > event is larger than the expectation for 100 events)
> > >        (note: in the expected case this can be very fast as it is a
> > > malloc() and a memcpy() from a template!)
> > >     set current RecordBatch to this newly created one
> > >   add event to current RecordBatch (for the non-calculated fields)
> > >   increment utilized length of current RecordBatch
> > >   calculate the calculated fields (in this case, moving average) by
> > > looking back at previous rows in this and previous RecordBatch objects
> > >   free() any RecordBatch objects that are now before the lookback
> window
> > >
> > > swoopIn(): /// somebody wants to chart the lookback window
> > >   obtain lock
> > >   visit all of the relevant data in the RecordBatches to construct the
> > > chart /// notice that the last RecordBatch may not yet be "as full as
> > > possible"
> > >
> > > The above analysis (minus the free()) could apply to the IPC file
> format
> > > and the lock could be a file lock and the swoopIn() could be a separate
> > > process.  In the case of the file format, while the file is locked, a
> new
> > > RecordBatch would overwrite the previous file Footer and a new Footer
> would
> > > be written.  In order to be able to delete or archive old data multiple
> > > files could be strung together in a logical series.
> > >
> > > -John
> > >
> > > On Tue, May 7, 2019 at 2:39 PM Wes McKinney <wesmck...@gmail.com>
> wrote:
> > >
> > >> On Tue, May 7, 2019 at 12:26 PM John Muehlhausen <j...@jgm.org> wrote:
> > >> >
> > >> > Wes, are we saying that `pa.ipc.open_file(...).read_pandas()`
> already
> > >> reads
> > >> > the future Feather format? If not, how will the future format
> differ?  I
> > >> > will work on my access pattern with this format instead of the
> current
> > >> > feather format.  Sorry I was not clear on that earlier.
> > >> >
> > >>
> > >> Yes, under the hood those will use the same zero-copy binary protocol
> > >> code paths to read the file.
> > >>
> > >> > Micah, thank you!
> > >> >
> > >> > On Tue, May 7, 2019 at 11:44 AM Micah Kornfield <
> emkornfi...@gmail.com>
> > >> > wrote:
> > >> >
> > >> > > Hi John,
> > >> > > To give a specific pointer [1] describes how the streaming
> protocol is
> > >> > > stored to a file.
> > >> > >
> > >> > > [1] https://arrow.apache.org/docs/format/IPC.html#file-format
> > >> > >
> > >> > > On Tue, May 7, 2019 at 9:40 AM Wes McKinney <wesmck...@gmail.com>
> > >> wrote:
> > >> > >
> > >> > > > hi John,
> > >> > > >
> > >> > > > As soon as the R folks can install the Arrow R package
> consistently,
> > >> > > > the intent is to replace the Feather internals with the plain
> Arrow
> > >> > > > IPC protocol where we have much better platform support all
> around.
> > >> > > >
> > >> > > > If you'd like to experiment with creating an API for
> pre-allocating
> > >> > > > fixed-size Arrow protocol blocks and then mutating the data and
> > >> > > > metadata on disk in-place, please be our guest. We don't have
> the
> > >> > > > tools developed yet to do this for you
> > >> > > >
> > >> > > > - Wes
> > >> > > >
> > >> > > > On Tue, May 7, 2019 at 11:25 AM John Muehlhausen <j...@jgm.org>
> > >> wrote:
> > >> > > > >
> > >> > > > > Thanks Wes:
> > >> > > > >
> > >> > > > > "the current Feather format is deprecated" ... yes, but there
> > >> will be a
> > >> > > > > future file format that replaces it, correct?  And my
> discussion
> > >> of
> > >> > > > > immutable "portions" of Arrow buffers, rather than
> immutability
> > >> of the
> > >> > > > > entire buffer, applies to IPC as well, right?  I am only
> > >> championing
> > >> > > the
> > >> > > > > idea that this future file format have the convenience that
> > >> certain
> > >> > > > > preallocated rows can be ignored based on a metadata setting.
> > >> > > > >
> > >> > > > > "I recommend batching your writes" ... this is extremely
> > >> inefficient
> > >> > > and
> > >> > > > > adds unacceptable latency, relative to the proposed
> solution.  Do
> > >> you
> > >> > > > > disagree?  Either I have a batch length of 1 to minimize
> latency,
> > >> which
> > >> > > > > eliminates columnar advantages on the read side, or else I add
> > >> latency.
> > >> > > > > Neither works, and it seems that a viable alternative is
> within
> > >> sight?
> > >> > > > >
> > >> > > > > If you don't agree that there is a core issue and opportunity
> > >> here, I'm
> > >> > > > not
> > >> > > > > sure how to better make my case....
> > >> > > > >
> > >> > > > > -John
> > >> > > > >
> > >> > > > > On Tue, May 7, 2019 at 11:02 AM Wes McKinney <
> wesmck...@gmail.com
> > >> >
> > >> > > > wrote:
> > >> > > > >
> > >> > > > > > hi John,
> > >> > > > > >
> > >> > > > > > On Tue, May 7, 2019 at 10:53 AM John Muehlhausen <
> j...@jgm.org>
> > >> > > wrote:
> > >> > > > > > >
> > >> > > > > > > Wes et al, I completed a preliminary study of populating a
> > >> Feather
> > >> > > > file
> > >> > > > > > > incrementally.  Some notes and questions:
> > >> > > > > > >
> > >> > > > > > > I wrote the following dataframe to a feather file:
> > >> > > > > > >             a    b
> > >> > > > > > > 0  0123456789  0.0
> > >> > > > > > > 1  0123456789  NaN
> > >> > > > > > > 2  0123456789  NaN
> > >> > > > > > > 3  0123456789  NaN
> > >> > > > > > > 4        None  NaN
> > >> > > > > > >
> > >> > > > > > > In re-writing the flatbuffers metadata (flatc -p doesn't
> > >> > > > > > > support --gen-mutable! yuck! C++ to the rescue...), it
> seems
> > >> that
> > >> > > > > > > read_feather is not affected by NumRows?  It seems to be
> > >> driven
> > >> > > > entirely
> > >> > > > > > by
> > >> > > > > > > the per-column Length values?
> > >> > > > > > >
> > >> > > > > > > Also, it seems as if one of the primary usages of
> NullCount
> > >> is to
> > >> > > > > > determine
> > >> > > > > > > whether or not a bitfield is present?  In the
> initialization
> > >> data
> > >> > > > above I
> > >> > > > > > > was careful to have a null value in each column in order
> to
> > >> > > generate
> > >> > > > a
> > >> > > > > > > bitfield.
> > >> > > > > >
> > >> > > > > > Per my prior e-mails, the current Feather format is
> deprecated,
> > >> so
> > >> > > I'm
> > >> > > > > > only willing to engage on a discussion of the official Arrow
> > >> binary
> > >> > > > > > protocol that we use for IPC (memory mapping) and RPC
> (Flight).
> > >> > > > > >
> > >> > > > > > >
> > >> > > > > > > I then wiped the bitfields in the file and set all of the
> > >> string
> > >> > > > indices
> > >> > > > > > to
> > >> > > > > > > one past the end of the blob buffer (all strings empty):
> > >> > > > > > >       a   b
> > >> > > > > > > 0  None NaN
> > >> > > > > > > 1  None NaN
> > >> > > > > > > 2  None NaN
> > >> > > > > > > 3  None NaN
> > >> > > > > > > 4  None NaN
> > >> > > > > > >
> > >> > > > > > > I then set the first record to some data by consuming
> some of
> > >> the
> > >> > > > string
> > >> > > > > > > blob and row 0 and 1 indices, also setting the double:
> > >> > > > > > >
> > >> > > > > > >                a    b
> > >> > > > > > > 0  Hello, world!  5.0
> > >> > > > > > > 1           None  NaN
> > >> > > > > > > 2           None  NaN
> > >> > > > > > > 3           None  NaN
> > >> > > > > > > 4           None  NaN
> > >> > > > > > >
> > >> > > > > > > As mentioned above, NumRows seems to be ignored.  I tried
> > >> adjusting
> > >> > > > each
> > >> > > > > > > column Length to mask off higher rows and it worked for 4
> > >> (hide
> > >> > > last
> > >> > > > row)
> > >> > > > > > > but not for less than 4.  I take this to be due to math
> used
> > >> to
> > >> > > > figure
> > >> > > > > > out
> > >> > > > > > > where the buffers are relative to one another since there
> is
> > >> only
> > >> > > one
> > >> > > > > > > metadata offset for all of: the (optional) bitset, index
> > >> column and
> > >> > > > (if
> > >> > > > > > > string) blobs.
> > >> > > > > > >
> > >> > > > > > > Populating subsequent rows would proceed in a similar way
> > >> until all
> > >> > > > of
> > >> > > > > > the
> > >> > > > > > > blob storage has been consumed, which may come before the
> > >> > > > pre-allocated
> > >> > > > > > > rows have been consumed.
> > >> > > > > > >
> > >> > > > > > > So what does this mean for my desire to incrementally
> write
> > >> these
> > >> > > > > > > (potentially memory-mapped) pre-allocated files and/or
> Arrow
> > >> > > buffers
> > >> > > > in
> > >> > > > > > > memory?  Some thoughts:
> > >> > > > > > >
> > >> > > > > > > - If a single value (such as NumRows) were consulted to
> > >> determine
> > >> > > the
> > >> > > > > > table
> > >> > > > > > > row dimension then updating this single value would serve
> to
> > >> tell a
> > >> > > > > > reader
> > >> > > > > > > which rows are relevant.  Assuming this value is updated
> > >> after all
> > >> > > > other
> > >> > > > > > > mutations are complete, and assuming that mutations are
> only
> > >> > > appends
> > >> > > > > > > (addition of rows), then concurrency control involves only
> > >> ensuring
> > >> > > > that
> > >> > > > > > > this value is not examined while it is being written.
> > >> > > > > > >
> > >> > > > > > > - NullCount presents a concurrency problem if someone
> reads
> > >> the
> > >> > > file
> > >> > > > > > after
> > >> > > > > > > this field has been updated, but before NumRows has
> exposed
> > >> the new
> > >> > > > > > record
> > >> > > > > > > (or vice versa).  The idea previously mentioned that there
> > >> will
> > >> > > > "likely
> > >> > > > > > > [be] more statistics in the future" feels like it might be
> > >> scope
> > >> > > > creep to
> > >> > > > > > > me?  This is a data representation, not a calculation
> > >> framework?
> > >> > > If
> > >> > > > > > > NullCount had its genesis in the optional nature of the
> > >> bitfield, I
> > >> > > > would
> > >> > > > > > > suggest that perhaps NullCount can be dropped in favor of
> > >> always
> > >> > > > > > supplying
> > >> > > > > > > the bitfield, which in any event is already contemplated
> by
> > >> the
> > >> > > spec:
> > >> > > > > > > "Implementations may choose to always allocate one anyway
> as a
> > >> > > > matter of
> > >> > > > > > > convenience."  If the concern is space savings, Arrow is
> > >> already an
> > >> > > > > > > extremely uncompressed format.  (Compression is something
> I
> > >> would
> > >> > > > also
> > >> > > > > > > consider to be scope creep as regards Feather...
> compressed
> > >> > > > filesystems
> > >> > > > > > can
> > >> > > > > > > be employed and there are other compressed dataframe
> formats.)
> > >> > > > However,
> > >> > > > > > if
> > >> > > > > > > protecting the 4 bytes required to update NowRows turns
> out
> > >> to be
> > >> > > no
> > >> > > > > > easier
> > >> > > > > > > than protecting all of the statistical bytes as well as
> part
> > >> of the
> > >> > > > same
> > >> > > > > > > "critical section" (locks: yuck!!) then statistics pose no
> > >> issue.
> > >> > > I
> > >> > > > > > have a
> > >> > > > > > > feeling that the availability of an atomic write of 4
> bytes
> > >> will
> > >> > > > depend
> > >> > > > > > on
> > >> > > > > > > the storage mechanism... memory vs memory map vs write()
> etc.
> > >> > > > > > >
> > >> > > > > > > - The elephant in the room appears to be the presumptive
> > >> binary
> > >> > > > yes/no on
> > >> > > > > > > mutability of Arrow buffers.  Perhaps the thought is that
> > >> certain
> > >> > > > batch
> > >> > > > > > > processes will be wrecked if anyone anywhere is mutating
> > >> buffers in
> > >> > > > any
> > >> > > > > > > way.  But keep in mind I am not proposing general
> mutability,
> > >> only
> > >> > > > > > > appending of new data.  *A huge amount of batch processing
> > >> that
> > >> > > will
> > >> > > > take
> > >> > > > > > > place with Arrow is on time-series data (whether
> financial or
> > >> > > > otherwise).
> > >> > > > > > > It is only natural that architects will want the minimal
> > >> impedance
> > >> > > > > > mismatch
> > >> > > > > > > when it comes time to grow those time series as the events
> > >> occur
> > >> > > > going
> > >> > > > > > > forward.*  So rather than say that I want "mutable" Arrow
> > >> buffers,
> > >> > > I
> > >> > > > > > would
> > >> > > > > > > pitch this as a call for "immutable populated areas" of
> Arrow
> > >> > > buffers
> > >> > > > > > > combined with the concept that the populated area can
> grow up
> > >> to
> > >> > > > whatever
> > >> > > > > > > was preallocated.  This will not affect anyone who has
> > >> "memoized" a
> > >> > > > > > > dimension and wants to continue to consider the
> then-current
> > >> data
> > >> > > as
> > >> > > > > > > immutable... it will be immutable and will always be
> immutable
> > >> > > > according
> > >> > > > > > to
> > >> > > > > > > that then-current dimension.
> > >> > > > > > >
> > >> > > > > > > Thanks in advance for considering this feedback!  I
> absolutely
> > >> > > > require
> > >> > > > > > > efficient row-wise growth of an Arrow-like buffer to deal
> > >> with time
> > >> > > > > > series
> > >> > > > > > > data in near real time.  I believe that preallocation is
> (by
> > >> far)
> > >> > > the
> > >> > > > > > most
> > >> > > > > > > efficient way to accomplish this.  I hope to be able to
> use
> > >> Arrow!
> > >> > > > If I
> > >> > > > > > > cannot use Arrow than I will be using a home-grown Arrow
> that
> > >> is
> > >> > > > > > identical
> > >> > > > > > > except for this feature, which would be very sad!  Even if
> > >> Arrow
> > >> > > > itself
> > >> > > > > > > could be used in this manner today, I would be hesitant to
> > >> use it
> > >> > > if
> > >> > > > the
> > >> > > > > > > use-case was not protected on a go-forward basis.
> > >> > > > > > >
> > >> > > > > >
> > >> > > > > > I recommend batching your writes and using the Arrow binary
> > >> streaming
> > >> > > > > > protocol so you are only appending to a file rather than
> > >> mutating
> > >> > > > > > previously-written bytes. This use case is well defined and
> > >> supported
> > >> > > > > > in the software already.
> > >> > > > > >
> > >> > > > > >
> > >> > > > > >
> > >> > > >
> > >> > >
> > >>
> https://github.com/apache/arrow/blob/master/docs/source/format/IPC.rst#streaming-format
> > >> > > > > >
> > >> > > > > > - Wes
> > >> > > > > >
> > >> > > > > > > Of course, I am completely open to alternative ideas and
> > >> > > approaches!
> > >> > > > > > >
> > >> > > > > > > -John
> > >> > > > > > >
> > >> > > > > > > On Mon, May 6, 2019 at 11:39 AM Wes McKinney <
> > >> wesmck...@gmail.com>
> > >> > > > > > wrote:
> > >> > > > > > >
> > >> > > > > > > > hi John -- again, I would caution you against using
> Feather
> > >> files
> > >> > > > for
> > >> > > > > > > > issues of longevity -- the internal memory layout of
> those
> > >> files
> > >> > > > is a
> > >> > > > > > > > "dead man walking" so to speak.
> > >> > > > > > > >
> > >> > > > > > > > I would advise against forking the project, IMHO that
> is a
> > >> dark
> > >> > > > path
> > >> > > > > > > > that leads nowhere good. We have a large community here
> and
> > >> we
> > >> > > > accept
> > >> > > > > > > > pull requests -- I think the challenge is going to be
> > >> defining
> > >> > > the
> > >> > > > use
> > >> > > > > > > > case to suitable clarity that a general purpose solution
> > >> can be
> > >> > > > > > > > developed.
> > >> > > > > > > >
> > >> > > > > > > > - Wes
> > >> > > > > > > >
> > >> > > > > > > >
> > >> > > > > > > > On Mon, May 6, 2019 at 11:16 AM John Muehlhausen <
> > >> j...@jgm.org>
> > >> > > > wrote:
> > >> > > > > > > > >
> > >> > > > > > > > > François, Wes,
> > >> > > > > > > > >
> > >> > > > > > > > > Thanks for the feedback.  I think the most practical
> > >> thing for
> > >> > > > me to
> > >> > > > > > do
> > >> > > > > > > > is
> > >> > > > > > > > > 1- write a Feather file that is structured to
> > >> pre-allocate the
> > >> > > > space
> > >> > > > > > I
> > >> > > > > > > > need
> > >> > > > > > > > > (e.g. initial variable-length strings are of average
> size)
> > >> > > > > > > > > 2- come up with code to monkey around with the values
> > >> contained
> > >> > > > in
> > >> > > > > > the
> > >> > > > > > > > > vectors so that before and after each manipulation the
> > >> file is
> > >> > > > valid
> > >> > > > > > as I
> > >> > > > > > > > > walk the rows ... this is a writer that uses memory
> > >> mapping
> > >> > > > > > > > > 3- check back in here once that works, assuming that
> it
> > >> does,
> > >> > > to
> > >> > > > see
> > >> > > > > > if
> > >> > > > > > > > we
> > >> > > > > > > > > can bless certain mutation paths
> > >> > > > > > > > > 4- if we can't bless certain mutation paths, fork the
> > >> project
> > >> > > for
> > >> > > > > > those
> > >> > > > > > > > who
> > >> > > > > > > > > care more about stream processing?  That would not
> seem
> > >> to be
> > >> > > > ideal
> > >> > > > > > as I
> > >> > > > > > > > > think mutation in row-order across the data set is
> > >> relatively
> > >> > > low
> > >> > > > > > impact
> > >> > > > > > > > on
> > >> > > > > > > > > the overall design?
> > >> > > > > > > > >
> > >> > > > > > > > > Thanks again for engaging the topic!
> > >> > > > > > > > > -John
> > >> > > > > > > > >
> > >> > > > > > > > > On Mon, May 6, 2019 at 10:26 AM Francois
> Saint-Jacques <
> > >> > > > > > > > > fsaintjacq...@gmail.com> wrote:
> > >> > > > > > > > >
> > >> > > > > > > > > > Hello John,
> > >> > > > > > > > > >
> > >> > > > > > > > > > Arrow is not yet suited for partial writes. The
> > >> specification
> > >> > > > only
> > >> > > > > > > > > > talks about fully frozen/immutable objects, you're
> in
> > >> > > > > > implementation
> > >> > > > > > > > > > defined territory here. For example, the C++ library
> > >> assumes
> > >> > > > the
> > >> > > > > > Array
> > >> > > > > > > > > > object is immutable; it memoize the null count, and
> > >> likely
> > >> > > more
> > >> > > > > > > > > > statistics in the future.
> > >> > > > > > > > > >
> > >> > > > > > > > > > If you want to use pre-allocated buffers and array,
> you
> > >> can
> > >> > > > use the
> > >> > > > > > > > > > column validity bitmap for this purpose, e.g. set
> all
> > >> null by
> > >> > > > > > default
> > >> > > > > > > > > > and flip once the row is written. It suffers from
> > >> concurrency
> > >> > > > > > issues
> > >> > > > > > > > > > (+ invalidation issues as pointed) when dealing with
> > >> multiple
> > >> > > > > > columns.
> > >> > > > > > > > > > You'll have to use a barrier of some kind, e.g. a
> > >> per-batch
> > >> > > > global
> > >> > > > > > > > > > atomic (if append-only), or dedicated column(s) à-la
> > >> MVCC.
> > >> > > But
> > >> > > > > > then,
> > >> > > > > > > > > > the reader needs to be aware of this and compute a
> mask
> > >> each
> > >> > > > time
> > >> > > > > > it
> > >> > > > > > > > > > needs to query the partial batch.
> > >> > > > > > > > > >
> > >> > > > > > > > > > This is a common columnar database problem, see [1]
> for
> > >> a
> > >> > > > recent
> > >> > > > > > paper
> > >> > > > > > > > > > on the subject. The usual technique is to store the
> > >> recent
> > >> > > data
> > >> > > > > > > > > > row-wise, and transform it in column-wise when a
> > >> threshold is
> > >> > > > met
> > >> > > > > > akin
> > >> > > > > > > > > > to a compaction phase. There was a somewhat related
> > >> thread
> > >> > > [2]
> > >> > > > > > lately
> > >> > > > > > > > > > about streaming vs batching. In the end, I think
> your
> > >> > > solution
> > >> > > > > > will be
> > >> > > > > > > > > > very application specific.
> > >> > > > > > > > > >
> > >> > > > > > > > > > François
> > >> > > > > > > > > >
> > >> > > > > > > > > > [1]
> > >> > > https://db.in.tum.de/downloads/publications/datablocks.pdf
> > >> > > > > > > > > > [2]
> > >> > > > > > > > > >
> > >> > > > > > > >
> > >> > > > > >
> > >> > > >
> > >> > >
> > >>
> https://lists.apache.org/thread.html/27945533db782361143586fd77ca08e15e96e2f2a5250ff084b462d6@%3Cdev.arrow.apache.org%3E
> > >> > > > > > > > > >
> > >> > > > > > > > > >
> > >> > > > > > > > > >
> > >> > > > > > > > > >
> > >> > > > > > > > > >
> > >> > > > > > > > > >
> > >> > > > > > > > > >
> > >> > > > > > > > > > On Mon, May 6, 2019 at 10:39 AM John Muehlhausen <
> > >> > > j...@jgm.org>
> > >> > > > > > wrote:
> > >> > > > > > > > > > >
> > >> > > > > > > > > > > Wes,
> > >> > > > > > > > > > >
> > >> > > > > > > > > > > I’m not afraid of writing my own C++ code to deal
> > >> with all
> > >> > > of
> > >> > > > > > this
> > >> > > > > > > > on the
> > >> > > > > > > > > > > writer side.  I just need a way to “append”
> > >> (incrementally
> > >> > > > > > populate)
> > >> > > > > > > > e.g.
> > >> > > > > > > > > > > feather files so that a person using e.g. pyarrow
> > >> doesn’t
> > >> > > > suffer
> > >> > > > > > some
> > >> > > > > > > > > > > catastrophic failure... and “on the side” I tell
> them
> > >> which
> > >> > > > rows
> > >> > > > > > are
> > >> > > > > > > > junk
> > >> > > > > > > > > > > and deal with any concurrency issues that can’t be
> > >> solved
> > >> > > in
> > >> > > > the
> > >> > > > > > > > arena of
> > >> > > > > > > > > > > atomicity and ordering of ops.  For now I care
> about
> > >> basic
> > >> > > > types
> > >> > > > > > but
> > >> > > > > > > > > > > including variable-width strings.
> > >> > > > > > > > > > >
> > >> > > > > > > > > > > For event-processing, I think Arrow has to have
> the
> > >> concept
> > >> > > > of a
> > >> > > > > > > > > > partially
> > >> > > > > > > > > > > full record set.  Some alternatives are:
> > >> > > > > > > > > > > - have a batch size of one, thus littering the
> > >> landscape
> > >> > > with
> > >> > > > > > > > trivially
> > >> > > > > > > > > > > small Arrow buffers
> > >> > > > > > > > > > > - artificially increase latency with a batch size
> > >> larger
> > >> > > than
> > >> > > > > > one,
> > >> > > > > > > > but
> > >> > > > > > > > > > not
> > >> > > > > > > > > > > processing any data until a batch is complete
> > >> > > > > > > > > > > - continuously re-write the (entire!) “main”
> buffer as
> > >> > > > batches of
> > >> > > > > > > > length
> > >> > > > > > > > > > 1
> > >> > > > > > > > > > > roll in
> > >> > > > > > > > > > > - instead of one main buffer, several, and at some
> > >> > > threshold
> > >> > > > > > combine
> > >> > > > > > > > the
> > >> > > > > > > > > > > last N length-1 batches into a length N buffer ...
> > >> still an
> > >> > > > > > > > inefficiency
> > >> > > > > > > > > > >
> > >> > > > > > > > > > > Consider the case of QAbstractTableModel as the
> > >> underlying
> > >> > > > data
> > >> > > > > > for a
> > >> > > > > > > > > > table
> > >> > > > > > > > > > > or a chart.  This visualization shows all of the
> data
> > >> for
> > >> > > the
> > >> > > > > > recent
> > >> > > > > > > > past
> > >> > > > > > > > > > > as well as events rolling in.  If this model
> > >> interface is
> > >> > > > > > > > implemented as
> > >> > > > > > > > > > a
> > >> > > > > > > > > > > view onto “many thousands” of individual event
> > >> buffers then
> > >> > > > we
> > >> > > > > > gain
> > >> > > > > > > > > > nothing
> > >> > > > > > > > > > > from columnar layout.  (Suppose there are tons of
> > >> columns
> > >> > > and
> > >> > > > > > most of
> > >> > > > > > > > > > them
> > >> > > > > > > > > > > are scrolled out of the view.). Likewise we cannot
> > >> re-write
> > >> > > > the
> > >> > > > > > > > entire
> > >> > > > > > > > > > > model on each event... time complexity blows up.
> > >> What we
> > >> > > > want
> > >> > > > > > is to
> > >> > > > > > > > > > have a
> > >> > > > > > > > > > > large pre-allocated chunk and just change
> rowCount()
> > >> as
> > >> > > data
> > >> > > > is
> > >> > > > > > > > > > “appended.”
> > >> > > > > > > > > > >  Sure, we may run out of space and have another
> and
> > >> another
> > >> > > > > > chunk for
> > >> > > > > > > > > > > future row ranges, but a handful of chunks chained
> > >> together
> > >> > > > is
> > >> > > > > > better
> > >> > > > > > > > > > than
> > >> > > > > > > > > > > as many chunks as there were events!
> > >> > > > > > > > > > >
> > >> > > > > > > > > > > And again, having a batch size >1 and delaying the
> > >> data
> > >> > > > until a
> > >> > > > > > > > batch is
> > >> > > > > > > > > > > full is a non-starter.
> > >> > > > > > > > > > >
> > >> > > > > > > > > > > I am really hoping to see partially-filled
> buffers as
> > >> > > > something
> > >> > > > > > we
> > >> > > > > > > > keep
> > >> > > > > > > > > > our
> > >> > > > > > > > > > > finger on moving forward!  Or else, what am I
> missing?
> > >> > > > > > > > > > >
> > >> > > > > > > > > > > -John
> > >> > > > > > > > > > >
> > >> > > > > > > > > > > On Mon, May 6, 2019 at 8:24 AM Wes McKinney <
> > >> > > > wesmck...@gmail.com
> > >> > > > > > >
> > >> > > > > > > > wrote:
> > >> > > > > > > > > > >
> > >> > > > > > > > > > > > hi John,
> > >> > > > > > > > > > > >
> > >> > > > > > > > > > > > In C++ the builder classes don't yet support
> > >> writing into
> > >> > > > > > > > preallocated
> > >> > > > > > > > > > > > memory. It would be tricky for applications to
> > >> determine
> > >> > > a
> > >> > > > > > priori
> > >> > > > > > > > > > > > which segments of memory to pass to the
> builder. It
> > >> seems
> > >> > > > only
> > >> > > > > > > > > > > > feasible for primitive / fixed-size types so my
> > >> guess
> > >> > > > would be
> > >> > > > > > > > that a
> > >> > > > > > > > > > > > separate set of interfaces would need to be
> > >> developed for
> > >> > > > this
> > >> > > > > > > > task.
> > >> > > > > > > > > > > >
> > >> > > > > > > > > > > > - Wes
> > >> > > > > > > > > > > >
> > >> > > > > > > > > > > > On Mon, May 6, 2019 at 8:18 AM Jacques Nadeau <
> > >> > > > > > jacq...@apache.org>
> > >> > > > > > > > > > wrote:
> > >> > > > > > > > > > > > >
> > >> > > > > > > > > > > > > This is more of a question of implementation
> > >> versus
> > >> > > > > > > > specification. An
> > >> > > > > > > > > > > > arrow
> > >> > > > > > > > > > > > > buffer is generally built and then sealed. In
> > >> different
> > >> > > > > > > > languages,
> > >> > > > > > > > > > this
> > >> > > > > > > > > > > > > building process works differently (a concern
> of
> > >> the
> > >> > > > language
> > >> > > > > > > > rather
> > >> > > > > > > > > > than
> > >> > > > > > > > > > > > > the memory specification). We don't currently
> > >> allow a
> > >> > > > half
> > >> > > > > > built
> > >> > > > > > > > > > vector
> > >> > > > > > > > > > > > to
> > >> > > > > > > > > > > > > be moved to another language and then be
> further
> > >> built.
> > >> > > > So
> > >> > > > > > the
> > >> > > > > > > > > > question
> > >> > > > > > > > > > > > is
> > >> > > > > > > > > > > > > really more concrete: what language are you
> > >> looking at
> > >> > > > and
> > >> > > > > > what
> > >> > > > > > > > is
> > >> > > > > > > > > > the
> > >> > > > > > > > > > > > > specific pattern you're trying to undertake
> for
> > >> > > building.
> > >> > > > > > > > > > > > >
> > >> > > > > > > > > > > > > If you're trying to go across independent
> > >> processes
> > >> > > > (whether
> > >> > > > > > the
> > >> > > > > > > > same
> > >> > > > > > > > > > > > > process restarted or two separate processes
> active
> > >> > > > > > > > simultaneously)
> > >> > > > > > > > > > you'll
> > >> > > > > > > > > > > > > need to build up your own data structures to
> help
> > >> with
> > >> > > > this.
> > >> > > > > > > > > > > > >
> > >> > > > > > > > > > > > > On Mon, May 6, 2019 at 6:28 PM John
> Muehlhausen <
> > >> > > > j...@jgm.org
> > >> > > > > > >
> > >> > > > > > > > wrote:
> > >> > > > > > > > > > > > >
> > >> > > > > > > > > > > > > > Hello,
> > >> > > > > > > > > > > > > >
> > >> > > > > > > > > > > > > > Glad to learn of this project— good work!
> > >> > > > > > > > > > > > > >
> > >> > > > > > > > > > > > > > If I allocate a single chunk of memory and
> start
> > >> > > > building
> > >> > > > > > Arrow
> > >> > > > > > > > > > format
> > >> > > > > > > > > > > > > > within it, does this chunk save any state
> > >> regarding
> > >> > > my
> > >> > > > > > > > progress?
> > >> > > > > > > > > > > > > >
> > >> > > > > > > > > > > > > > For example, suppose I allocate a column for
> > >> floating
> > >> > > > point
> > >> > > > > > > > (fixed
> > >> > > > > > > > > > > > width)
> > >> > > > > > > > > > > > > > and a column for string (variable width).
> > >> Suppose I
> > >> > > > start
> > >> > > > > > > > > > building the
> > >> > > > > > > > > > > > > > floating point column at offset X into my
> single
> > >> > > > buffer,
> > >> > > > > > and
> > >> > > > > > > > the
> > >> > > > > > > > > > string
> > >> > > > > > > > > > > > > > “pointer” column at offset Y into the same
> > >> single
> > >> > > > buffer,
> > >> > > > > > and
> > >> > > > > > > > the
> > >> > > > > > > > > > > > string
> > >> > > > > > > > > > > > > > data elements at offset Z.
> > >> > > > > > > > > > > > > >
> > >> > > > > > > > > > > > > > I write one floating point number and one
> > >> string,
> > >> > > then
> > >> > > > go
> > >> > > > > > away.
> > >> > > > > > > > > > When I
> > >> > > > > > > > > > > > > > come back to this buffer to append another
> > >> value,
> > >> > > does
> > >> > > > the
> > >> > > > > > > > buffer
> > >> > > > > > > > > > > > itself
> > >> > > > > > > > > > > > > > know where I would begin?  I.e. is there a
> > >> > > > differentiation
> > >> > > > > > in
> > >> > > > > > > > the
> > >> > > > > > > > > > > > column
> > >> > > > > > > > > > > > > > (or blob) data itself between the available
> > >> space and
> > >> > > > the
> > >> > > > > > used
> > >> > > > > > > > > > space?
> > >> > > > > > > > > > > > > >
> > >> > > > > > > > > > > > > > Suppose I write a lot of large variable
> width
> > >> strings
> > >> > > > and
> > >> > > > > > “run
> > >> > > > > > > > > > out” of
> > >> > > > > > > > > > > > > > space for them before running out of space
> for
> > >> > > floating
> > >> > > > > > point
> > >> > > > > > > > > > numbers
> > >> > > > > > > > > > > > or
> > >> > > > > > > > > > > > > > string pointers.  (I guessed badly when
> doing
> > >> the
> > >> > > > original
> > >> > > > > > > > > > > > allocation.). I
> > >> > > > > > > > > > > > > > consider this to be Ok since I can always
> > >> “copy” the
> > >> > > > data
> > >> > > > > > to
> > >> > > > > > > > > > “compress
> > >> > > > > > > > > > > > out”
> > >> > > > > > > > > > > > > > the unused fp/pointer buckets... the choice
> is
> > >> up to
> > >> > > > me.
> > >> > > > > > > > > > > > > >
> > >> > > > > > > > > > > > > > The above applied to a (feather?) file is
> how I
> > >> > > > anticipate
> > >> > > > > > > > > > appending
> > >> > > > > > > > > > > > data
> > >> > > > > > > > > > > > > > to disk... pre-allocate a mem-mapped file
> and
> > >> > > gradually
> > >> > > > > > fill
> > >> > > > > > > > it up.
> > >> > > > > > > > > > > > The
> > >> > > > > > > > > > > > > > efficiency of file utilization will depend
> on my
> > >> > > > > > projections
> > >> > > > > > > > > > regarding
> > >> > > > > > > > > > > > > > variable-width data types, but as I said
> above,
> > >> I can
> > >> > > > > > always
> > >> > > > > > > > > > re-write
> > >> > > > > > > > > > > > the
> > >> > > > > > > > > > > > > > file if/when this bothers me.
> > >> > > > > > > > > > > > > >
> > >> > > > > > > > > > > > > > Is this the recommended and supported
> approach
> > >> for
> > >> > > > > > incremental
> > >> > > > > > > > > > appends?
> > >> > > > > > > > > > > > > > I’m really hoping to use Arrow instead of
> > >> rolling my
> > >> > > > own,
> > >> > > > > > but
> > >> > > > > > > > > > > > functionality
> > >> > > > > > > > > > > > > > like this is absolutely key!  Hoping not to
> use
> > >> a
> > >> > > > side-car
> > >> > > > > > > > file (or
> > >> > > > > > > > > > > > memory
> > >> > > > > > > > > > > > > > chunk) to store “append progress”
> information.
> > >> > > > > > > > > > > > > >
> > >> > > > > > > > > > > > > > I am brand new to this project so please
> > >> forgive me
> > >> > > if
> > >> > > > I
> > >> > > > > > have
> > >> > > > > > > > > > > > overlooked
> > >> > > > > > > > > > > > > > something obvious.  And again, looks like
> great
> > >> work
> > >> > > so
> > >> > > > > > far!
> > >> > > > > > > > > > > > > >
> > >> > > > > > > > > > > > > > Thanks!
> > >> > > > > > > > > > > > > > -John
> > >> > > > > > > > > > > > > >
> > >> > > > > > > > > > > >
> > >> > > > > > > > > >
> > >> > > > > > > >
> > >> > > > > >
> > >> > > >
> > >> > >
> > >>
> > >
>
>

Re: Stored state of incremental writes to fixed size Arrow buffer?

Reply via email to