Re: Stored state of incremental writes to fixed size Arrow buffer?

Wes McKinney Mon, 13 May 2019 06:07:42 -0700

hi John,

Sorry, there's a number of fairly long e-mails in this thread; I'm
having a hard time following all of the details.


I suspect the most parsimonious thing would be to have some "sidecar"
metadata that tracks the state of your writes into pre-allocated Arrow
blocks so that readers know to call "Slice" on the blocks to obtain
only the written-so-far portion. I'm not likely to be in favor of
making changes to the binary protocol for this use case; if others
have opinions I'll let them speak for themselves.

- Wes

On Mon, May 13, 2019 at 7:50 AM John Muehlhausen <[email protected]> wrote:
>
> Any thoughts on a RecordBatch distinguishing size from capacity? (To borrow
> std::vector terminology)
>
> Thanks,
> John
>
> On Thu, May 9, 2019 at 2:46 PM John Muehlhausen <[email protected]> wrote:
>
> > Wes et al, I think my core proposal is that Message.fbs:RecordBatch split
> > the "length" parameter into "theoretical max length" and "utilized length"
> > (perhaps not those exact names).
> >
> > "theoretical max length is the same as "length" now ... /// ...The arrays
> > in the batch should all have this
> >
> > "utilized length" are the number of rows (starting from the first one)
> > that actually contain interesting data... the rest do not.
> >
> > The reason we can have a RecordBatch where these numbers are not the same
> > is that the RecordBatch space was preallocated (for performance reasons)
> > and the number of rows that actually "fit" depends on how correct the
> > preallocation was.  In any case, it gives the user control of this
> > space/time tradeoff... wasted space in order to save time in record batch
> > construction.  The fact that some space will usually be wasted when there
> > are variable-length columns (barring extreme luck) with this batch
> > construction paradigm explains the word "theoretical" above.  This also
> > gives us the ability to look at a partially constructed batch that is still
> > being constructed, given appropriate user-supplied concurrency control.
> >
> > I am not an expert in all of the Arrow variable-length data types, but I
> > think this works if they are all similar to variable-length strings where
> > we advance through "blob storage" by setting the indexes into that storage
> > for the current and next row in order to indicate that we have
> > incrementally consumed more blob storage.  (Conceptually this storage is
> > "unallocated" after the pre-allocation and before rows are populated.)
> >
> > At a high level I am seeking to shore up the format for event ingress into
> > real-time analytics that have some look-back window.  If I'm not mistaken I
> > think this is the subject of the last multi-sentence paragraph here?:
> > https://zd.net/2H0LlBY
> >
> > Currently we have a less-efficient paradigm where "microbatches" (e.g. of
> > length 1 for minimal latency) have to spin the CPU periodically in order to
> > be combined into buffers where we get the columnar layout benefit.  With
> > pre-allocation we can deal with microbatches (a partially populated larger
> > RecordBatch) and immediately have the columnar layout benefits for the
> > populated section with no additional computation.
> >
> > For example, consider an event processing system that calculates a "moving
> > average" as events roll in.  While this is somewhat contrived lets assume
> > that the moving average window is 1000 periods and our pre-allocation
> > ("theoretical max length") of RecordBatch elements is 100.  The algorithm
> > would be something like this, for a list of RecordBatch buffers in memory:
> >
> > initialization():
> >   set up configuration of expected variable length storage requirements,
> > e.g. the template RecordBatch mentioned below
> >
> > onIncomingEvent(event):
> >   obtain lock /// cf. swoopIn() below
> >   if last RecordBatch theoretical max length is not less than utilized
> > length or variable-length components of "event" will not fit in remaining
> > blob storage:
> >     create a new RecordBatch pre-allocation of max utilized length 100 and
> > with blob preallocation that is max(expected, event .. in case the single
> > event is larger than the expectation for 100 events)
> >        (note: in the expected case this can be very fast as it is a
> > malloc() and a memcpy() from a template!)
> >     set current RecordBatch to this newly created one
> >   add event to current RecordBatch (for the non-calculated fields)
> >   increment utilized length of current RecordBatch
> >   calculate the calculated fields (in this case, moving average) by
> > looking back at previous rows in this and previous RecordBatch objects
> >   free() any RecordBatch objects that are now before the lookback window
> >
> > swoopIn(): /// somebody wants to chart the lookback window
> >   obtain lock
> >   visit all of the relevant data in the RecordBatches to construct the
> > chart /// notice that the last RecordBatch may not yet be "as full as
> > possible"
> >
> > The above analysis (minus the free()) could apply to the IPC file format
> > and the lock could be a file lock and the swoopIn() could be a separate
> > process.  In the case of the file format, while the file is locked, a new
> > RecordBatch would overwrite the previous file Footer and a new Footer would
> > be written.  In order to be able to delete or archive old data multiple
> > files could be strung together in a logical series.
> >
> > -John
> >
> > On Tue, May 7, 2019 at 2:39 PM Wes McKinney <[email protected]> wrote:
> >
> >> On Tue, May 7, 2019 at 12:26 PM John Muehlhausen <[email protected]> wrote:
> >> >
> >> > Wes, are we saying that `pa.ipc.open_file(...).read_pandas()` already
> >> reads
> >> > the future Feather format? If not, how will the future format differ?  I
> >> > will work on my access pattern with this format instead of the current
> >> > feather format.  Sorry I was not clear on that earlier.
> >> >
> >>
> >> Yes, under the hood those will use the same zero-copy binary protocol
> >> code paths to read the file.
> >>
> >> > Micah, thank you!
> >> >
> >> > On Tue, May 7, 2019 at 11:44 AM Micah Kornfield <[email protected]>
> >> > wrote:
> >> >
> >> > > Hi John,
> >> > > To give a specific pointer [1] describes how the streaming protocol is
> >> > > stored to a file.
> >> > >
> >> > > [1] https://arrow.apache.org/docs/format/IPC.html#file-format
> >> > >
> >> > > On Tue, May 7, 2019 at 9:40 AM Wes McKinney <[email protected]>
> >> wrote:
> >> > >
> >> > > > hi John,
> >> > > >
> >> > > > As soon as the R folks can install the Arrow R package consistently,
> >> > > > the intent is to replace the Feather internals with the plain Arrow
> >> > > > IPC protocol where we have much better platform support all around.
> >> > > >
> >> > > > If you'd like to experiment with creating an API for pre-allocating
> >> > > > fixed-size Arrow protocol blocks and then mutating the data and
> >> > > > metadata on disk in-place, please be our guest. We don't have the
> >> > > > tools developed yet to do this for you
> >> > > >
> >> > > > - Wes
> >> > > >
> >> > > > On Tue, May 7, 2019 at 11:25 AM John Muehlhausen <[email protected]>
> >> wrote:
> >> > > > >
> >> > > > > Thanks Wes:
> >> > > > >
> >> > > > > "the current Feather format is deprecated" ... yes, but there
> >> will be a
> >> > > > > future file format that replaces it, correct?  And my discussion
> >> of
> >> > > > > immutable "portions" of Arrow buffers, rather than immutability
> >> of the
> >> > > > > entire buffer, applies to IPC as well, right?  I am only
> >> championing
> >> > > the
> >> > > > > idea that this future file format have the convenience that
> >> certain
> >> > > > > preallocated rows can be ignored based on a metadata setting.
> >> > > > >
> >> > > > > "I recommend batching your writes" ... this is extremely
> >> inefficient
> >> > > and
> >> > > > > adds unacceptable latency, relative to the proposed solution.  Do
> >> you
> >> > > > > disagree?  Either I have a batch length of 1 to minimize latency,
> >> which
> >> > > > > eliminates columnar advantages on the read side, or else I add
> >> latency.
> >> > > > > Neither works, and it seems that a viable alternative is within
> >> sight?
> >> > > > >
> >> > > > > If you don't agree that there is a core issue and opportunity
> >> here, I'm
> >> > > > not
> >> > > > > sure how to better make my case....
> >> > > > >
> >> > > > > -John
> >> > > > >
> >> > > > > On Tue, May 7, 2019 at 11:02 AM Wes McKinney <[email protected]
> >> >
> >> > > > wrote:
> >> > > > >
> >> > > > > > hi John,
> >> > > > > >
> >> > > > > > On Tue, May 7, 2019 at 10:53 AM John Muehlhausen <[email protected]>
> >> > > wrote:
> >> > > > > > >
> >> > > > > > > Wes et al, I completed a preliminary study of populating a
> >> Feather
> >> > > > file
> >> > > > > > > incrementally.  Some notes and questions:
> >> > > > > > >
> >> > > > > > > I wrote the following dataframe to a feather file:
> >> > > > > > >             a    b
> >> > > > > > > 0  0123456789  0.0
> >> > > > > > > 1  0123456789  NaN
> >> > > > > > > 2  0123456789  NaN
> >> > > > > > > 3  0123456789  NaN
> >> > > > > > > 4        None  NaN
> >> > > > > > >
> >> > > > > > > In re-writing the flatbuffers metadata (flatc -p doesn't
> >> > > > > > > support --gen-mutable! yuck! C++ to the rescue...), it seems
> >> that
> >> > > > > > > read_feather is not affected by NumRows?  It seems to be
> >> driven
> >> > > > entirely
> >> > > > > > by
> >> > > > > > > the per-column Length values?
> >> > > > > > >
> >> > > > > > > Also, it seems as if one of the primary usages of NullCount
> >> is to
> >> > > > > > determine
> >> > > > > > > whether or not a bitfield is present?  In the initialization
> >> data
> >> > > > above I
> >> > > > > > > was careful to have a null value in each column in order to
> >> > > generate
> >> > > > a
> >> > > > > > > bitfield.
> >> > > > > >
> >> > > > > > Per my prior e-mails, the current Feather format is deprecated,
> >> so
> >> > > I'm
> >> > > > > > only willing to engage on a discussion of the official Arrow
> >> binary
> >> > > > > > protocol that we use for IPC (memory mapping) and RPC (Flight).
> >> > > > > >
> >> > > > > > >
> >> > > > > > > I then wiped the bitfields in the file and set all of the
> >> string
> >> > > > indices
> >> > > > > > to
> >> > > > > > > one past the end of the blob buffer (all strings empty):
> >> > > > > > >       a   b
> >> > > > > > > 0  None NaN
> >> > > > > > > 1  None NaN
> >> > > > > > > 2  None NaN
> >> > > > > > > 3  None NaN
> >> > > > > > > 4  None NaN
> >> > > > > > >
> >> > > > > > > I then set the first record to some data by consuming some of
> >> the
> >> > > > string
> >> > > > > > > blob and row 0 and 1 indices, also setting the double:
> >> > > > > > >
> >> > > > > > >                a    b
> >> > > > > > > 0  Hello, world!  5.0
> >> > > > > > > 1           None  NaN
> >> > > > > > > 2           None  NaN
> >> > > > > > > 3           None  NaN
> >> > > > > > > 4           None  NaN
> >> > > > > > >
> >> > > > > > > As mentioned above, NumRows seems to be ignored.  I tried
> >> adjusting
> >> > > > each
> >> > > > > > > column Length to mask off higher rows and it worked for 4
> >> (hide
> >> > > last
> >> > > > row)
> >> > > > > > > but not for less than 4.  I take this to be due to math used
> >> to
> >> > > > figure
> >> > > > > > out
> >> > > > > > > where the buffers are relative to one another since there is
> >> only
> >> > > one
> >> > > > > > > metadata offset for all of: the (optional) bitset, index
> >> column and
> >> > > > (if
> >> > > > > > > string) blobs.
> >> > > > > > >
> >> > > > > > > Populating subsequent rows would proceed in a similar way
> >> until all
> >> > > > of
> >> > > > > > the
> >> > > > > > > blob storage has been consumed, which may come before the
> >> > > > pre-allocated
> >> > > > > > > rows have been consumed.
> >> > > > > > >
> >> > > > > > > So what does this mean for my desire to incrementally write
> >> these
> >> > > > > > > (potentially memory-mapped) pre-allocated files and/or Arrow
> >> > > buffers
> >> > > > in
> >> > > > > > > memory?  Some thoughts:
> >> > > > > > >
> >> > > > > > > - If a single value (such as NumRows) were consulted to
> >> determine
> >> > > the
> >> > > > > > table
> >> > > > > > > row dimension then updating this single value would serve to
> >> tell a
> >> > > > > > reader
> >> > > > > > > which rows are relevant.  Assuming this value is updated
> >> after all
> >> > > > other
> >> > > > > > > mutations are complete, and assuming that mutations are only
> >> > > appends
> >> > > > > > > (addition of rows), then concurrency control involves only
> >> ensuring
> >> > > > that
> >> > > > > > > this value is not examined while it is being written.
> >> > > > > > >
> >> > > > > > > - NullCount presents a concurrency problem if someone reads
> >> the
> >> > > file
> >> > > > > > after
> >> > > > > > > this field has been updated, but before NumRows has exposed
> >> the new
> >> > > > > > record
> >> > > > > > > (or vice versa).  The idea previously mentioned that there
> >> will
> >> > > > "likely
> >> > > > > > > [be] more statistics in the future" feels like it might be
> >> scope
> >> > > > creep to
> >> > > > > > > me?  This is a data representation, not a calculation
> >> framework?
> >> > > If
> >> > > > > > > NullCount had its genesis in the optional nature of the
> >> bitfield, I
> >> > > > would
> >> > > > > > > suggest that perhaps NullCount can be dropped in favor of
> >> always
> >> > > > > > supplying
> >> > > > > > > the bitfield, which in any event is already contemplated by
> >> the
> >> > > spec:
> >> > > > > > > "Implementations may choose to always allocate one anyway as a
> >> > > > matter of
> >> > > > > > > convenience."  If the concern is space savings, Arrow is
> >> already an
> >> > > > > > > extremely uncompressed format.  (Compression is something I
> >> would
> >> > > > also
> >> > > > > > > consider to be scope creep as regards Feather... compressed
> >> > > > filesystems
> >> > > > > > can
> >> > > > > > > be employed and there are other compressed dataframe formats.)
> >> > > > However,
> >> > > > > > if
> >> > > > > > > protecting the 4 bytes required to update NowRows turns out
> >> to be
> >> > > no
> >> > > > > > easier
> >> > > > > > > than protecting all of the statistical bytes as well as part
> >> of the
> >> > > > same
> >> > > > > > > "critical section" (locks: yuck!!) then statistics pose no
> >> issue.
> >> > > I
> >> > > > > > have a
> >> > > > > > > feeling that the availability of an atomic write of 4 bytes
> >> will
> >> > > > depend
> >> > > > > > on
> >> > > > > > > the storage mechanism... memory vs memory map vs write() etc.
> >> > > > > > >
> >> > > > > > > - The elephant in the room appears to be the presumptive
> >> binary
> >> > > > yes/no on
> >> > > > > > > mutability of Arrow buffers.  Perhaps the thought is that
> >> certain
> >> > > > batch
> >> > > > > > > processes will be wrecked if anyone anywhere is mutating
> >> buffers in
> >> > > > any
> >> > > > > > > way.  But keep in mind I am not proposing general mutability,
> >> only
> >> > > > > > > appending of new data.  *A huge amount of batch processing
> >> that
> >> > > will
> >> > > > take
> >> > > > > > > place with Arrow is on time-series data (whether financial or
> >> > > > otherwise).
> >> > > > > > > It is only natural that architects will want the minimal
> >> impedance
> >> > > > > > mismatch
> >> > > > > > > when it comes time to grow those time series as the events
> >> occur
> >> > > > going
> >> > > > > > > forward.*  So rather than say that I want "mutable" Arrow
> >> buffers,
> >> > > I
> >> > > > > > would
> >> > > > > > > pitch this as a call for "immutable populated areas" of Arrow
> >> > > buffers
> >> > > > > > > combined with the concept that the populated area can grow up
> >> to
> >> > > > whatever
> >> > > > > > > was preallocated.  This will not affect anyone who has
> >> "memoized" a
> >> > > > > > > dimension and wants to continue to consider the then-current
> >> data
> >> > > as
> >> > > > > > > immutable... it will be immutable and will always be immutable
> >> > > > according
> >> > > > > > to
> >> > > > > > > that then-current dimension.
> >> > > > > > >
> >> > > > > > > Thanks in advance for considering this feedback!  I absolutely
> >> > > > require
> >> > > > > > > efficient row-wise growth of an Arrow-like buffer to deal
> >> with time
> >> > > > > > series
> >> > > > > > > data in near real time.  I believe that preallocation is (by
> >> far)
> >> > > the
> >> > > > > > most
> >> > > > > > > efficient way to accomplish this.  I hope to be able to use
> >> Arrow!
> >> > > > If I
> >> > > > > > > cannot use Arrow than I will be using a home-grown Arrow that
> >> is
> >> > > > > > identical
> >> > > > > > > except for this feature, which would be very sad!  Even if
> >> Arrow
> >> > > > itself
> >> > > > > > > could be used in this manner today, I would be hesitant to
> >> use it
> >> > > if
> >> > > > the
> >> > > > > > > use-case was not protected on a go-forward basis.
> >> > > > > > >
> >> > > > > >
> >> > > > > > I recommend batching your writes and using the Arrow binary
> >> streaming
> >> > > > > > protocol so you are only appending to a file rather than
> >> mutating
> >> > > > > > previously-written bytes. This use case is well defined and
> >> supported
> >> > > > > > in the software already.
> >> > > > > >
> >> > > > > >
> >> > > > > >
> >> > > >
> >> > >
> >> https://github.com/apache/arrow/blob/master/docs/source/format/IPC.rst#streaming-format
> >> > > > > >
> >> > > > > > - Wes
> >> > > > > >
> >> > > > > > > Of course, I am completely open to alternative ideas and
> >> > > approaches!
> >> > > > > > >
> >> > > > > > > -John
> >> > > > > > >
> >> > > > > > > On Mon, May 6, 2019 at 11:39 AM Wes McKinney <
> >> [email protected]>
> >> > > > > > wrote:
> >> > > > > > >
> >> > > > > > > > hi John -- again, I would caution you against using Feather
> >> files
> >> > > > for
> >> > > > > > > > issues of longevity -- the internal memory layout of those
> >> files
> >> > > > is a
> >> > > > > > > > "dead man walking" so to speak.
> >> > > > > > > >
> >> > > > > > > > I would advise against forking the project, IMHO that is a
> >> dark
> >> > > > path
> >> > > > > > > > that leads nowhere good. We have a large community here and
> >> we
> >> > > > accept
> >> > > > > > > > pull requests -- I think the challenge is going to be
> >> defining
> >> > > the
> >> > > > use
> >> > > > > > > > case to suitable clarity that a general purpose solution
> >> can be
> >> > > > > > > > developed.
> >> > > > > > > >
> >> > > > > > > > - Wes
> >> > > > > > > >
> >> > > > > > > >
> >> > > > > > > > On Mon, May 6, 2019 at 11:16 AM John Muehlhausen <
> >> [email protected]>
> >> > > > wrote:
> >> > > > > > > > >
> >> > > > > > > > > François, Wes,
> >> > > > > > > > >
> >> > > > > > > > > Thanks for the feedback.  I think the most practical
> >> thing for
> >> > > > me to
> >> > > > > > do
> >> > > > > > > > is
> >> > > > > > > > > 1- write a Feather file that is structured to
> >> pre-allocate the
> >> > > > space
> >> > > > > > I
> >> > > > > > > > need
> >> > > > > > > > > (e.g. initial variable-length strings are of average size)
> >> > > > > > > > > 2- come up with code to monkey around with the values
> >> contained
> >> > > > in
> >> > > > > > the
> >> > > > > > > > > vectors so that before and after each manipulation the
> >> file is
> >> > > > valid
> >> > > > > > as I
> >> > > > > > > > > walk the rows ... this is a writer that uses memory
> >> mapping
> >> > > > > > > > > 3- check back in here once that works, assuming that it
> >> does,
> >> > > to
> >> > > > see
> >> > > > > > if
> >> > > > > > > > we
> >> > > > > > > > > can bless certain mutation paths
> >> > > > > > > > > 4- if we can't bless certain mutation paths, fork the
> >> project
> >> > > for
> >> > > > > > those
> >> > > > > > > > who
> >> > > > > > > > > care more about stream processing?  That would not seem
> >> to be
> >> > > > ideal
> >> > > > > > as I
> >> > > > > > > > > think mutation in row-order across the data set is
> >> relatively
> >> > > low
> >> > > > > > impact
> >> > > > > > > > on
> >> > > > > > > > > the overall design?
> >> > > > > > > > >
> >> > > > > > > > > Thanks again for engaging the topic!
> >> > > > > > > > > -John
> >> > > > > > > > >
> >> > > > > > > > > On Mon, May 6, 2019 at 10:26 AM Francois Saint-Jacques <
> >> > > > > > > > > [email protected]> wrote:
> >> > > > > > > > >
> >> > > > > > > > > > Hello John,
> >> > > > > > > > > >
> >> > > > > > > > > > Arrow is not yet suited for partial writes. The
> >> specification
> >> > > > only
> >> > > > > > > > > > talks about fully frozen/immutable objects, you're in
> >> > > > > > implementation
> >> > > > > > > > > > defined territory here. For example, the C++ library
> >> assumes
> >> > > > the
> >> > > > > > Array
> >> > > > > > > > > > object is immutable; it memoize the null count, and
> >> likely
> >> > > more
> >> > > > > > > > > > statistics in the future.
> >> > > > > > > > > >
> >> > > > > > > > > > If you want to use pre-allocated buffers and array, you
> >> can
> >> > > > use the
> >> > > > > > > > > > column validity bitmap for this purpose, e.g. set all
> >> null by
> >> > > > > > default
> >> > > > > > > > > > and flip once the row is written. It suffers from
> >> concurrency
> >> > > > > > issues
> >> > > > > > > > > > (+ invalidation issues as pointed) when dealing with
> >> multiple
> >> > > > > > columns.
> >> > > > > > > > > > You'll have to use a barrier of some kind, e.g. a
> >> per-batch
> >> > > > global
> >> > > > > > > > > > atomic (if append-only), or dedicated column(s) à-la
> >> MVCC.
> >> > > But
> >> > > > > > then,
> >> > > > > > > > > > the reader needs to be aware of this and compute a mask
> >> each
> >> > > > time
> >> > > > > > it
> >> > > > > > > > > > needs to query the partial batch.
> >> > > > > > > > > >
> >> > > > > > > > > > This is a common columnar database problem, see [1] for
> >> a
> >> > > > recent
> >> > > > > > paper
> >> > > > > > > > > > on the subject. The usual technique is to store the
> >> recent
> >> > > data
> >> > > > > > > > > > row-wise, and transform it in column-wise when a
> >> threshold is
> >> > > > met
> >> > > > > > akin
> >> > > > > > > > > > to a compaction phase. There was a somewhat related
> >> thread
> >> > > [2]
> >> > > > > > lately
> >> > > > > > > > > > about streaming vs batching. In the end, I think your
> >> > > solution
> >> > > > > > will be
> >> > > > > > > > > > very application specific.
> >> > > > > > > > > >
> >> > > > > > > > > > François
> >> > > > > > > > > >
> >> > > > > > > > > > [1]
> >> > > https://db.in.tum.de/downloads/publications/datablocks.pdf
> >> > > > > > > > > > [2]
> >> > > > > > > > > >
> >> > > > > > > >
> >> > > > > >
> >> > > >
> >> > >
> >> https://lists.apache.org/thread.html/27945533db782361143586fd77ca08e15e96e2f2a5250ff084b462d6@%3Cdev.arrow.apache.org%3E
> >> > > > > > > > > >
> >> > > > > > > > > >
> >> > > > > > > > > >
> >> > > > > > > > > >
> >> > > > > > > > > >
> >> > > > > > > > > >
> >> > > > > > > > > >
> >> > > > > > > > > > On Mon, May 6, 2019 at 10:39 AM John Muehlhausen <
> >> > > [email protected]>
> >> > > > > > wrote:
> >> > > > > > > > > > >
> >> > > > > > > > > > > Wes,
> >> > > > > > > > > > >
> >> > > > > > > > > > > I’m not afraid of writing my own C++ code to deal
> >> with all
> >> > > of
> >> > > > > > this
> >> > > > > > > > on the
> >> > > > > > > > > > > writer side.  I just need a way to “append”
> >> (incrementally
> >> > > > > > populate)
> >> > > > > > > > e.g.
> >> > > > > > > > > > > feather files so that a person using e.g. pyarrow
> >> doesn’t
> >> > > > suffer
> >> > > > > > some
> >> > > > > > > > > > > catastrophic failure... and “on the side” I tell them
> >> which
> >> > > > rows
> >> > > > > > are
> >> > > > > > > > junk
> >> > > > > > > > > > > and deal with any concurrency issues that can’t be
> >> solved
> >> > > in
> >> > > > the
> >> > > > > > > > arena of
> >> > > > > > > > > > > atomicity and ordering of ops.  For now I care about
> >> basic
> >> > > > types
> >> > > > > > but
> >> > > > > > > > > > > including variable-width strings.
> >> > > > > > > > > > >
> >> > > > > > > > > > > For event-processing, I think Arrow has to have the
> >> concept
> >> > > > of a
> >> > > > > > > > > > partially
> >> > > > > > > > > > > full record set.  Some alternatives are:
> >> > > > > > > > > > > - have a batch size of one, thus littering the
> >> landscape
> >> > > with
> >> > > > > > > > trivially
> >> > > > > > > > > > > small Arrow buffers
> >> > > > > > > > > > > - artificially increase latency with a batch size
> >> larger
> >> > > than
> >> > > > > > one,
> >> > > > > > > > but
> >> > > > > > > > > > not
> >> > > > > > > > > > > processing any data until a batch is complete
> >> > > > > > > > > > > - continuously re-write the (entire!) “main” buffer as
> >> > > > batches of
> >> > > > > > > > length
> >> > > > > > > > > > 1
> >> > > > > > > > > > > roll in
> >> > > > > > > > > > > - instead of one main buffer, several, and at some
> >> > > threshold
> >> > > > > > combine
> >> > > > > > > > the
> >> > > > > > > > > > > last N length-1 batches into a length N buffer ...
> >> still an
> >> > > > > > > > inefficiency
> >> > > > > > > > > > >
> >> > > > > > > > > > > Consider the case of QAbstractTableModel as the
> >> underlying
> >> > > > data
> >> > > > > > for a
> >> > > > > > > > > > table
> >> > > > > > > > > > > or a chart.  This visualization shows all of the data
> >> for
> >> > > the
> >> > > > > > recent
> >> > > > > > > > past
> >> > > > > > > > > > > as well as events rolling in.  If this model
> >> interface is
> >> > > > > > > > implemented as
> >> > > > > > > > > > a
> >> > > > > > > > > > > view onto “many thousands” of individual event
> >> buffers then
> >> > > > we
> >> > > > > > gain
> >> > > > > > > > > > nothing
> >> > > > > > > > > > > from columnar layout.  (Suppose there are tons of
> >> columns
> >> > > and
> >> > > > > > most of
> >> > > > > > > > > > them
> >> > > > > > > > > > > are scrolled out of the view.). Likewise we cannot
> >> re-write
> >> > > > the
> >> > > > > > > > entire
> >> > > > > > > > > > > model on each event... time complexity blows up.
> >> What we
> >> > > > want
> >> > > > > > is to
> >> > > > > > > > > > have a
> >> > > > > > > > > > > large pre-allocated chunk and just change rowCount()
> >> as
> >> > > data
> >> > > > is
> >> > > > > > > > > > “appended.”
> >> > > > > > > > > > >  Sure, we may run out of space and have another and
> >> another
> >> > > > > > chunk for
> >> > > > > > > > > > > future row ranges, but a handful of chunks chained
> >> together
> >> > > > is
> >> > > > > > better
> >> > > > > > > > > > than
> >> > > > > > > > > > > as many chunks as there were events!
> >> > > > > > > > > > >
> >> > > > > > > > > > > And again, having a batch size >1 and delaying the
> >> data
> >> > > > until a
> >> > > > > > > > batch is
> >> > > > > > > > > > > full is a non-starter.
> >> > > > > > > > > > >
> >> > > > > > > > > > > I am really hoping to see partially-filled buffers as
> >> > > > something
> >> > > > > > we
> >> > > > > > > > keep
> >> > > > > > > > > > our
> >> > > > > > > > > > > finger on moving forward!  Or else, what am I missing?
> >> > > > > > > > > > >
> >> > > > > > > > > > > -John
> >> > > > > > > > > > >
> >> > > > > > > > > > > On Mon, May 6, 2019 at 8:24 AM Wes McKinney <
> >> > > > [email protected]
> >> > > > > > >
> >> > > > > > > > wrote:
> >> > > > > > > > > > >
> >> > > > > > > > > > > > hi John,
> >> > > > > > > > > > > >
> >> > > > > > > > > > > > In C++ the builder classes don't yet support
> >> writing into
> >> > > > > > > > preallocated
> >> > > > > > > > > > > > memory. It would be tricky for applications to
> >> determine
> >> > > a
> >> > > > > > priori
> >> > > > > > > > > > > > which segments of memory to pass to the builder. It
> >> seems
> >> > > > only
> >> > > > > > > > > > > > feasible for primitive / fixed-size types so my
> >> guess
> >> > > > would be
> >> > > > > > > > that a
> >> > > > > > > > > > > > separate set of interfaces would need to be
> >> developed for
> >> > > > this
> >> > > > > > > > task.
> >> > > > > > > > > > > >
> >> > > > > > > > > > > > - Wes
> >> > > > > > > > > > > >
> >> > > > > > > > > > > > On Mon, May 6, 2019 at 8:18 AM Jacques Nadeau <
> >> > > > > > [email protected]>
> >> > > > > > > > > > wrote:
> >> > > > > > > > > > > > >
> >> > > > > > > > > > > > > This is more of a question of implementation
> >> versus
> >> > > > > > > > specification. An
> >> > > > > > > > > > > > arrow
> >> > > > > > > > > > > > > buffer is generally built and then sealed. In
> >> different
> >> > > > > > > > languages,
> >> > > > > > > > > > this
> >> > > > > > > > > > > > > building process works differently (a concern of
> >> the
> >> > > > language
> >> > > > > > > > rather
> >> > > > > > > > > > than
> >> > > > > > > > > > > > > the memory specification). We don't currently
> >> allow a
> >> > > > half
> >> > > > > > built
> >> > > > > > > > > > vector
> >> > > > > > > > > > > > to
> >> > > > > > > > > > > > > be moved to another language and then be further
> >> built.
> >> > > > So
> >> > > > > > the
> >> > > > > > > > > > question
> >> > > > > > > > > > > > is
> >> > > > > > > > > > > > > really more concrete: what language are you
> >> looking at
> >> > > > and
> >> > > > > > what
> >> > > > > > > > is
> >> > > > > > > > > > the
> >> > > > > > > > > > > > > specific pattern you're trying to undertake for
> >> > > building.
> >> > > > > > > > > > > > >
> >> > > > > > > > > > > > > If you're trying to go across independent
> >> processes
> >> > > > (whether
> >> > > > > > the
> >> > > > > > > > same
> >> > > > > > > > > > > > > process restarted or two separate processes active
> >> > > > > > > > simultaneously)
> >> > > > > > > > > > you'll
> >> > > > > > > > > > > > > need to build up your own data structures to help
> >> with
> >> > > > this.
> >> > > > > > > > > > > > >
> >> > > > > > > > > > > > > On Mon, May 6, 2019 at 6:28 PM John Muehlhausen <
> >> > > > [email protected]
> >> > > > > > >
> >> > > > > > > > wrote:
> >> > > > > > > > > > > > >
> >> > > > > > > > > > > > > > Hello,
> >> > > > > > > > > > > > > >
> >> > > > > > > > > > > > > > Glad to learn of this project— good work!
> >> > > > > > > > > > > > > >
> >> > > > > > > > > > > > > > If I allocate a single chunk of memory and start
> >> > > > building
> >> > > > > > Arrow
> >> > > > > > > > > > format
> >> > > > > > > > > > > > > > within it, does this chunk save any state
> >> regarding
> >> > > my
> >> > > > > > > > progress?
> >> > > > > > > > > > > > > >
> >> > > > > > > > > > > > > > For example, suppose I allocate a column for
> >> floating
> >> > > > point
> >> > > > > > > > (fixed
> >> > > > > > > > > > > > width)
> >> > > > > > > > > > > > > > and a column for string (variable width).
> >> Suppose I
> >> > > > start
> >> > > > > > > > > > building the
> >> > > > > > > > > > > > > > floating point column at offset X into my single
> >> > > > buffer,
> >> > > > > > and
> >> > > > > > > > the
> >> > > > > > > > > > string
> >> > > > > > > > > > > > > > “pointer” column at offset Y into the same
> >> single
> >> > > > buffer,
> >> > > > > > and
> >> > > > > > > > the
> >> > > > > > > > > > > > string
> >> > > > > > > > > > > > > > data elements at offset Z.
> >> > > > > > > > > > > > > >
> >> > > > > > > > > > > > > > I write one floating point number and one
> >> string,
> >> > > then
> >> > > > go
> >> > > > > > away.
> >> > > > > > > > > > When I
> >> > > > > > > > > > > > > > come back to this buffer to append another
> >> value,
> >> > > does
> >> > > > the
> >> > > > > > > > buffer
> >> > > > > > > > > > > > itself
> >> > > > > > > > > > > > > > know where I would begin?  I.e. is there a
> >> > > > differentiation
> >> > > > > > in
> >> > > > > > > > the
> >> > > > > > > > > > > > column
> >> > > > > > > > > > > > > > (or blob) data itself between the available
> >> space and
> >> > > > the
> >> > > > > > used
> >> > > > > > > > > > space?
> >> > > > > > > > > > > > > >
> >> > > > > > > > > > > > > > Suppose I write a lot of large variable width
> >> strings
> >> > > > and
> >> > > > > > “run
> >> > > > > > > > > > out” of
> >> > > > > > > > > > > > > > space for them before running out of space for
> >> > > floating
> >> > > > > > point
> >> > > > > > > > > > numbers
> >> > > > > > > > > > > > or
> >> > > > > > > > > > > > > > string pointers.  (I guessed badly when doing
> >> the
> >> > > > original
> >> > > > > > > > > > > > allocation.). I
> >> > > > > > > > > > > > > > consider this to be Ok since I can always
> >> “copy” the
> >> > > > data
> >> > > > > > to
> >> > > > > > > > > > “compress
> >> > > > > > > > > > > > out”
> >> > > > > > > > > > > > > > the unused fp/pointer buckets... the choice is
> >> up to
> >> > > > me.
> >> > > > > > > > > > > > > >
> >> > > > > > > > > > > > > > The above applied to a (feather?) file is how I
> >> > > > anticipate
> >> > > > > > > > > > appending
> >> > > > > > > > > > > > data
> >> > > > > > > > > > > > > > to disk... pre-allocate a mem-mapped file and
> >> > > gradually
> >> > > > > > fill
> >> > > > > > > > it up.
> >> > > > > > > > > > > > The
> >> > > > > > > > > > > > > > efficiency of file utilization will depend on my
> >> > > > > > projections
> >> > > > > > > > > > regarding
> >> > > > > > > > > > > > > > variable-width data types, but as I said above,
> >> I can
> >> > > > > > always
> >> > > > > > > > > > re-write
> >> > > > > > > > > > > > the
> >> > > > > > > > > > > > > > file if/when this bothers me.
> >> > > > > > > > > > > > > >
> >> > > > > > > > > > > > > > Is this the recommended and supported approach
> >> for
> >> > > > > > incremental
> >> > > > > > > > > > appends?
> >> > > > > > > > > > > > > > I’m really hoping to use Arrow instead of
> >> rolling my
> >> > > > own,
> >> > > > > > but
> >> > > > > > > > > > > > functionality
> >> > > > > > > > > > > > > > like this is absolutely key!  Hoping not to use
> >> a
> >> > > > side-car
> >> > > > > > > > file (or
> >> > > > > > > > > > > > memory
> >> > > > > > > > > > > > > > chunk) to store “append progress” information.
> >> > > > > > > > > > > > > >
> >> > > > > > > > > > > > > > I am brand new to this project so please
> >> forgive me
> >> > > if
> >> > > > I
> >> > > > > > have
> >> > > > > > > > > > > > overlooked
> >> > > > > > > > > > > > > > something obvious.  And again, looks like great
> >> work
> >> > > so
> >> > > > > > far!
> >> > > > > > > > > > > > > >
> >> > > > > > > > > > > > > > Thanks!
> >> > > > > > > > > > > > > > -John
> >> > > > > > > > > > > > > >
> >> > > > > > > > > > > >
> >> > > > > > > > > >
> >> > > > > > > >
> >> > > > > >
> >> > > >
> >> > >
> >>
> >

Re: Stored state of incremental writes to fixed size Arrow buffer?

Reply via email to