Re: Stored state of incremental writes to fixed size Arrow buffer?

Wes McKinney Mon, 13 May 2019 08:35:50 -0700

On Mon, May 13, 2019 at 10:28 AM John Muehlhausen <[email protected]> wrote:
>
> ``perhaps the right way forward is to start by gathering a
> number of interested parties and start designing a proposal''
>
> YES!  How do we go about this?
>


I'd recommend writing a proposal document (using Google Docs or
whatever tool you prefer) that lays out the use cases to motivate the
work and proposals for solving them -- I would recommend to be as
clear and concise as possible. You can circulate the document for
comment in a [DISCUSS] thread on this mailing list. Be aware that the
process may take weeks or months since people have to find the time to
read the document and comment.

> ``There are some early experiments to populate Arrow nodes in microbatches
> from Kafka'' (cf link in thread)
>
> Who did this?

I'm not aware of any work in open source to do this, but I know a
party that did do this in proprietary form. I will contact them
offline and see if they are interested in getting involved in the
discussion.

Thanks

>
> -John
>
> On Mon, May 13, 2019 at 9:39 AM Antoine Pitrou <[email protected]> wrote:
>
> >
> > Hi John,
> >
> > We are strongly committed to backwards compatibility in the Arrow format
> > specification.  You should not fear any compatibility-breaking changes
> > in the future.  People sometimes express uncertainty because we have not
> > reached 1.0 yet, but that's because we have not yet implemented all the
> > data types we want to be in that spec.
> >
> > As for the general goal of making Arrow more suitable for event
> > processing, perhaps the right way forward is to start by gathering a
> > number of interested parties and start designing a proposal (which may
> > or may not include spec additions).
> >
> > Regards
> >
> > Antoine.
> >
> >
> > Le 13/05/2019 à 15:38, John Muehlhausen a écrit :
> > > Micah, yes, it all works at the moment.  How have we staked out that it
> > > will always work in the future as people continue to work on the spec?
> > That
> > > is my concern.
> > >
> > > Also, it would be extremely useful if someone opening a file had my nil
> > > rows hidden from them without needing to analyze the app-specific
> > side-car
> > > data.
> > >
> > > I believe that something like my solution is how everyone will do
> > efficient
> > > event processing with Arrow, so I believe it is worth a broader
> > discussion.
> > >
> > > On Mon, May 13, 2019 at 8:30 AM Micah Kornfield <[email protected]>
> > > wrote:
> > >
> > >> Hi John,
> > >> To expand on this I don't think there is anything  preventing you in the
> > >> current spec from over provisioning the underlying buffers.  So you can
> > >> effectively split "capacity" from "length" by subtracting the size of
> > the
> > >> buffer from the amount of space taken by the rows indicated in the
> > batch.
> > >> For variable width types you would have to reference last value in the
> > >> offset buffer to determine used capacity.
> > >>
> > >>   When appending if you runout of memory in a particular buffer, you
> > don't
> > >> increment the count on the batch and simply append to the next one.
> > >>
> > >> This is restating parts of the  thread, but I don't think the c++ code
> > base
> > >> has any facility for this directly and if you want to be parsimonious
> > with
> > >> memory you would have to rewrite batches at some point.
> > >>
> > >> Apologies if I missed something as Wes said this is a long thread.
> > >>
> > >>
> > >> Thanks,
> > >> Micah
> > >>
> > >> On Mon, May 13, 2019 at 6:07 AM Wes McKinney <[email protected]>
> > wrote:
> > >>
> > >>> hi John,
> > >>>
> > >>> Sorry, there's a number of fairly long e-mails in this thread; I'm
> > >>> having a hard time following all of the details.
> > >>>
> > >>> I suspect the most parsimonious thing would be to have some "sidecar"
> > >>> metadata that tracks the state of your writes into pre-allocated Arrow
> > >>> blocks so that readers know to call "Slice" on the blocks to obtain
> > >>> only the written-so-far portion. I'm not likely to be in favor of
> > >>> making changes to the binary protocol for this use case; if others
> > >>> have opinions I'll let them speak for themselves.
> > >>>
> > >>> - Wes
> > >>>
> > >>> On Mon, May 13, 2019 at 7:50 AM John Muehlhausen <[email protected]> wrote:
> > >>>>
> > >>>> Any thoughts on a RecordBatch distinguishing size from capacity? (To
> > >>> borrow
> > >>>> std::vector terminology)
> > >>>>
> > >>>> Thanks,
> > >>>> John
> > >>>>
> > >>>> On Thu, May 9, 2019 at 2:46 PM John Muehlhausen <[email protected]> wrote:
> > >>>>
> > >>>>> Wes et al, I think my core proposal is that Message.fbs:RecordBatch
> > >>> split
> > >>>>> the "length" parameter into "theoretical max length" and "utilized
> > >>> length"
> > >>>>> (perhaps not those exact names).
> > >>>>>
> > >>>>> "theoretical max length is the same as "length" now ... /// ...The
> > >>> arrays
> > >>>>> in the batch should all have this
> > >>>>>
> > >>>>> "utilized length" are the number of rows (starting from the first
> > >> one)
> > >>>>> that actually contain interesting data... the rest do not.
> > >>>>>
> > >>>>> The reason we can have a RecordBatch where these numbers are not the
> > >>> same
> > >>>>> is that the RecordBatch space was preallocated (for performance
> > >>> reasons)
> > >>>>> and the number of rows that actually "fit" depends on how correct the
> > >>>>> preallocation was.  In any case, it gives the user control of this
> > >>>>> space/time tradeoff... wasted space in order to save time in record
> > >>> batch
> > >>>>> construction.  The fact that some space will usually be wasted when
> > >>> there
> > >>>>> are variable-length columns (barring extreme luck) with this batch
> > >>>>> construction paradigm explains the word "theoretical" above.  This
> > >> also
> > >>>>> gives us the ability to look at a partially constructed batch that is
> > >>> still
> > >>>>> being constructed, given appropriate user-supplied concurrency
> > >> control.
> > >>>>>
> > >>>>> I am not an expert in all of the Arrow variable-length data types,
> > >> but
> > >>> I
> > >>>>> think this works if they are all similar to variable-length strings
> > >>> where
> > >>>>> we advance through "blob storage" by setting the indexes into that
> > >>> storage
> > >>>>> for the current and next row in order to indicate that we have
> > >>>>> incrementally consumed more blob storage.  (Conceptually this storage
> > >>> is
> > >>>>> "unallocated" after the pre-allocation and before rows are
> > >> populated.)
> > >>>>>
> > >>>>> At a high level I am seeking to shore up the format for event ingress
> > >>> into
> > >>>>> real-time analytics that have some look-back window.  If I'm not
> > >>> mistaken I
> > >>>>> think this is the subject of the last multi-sentence paragraph here?:
> > >>>>> https://zd.net/2H0LlBY
> > >>>>>
> > >>>>> Currently we have a less-efficient paradigm where "microbatches"
> > >> (e.g.
> > >>> of
> > >>>>> length 1 for minimal latency) have to spin the CPU periodically in
> > >>> order to
> > >>>>> be combined into buffers where we get the columnar layout benefit.
> > >>> With
> > >>>>> pre-allocation we can deal with microbatches (a partially populated
> > >>> larger
> > >>>>> RecordBatch) and immediately have the columnar layout benefits for
> > >> the
> > >>>>> populated section with no additional computation.
> > >>>>>
> > >>>>> For example, consider an event processing system that calculates a
> > >>> "moving
> > >>>>> average" as events roll in.  While this is somewhat contrived lets
> > >>> assume
> > >>>>> that the moving average window is 1000 periods and our pre-allocation
> > >>>>> ("theoretical max length") of RecordBatch elements is 100.  The
> > >>> algorithm
> > >>>>> would be something like this, for a list of RecordBatch buffers in
> > >>> memory:
> > >>>>>
> > >>>>> initialization():
> > >>>>>   set up configuration of expected variable length storage
> > >>> requirements,
> > >>>>> e.g. the template RecordBatch mentioned below
> > >>>>>
> > >>>>> onIncomingEvent(event):
> > >>>>>   obtain lock /// cf. swoopIn() below
> > >>>>>   if last RecordBatch theoretical max length is not less than
> > >> utilized
> > >>>>> length or variable-length components of "event" will not fit in
> > >>> remaining
> > >>>>> blob storage:
> > >>>>>     create a new RecordBatch pre-allocation of max utilized length
> > >> 100
> > >>> and
> > >>>>> with blob preallocation that is max(expected, event .. in case the
> > >>> single
> > >>>>> event is larger than the expectation for 100 events)
> > >>>>>        (note: in the expected case this can be very fast as it is a
> > >>>>> malloc() and a memcpy() from a template!)
> > >>>>>     set current RecordBatch to this newly created one
> > >>>>>   add event to current RecordBatch (for the non-calculated fields)
> > >>>>>   increment utilized length of current RecordBatch
> > >>>>>   calculate the calculated fields (in this case, moving average) by
> > >>>>> looking back at previous rows in this and previous RecordBatch
> > >> objects
> > >>>>>   free() any RecordBatch objects that are now before the lookback
> > >>> window
> > >>>>>
> > >>>>> swoopIn(): /// somebody wants to chart the lookback window
> > >>>>>   obtain lock
> > >>>>>   visit all of the relevant data in the RecordBatches to construct
> > >> the
> > >>>>> chart /// notice that the last RecordBatch may not yet be "as full as
> > >>>>> possible"
> > >>>>>
> > >>>>> The above analysis (minus the free()) could apply to the IPC file
> > >>> format
> > >>>>> and the lock could be a file lock and the swoopIn() could be a
> > >> separate
> > >>>>> process.  In the case of the file format, while the file is locked, a
> > >>> new
> > >>>>> RecordBatch would overwrite the previous file Footer and a new Footer
> > >>> would
> > >>>>> be written.  In order to be able to delete or archive old data
> > >> multiple
> > >>>>> files could be strung together in a logical series.
> > >>>>>
> > >>>>> -John
> > >>>>>
> > >>>>> On Tue, May 7, 2019 at 2:39 PM Wes McKinney <[email protected]>
> > >>> wrote:
> > >>>>>
> > >>>>>> On Tue, May 7, 2019 at 12:26 PM John Muehlhausen <[email protected]>
> > >> wrote:
> > >>>>>>>
> > >>>>>>> Wes, are we saying that `pa.ipc.open_file(...).read_pandas()`
> > >>> already
> > >>>>>> reads
> > >>>>>>> the future Feather format? If not, how will the future format
> > >>> differ?  I
> > >>>>>>> will work on my access pattern with this format instead of the
> > >>> current
> > >>>>>>> feather format.  Sorry I was not clear on that earlier.
> > >>>>>>>
> > >>>>>>
> > >>>>>> Yes, under the hood those will use the same zero-copy binary
> > >> protocol
> > >>>>>> code paths to read the file.
> > >>>>>>
> > >>>>>>> Micah, thank you!
> > >>>>>>>
> > >>>>>>> On Tue, May 7, 2019 at 11:44 AM Micah Kornfield <
> > >>> [email protected]>
> > >>>>>>> wrote:
> > >>>>>>>
> > >>>>>>>> Hi John,
> > >>>>>>>> To give a specific pointer [1] describes how the streaming
> > >>> protocol is
> > >>>>>>>> stored to a file.
> > >>>>>>>>
> > >>>>>>>> [1] https://arrow.apache.org/docs/format/IPC.html#file-format
> > >>>>>>>>
> > >>>>>>>> On Tue, May 7, 2019 at 9:40 AM Wes McKinney <
> > >> [email protected]>
> > >>>>>> wrote:
> > >>>>>>>>
> > >>>>>>>>> hi John,
> > >>>>>>>>>
> > >>>>>>>>> As soon as the R folks can install the Arrow R package
> > >>> consistently,
> > >>>>>>>>> the intent is to replace the Feather internals with the plain
> > >>> Arrow
> > >>>>>>>>> IPC protocol where we have much better platform support all
> > >>> around.
> > >>>>>>>>>
> > >>>>>>>>> If you'd like to experiment with creating an API for
> > >>> pre-allocating
> > >>>>>>>>> fixed-size Arrow protocol blocks and then mutating the data
> > >> and
> > >>>>>>>>> metadata on disk in-place, please be our guest. We don't have
> > >>> the
> > >>>>>>>>> tools developed yet to do this for you
> > >>>>>>>>>
> > >>>>>>>>> - Wes
> > >>>>>>>>>
> > >>>>>>>>> On Tue, May 7, 2019 at 11:25 AM John Muehlhausen <[email protected]
> > >>>
> > >>>>>> wrote:
> > >>>>>>>>>>
> > >>>>>>>>>> Thanks Wes:
> > >>>>>>>>>>
> > >>>>>>>>>> "the current Feather format is deprecated" ... yes, but
> > >> there
> > >>>>>> will be a
> > >>>>>>>>>> future file format that replaces it, correct?  And my
> > >>> discussion
> > >>>>>> of
> > >>>>>>>>>> immutable "portions" of Arrow buffers, rather than
> > >>> immutability
> > >>>>>> of the
> > >>>>>>>>>> entire buffer, applies to IPC as well, right?  I am only
> > >>>>>> championing
> > >>>>>>>> the
> > >>>>>>>>>> idea that this future file format have the convenience that
> > >>>>>> certain
> > >>>>>>>>>> preallocated rows can be ignored based on a metadata
> > >> setting.
> > >>>>>>>>>>
> > >>>>>>>>>> "I recommend batching your writes" ... this is extremely
> > >>>>>> inefficient
> > >>>>>>>> and
> > >>>>>>>>>> adds unacceptable latency, relative to the proposed
> > >>> solution.  Do
> > >>>>>> you
> > >>>>>>>>>> disagree?  Either I have a batch length of 1 to minimize
> > >>> latency,
> > >>>>>> which
> > >>>>>>>>>> eliminates columnar advantages on the read side, or else I
> > >> add
> > >>>>>> latency.
> > >>>>>>>>>> Neither works, and it seems that a viable alternative is
> > >>> within
> > >>>>>> sight?
> > >>>>>>>>>>
> > >>>>>>>>>> If you don't agree that there is a core issue and
> > >> opportunity
> > >>>>>> here, I'm
> > >>>>>>>>> not
> > >>>>>>>>>> sure how to better make my case....
> > >>>>>>>>>>
> > >>>>>>>>>> -John
> > >>>>>>>>>>
> > >>>>>>>>>> On Tue, May 7, 2019 at 11:02 AM Wes McKinney <
> > >>> [email protected]
> > >>>>>>>
> > >>>>>>>>> wrote:
> > >>>>>>>>>>
> > >>>>>>>>>>> hi John,
> > >>>>>>>>>>>
> > >>>>>>>>>>> On Tue, May 7, 2019 at 10:53 AM John Muehlhausen <
> > >>> [email protected]>
> > >>>>>>>> wrote:
> > >>>>>>>>>>>>
> > >>>>>>>>>>>> Wes et al, I completed a preliminary study of
> > >> populating a
> > >>>>>> Feather
> > >>>>>>>>> file
> > >>>>>>>>>>>> incrementally.  Some notes and questions:
> > >>>>>>>>>>>>
> > >>>>>>>>>>>> I wrote the following dataframe to a feather file:
> > >>>>>>>>>>>>             a    b
> > >>>>>>>>>>>> 0  0123456789  0.0
> > >>>>>>>>>>>> 1  0123456789  NaN
> > >>>>>>>>>>>> 2  0123456789  NaN
> > >>>>>>>>>>>> 3  0123456789  NaN
> > >>>>>>>>>>>> 4        None  NaN
> > >>>>>>>>>>>>
> > >>>>>>>>>>>> In re-writing the flatbuffers metadata (flatc -p doesn't
> > >>>>>>>>>>>> support --gen-mutable! yuck! C++ to the rescue...), it
> > >>> seems
> > >>>>>> that
> > >>>>>>>>>>>> read_feather is not affected by NumRows?  It seems to be
> > >>>>>> driven
> > >>>>>>>>> entirely
> > >>>>>>>>>>> by
> > >>>>>>>>>>>> the per-column Length values?
> > >>>>>>>>>>>>
> > >>>>>>>>>>>> Also, it seems as if one of the primary usages of
> > >>> NullCount
> > >>>>>> is to
> > >>>>>>>>>>> determine
> > >>>>>>>>>>>> whether or not a bitfield is present?  In the
> > >>> initialization
> > >>>>>> data
> > >>>>>>>>> above I
> > >>>>>>>>>>>> was careful to have a null value in each column in order
> > >>> to
> > >>>>>>>> generate
> > >>>>>>>>> a
> > >>>>>>>>>>>> bitfield.
> > >>>>>>>>>>>
> > >>>>>>>>>>> Per my prior e-mails, the current Feather format is
> > >>> deprecated,
> > >>>>>> so
> > >>>>>>>> I'm
> > >>>>>>>>>>> only willing to engage on a discussion of the official
> > >> Arrow
> > >>>>>> binary
> > >>>>>>>>>>> protocol that we use for IPC (memory mapping) and RPC
> > >>> (Flight).
> > >>>>>>>>>>>
> > >>>>>>>>>>>>
> > >>>>>>>>>>>> I then wiped the bitfields in the file and set all of
> > >> the
> > >>>>>> string
> > >>>>>>>>> indices
> > >>>>>>>>>>> to
> > >>>>>>>>>>>> one past the end of the blob buffer (all strings empty):
> > >>>>>>>>>>>>       a   b
> > >>>>>>>>>>>> 0  None NaN
> > >>>>>>>>>>>> 1  None NaN
> > >>>>>>>>>>>> 2  None NaN
> > >>>>>>>>>>>> 3  None NaN
> > >>>>>>>>>>>> 4  None NaN
> > >>>>>>>>>>>>
> > >>>>>>>>>>>> I then set the first record to some data by consuming
> > >>> some of
> > >>>>>> the
> > >>>>>>>>> string
> > >>>>>>>>>>>> blob and row 0 and 1 indices, also setting the double:
> > >>>>>>>>>>>>
> > >>>>>>>>>>>>                a    b
> > >>>>>>>>>>>> 0  Hello, world!  5.0
> > >>>>>>>>>>>> 1           None  NaN
> > >>>>>>>>>>>> 2           None  NaN
> > >>>>>>>>>>>> 3           None  NaN
> > >>>>>>>>>>>> 4           None  NaN
> > >>>>>>>>>>>>
> > >>>>>>>>>>>> As mentioned above, NumRows seems to be ignored.  I
> > >> tried
> > >>>>>> adjusting
> > >>>>>>>>> each
> > >>>>>>>>>>>> column Length to mask off higher rows and it worked for
> > >> 4
> > >>>>>> (hide
> > >>>>>>>> last
> > >>>>>>>>> row)
> > >>>>>>>>>>>> but not for less than 4.  I take this to be due to math
> > >>> used
> > >>>>>> to
> > >>>>>>>>> figure
> > >>>>>>>>>>> out
> > >>>>>>>>>>>> where the buffers are relative to one another since
> > >> there
> > >>> is
> > >>>>>> only
> > >>>>>>>> one
> > >>>>>>>>>>>> metadata offset for all of: the (optional) bitset, index
> > >>>>>> column and
> > >>>>>>>>> (if
> > >>>>>>>>>>>> string) blobs.
> > >>>>>>>>>>>>
> > >>>>>>>>>>>> Populating subsequent rows would proceed in a similar
> > >> way
> > >>>>>> until all
> > >>>>>>>>> of
> > >>>>>>>>>>> the
> > >>>>>>>>>>>> blob storage has been consumed, which may come before
> > >> the
> > >>>>>>>>> pre-allocated
> > >>>>>>>>>>>> rows have been consumed.
> > >>>>>>>>>>>>
> > >>>>>>>>>>>> So what does this mean for my desire to incrementally
> > >>> write
> > >>>>>> these
> > >>>>>>>>>>>> (potentially memory-mapped) pre-allocated files and/or
> > >>> Arrow
> > >>>>>>>> buffers
> > >>>>>>>>> in
> > >>>>>>>>>>>> memory?  Some thoughts:
> > >>>>>>>>>>>>
> > >>>>>>>>>>>> - If a single value (such as NumRows) were consulted to
> > >>>>>> determine
> > >>>>>>>> the
> > >>>>>>>>>>> table
> > >>>>>>>>>>>> row dimension then updating this single value would
> > >> serve
> > >>> to
> > >>>>>> tell a
> > >>>>>>>>>>> reader
> > >>>>>>>>>>>> which rows are relevant.  Assuming this value is updated
> > >>>>>> after all
> > >>>>>>>>> other
> > >>>>>>>>>>>> mutations are complete, and assuming that mutations are
> > >>> only
> > >>>>>>>> appends
> > >>>>>>>>>>>> (addition of rows), then concurrency control involves
> > >> only
> > >>>>>> ensuring
> > >>>>>>>>> that
> > >>>>>>>>>>>> this value is not examined while it is being written.
> > >>>>>>>>>>>>
> > >>>>>>>>>>>> - NullCount presents a concurrency problem if someone
> > >>> reads
> > >>>>>> the
> > >>>>>>>> file
> > >>>>>>>>>>> after
> > >>>>>>>>>>>> this field has been updated, but before NumRows has
> > >>> exposed
> > >>>>>> the new
> > >>>>>>>>>>> record
> > >>>>>>>>>>>> (or vice versa).  The idea previously mentioned that
> > >> there
> > >>>>>> will
> > >>>>>>>>> "likely
> > >>>>>>>>>>>> [be] more statistics in the future" feels like it might
> > >> be
> > >>>>>> scope
> > >>>>>>>>> creep to
> > >>>>>>>>>>>> me?  This is a data representation, not a calculation
> > >>>>>> framework?
> > >>>>>>>> If
> > >>>>>>>>>>>> NullCount had its genesis in the optional nature of the
> > >>>>>> bitfield, I
> > >>>>>>>>> would
> > >>>>>>>>>>>> suggest that perhaps NullCount can be dropped in favor
> > >> of
> > >>>>>> always
> > >>>>>>>>>>> supplying
> > >>>>>>>>>>>> the bitfield, which in any event is already contemplated
> > >>> by
> > >>>>>> the
> > >>>>>>>> spec:
> > >>>>>>>>>>>> "Implementations may choose to always allocate one
> > >> anyway
> > >>> as a
> > >>>>>>>>> matter of
> > >>>>>>>>>>>> convenience."  If the concern is space savings, Arrow is
> > >>>>>> already an
> > >>>>>>>>>>>> extremely uncompressed format.  (Compression is
> > >> something
> > >>> I
> > >>>>>> would
> > >>>>>>>>> also
> > >>>>>>>>>>>> consider to be scope creep as regards Feather...
> > >>> compressed
> > >>>>>>>>> filesystems
> > >>>>>>>>>>> can
> > >>>>>>>>>>>> be employed and there are other compressed dataframe
> > >>> formats.)
> > >>>>>>>>> However,
> > >>>>>>>>>>> if
> > >>>>>>>>>>>> protecting the 4 bytes required to update NowRows turns
> > >>> out
> > >>>>>> to be
> > >>>>>>>> no
> > >>>>>>>>>>> easier
> > >>>>>>>>>>>> than protecting all of the statistical bytes as well as
> > >>> part
> > >>>>>> of the
> > >>>>>>>>> same
> > >>>>>>>>>>>> "critical section" (locks: yuck!!) then statistics pose
> > >> no
> > >>>>>> issue.
> > >>>>>>>> I
> > >>>>>>>>>>> have a
> > >>>>>>>>>>>> feeling that the availability of an atomic write of 4
> > >>> bytes
> > >>>>>> will
> > >>>>>>>>> depend
> > >>>>>>>>>>> on
> > >>>>>>>>>>>> the storage mechanism... memory vs memory map vs write()
> > >>> etc.
> > >>>>>>>>>>>>
> > >>>>>>>>>>>> - The elephant in the room appears to be the presumptive
> > >>>>>> binary
> > >>>>>>>>> yes/no on
> > >>>>>>>>>>>> mutability of Arrow buffers.  Perhaps the thought is
> > >> that
> > >>>>>> certain
> > >>>>>>>>> batch
> > >>>>>>>>>>>> processes will be wrecked if anyone anywhere is mutating
> > >>>>>> buffers in
> > >>>>>>>>> any
> > >>>>>>>>>>>> way.  But keep in mind I am not proposing general
> > >>> mutability,
> > >>>>>> only
> > >>>>>>>>>>>> appending of new data.  *A huge amount of batch
> > >> processing
> > >>>>>> that
> > >>>>>>>> will
> > >>>>>>>>> take
> > >>>>>>>>>>>> place with Arrow is on time-series data (whether
> > >>> financial or
> > >>>>>>>>> otherwise).
> > >>>>>>>>>>>> It is only natural that architects will want the minimal
> > >>>>>> impedance
> > >>>>>>>>>>> mismatch
> > >>>>>>>>>>>> when it comes time to grow those time series as the
> > >> events
> > >>>>>> occur
> > >>>>>>>>> going
> > >>>>>>>>>>>> forward.*  So rather than say that I want "mutable"
> > >> Arrow
> > >>>>>> buffers,
> > >>>>>>>> I
> > >>>>>>>>>>> would
> > >>>>>>>>>>>> pitch this as a call for "immutable populated areas" of
> > >>> Arrow
> > >>>>>>>> buffers
> > >>>>>>>>>>>> combined with the concept that the populated area can
> > >>> grow up
> > >>>>>> to
> > >>>>>>>>> whatever
> > >>>>>>>>>>>> was preallocated.  This will not affect anyone who has
> > >>>>>> "memoized" a
> > >>>>>>>>>>>> dimension and wants to continue to consider the
> > >>> then-current
> > >>>>>> data
> > >>>>>>>> as
> > >>>>>>>>>>>> immutable... it will be immutable and will always be
> > >>> immutable
> > >>>>>>>>> according
> > >>>>>>>>>>> to
> > >>>>>>>>>>>> that then-current dimension.
> > >>>>>>>>>>>>
> > >>>>>>>>>>>> Thanks in advance for considering this feedback!  I
> > >>> absolutely
> > >>>>>>>>> require
> > >>>>>>>>>>>> efficient row-wise growth of an Arrow-like buffer to
> > >> deal
> > >>>>>> with time
> > >>>>>>>>>>> series
> > >>>>>>>>>>>> data in near real time.  I believe that preallocation is
> > >>> (by
> > >>>>>> far)
> > >>>>>>>> the
> > >>>>>>>>>>> most
> > >>>>>>>>>>>> efficient way to accomplish this.  I hope to be able to
> > >>> use
> > >>>>>> Arrow!
> > >>>>>>>>> If I
> > >>>>>>>>>>>> cannot use Arrow than I will be using a home-grown Arrow
> > >>> that
> > >>>>>> is
> > >>>>>>>>>>> identical
> > >>>>>>>>>>>> except for this feature, which would be very sad!  Even
> > >> if
> > >>>>>> Arrow
> > >>>>>>>>> itself
> > >>>>>>>>>>>> could be used in this manner today, I would be hesitant
> > >> to
> > >>>>>> use it
> > >>>>>>>> if
> > >>>>>>>>> the
> > >>>>>>>>>>>> use-case was not protected on a go-forward basis.
> > >>>>>>>>>>>>
> > >>>>>>>>>>>
> > >>>>>>>>>>> I recommend batching your writes and using the Arrow
> > >> binary
> > >>>>>> streaming
> > >>>>>>>>>>> protocol so you are only appending to a file rather than
> > >>>>>> mutating
> > >>>>>>>>>>> previously-written bytes. This use case is well defined
> > >> and
> > >>>>>> supported
> > >>>>>>>>>>> in the software already.
> > >>>>>>>>>>>
> > >>>>>>>>>>>
> > >>>>>>>>>>>
> > >>>>>>>>>
> > >>>>>>>>
> > >>>>>>
> > >>>
> > >>
> > https://github.com/apache/arrow/blob/master/docs/source/format/IPC.rst#streaming-format
> > >>>>>>>>>>>
> > >>>>>>>>>>> - Wes
> > >>>>>>>>>>>
> > >>>>>>>>>>>> Of course, I am completely open to alternative ideas and
> > >>>>>>>> approaches!
> > >>>>>>>>>>>>
> > >>>>>>>>>>>> -John
> > >>>>>>>>>>>>
> > >>>>>>>>>>>> On Mon, May 6, 2019 at 11:39 AM Wes McKinney <
> > >>>>>> [email protected]>
> > >>>>>>>>>>> wrote:
> > >>>>>>>>>>>>
> > >>>>>>>>>>>>> hi John -- again, I would caution you against using
> > >>> Feather
> > >>>>>> files
> > >>>>>>>>> for
> > >>>>>>>>>>>>> issues of longevity -- the internal memory layout of
> > >>> those
> > >>>>>> files
> > >>>>>>>>> is a
> > >>>>>>>>>>>>> "dead man walking" so to speak.
> > >>>>>>>>>>>>>
> > >>>>>>>>>>>>> I would advise against forking the project, IMHO that
> > >>> is a
> > >>>>>> dark
> > >>>>>>>>> path
> > >>>>>>>>>>>>> that leads nowhere good. We have a large community
> > >> here
> > >>> and
> > >>>>>> we
> > >>>>>>>>> accept
> > >>>>>>>>>>>>> pull requests -- I think the challenge is going to be
> > >>>>>> defining
> > >>>>>>>> the
> > >>>>>>>>> use
> > >>>>>>>>>>>>> case to suitable clarity that a general purpose
> > >> solution
> > >>>>>> can be
> > >>>>>>>>>>>>> developed.
> > >>>>>>>>>>>>>
> > >>>>>>>>>>>>> - Wes
> > >>>>>>>>>>>>>
> > >>>>>>>>>>>>>
> > >>>>>>>>>>>>> On Mon, May 6, 2019 at 11:16 AM John Muehlhausen <
> > >>>>>> [email protected]>
> > >>>>>>>>> wrote:
> > >>>>>>>>>>>>>>
> > >>>>>>>>>>>>>> François, Wes,
> > >>>>>>>>>>>>>>
> > >>>>>>>>>>>>>> Thanks for the feedback.  I think the most practical
> > >>>>>> thing for
> > >>>>>>>>> me to
> > >>>>>>>>>>> do
> > >>>>>>>>>>>>> is
> > >>>>>>>>>>>>>> 1- write a Feather file that is structured to
> > >>>>>> pre-allocate the
> > >>>>>>>>> space
> > >>>>>>>>>>> I
> > >>>>>>>>>>>>> need
> > >>>>>>>>>>>>>> (e.g. initial variable-length strings are of average
> > >>> size)
> > >>>>>>>>>>>>>> 2- come up with code to monkey around with the
> > >> values
> > >>>>>> contained
> > >>>>>>>>> in
> > >>>>>>>>>>> the
> > >>>>>>>>>>>>>> vectors so that before and after each manipulation
> > >> the
> > >>>>>> file is
> > >>>>>>>>> valid
> > >>>>>>>>>>> as I
> > >>>>>>>>>>>>>> walk the rows ... this is a writer that uses memory
> > >>>>>> mapping
> > >>>>>>>>>>>>>> 3- check back in here once that works, assuming that
> > >>> it
> > >>>>>> does,
> > >>>>>>>> to
> > >>>>>>>>> see
> > >>>>>>>>>>> if
> > >>>>>>>>>>>>> we
> > >>>>>>>>>>>>>> can bless certain mutation paths
> > >>>>>>>>>>>>>> 4- if we can't bless certain mutation paths, fork
> > >> the
> > >>>>>> project
> > >>>>>>>> for
> > >>>>>>>>>>> those
> > >>>>>>>>>>>>> who
> > >>>>>>>>>>>>>> care more about stream processing?  That would not
> > >>> seem
> > >>>>>> to be
> > >>>>>>>>> ideal
> > >>>>>>>>>>> as I
> > >>>>>>>>>>>>>> think mutation in row-order across the data set is
> > >>>>>> relatively
> > >>>>>>>> low
> > >>>>>>>>>>> impact
> > >>>>>>>>>>>>> on
> > >>>>>>>>>>>>>> the overall design?
> > >>>>>>>>>>>>>>
> > >>>>>>>>>>>>>> Thanks again for engaging the topic!
> > >>>>>>>>>>>>>> -John
> > >>>>>>>>>>>>>>
> > >>>>>>>>>>>>>> On Mon, May 6, 2019 at 10:26 AM Francois
> > >>> Saint-Jacques <
> > >>>>>>>>>>>>>> [email protected]> wrote:
> > >>>>>>>>>>>>>>
> > >>>>>>>>>>>>>>> Hello John,
> > >>>>>>>>>>>>>>>
> > >>>>>>>>>>>>>>> Arrow is not yet suited for partial writes. The
> > >>>>>> specification
> > >>>>>>>>> only
> > >>>>>>>>>>>>>>> talks about fully frozen/immutable objects, you're
> > >>> in
> > >>>>>>>>>>> implementation
> > >>>>>>>>>>>>>>> defined territory here. For example, the C++
> > >> library
> > >>>>>> assumes
> > >>>>>>>>> the
> > >>>>>>>>>>> Array
> > >>>>>>>>>>>>>>> object is immutable; it memoize the null count,
> > >> and
> > >>>>>> likely
> > >>>>>>>> more
> > >>>>>>>>>>>>>>> statistics in the future.
> > >>>>>>>>>>>>>>>
> > >>>>>>>>>>>>>>> If you want to use pre-allocated buffers and
> > >> array,
> > >>> you
> > >>>>>> can
> > >>>>>>>>> use the
> > >>>>>>>>>>>>>>> column validity bitmap for this purpose, e.g. set
> > >>> all
> > >>>>>> null by
> > >>>>>>>>>>> default
> > >>>>>>>>>>>>>>> and flip once the row is written. It suffers from
> > >>>>>> concurrency
> > >>>>>>>>>>> issues
> > >>>>>>>>>>>>>>> (+ invalidation issues as pointed) when dealing
> > >> with
> > >>>>>> multiple
> > >>>>>>>>>>> columns.
> > >>>>>>>>>>>>>>> You'll have to use a barrier of some kind, e.g. a
> > >>>>>> per-batch
> > >>>>>>>>> global
> > >>>>>>>>>>>>>>> atomic (if append-only), or dedicated column(s)
> > >> à-la
> > >>>>>> MVCC.
> > >>>>>>>> But
> > >>>>>>>>>>> then,
> > >>>>>>>>>>>>>>> the reader needs to be aware of this and compute a
> > >>> mask
> > >>>>>> each
> > >>>>>>>>> time
> > >>>>>>>>>>> it
> > >>>>>>>>>>>>>>> needs to query the partial batch.
> > >>>>>>>>>>>>>>>
> > >>>>>>>>>>>>>>> This is a common columnar database problem, see
> > >> [1]
> > >>> for
> > >>>>>> a
> > >>>>>>>>> recent
> > >>>>>>>>>>> paper
> > >>>>>>>>>>>>>>> on the subject. The usual technique is to store
> > >> the
> > >>>>>> recent
> > >>>>>>>> data
> > >>>>>>>>>>>>>>> row-wise, and transform it in column-wise when a
> > >>>>>> threshold is
> > >>>>>>>>> met
> > >>>>>>>>>>> akin
> > >>>>>>>>>>>>>>> to a compaction phase. There was a somewhat
> > >> related
> > >>>>>> thread
> > >>>>>>>> [2]
> > >>>>>>>>>>> lately
> > >>>>>>>>>>>>>>> about streaming vs batching. In the end, I think
> > >>> your
> > >>>>>>>> solution
> > >>>>>>>>>>> will be
> > >>>>>>>>>>>>>>> very application specific.
> > >>>>>>>>>>>>>>>
> > >>>>>>>>>>>>>>> François
> > >>>>>>>>>>>>>>>
> > >>>>>>>>>>>>>>> [1]
> > >>>>>>>> https://db.in.tum.de/downloads/publications/datablocks.pdf
> > >>>>>>>>>>>>>>> [2]
> > >>>>>>>>>>>>>>>
> > >>>>>>>>>>>>>
> > >>>>>>>>>>>
> > >>>>>>>>>
> > >>>>>>>>
> > >>>>>>
> > >>>
> > >>
> > https://lists.apache.org/thread.html/27945533db782361143586fd77ca08e15e96e2f2a5250ff084b462d6@%3Cdev.arrow.apache.org%3E
> > >>>>>>>>>>>>>>>
> > >>>>>>>>>>>>>>>
> > >>>>>>>>>>>>>>>
> > >>>>>>>>>>>>>>>
> > >>>>>>>>>>>>>>>
> > >>>>>>>>>>>>>>>
> > >>>>>>>>>>>>>>>
> > >>>>>>>>>>>>>>> On Mon, May 6, 2019 at 10:39 AM John Muehlhausen <
> > >>>>>>>> [email protected]>
> > >>>>>>>>>>> wrote:
> > >>>>>>>>>>>>>>>>
> > >>>>>>>>>>>>>>>> Wes,
> > >>>>>>>>>>>>>>>>
> > >>>>>>>>>>>>>>>> I’m not afraid of writing my own C++ code to
> > >> deal
> > >>>>>> with all
> > >>>>>>>> of
> > >>>>>>>>>>> this
> > >>>>>>>>>>>>> on the
> > >>>>>>>>>>>>>>>> writer side.  I just need a way to “append”
> > >>>>>> (incrementally
> > >>>>>>>>>>> populate)
> > >>>>>>>>>>>>> e.g.
> > >>>>>>>>>>>>>>>> feather files so that a person using e.g.
> > >> pyarrow
> > >>>>>> doesn’t
> > >>>>>>>>> suffer
> > >>>>>>>>>>> some
> > >>>>>>>>>>>>>>>> catastrophic failure... and “on the side” I tell
> > >>> them
> > >>>>>> which
> > >>>>>>>>> rows
> > >>>>>>>>>>> are
> > >>>>>>>>>>>>> junk
> > >>>>>>>>>>>>>>>> and deal with any concurrency issues that can’t
> > >> be
> > >>>>>> solved
> > >>>>>>>> in
> > >>>>>>>>> the
> > >>>>>>>>>>>>> arena of
> > >>>>>>>>>>>>>>>> atomicity and ordering of ops.  For now I care
> > >>> about
> > >>>>>> basic
> > >>>>>>>>> types
> > >>>>>>>>>>> but
> > >>>>>>>>>>>>>>>> including variable-width strings.
> > >>>>>>>>>>>>>>>>
> > >>>>>>>>>>>>>>>> For event-processing, I think Arrow has to have
> > >>> the
> > >>>>>> concept
> > >>>>>>>>> of a
> > >>>>>>>>>>>>>>> partially
> > >>>>>>>>>>>>>>>> full record set.  Some alternatives are:
> > >>>>>>>>>>>>>>>> - have a batch size of one, thus littering the
> > >>>>>> landscape
> > >>>>>>>> with
> > >>>>>>>>>>>>> trivially
> > >>>>>>>>>>>>>>>> small Arrow buffers
> > >>>>>>>>>>>>>>>> - artificially increase latency with a batch
> > >> size
> > >>>>>> larger
> > >>>>>>>> than
> > >>>>>>>>>>> one,
> > >>>>>>>>>>>>> but
> > >>>>>>>>>>>>>>> not
> > >>>>>>>>>>>>>>>> processing any data until a batch is complete
> > >>>>>>>>>>>>>>>> - continuously re-write the (entire!) “main”
> > >>> buffer as
> > >>>>>>>>> batches of
> > >>>>>>>>>>>>> length
> > >>>>>>>>>>>>>>> 1
> > >>>>>>>>>>>>>>>> roll in
> > >>>>>>>>>>>>>>>> - instead of one main buffer, several, and at
> > >> some
> > >>>>>>>> threshold
> > >>>>>>>>>>> combine
> > >>>>>>>>>>>>> the
> > >>>>>>>>>>>>>>>> last N length-1 batches into a length N buffer
> > >> ...
> > >>>>>> still an
> > >>>>>>>>>>>>> inefficiency
> > >>>>>>>>>>>>>>>>
> > >>>>>>>>>>>>>>>> Consider the case of QAbstractTableModel as the
> > >>>>>> underlying
> > >>>>>>>>> data
> > >>>>>>>>>>> for a
> > >>>>>>>>>>>>>>> table
> > >>>>>>>>>>>>>>>> or a chart.  This visualization shows all of the
> > >>> data
> > >>>>>> for
> > >>>>>>>> the
> > >>>>>>>>>>> recent
> > >>>>>>>>>>>>> past
> > >>>>>>>>>>>>>>>> as well as events rolling in.  If this model
> > >>>>>> interface is
> > >>>>>>>>>>>>> implemented as
> > >>>>>>>>>>>>>>> a
> > >>>>>>>>>>>>>>>> view onto “many thousands” of individual event
> > >>>>>> buffers then
> > >>>>>>>>> we
> > >>>>>>>>>>> gain
> > >>>>>>>>>>>>>>> nothing
> > >>>>>>>>>>>>>>>> from columnar layout.  (Suppose there are tons
> > >> of
> > >>>>>> columns
> > >>>>>>>> and
> > >>>>>>>>>>> most of
> > >>>>>>>>>>>>>>> them
> > >>>>>>>>>>>>>>>> are scrolled out of the view.). Likewise we
> > >> cannot
> > >>>>>> re-write
> > >>>>>>>>> the
> > >>>>>>>>>>>>> entire
> > >>>>>>>>>>>>>>>> model on each event... time complexity blows up.
> > >>>>>> What we
> > >>>>>>>>> want
> > >>>>>>>>>>> is to
> > >>>>>>>>>>>>>>> have a
> > >>>>>>>>>>>>>>>> large pre-allocated chunk and just change
> > >>> rowCount()
> > >>>>>> as
> > >>>>>>>> data
> > >>>>>>>>> is
> > >>>>>>>>>>>>>>> “appended.”
> > >>>>>>>>>>>>>>>>  Sure, we may run out of space and have another
> > >>> and
> > >>>>>> another
> > >>>>>>>>>>> chunk for
> > >>>>>>>>>>>>>>>> future row ranges, but a handful of chunks
> > >> chained
> > >>>>>> together
> > >>>>>>>>> is
> > >>>>>>>>>>> better
> > >>>>>>>>>>>>>>> than
> > >>>>>>>>>>>>>>>> as many chunks as there were events!
> > >>>>>>>>>>>>>>>>
> > >>>>>>>>>>>>>>>> And again, having a batch size >1 and delaying
> > >> the
> > >>>>>> data
> > >>>>>>>>> until a
> > >>>>>>>>>>>>> batch is
> > >>>>>>>>>>>>>>>> full is a non-starter.
> > >>>>>>>>>>>>>>>>
> > >>>>>>>>>>>>>>>> I am really hoping to see partially-filled
> > >>> buffers as
> > >>>>>>>>> something
> > >>>>>>>>>>> we
> > >>>>>>>>>>>>> keep
> > >>>>>>>>>>>>>>> our
> > >>>>>>>>>>>>>>>> finger on moving forward!  Or else, what am I
> > >>> missing?
> > >>>>>>>>>>>>>>>>
> > >>>>>>>>>>>>>>>> -John
> > >>>>>>>>>>>>>>>>
> > >>>>>>>>>>>>>>>> On Mon, May 6, 2019 at 8:24 AM Wes McKinney <
> > >>>>>>>>> [email protected]
> > >>>>>>>>>>>>
> > >>>>>>>>>>>>> wrote:
> > >>>>>>>>>>>>>>>>
> > >>>>>>>>>>>>>>>>> hi John,
> > >>>>>>>>>>>>>>>>>
> > >>>>>>>>>>>>>>>>> In C++ the builder classes don't yet support
> > >>>>>> writing into
> > >>>>>>>>>>>>> preallocated
> > >>>>>>>>>>>>>>>>> memory. It would be tricky for applications to
> > >>>>>> determine
> > >>>>>>>> a
> > >>>>>>>>>>> priori
> > >>>>>>>>>>>>>>>>> which segments of memory to pass to the
> > >>> builder. It
> > >>>>>> seems
> > >>>>>>>>> only
> > >>>>>>>>>>>>>>>>> feasible for primitive / fixed-size types so
> > >> my
> > >>>>>> guess
> > >>>>>>>>> would be
> > >>>>>>>>>>>>> that a
> > >>>>>>>>>>>>>>>>> separate set of interfaces would need to be
> > >>>>>> developed for
> > >>>>>>>>> this
> > >>>>>>>>>>>>> task.
> > >>>>>>>>>>>>>>>>>
> > >>>>>>>>>>>>>>>>> - Wes
> > >>>>>>>>>>>>>>>>>
> > >>>>>>>>>>>>>>>>> On Mon, May 6, 2019 at 8:18 AM Jacques Nadeau
> > >> <
> > >>>>>>>>>>> [email protected]>
> > >>>>>>>>>>>>>>> wrote:
> > >>>>>>>>>>>>>>>>>>
> > >>>>>>>>>>>>>>>>>> This is more of a question of implementation
> > >>>>>> versus
> > >>>>>>>>>>>>> specification. An
> > >>>>>>>>>>>>>>>>> arrow
> > >>>>>>>>>>>>>>>>>> buffer is generally built and then sealed.
> > >> In
> > >>>>>> different
> > >>>>>>>>>>>>> languages,
> > >>>>>>>>>>>>>>> this
> > >>>>>>>>>>>>>>>>>> building process works differently (a
> > >> concern
> > >>> of
> > >>>>>> the
> > >>>>>>>>> language
> > >>>>>>>>>>>>> rather
> > >>>>>>>>>>>>>>> than
> > >>>>>>>>>>>>>>>>>> the memory specification). We don't
> > >> currently
> > >>>>>> allow a
> > >>>>>>>>> half
> > >>>>>>>>>>> built
> > >>>>>>>>>>>>>>> vector
> > >>>>>>>>>>>>>>>>> to
> > >>>>>>>>>>>>>>>>>> be moved to another language and then be
> > >>> further
> > >>>>>> built.
> > >>>>>>>>> So
> > >>>>>>>>>>> the
> > >>>>>>>>>>>>>>> question
> > >>>>>>>>>>>>>>>>> is
> > >>>>>>>>>>>>>>>>>> really more concrete: what language are you
> > >>>>>> looking at
> > >>>>>>>>> and
> > >>>>>>>>>>> what
> > >>>>>>>>>>>>> is
> > >>>>>>>>>>>>>>> the
> > >>>>>>>>>>>>>>>>>> specific pattern you're trying to undertake
> > >>> for
> > >>>>>>>> building.
> > >>>>>>>>>>>>>>>>>>
> > >>>>>>>>>>>>>>>>>> If you're trying to go across independent
> > >>>>>> processes
> > >>>>>>>>> (whether
> > >>>>>>>>>>> the
> > >>>>>>>>>>>>> same
> > >>>>>>>>>>>>>>>>>> process restarted or two separate processes
> > >>> active
> > >>>>>>>>>>>>> simultaneously)
> > >>>>>>>>>>>>>>> you'll
> > >>>>>>>>>>>>>>>>>> need to build up your own data structures to
> > >>> help
> > >>>>>> with
> > >>>>>>>>> this.
> > >>>>>>>>>>>>>>>>>>
> > >>>>>>>>>>>>>>>>>> On Mon, May 6, 2019 at 6:28 PM John
> > >>> Muehlhausen <
> > >>>>>>>>> [email protected]
> > >>>>>>>>>>>>
> > >>>>>>>>>>>>> wrote:
> > >>>>>>>>>>>>>>>>>>
> > >>>>>>>>>>>>>>>>>>> Hello,
> > >>>>>>>>>>>>>>>>>>>
> > >>>>>>>>>>>>>>>>>>> Glad to learn of this project— good work!
> > >>>>>>>>>>>>>>>>>>>
> > >>>>>>>>>>>>>>>>>>> If I allocate a single chunk of memory and
> > >>> start
> > >>>>>>>>> building
> > >>>>>>>>>>> Arrow
> > >>>>>>>>>>>>>>> format
> > >>>>>>>>>>>>>>>>>>> within it, does this chunk save any state
> > >>>>>> regarding
> > >>>>>>>> my
> > >>>>>>>>>>>>> progress?
> > >>>>>>>>>>>>>>>>>>>
> > >>>>>>>>>>>>>>>>>>> For example, suppose I allocate a column
> > >> for
> > >>>>>> floating
> > >>>>>>>>> point
> > >>>>>>>>>>>>> (fixed
> > >>>>>>>>>>>>>>>>> width)
> > >>>>>>>>>>>>>>>>>>> and a column for string (variable width).
> > >>>>>> Suppose I
> > >>>>>>>>> start
> > >>>>>>>>>>>>>>> building the
> > >>>>>>>>>>>>>>>>>>> floating point column at offset X into my
> > >>> single
> > >>>>>>>>> buffer,
> > >>>>>>>>>>> and
> > >>>>>>>>>>>>> the
> > >>>>>>>>>>>>>>> string
> > >>>>>>>>>>>>>>>>>>> “pointer” column at offset Y into the same
> > >>>>>> single
> > >>>>>>>>> buffer,
> > >>>>>>>>>>> and
> > >>>>>>>>>>>>> the
> > >>>>>>>>>>>>>>>>> string
> > >>>>>>>>>>>>>>>>>>> data elements at offset Z.
> > >>>>>>>>>>>>>>>>>>>
> > >>>>>>>>>>>>>>>>>>> I write one floating point number and one
> > >>>>>> string,
> > >>>>>>>> then
> > >>>>>>>>> go
> > >>>>>>>>>>> away.
> > >>>>>>>>>>>>>>> When I
> > >>>>>>>>>>>>>>>>>>> come back to this buffer to append another
> > >>>>>> value,
> > >>>>>>>> does
> > >>>>>>>>> the
> > >>>>>>>>>>>>> buffer
> > >>>>>>>>>>>>>>>>> itself
> > >>>>>>>>>>>>>>>>>>> know where I would begin?  I.e. is there a
> > >>>>>>>>> differentiation
> > >>>>>>>>>>> in
> > >>>>>>>>>>>>> the
> > >>>>>>>>>>>>>>>>> column
> > >>>>>>>>>>>>>>>>>>> (or blob) data itself between the
> > >> available
> > >>>>>> space and
> > >>>>>>>>> the
> > >>>>>>>>>>> used
> > >>>>>>>>>>>>>>> space?
> > >>>>>>>>>>>>>>>>>>>
> > >>>>>>>>>>>>>>>>>>> Suppose I write a lot of large variable
> > >>> width
> > >>>>>> strings
> > >>>>>>>>> and
> > >>>>>>>>>>> “run
> > >>>>>>>>>>>>>>> out” of
> > >>>>>>>>>>>>>>>>>>> space for them before running out of space
> > >>> for
> > >>>>>>>> floating
> > >>>>>>>>>>> point
> > >>>>>>>>>>>>>>> numbers
> > >>>>>>>>>>>>>>>>> or
> > >>>>>>>>>>>>>>>>>>> string pointers.  (I guessed badly when
> > >>> doing
> > >>>>>> the
> > >>>>>>>>> original
> > >>>>>>>>>>>>>>>>> allocation.). I
> > >>>>>>>>>>>>>>>>>>> consider this to be Ok since I can always
> > >>>>>> “copy” the
> > >>>>>>>>> data
> > >>>>>>>>>>> to
> > >>>>>>>>>>>>>>> “compress
> > >>>>>>>>>>>>>>>>> out”
> > >>>>>>>>>>>>>>>>>>> the unused fp/pointer buckets... the
> > >> choice
> > >>> is
> > >>>>>> up to
> > >>>>>>>>> me.
> > >>>>>>>>>>>>>>>>>>>
> > >>>>>>>>>>>>>>>>>>> The above applied to a (feather?) file is
> > >>> how I
> > >>>>>>>>> anticipate
> > >>>>>>>>>>>>>>> appending
> > >>>>>>>>>>>>>>>>> data
> > >>>>>>>>>>>>>>>>>>> to disk... pre-allocate a mem-mapped file
> > >>> and
> > >>>>>>>> gradually
> > >>>>>>>>>>> fill
> > >>>>>>>>>>>>> it up.
> > >>>>>>>>>>>>>>>>> The
> > >>>>>>>>>>>>>>>>>>> efficiency of file utilization will depend
> > >>> on my
> > >>>>>>>>>>> projections
> > >>>>>>>>>>>>>>> regarding
> > >>>>>>>>>>>>>>>>>>> variable-width data types, but as I said
> > >>> above,
> > >>>>>> I can
> > >>>>>>>>>>> always
> > >>>>>>>>>>>>>>> re-write
> > >>>>>>>>>>>>>>>>> the
> > >>>>>>>>>>>>>>>>>>> file if/when this bothers me.
> > >>>>>>>>>>>>>>>>>>>
> > >>>>>>>>>>>>>>>>>>> Is this the recommended and supported
> > >>> approach
> > >>>>>> for
> > >>>>>>>>>>> incremental
> > >>>>>>>>>>>>>>> appends?
> > >>>>>>>>>>>>>>>>>>> I’m really hoping to use Arrow instead of
> > >>>>>> rolling my
> > >>>>>>>>> own,
> > >>>>>>>>>>> but
> > >>>>>>>>>>>>>>>>> functionality
> > >>>>>>>>>>>>>>>>>>> like this is absolutely key!  Hoping not
> > >> to
> > >>> use
> > >>>>>> a
> > >>>>>>>>> side-car
> > >>>>>>>>>>>>> file (or
> > >>>>>>>>>>>>>>>>> memory
> > >>>>>>>>>>>>>>>>>>> chunk) to store “append progress”
> > >>> information.
> > >>>>>>>>>>>>>>>>>>>
> > >>>>>>>>>>>>>>>>>>> I am brand new to this project so please
> > >>>>>> forgive me
> > >>>>>>>> if
> > >>>>>>>>> I
> > >>>>>>>>>>> have
> > >>>>>>>>>>>>>>>>> overlooked
> > >>>>>>>>>>>>>>>>>>> something obvious.  And again, looks like
> > >>> great
> > >>>>>> work
> > >>>>>>>> so
> > >>>>>>>>>>> far!
> > >>>>>>>>>>>>>>>>>>>
> > >>>>>>>>>>>>>>>>>>> Thanks!
> > >>>>>>>>>>>>>>>>>>> -John
> > >>>>>>>>>>>>>>>>>>>
> > >>>>>>>>>>>>>>>>>
> > >>>>>>>>>>>>>>>
> > >>>>>>>>>>>>>
> > >>>>>>>>>>>
> > >>>>>>>>>
> > >>>>>>>>
> > >>>>>>
> > >>>>>
> > >>>
> > >>>
> > >>
> > >
> >

Re: Stored state of incremental writes to fixed size Arrow buffer?

Reply via email to