On Mon, May 13, 2019 at 10:28 AM John Muehlhausen <j...@jgm.org> wrote: > > ``perhaps the right way forward is to start by gathering a > number of interested parties and start designing a proposal'' > > YES! How do we go about this? >
I'd recommend writing a proposal document (using Google Docs or whatever tool you prefer) that lays out the use cases to motivate the work and proposals for solving them -- I would recommend to be as clear and concise as possible. You can circulate the document for comment in a [DISCUSS] thread on this mailing list. Be aware that the process may take weeks or months since people have to find the time to read the document and comment. > ``There are some early experiments to populate Arrow nodes in microbatches > from Kafka'' (cf link in thread) > > Who did this? I'm not aware of any work in open source to do this, but I know a party that did do this in proprietary form. I will contact them offline and see if they are interested in getting involved in the discussion. Thanks > > -John > > On Mon, May 13, 2019 at 9:39 AM Antoine Pitrou <anto...@python.org> wrote: > > > > > Hi John, > > > > We are strongly committed to backwards compatibility in the Arrow format > > specification. You should not fear any compatibility-breaking changes > > in the future. People sometimes express uncertainty because we have not > > reached 1.0 yet, but that's because we have not yet implemented all the > > data types we want to be in that spec. > > > > As for the general goal of making Arrow more suitable for event > > processing, perhaps the right way forward is to start by gathering a > > number of interested parties and start designing a proposal (which may > > or may not include spec additions). > > > > Regards > > > > Antoine. > > > > > > Le 13/05/2019 à 15:38, John Muehlhausen a écrit : > > > Micah, yes, it all works at the moment. How have we staked out that it > > > will always work in the future as people continue to work on the spec? > > That > > > is my concern. > > > > > > Also, it would be extremely useful if someone opening a file had my nil > > > rows hidden from them without needing to analyze the app-specific > > side-car > > > data. > > > > > > I believe that something like my solution is how everyone will do > > efficient > > > event processing with Arrow, so I believe it is worth a broader > > discussion. > > > > > > On Mon, May 13, 2019 at 8:30 AM Micah Kornfield <emkornfi...@gmail.com> > > > wrote: > > > > > >> Hi John, > > >> To expand on this I don't think there is anything preventing you in the > > >> current spec from over provisioning the underlying buffers. So you can > > >> effectively split "capacity" from "length" by subtracting the size of > > the > > >> buffer from the amount of space taken by the rows indicated in the > > batch. > > >> For variable width types you would have to reference last value in the > > >> offset buffer to determine used capacity. > > >> > > >> When appending if you runout of memory in a particular buffer, you > > don't > > >> increment the count on the batch and simply append to the next one. > > >> > > >> This is restating parts of the thread, but I don't think the c++ code > > base > > >> has any facility for this directly and if you want to be parsimonious > > with > > >> memory you would have to rewrite batches at some point. > > >> > > >> Apologies if I missed something as Wes said this is a long thread. > > >> > > >> > > >> Thanks, > > >> Micah > > >> > > >> On Mon, May 13, 2019 at 6:07 AM Wes McKinney <wesmck...@gmail.com> > > wrote: > > >> > > >>> hi John, > > >>> > > >>> Sorry, there's a number of fairly long e-mails in this thread; I'm > > >>> having a hard time following all of the details. > > >>> > > >>> I suspect the most parsimonious thing would be to have some "sidecar" > > >>> metadata that tracks the state of your writes into pre-allocated Arrow > > >>> blocks so that readers know to call "Slice" on the blocks to obtain > > >>> only the written-so-far portion. I'm not likely to be in favor of > > >>> making changes to the binary protocol for this use case; if others > > >>> have opinions I'll let them speak for themselves. > > >>> > > >>> - Wes > > >>> > > >>> On Mon, May 13, 2019 at 7:50 AM John Muehlhausen <j...@jgm.org> wrote: > > >>>> > > >>>> Any thoughts on a RecordBatch distinguishing size from capacity? (To > > >>> borrow > > >>>> std::vector terminology) > > >>>> > > >>>> Thanks, > > >>>> John > > >>>> > > >>>> On Thu, May 9, 2019 at 2:46 PM John Muehlhausen <j...@jgm.org> wrote: > > >>>> > > >>>>> Wes et al, I think my core proposal is that Message.fbs:RecordBatch > > >>> split > > >>>>> the "length" parameter into "theoretical max length" and "utilized > > >>> length" > > >>>>> (perhaps not those exact names). > > >>>>> > > >>>>> "theoretical max length is the same as "length" now ... /// ...The > > >>> arrays > > >>>>> in the batch should all have this > > >>>>> > > >>>>> "utilized length" are the number of rows (starting from the first > > >> one) > > >>>>> that actually contain interesting data... the rest do not. > > >>>>> > > >>>>> The reason we can have a RecordBatch where these numbers are not the > > >>> same > > >>>>> is that the RecordBatch space was preallocated (for performance > > >>> reasons) > > >>>>> and the number of rows that actually "fit" depends on how correct the > > >>>>> preallocation was. In any case, it gives the user control of this > > >>>>> space/time tradeoff... wasted space in order to save time in record > > >>> batch > > >>>>> construction. The fact that some space will usually be wasted when > > >>> there > > >>>>> are variable-length columns (barring extreme luck) with this batch > > >>>>> construction paradigm explains the word "theoretical" above. This > > >> also > > >>>>> gives us the ability to look at a partially constructed batch that is > > >>> still > > >>>>> being constructed, given appropriate user-supplied concurrency > > >> control. > > >>>>> > > >>>>> I am not an expert in all of the Arrow variable-length data types, > > >> but > > >>> I > > >>>>> think this works if they are all similar to variable-length strings > > >>> where > > >>>>> we advance through "blob storage" by setting the indexes into that > > >>> storage > > >>>>> for the current and next row in order to indicate that we have > > >>>>> incrementally consumed more blob storage. (Conceptually this storage > > >>> is > > >>>>> "unallocated" after the pre-allocation and before rows are > > >> populated.) > > >>>>> > > >>>>> At a high level I am seeking to shore up the format for event ingress > > >>> into > > >>>>> real-time analytics that have some look-back window. If I'm not > > >>> mistaken I > > >>>>> think this is the subject of the last multi-sentence paragraph here?: > > >>>>> https://zd.net/2H0LlBY > > >>>>> > > >>>>> Currently we have a less-efficient paradigm where "microbatches" > > >> (e.g. > > >>> of > > >>>>> length 1 for minimal latency) have to spin the CPU periodically in > > >>> order to > > >>>>> be combined into buffers where we get the columnar layout benefit. > > >>> With > > >>>>> pre-allocation we can deal with microbatches (a partially populated > > >>> larger > > >>>>> RecordBatch) and immediately have the columnar layout benefits for > > >> the > > >>>>> populated section with no additional computation. > > >>>>> > > >>>>> For example, consider an event processing system that calculates a > > >>> "moving > > >>>>> average" as events roll in. While this is somewhat contrived lets > > >>> assume > > >>>>> that the moving average window is 1000 periods and our pre-allocation > > >>>>> ("theoretical max length") of RecordBatch elements is 100. The > > >>> algorithm > > >>>>> would be something like this, for a list of RecordBatch buffers in > > >>> memory: > > >>>>> > > >>>>> initialization(): > > >>>>> set up configuration of expected variable length storage > > >>> requirements, > > >>>>> e.g. the template RecordBatch mentioned below > > >>>>> > > >>>>> onIncomingEvent(event): > > >>>>> obtain lock /// cf. swoopIn() below > > >>>>> if last RecordBatch theoretical max length is not less than > > >> utilized > > >>>>> length or variable-length components of "event" will not fit in > > >>> remaining > > >>>>> blob storage: > > >>>>> create a new RecordBatch pre-allocation of max utilized length > > >> 100 > > >>> and > > >>>>> with blob preallocation that is max(expected, event .. in case the > > >>> single > > >>>>> event is larger than the expectation for 100 events) > > >>>>> (note: in the expected case this can be very fast as it is a > > >>>>> malloc() and a memcpy() from a template!) > > >>>>> set current RecordBatch to this newly created one > > >>>>> add event to current RecordBatch (for the non-calculated fields) > > >>>>> increment utilized length of current RecordBatch > > >>>>> calculate the calculated fields (in this case, moving average) by > > >>>>> looking back at previous rows in this and previous RecordBatch > > >> objects > > >>>>> free() any RecordBatch objects that are now before the lookback > > >>> window > > >>>>> > > >>>>> swoopIn(): /// somebody wants to chart the lookback window > > >>>>> obtain lock > > >>>>> visit all of the relevant data in the RecordBatches to construct > > >> the > > >>>>> chart /// notice that the last RecordBatch may not yet be "as full as > > >>>>> possible" > > >>>>> > > >>>>> The above analysis (minus the free()) could apply to the IPC file > > >>> format > > >>>>> and the lock could be a file lock and the swoopIn() could be a > > >> separate > > >>>>> process. In the case of the file format, while the file is locked, a > > >>> new > > >>>>> RecordBatch would overwrite the previous file Footer and a new Footer > > >>> would > > >>>>> be written. In order to be able to delete or archive old data > > >> multiple > > >>>>> files could be strung together in a logical series. > > >>>>> > > >>>>> -John > > >>>>> > > >>>>> On Tue, May 7, 2019 at 2:39 PM Wes McKinney <wesmck...@gmail.com> > > >>> wrote: > > >>>>> > > >>>>>> On Tue, May 7, 2019 at 12:26 PM John Muehlhausen <j...@jgm.org> > > >> wrote: > > >>>>>>> > > >>>>>>> Wes, are we saying that `pa.ipc.open_file(...).read_pandas()` > > >>> already > > >>>>>> reads > > >>>>>>> the future Feather format? If not, how will the future format > > >>> differ? I > > >>>>>>> will work on my access pattern with this format instead of the > > >>> current > > >>>>>>> feather format. Sorry I was not clear on that earlier. > > >>>>>>> > > >>>>>> > > >>>>>> Yes, under the hood those will use the same zero-copy binary > > >> protocol > > >>>>>> code paths to read the file. > > >>>>>> > > >>>>>>> Micah, thank you! > > >>>>>>> > > >>>>>>> On Tue, May 7, 2019 at 11:44 AM Micah Kornfield < > > >>> emkornfi...@gmail.com> > > >>>>>>> wrote: > > >>>>>>> > > >>>>>>>> Hi John, > > >>>>>>>> To give a specific pointer [1] describes how the streaming > > >>> protocol is > > >>>>>>>> stored to a file. > > >>>>>>>> > > >>>>>>>> [1] https://arrow.apache.org/docs/format/IPC.html#file-format > > >>>>>>>> > > >>>>>>>> On Tue, May 7, 2019 at 9:40 AM Wes McKinney < > > >> wesmck...@gmail.com> > > >>>>>> wrote: > > >>>>>>>> > > >>>>>>>>> hi John, > > >>>>>>>>> > > >>>>>>>>> As soon as the R folks can install the Arrow R package > > >>> consistently, > > >>>>>>>>> the intent is to replace the Feather internals with the plain > > >>> Arrow > > >>>>>>>>> IPC protocol where we have much better platform support all > > >>> around. > > >>>>>>>>> > > >>>>>>>>> If you'd like to experiment with creating an API for > > >>> pre-allocating > > >>>>>>>>> fixed-size Arrow protocol blocks and then mutating the data > > >> and > > >>>>>>>>> metadata on disk in-place, please be our guest. We don't have > > >>> the > > >>>>>>>>> tools developed yet to do this for you > > >>>>>>>>> > > >>>>>>>>> - Wes > > >>>>>>>>> > > >>>>>>>>> On Tue, May 7, 2019 at 11:25 AM John Muehlhausen <j...@jgm.org > > >>> > > >>>>>> wrote: > > >>>>>>>>>> > > >>>>>>>>>> Thanks Wes: > > >>>>>>>>>> > > >>>>>>>>>> "the current Feather format is deprecated" ... yes, but > > >> there > > >>>>>> will be a > > >>>>>>>>>> future file format that replaces it, correct? And my > > >>> discussion > > >>>>>> of > > >>>>>>>>>> immutable "portions" of Arrow buffers, rather than > > >>> immutability > > >>>>>> of the > > >>>>>>>>>> entire buffer, applies to IPC as well, right? I am only > > >>>>>> championing > > >>>>>>>> the > > >>>>>>>>>> idea that this future file format have the convenience that > > >>>>>> certain > > >>>>>>>>>> preallocated rows can be ignored based on a metadata > > >> setting. > > >>>>>>>>>> > > >>>>>>>>>> "I recommend batching your writes" ... this is extremely > > >>>>>> inefficient > > >>>>>>>> and > > >>>>>>>>>> adds unacceptable latency, relative to the proposed > > >>> solution. Do > > >>>>>> you > > >>>>>>>>>> disagree? Either I have a batch length of 1 to minimize > > >>> latency, > > >>>>>> which > > >>>>>>>>>> eliminates columnar advantages on the read side, or else I > > >> add > > >>>>>> latency. > > >>>>>>>>>> Neither works, and it seems that a viable alternative is > > >>> within > > >>>>>> sight? > > >>>>>>>>>> > > >>>>>>>>>> If you don't agree that there is a core issue and > > >> opportunity > > >>>>>> here, I'm > > >>>>>>>>> not > > >>>>>>>>>> sure how to better make my case.... > > >>>>>>>>>> > > >>>>>>>>>> -John > > >>>>>>>>>> > > >>>>>>>>>> On Tue, May 7, 2019 at 11:02 AM Wes McKinney < > > >>> wesmck...@gmail.com > > >>>>>>> > > >>>>>>>>> wrote: > > >>>>>>>>>> > > >>>>>>>>>>> hi John, > > >>>>>>>>>>> > > >>>>>>>>>>> On Tue, May 7, 2019 at 10:53 AM John Muehlhausen < > > >>> j...@jgm.org> > > >>>>>>>> wrote: > > >>>>>>>>>>>> > > >>>>>>>>>>>> Wes et al, I completed a preliminary study of > > >> populating a > > >>>>>> Feather > > >>>>>>>>> file > > >>>>>>>>>>>> incrementally. Some notes and questions: > > >>>>>>>>>>>> > > >>>>>>>>>>>> I wrote the following dataframe to a feather file: > > >>>>>>>>>>>> a b > > >>>>>>>>>>>> 0 0123456789 0.0 > > >>>>>>>>>>>> 1 0123456789 NaN > > >>>>>>>>>>>> 2 0123456789 NaN > > >>>>>>>>>>>> 3 0123456789 NaN > > >>>>>>>>>>>> 4 None NaN > > >>>>>>>>>>>> > > >>>>>>>>>>>> In re-writing the flatbuffers metadata (flatc -p doesn't > > >>>>>>>>>>>> support --gen-mutable! yuck! C++ to the rescue...), it > > >>> seems > > >>>>>> that > > >>>>>>>>>>>> read_feather is not affected by NumRows? It seems to be > > >>>>>> driven > > >>>>>>>>> entirely > > >>>>>>>>>>> by > > >>>>>>>>>>>> the per-column Length values? > > >>>>>>>>>>>> > > >>>>>>>>>>>> Also, it seems as if one of the primary usages of > > >>> NullCount > > >>>>>> is to > > >>>>>>>>>>> determine > > >>>>>>>>>>>> whether or not a bitfield is present? In the > > >>> initialization > > >>>>>> data > > >>>>>>>>> above I > > >>>>>>>>>>>> was careful to have a null value in each column in order > > >>> to > > >>>>>>>> generate > > >>>>>>>>> a > > >>>>>>>>>>>> bitfield. > > >>>>>>>>>>> > > >>>>>>>>>>> Per my prior e-mails, the current Feather format is > > >>> deprecated, > > >>>>>> so > > >>>>>>>> I'm > > >>>>>>>>>>> only willing to engage on a discussion of the official > > >> Arrow > > >>>>>> binary > > >>>>>>>>>>> protocol that we use for IPC (memory mapping) and RPC > > >>> (Flight). > > >>>>>>>>>>> > > >>>>>>>>>>>> > > >>>>>>>>>>>> I then wiped the bitfields in the file and set all of > > >> the > > >>>>>> string > > >>>>>>>>> indices > > >>>>>>>>>>> to > > >>>>>>>>>>>> one past the end of the blob buffer (all strings empty): > > >>>>>>>>>>>> a b > > >>>>>>>>>>>> 0 None NaN > > >>>>>>>>>>>> 1 None NaN > > >>>>>>>>>>>> 2 None NaN > > >>>>>>>>>>>> 3 None NaN > > >>>>>>>>>>>> 4 None NaN > > >>>>>>>>>>>> > > >>>>>>>>>>>> I then set the first record to some data by consuming > > >>> some of > > >>>>>> the > > >>>>>>>>> string > > >>>>>>>>>>>> blob and row 0 and 1 indices, also setting the double: > > >>>>>>>>>>>> > > >>>>>>>>>>>> a b > > >>>>>>>>>>>> 0 Hello, world! 5.0 > > >>>>>>>>>>>> 1 None NaN > > >>>>>>>>>>>> 2 None NaN > > >>>>>>>>>>>> 3 None NaN > > >>>>>>>>>>>> 4 None NaN > > >>>>>>>>>>>> > > >>>>>>>>>>>> As mentioned above, NumRows seems to be ignored. I > > >> tried > > >>>>>> adjusting > > >>>>>>>>> each > > >>>>>>>>>>>> column Length to mask off higher rows and it worked for > > >> 4 > > >>>>>> (hide > > >>>>>>>> last > > >>>>>>>>> row) > > >>>>>>>>>>>> but not for less than 4. I take this to be due to math > > >>> used > > >>>>>> to > > >>>>>>>>> figure > > >>>>>>>>>>> out > > >>>>>>>>>>>> where the buffers are relative to one another since > > >> there > > >>> is > > >>>>>> only > > >>>>>>>> one > > >>>>>>>>>>>> metadata offset for all of: the (optional) bitset, index > > >>>>>> column and > > >>>>>>>>> (if > > >>>>>>>>>>>> string) blobs. > > >>>>>>>>>>>> > > >>>>>>>>>>>> Populating subsequent rows would proceed in a similar > > >> way > > >>>>>> until all > > >>>>>>>>> of > > >>>>>>>>>>> the > > >>>>>>>>>>>> blob storage has been consumed, which may come before > > >> the > > >>>>>>>>> pre-allocated > > >>>>>>>>>>>> rows have been consumed. > > >>>>>>>>>>>> > > >>>>>>>>>>>> So what does this mean for my desire to incrementally > > >>> write > > >>>>>> these > > >>>>>>>>>>>> (potentially memory-mapped) pre-allocated files and/or > > >>> Arrow > > >>>>>>>> buffers > > >>>>>>>>> in > > >>>>>>>>>>>> memory? Some thoughts: > > >>>>>>>>>>>> > > >>>>>>>>>>>> - If a single value (such as NumRows) were consulted to > > >>>>>> determine > > >>>>>>>> the > > >>>>>>>>>>> table > > >>>>>>>>>>>> row dimension then updating this single value would > > >> serve > > >>> to > > >>>>>> tell a > > >>>>>>>>>>> reader > > >>>>>>>>>>>> which rows are relevant. Assuming this value is updated > > >>>>>> after all > > >>>>>>>>> other > > >>>>>>>>>>>> mutations are complete, and assuming that mutations are > > >>> only > > >>>>>>>> appends > > >>>>>>>>>>>> (addition of rows), then concurrency control involves > > >> only > > >>>>>> ensuring > > >>>>>>>>> that > > >>>>>>>>>>>> this value is not examined while it is being written. > > >>>>>>>>>>>> > > >>>>>>>>>>>> - NullCount presents a concurrency problem if someone > > >>> reads > > >>>>>> the > > >>>>>>>> file > > >>>>>>>>>>> after > > >>>>>>>>>>>> this field has been updated, but before NumRows has > > >>> exposed > > >>>>>> the new > > >>>>>>>>>>> record > > >>>>>>>>>>>> (or vice versa). The idea previously mentioned that > > >> there > > >>>>>> will > > >>>>>>>>> "likely > > >>>>>>>>>>>> [be] more statistics in the future" feels like it might > > >> be > > >>>>>> scope > > >>>>>>>>> creep to > > >>>>>>>>>>>> me? This is a data representation, not a calculation > > >>>>>> framework? > > >>>>>>>> If > > >>>>>>>>>>>> NullCount had its genesis in the optional nature of the > > >>>>>> bitfield, I > > >>>>>>>>> would > > >>>>>>>>>>>> suggest that perhaps NullCount can be dropped in favor > > >> of > > >>>>>> always > > >>>>>>>>>>> supplying > > >>>>>>>>>>>> the bitfield, which in any event is already contemplated > > >>> by > > >>>>>> the > > >>>>>>>> spec: > > >>>>>>>>>>>> "Implementations may choose to always allocate one > > >> anyway > > >>> as a > > >>>>>>>>> matter of > > >>>>>>>>>>>> convenience." If the concern is space savings, Arrow is > > >>>>>> already an > > >>>>>>>>>>>> extremely uncompressed format. (Compression is > > >> something > > >>> I > > >>>>>> would > > >>>>>>>>> also > > >>>>>>>>>>>> consider to be scope creep as regards Feather... > > >>> compressed > > >>>>>>>>> filesystems > > >>>>>>>>>>> can > > >>>>>>>>>>>> be employed and there are other compressed dataframe > > >>> formats.) > > >>>>>>>>> However, > > >>>>>>>>>>> if > > >>>>>>>>>>>> protecting the 4 bytes required to update NowRows turns > > >>> out > > >>>>>> to be > > >>>>>>>> no > > >>>>>>>>>>> easier > > >>>>>>>>>>>> than protecting all of the statistical bytes as well as > > >>> part > > >>>>>> of the > > >>>>>>>>> same > > >>>>>>>>>>>> "critical section" (locks: yuck!!) then statistics pose > > >> no > > >>>>>> issue. > > >>>>>>>> I > > >>>>>>>>>>> have a > > >>>>>>>>>>>> feeling that the availability of an atomic write of 4 > > >>> bytes > > >>>>>> will > > >>>>>>>>> depend > > >>>>>>>>>>> on > > >>>>>>>>>>>> the storage mechanism... memory vs memory map vs write() > > >>> etc. > > >>>>>>>>>>>> > > >>>>>>>>>>>> - The elephant in the room appears to be the presumptive > > >>>>>> binary > > >>>>>>>>> yes/no on > > >>>>>>>>>>>> mutability of Arrow buffers. Perhaps the thought is > > >> that > > >>>>>> certain > > >>>>>>>>> batch > > >>>>>>>>>>>> processes will be wrecked if anyone anywhere is mutating > > >>>>>> buffers in > > >>>>>>>>> any > > >>>>>>>>>>>> way. But keep in mind I am not proposing general > > >>> mutability, > > >>>>>> only > > >>>>>>>>>>>> appending of new data. *A huge amount of batch > > >> processing > > >>>>>> that > > >>>>>>>> will > > >>>>>>>>> take > > >>>>>>>>>>>> place with Arrow is on time-series data (whether > > >>> financial or > > >>>>>>>>> otherwise). > > >>>>>>>>>>>> It is only natural that architects will want the minimal > > >>>>>> impedance > > >>>>>>>>>>> mismatch > > >>>>>>>>>>>> when it comes time to grow those time series as the > > >> events > > >>>>>> occur > > >>>>>>>>> going > > >>>>>>>>>>>> forward.* So rather than say that I want "mutable" > > >> Arrow > > >>>>>> buffers, > > >>>>>>>> I > > >>>>>>>>>>> would > > >>>>>>>>>>>> pitch this as a call for "immutable populated areas" of > > >>> Arrow > > >>>>>>>> buffers > > >>>>>>>>>>>> combined with the concept that the populated area can > > >>> grow up > > >>>>>> to > > >>>>>>>>> whatever > > >>>>>>>>>>>> was preallocated. This will not affect anyone who has > > >>>>>> "memoized" a > > >>>>>>>>>>>> dimension and wants to continue to consider the > > >>> then-current > > >>>>>> data > > >>>>>>>> as > > >>>>>>>>>>>> immutable... it will be immutable and will always be > > >>> immutable > > >>>>>>>>> according > > >>>>>>>>>>> to > > >>>>>>>>>>>> that then-current dimension. > > >>>>>>>>>>>> > > >>>>>>>>>>>> Thanks in advance for considering this feedback! I > > >>> absolutely > > >>>>>>>>> require > > >>>>>>>>>>>> efficient row-wise growth of an Arrow-like buffer to > > >> deal > > >>>>>> with time > > >>>>>>>>>>> series > > >>>>>>>>>>>> data in near real time. I believe that preallocation is > > >>> (by > > >>>>>> far) > > >>>>>>>> the > > >>>>>>>>>>> most > > >>>>>>>>>>>> efficient way to accomplish this. I hope to be able to > > >>> use > > >>>>>> Arrow! > > >>>>>>>>> If I > > >>>>>>>>>>>> cannot use Arrow than I will be using a home-grown Arrow > > >>> that > > >>>>>> is > > >>>>>>>>>>> identical > > >>>>>>>>>>>> except for this feature, which would be very sad! Even > > >> if > > >>>>>> Arrow > > >>>>>>>>> itself > > >>>>>>>>>>>> could be used in this manner today, I would be hesitant > > >> to > > >>>>>> use it > > >>>>>>>> if > > >>>>>>>>> the > > >>>>>>>>>>>> use-case was not protected on a go-forward basis. > > >>>>>>>>>>>> > > >>>>>>>>>>> > > >>>>>>>>>>> I recommend batching your writes and using the Arrow > > >> binary > > >>>>>> streaming > > >>>>>>>>>>> protocol so you are only appending to a file rather than > > >>>>>> mutating > > >>>>>>>>>>> previously-written bytes. This use case is well defined > > >> and > > >>>>>> supported > > >>>>>>>>>>> in the software already. > > >>>>>>>>>>> > > >>>>>>>>>>> > > >>>>>>>>>>> > > >>>>>>>>> > > >>>>>>>> > > >>>>>> > > >>> > > >> > > https://github.com/apache/arrow/blob/master/docs/source/format/IPC.rst#streaming-format > > >>>>>>>>>>> > > >>>>>>>>>>> - Wes > > >>>>>>>>>>> > > >>>>>>>>>>>> Of course, I am completely open to alternative ideas and > > >>>>>>>> approaches! > > >>>>>>>>>>>> > > >>>>>>>>>>>> -John > > >>>>>>>>>>>> > > >>>>>>>>>>>> On Mon, May 6, 2019 at 11:39 AM Wes McKinney < > > >>>>>> wesmck...@gmail.com> > > >>>>>>>>>>> wrote: > > >>>>>>>>>>>> > > >>>>>>>>>>>>> hi John -- again, I would caution you against using > > >>> Feather > > >>>>>> files > > >>>>>>>>> for > > >>>>>>>>>>>>> issues of longevity -- the internal memory layout of > > >>> those > > >>>>>> files > > >>>>>>>>> is a > > >>>>>>>>>>>>> "dead man walking" so to speak. > > >>>>>>>>>>>>> > > >>>>>>>>>>>>> I would advise against forking the project, IMHO that > > >>> is a > > >>>>>> dark > > >>>>>>>>> path > > >>>>>>>>>>>>> that leads nowhere good. We have a large community > > >> here > > >>> and > > >>>>>> we > > >>>>>>>>> accept > > >>>>>>>>>>>>> pull requests -- I think the challenge is going to be > > >>>>>> defining > > >>>>>>>> the > > >>>>>>>>> use > > >>>>>>>>>>>>> case to suitable clarity that a general purpose > > >> solution > > >>>>>> can be > > >>>>>>>>>>>>> developed. > > >>>>>>>>>>>>> > > >>>>>>>>>>>>> - Wes > > >>>>>>>>>>>>> > > >>>>>>>>>>>>> > > >>>>>>>>>>>>> On Mon, May 6, 2019 at 11:16 AM John Muehlhausen < > > >>>>>> j...@jgm.org> > > >>>>>>>>> wrote: > > >>>>>>>>>>>>>> > > >>>>>>>>>>>>>> François, Wes, > > >>>>>>>>>>>>>> > > >>>>>>>>>>>>>> Thanks for the feedback. I think the most practical > > >>>>>> thing for > > >>>>>>>>> me to > > >>>>>>>>>>> do > > >>>>>>>>>>>>> is > > >>>>>>>>>>>>>> 1- write a Feather file that is structured to > > >>>>>> pre-allocate the > > >>>>>>>>> space > > >>>>>>>>>>> I > > >>>>>>>>>>>>> need > > >>>>>>>>>>>>>> (e.g. initial variable-length strings are of average > > >>> size) > > >>>>>>>>>>>>>> 2- come up with code to monkey around with the > > >> values > > >>>>>> contained > > >>>>>>>>> in > > >>>>>>>>>>> the > > >>>>>>>>>>>>>> vectors so that before and after each manipulation > > >> the > > >>>>>> file is > > >>>>>>>>> valid > > >>>>>>>>>>> as I > > >>>>>>>>>>>>>> walk the rows ... this is a writer that uses memory > > >>>>>> mapping > > >>>>>>>>>>>>>> 3- check back in here once that works, assuming that > > >>> it > > >>>>>> does, > > >>>>>>>> to > > >>>>>>>>> see > > >>>>>>>>>>> if > > >>>>>>>>>>>>> we > > >>>>>>>>>>>>>> can bless certain mutation paths > > >>>>>>>>>>>>>> 4- if we can't bless certain mutation paths, fork > > >> the > > >>>>>> project > > >>>>>>>> for > > >>>>>>>>>>> those > > >>>>>>>>>>>>> who > > >>>>>>>>>>>>>> care more about stream processing? That would not > > >>> seem > > >>>>>> to be > > >>>>>>>>> ideal > > >>>>>>>>>>> as I > > >>>>>>>>>>>>>> think mutation in row-order across the data set is > > >>>>>> relatively > > >>>>>>>> low > > >>>>>>>>>>> impact > > >>>>>>>>>>>>> on > > >>>>>>>>>>>>>> the overall design? > > >>>>>>>>>>>>>> > > >>>>>>>>>>>>>> Thanks again for engaging the topic! > > >>>>>>>>>>>>>> -John > > >>>>>>>>>>>>>> > > >>>>>>>>>>>>>> On Mon, May 6, 2019 at 10:26 AM Francois > > >>> Saint-Jacques < > > >>>>>>>>>>>>>> fsaintjacq...@gmail.com> wrote: > > >>>>>>>>>>>>>> > > >>>>>>>>>>>>>>> Hello John, > > >>>>>>>>>>>>>>> > > >>>>>>>>>>>>>>> Arrow is not yet suited for partial writes. The > > >>>>>> specification > > >>>>>>>>> only > > >>>>>>>>>>>>>>> talks about fully frozen/immutable objects, you're > > >>> in > > >>>>>>>>>>> implementation > > >>>>>>>>>>>>>>> defined territory here. For example, the C++ > > >> library > > >>>>>> assumes > > >>>>>>>>> the > > >>>>>>>>>>> Array > > >>>>>>>>>>>>>>> object is immutable; it memoize the null count, > > >> and > > >>>>>> likely > > >>>>>>>> more > > >>>>>>>>>>>>>>> statistics in the future. > > >>>>>>>>>>>>>>> > > >>>>>>>>>>>>>>> If you want to use pre-allocated buffers and > > >> array, > > >>> you > > >>>>>> can > > >>>>>>>>> use the > > >>>>>>>>>>>>>>> column validity bitmap for this purpose, e.g. set > > >>> all > > >>>>>> null by > > >>>>>>>>>>> default > > >>>>>>>>>>>>>>> and flip once the row is written. It suffers from > > >>>>>> concurrency > > >>>>>>>>>>> issues > > >>>>>>>>>>>>>>> (+ invalidation issues as pointed) when dealing > > >> with > > >>>>>> multiple > > >>>>>>>>>>> columns. > > >>>>>>>>>>>>>>> You'll have to use a barrier of some kind, e.g. a > > >>>>>> per-batch > > >>>>>>>>> global > > >>>>>>>>>>>>>>> atomic (if append-only), or dedicated column(s) > > >> à-la > > >>>>>> MVCC. > > >>>>>>>> But > > >>>>>>>>>>> then, > > >>>>>>>>>>>>>>> the reader needs to be aware of this and compute a > > >>> mask > > >>>>>> each > > >>>>>>>>> time > > >>>>>>>>>>> it > > >>>>>>>>>>>>>>> needs to query the partial batch. > > >>>>>>>>>>>>>>> > > >>>>>>>>>>>>>>> This is a common columnar database problem, see > > >> [1] > > >>> for > > >>>>>> a > > >>>>>>>>> recent > > >>>>>>>>>>> paper > > >>>>>>>>>>>>>>> on the subject. The usual technique is to store > > >> the > > >>>>>> recent > > >>>>>>>> data > > >>>>>>>>>>>>>>> row-wise, and transform it in column-wise when a > > >>>>>> threshold is > > >>>>>>>>> met > > >>>>>>>>>>> akin > > >>>>>>>>>>>>>>> to a compaction phase. There was a somewhat > > >> related > > >>>>>> thread > > >>>>>>>> [2] > > >>>>>>>>>>> lately > > >>>>>>>>>>>>>>> about streaming vs batching. In the end, I think > > >>> your > > >>>>>>>> solution > > >>>>>>>>>>> will be > > >>>>>>>>>>>>>>> very application specific. > > >>>>>>>>>>>>>>> > > >>>>>>>>>>>>>>> François > > >>>>>>>>>>>>>>> > > >>>>>>>>>>>>>>> [1] > > >>>>>>>> https://db.in.tum.de/downloads/publications/datablocks.pdf > > >>>>>>>>>>>>>>> [2] > > >>>>>>>>>>>>>>> > > >>>>>>>>>>>>> > > >>>>>>>>>>> > > >>>>>>>>> > > >>>>>>>> > > >>>>>> > > >>> > > >> > > https://lists.apache.org/thread.html/27945533db782361143586fd77ca08e15e96e2f2a5250ff084b462d6@%3Cdev.arrow.apache.org%3E > > >>>>>>>>>>>>>>> > > >>>>>>>>>>>>>>> > > >>>>>>>>>>>>>>> > > >>>>>>>>>>>>>>> > > >>>>>>>>>>>>>>> > > >>>>>>>>>>>>>>> > > >>>>>>>>>>>>>>> > > >>>>>>>>>>>>>>> On Mon, May 6, 2019 at 10:39 AM John Muehlhausen < > > >>>>>>>> j...@jgm.org> > > >>>>>>>>>>> wrote: > > >>>>>>>>>>>>>>>> > > >>>>>>>>>>>>>>>> Wes, > > >>>>>>>>>>>>>>>> > > >>>>>>>>>>>>>>>> I’m not afraid of writing my own C++ code to > > >> deal > > >>>>>> with all > > >>>>>>>> of > > >>>>>>>>>>> this > > >>>>>>>>>>>>> on the > > >>>>>>>>>>>>>>>> writer side. I just need a way to “append” > > >>>>>> (incrementally > > >>>>>>>>>>> populate) > > >>>>>>>>>>>>> e.g. > > >>>>>>>>>>>>>>>> feather files so that a person using e.g. > > >> pyarrow > > >>>>>> doesn’t > > >>>>>>>>> suffer > > >>>>>>>>>>> some > > >>>>>>>>>>>>>>>> catastrophic failure... and “on the side” I tell > > >>> them > > >>>>>> which > > >>>>>>>>> rows > > >>>>>>>>>>> are > > >>>>>>>>>>>>> junk > > >>>>>>>>>>>>>>>> and deal with any concurrency issues that can’t > > >> be > > >>>>>> solved > > >>>>>>>> in > > >>>>>>>>> the > > >>>>>>>>>>>>> arena of > > >>>>>>>>>>>>>>>> atomicity and ordering of ops. For now I care > > >>> about > > >>>>>> basic > > >>>>>>>>> types > > >>>>>>>>>>> but > > >>>>>>>>>>>>>>>> including variable-width strings. > > >>>>>>>>>>>>>>>> > > >>>>>>>>>>>>>>>> For event-processing, I think Arrow has to have > > >>> the > > >>>>>> concept > > >>>>>>>>> of a > > >>>>>>>>>>>>>>> partially > > >>>>>>>>>>>>>>>> full record set. Some alternatives are: > > >>>>>>>>>>>>>>>> - have a batch size of one, thus littering the > > >>>>>> landscape > > >>>>>>>> with > > >>>>>>>>>>>>> trivially > > >>>>>>>>>>>>>>>> small Arrow buffers > > >>>>>>>>>>>>>>>> - artificially increase latency with a batch > > >> size > > >>>>>> larger > > >>>>>>>> than > > >>>>>>>>>>> one, > > >>>>>>>>>>>>> but > > >>>>>>>>>>>>>>> not > > >>>>>>>>>>>>>>>> processing any data until a batch is complete > > >>>>>>>>>>>>>>>> - continuously re-write the (entire!) “main” > > >>> buffer as > > >>>>>>>>> batches of > > >>>>>>>>>>>>> length > > >>>>>>>>>>>>>>> 1 > > >>>>>>>>>>>>>>>> roll in > > >>>>>>>>>>>>>>>> - instead of one main buffer, several, and at > > >> some > > >>>>>>>> threshold > > >>>>>>>>>>> combine > > >>>>>>>>>>>>> the > > >>>>>>>>>>>>>>>> last N length-1 batches into a length N buffer > > >> ... > > >>>>>> still an > > >>>>>>>>>>>>> inefficiency > > >>>>>>>>>>>>>>>> > > >>>>>>>>>>>>>>>> Consider the case of QAbstractTableModel as the > > >>>>>> underlying > > >>>>>>>>> data > > >>>>>>>>>>> for a > > >>>>>>>>>>>>>>> table > > >>>>>>>>>>>>>>>> or a chart. This visualization shows all of the > > >>> data > > >>>>>> for > > >>>>>>>> the > > >>>>>>>>>>> recent > > >>>>>>>>>>>>> past > > >>>>>>>>>>>>>>>> as well as events rolling in. If this model > > >>>>>> interface is > > >>>>>>>>>>>>> implemented as > > >>>>>>>>>>>>>>> a > > >>>>>>>>>>>>>>>> view onto “many thousands” of individual event > > >>>>>> buffers then > > >>>>>>>>> we > > >>>>>>>>>>> gain > > >>>>>>>>>>>>>>> nothing > > >>>>>>>>>>>>>>>> from columnar layout. (Suppose there are tons > > >> of > > >>>>>> columns > > >>>>>>>> and > > >>>>>>>>>>> most of > > >>>>>>>>>>>>>>> them > > >>>>>>>>>>>>>>>> are scrolled out of the view.). Likewise we > > >> cannot > > >>>>>> re-write > > >>>>>>>>> the > > >>>>>>>>>>>>> entire > > >>>>>>>>>>>>>>>> model on each event... time complexity blows up. > > >>>>>> What we > > >>>>>>>>> want > > >>>>>>>>>>> is to > > >>>>>>>>>>>>>>> have a > > >>>>>>>>>>>>>>>> large pre-allocated chunk and just change > > >>> rowCount() > > >>>>>> as > > >>>>>>>> data > > >>>>>>>>> is > > >>>>>>>>>>>>>>> “appended.” > > >>>>>>>>>>>>>>>> Sure, we may run out of space and have another > > >>> and > > >>>>>> another > > >>>>>>>>>>> chunk for > > >>>>>>>>>>>>>>>> future row ranges, but a handful of chunks > > >> chained > > >>>>>> together > > >>>>>>>>> is > > >>>>>>>>>>> better > > >>>>>>>>>>>>>>> than > > >>>>>>>>>>>>>>>> as many chunks as there were events! > > >>>>>>>>>>>>>>>> > > >>>>>>>>>>>>>>>> And again, having a batch size >1 and delaying > > >> the > > >>>>>> data > > >>>>>>>>> until a > > >>>>>>>>>>>>> batch is > > >>>>>>>>>>>>>>>> full is a non-starter. > > >>>>>>>>>>>>>>>> > > >>>>>>>>>>>>>>>> I am really hoping to see partially-filled > > >>> buffers as > > >>>>>>>>> something > > >>>>>>>>>>> we > > >>>>>>>>>>>>> keep > > >>>>>>>>>>>>>>> our > > >>>>>>>>>>>>>>>> finger on moving forward! Or else, what am I > > >>> missing? > > >>>>>>>>>>>>>>>> > > >>>>>>>>>>>>>>>> -John > > >>>>>>>>>>>>>>>> > > >>>>>>>>>>>>>>>> On Mon, May 6, 2019 at 8:24 AM Wes McKinney < > > >>>>>>>>> wesmck...@gmail.com > > >>>>>>>>>>>> > > >>>>>>>>>>>>> wrote: > > >>>>>>>>>>>>>>>> > > >>>>>>>>>>>>>>>>> hi John, > > >>>>>>>>>>>>>>>>> > > >>>>>>>>>>>>>>>>> In C++ the builder classes don't yet support > > >>>>>> writing into > > >>>>>>>>>>>>> preallocated > > >>>>>>>>>>>>>>>>> memory. It would be tricky for applications to > > >>>>>> determine > > >>>>>>>> a > > >>>>>>>>>>> priori > > >>>>>>>>>>>>>>>>> which segments of memory to pass to the > > >>> builder. It > > >>>>>> seems > > >>>>>>>>> only > > >>>>>>>>>>>>>>>>> feasible for primitive / fixed-size types so > > >> my > > >>>>>> guess > > >>>>>>>>> would be > > >>>>>>>>>>>>> that a > > >>>>>>>>>>>>>>>>> separate set of interfaces would need to be > > >>>>>> developed for > > >>>>>>>>> this > > >>>>>>>>>>>>> task. > > >>>>>>>>>>>>>>>>> > > >>>>>>>>>>>>>>>>> - Wes > > >>>>>>>>>>>>>>>>> > > >>>>>>>>>>>>>>>>> On Mon, May 6, 2019 at 8:18 AM Jacques Nadeau > > >> < > > >>>>>>>>>>> jacq...@apache.org> > > >>>>>>>>>>>>>>> wrote: > > >>>>>>>>>>>>>>>>>> > > >>>>>>>>>>>>>>>>>> This is more of a question of implementation > > >>>>>> versus > > >>>>>>>>>>>>> specification. An > > >>>>>>>>>>>>>>>>> arrow > > >>>>>>>>>>>>>>>>>> buffer is generally built and then sealed. > > >> In > > >>>>>> different > > >>>>>>>>>>>>> languages, > > >>>>>>>>>>>>>>> this > > >>>>>>>>>>>>>>>>>> building process works differently (a > > >> concern > > >>> of > > >>>>>> the > > >>>>>>>>> language > > >>>>>>>>>>>>> rather > > >>>>>>>>>>>>>>> than > > >>>>>>>>>>>>>>>>>> the memory specification). We don't > > >> currently > > >>>>>> allow a > > >>>>>>>>> half > > >>>>>>>>>>> built > > >>>>>>>>>>>>>>> vector > > >>>>>>>>>>>>>>>>> to > > >>>>>>>>>>>>>>>>>> be moved to another language and then be > > >>> further > > >>>>>> built. > > >>>>>>>>> So > > >>>>>>>>>>> the > > >>>>>>>>>>>>>>> question > > >>>>>>>>>>>>>>>>> is > > >>>>>>>>>>>>>>>>>> really more concrete: what language are you > > >>>>>> looking at > > >>>>>>>>> and > > >>>>>>>>>>> what > > >>>>>>>>>>>>> is > > >>>>>>>>>>>>>>> the > > >>>>>>>>>>>>>>>>>> specific pattern you're trying to undertake > > >>> for > > >>>>>>>> building. > > >>>>>>>>>>>>>>>>>> > > >>>>>>>>>>>>>>>>>> If you're trying to go across independent > > >>>>>> processes > > >>>>>>>>> (whether > > >>>>>>>>>>> the > > >>>>>>>>>>>>> same > > >>>>>>>>>>>>>>>>>> process restarted or two separate processes > > >>> active > > >>>>>>>>>>>>> simultaneously) > > >>>>>>>>>>>>>>> you'll > > >>>>>>>>>>>>>>>>>> need to build up your own data structures to > > >>> help > > >>>>>> with > > >>>>>>>>> this. > > >>>>>>>>>>>>>>>>>> > > >>>>>>>>>>>>>>>>>> On Mon, May 6, 2019 at 6:28 PM John > > >>> Muehlhausen < > > >>>>>>>>> j...@jgm.org > > >>>>>>>>>>>> > > >>>>>>>>>>>>> wrote: > > >>>>>>>>>>>>>>>>>> > > >>>>>>>>>>>>>>>>>>> Hello, > > >>>>>>>>>>>>>>>>>>> > > >>>>>>>>>>>>>>>>>>> Glad to learn of this project— good work! > > >>>>>>>>>>>>>>>>>>> > > >>>>>>>>>>>>>>>>>>> If I allocate a single chunk of memory and > > >>> start > > >>>>>>>>> building > > >>>>>>>>>>> Arrow > > >>>>>>>>>>>>>>> format > > >>>>>>>>>>>>>>>>>>> within it, does this chunk save any state > > >>>>>> regarding > > >>>>>>>> my > > >>>>>>>>>>>>> progress? > > >>>>>>>>>>>>>>>>>>> > > >>>>>>>>>>>>>>>>>>> For example, suppose I allocate a column > > >> for > > >>>>>> floating > > >>>>>>>>> point > > >>>>>>>>>>>>> (fixed > > >>>>>>>>>>>>>>>>> width) > > >>>>>>>>>>>>>>>>>>> and a column for string (variable width). > > >>>>>> Suppose I > > >>>>>>>>> start > > >>>>>>>>>>>>>>> building the > > >>>>>>>>>>>>>>>>>>> floating point column at offset X into my > > >>> single > > >>>>>>>>> buffer, > > >>>>>>>>>>> and > > >>>>>>>>>>>>> the > > >>>>>>>>>>>>>>> string > > >>>>>>>>>>>>>>>>>>> “pointer” column at offset Y into the same > > >>>>>> single > > >>>>>>>>> buffer, > > >>>>>>>>>>> and > > >>>>>>>>>>>>> the > > >>>>>>>>>>>>>>>>> string > > >>>>>>>>>>>>>>>>>>> data elements at offset Z. > > >>>>>>>>>>>>>>>>>>> > > >>>>>>>>>>>>>>>>>>> I write one floating point number and one > > >>>>>> string, > > >>>>>>>> then > > >>>>>>>>> go > > >>>>>>>>>>> away. > > >>>>>>>>>>>>>>> When I > > >>>>>>>>>>>>>>>>>>> come back to this buffer to append another > > >>>>>> value, > > >>>>>>>> does > > >>>>>>>>> the > > >>>>>>>>>>>>> buffer > > >>>>>>>>>>>>>>>>> itself > > >>>>>>>>>>>>>>>>>>> know where I would begin? I.e. is there a > > >>>>>>>>> differentiation > > >>>>>>>>>>> in > > >>>>>>>>>>>>> the > > >>>>>>>>>>>>>>>>> column > > >>>>>>>>>>>>>>>>>>> (or blob) data itself between the > > >> available > > >>>>>> space and > > >>>>>>>>> the > > >>>>>>>>>>> used > > >>>>>>>>>>>>>>> space? > > >>>>>>>>>>>>>>>>>>> > > >>>>>>>>>>>>>>>>>>> Suppose I write a lot of large variable > > >>> width > > >>>>>> strings > > >>>>>>>>> and > > >>>>>>>>>>> “run > > >>>>>>>>>>>>>>> out” of > > >>>>>>>>>>>>>>>>>>> space for them before running out of space > > >>> for > > >>>>>>>> floating > > >>>>>>>>>>> point > > >>>>>>>>>>>>>>> numbers > > >>>>>>>>>>>>>>>>> or > > >>>>>>>>>>>>>>>>>>> string pointers. (I guessed badly when > > >>> doing > > >>>>>> the > > >>>>>>>>> original > > >>>>>>>>>>>>>>>>> allocation.). I > > >>>>>>>>>>>>>>>>>>> consider this to be Ok since I can always > > >>>>>> “copy” the > > >>>>>>>>> data > > >>>>>>>>>>> to > > >>>>>>>>>>>>>>> “compress > > >>>>>>>>>>>>>>>>> out” > > >>>>>>>>>>>>>>>>>>> the unused fp/pointer buckets... the > > >> choice > > >>> is > > >>>>>> up to > > >>>>>>>>> me. > > >>>>>>>>>>>>>>>>>>> > > >>>>>>>>>>>>>>>>>>> The above applied to a (feather?) file is > > >>> how I > > >>>>>>>>> anticipate > > >>>>>>>>>>>>>>> appending > > >>>>>>>>>>>>>>>>> data > > >>>>>>>>>>>>>>>>>>> to disk... pre-allocate a mem-mapped file > > >>> and > > >>>>>>>> gradually > > >>>>>>>>>>> fill > > >>>>>>>>>>>>> it up. > > >>>>>>>>>>>>>>>>> The > > >>>>>>>>>>>>>>>>>>> efficiency of file utilization will depend > > >>> on my > > >>>>>>>>>>> projections > > >>>>>>>>>>>>>>> regarding > > >>>>>>>>>>>>>>>>>>> variable-width data types, but as I said > > >>> above, > > >>>>>> I can > > >>>>>>>>>>> always > > >>>>>>>>>>>>>>> re-write > > >>>>>>>>>>>>>>>>> the > > >>>>>>>>>>>>>>>>>>> file if/when this bothers me. > > >>>>>>>>>>>>>>>>>>> > > >>>>>>>>>>>>>>>>>>> Is this the recommended and supported > > >>> approach > > >>>>>> for > > >>>>>>>>>>> incremental > > >>>>>>>>>>>>>>> appends? > > >>>>>>>>>>>>>>>>>>> I’m really hoping to use Arrow instead of > > >>>>>> rolling my > > >>>>>>>>> own, > > >>>>>>>>>>> but > > >>>>>>>>>>>>>>>>> functionality > > >>>>>>>>>>>>>>>>>>> like this is absolutely key! Hoping not > > >> to > > >>> use > > >>>>>> a > > >>>>>>>>> side-car > > >>>>>>>>>>>>> file (or > > >>>>>>>>>>>>>>>>> memory > > >>>>>>>>>>>>>>>>>>> chunk) to store “append progress” > > >>> information. > > >>>>>>>>>>>>>>>>>>> > > >>>>>>>>>>>>>>>>>>> I am brand new to this project so please > > >>>>>> forgive me > > >>>>>>>> if > > >>>>>>>>> I > > >>>>>>>>>>> have > > >>>>>>>>>>>>>>>>> overlooked > > >>>>>>>>>>>>>>>>>>> something obvious. And again, looks like > > >>> great > > >>>>>> work > > >>>>>>>> so > > >>>>>>>>>>> far! > > >>>>>>>>>>>>>>>>>>> > > >>>>>>>>>>>>>>>>>>> Thanks! > > >>>>>>>>>>>>>>>>>>> -John > > >>>>>>>>>>>>>>>>>>> > > >>>>>>>>>>>>>>>>> > > >>>>>>>>>>>>>>> > > >>>>>>>>>>>>> > > >>>>>>>>>>> > > >>>>>>>>> > > >>>>>>>> > > >>>>>> > > >>>>> > > >>> > > >>> > > >> > > > > >