Re: storing per record batch metadata in arrow IPC file

Yue Ni Wed, 06 Apr 2022 06:42:35 -0700

Hi Weston,

> The C++ implementation does not expose this today that I can tell. So if
you want to use this then some C++ changes will be needed.  There is
already a JIRA ticket for this at [2].
Thanks for pointing this out, it seems the ticket ARROW-16131 I logged
above duplicates with ARROW-6940 you mentioned. After checking the source
code of this part, I gave it a try and submitted PR
https://github.com/apache/arrow/pull/12812


> On the other hand, if you already know what subset of batches you
are interested in, then I could maybe see some advantage in storing
the metadatas separately but only if the metadata is quite large.
This is indeed the case I am investigating. I plan to use some external
index to figure out a subset of batches to query against, by memory mapping
the IPC file, I can randomly access these selective record batches and then
use the metadata in each batch for further filtering/providing extra info.

> If the metadata is relatively small (KBs) then I still think you'd be
better off storing it all in the footer in most cases (or there wouldn't
be much difference)
I am still not sure if it is worth it to put the info in each batch's
metadata, and I am experimenting to see if this helps and ran into this
issue. In my case, I might create an IPC file with several thousands of
record batches, and each batch may have up to 100ish bytes metadata. If I
put the info for all batches into footer's metadata, depending on the shape
of data, it may be 100KB or more metadata in the footer in an unhappy path,
which I think could be wasteful. But you are correct, this may not matter
too much in many cases, I am still experimenting and thanks so much for the
detailed guidance.


On Wed, Apr 6, 2022 at 2:41 PM Weston Pace <weston.p...@gmail.com> wrote:

> Correct, the "ground truth" so to speak for these things is probably
> the flatbuffers files[1] (Message.fbs, Schema.fbs, and Schema.fbs in
> this case). There is a per-message custom metadata field that could be
> used as you describe.  The C++ implementation does not expose this
> today that I can tell.  So if you want to use this then some C++
> changes will be needed.  There is already a JIRA ticket for this at
> [2].
>
> > the metadata may
> > vary from batch to batch in an IPC file, and I can filter these batches
> > quickly simply using metadata without looking into data in the arrays.
>
> > There is a similar effort I can find on the web [1], but it stores all
> the
> > record batches metadata in the IPC file footer's schema. I think the
> footer
> > will be fully loaded for every access, which will introduce some
> > unnecessary IO if only a few of the record batches are read each time.
>
> I'm not sure the two above statements work together well.  If you want
> to use the metadata to determine which batches to read then you will
> need to read the metadata for every single batch.  So it doesn't make
> sense to spread this information throughout the file.
>
> On the other hand, if you already know what subset of batches you are
> interested in, then I could maybe see some advantage in storing the
> metadatas separately but only if the metadata is quite large.  If the
> metadata is relatively small (KBs) then I still think you'd be better
> off storing it all in the footer in most cases (or there wouldn't be
> much difference).
>
> If you're doing streaming processing of the entire file then it
> probably doesn't matter much either way.
>
> So there might be some potential here but I wouldn't say it is a sure
> thing.
>
> [1] https://github.com/apache/arrow/tree/master/format
> [2] https://issues.apache.org/jira/browse/ARROW-6940
>
> On Tue, Apr 5, 2022 at 7:26 PM Yue Ni <niyue....@gmail.com> wrote:
> >
> > Hi Aldrin,
> >
> > Thanks for the pointers. I checked out the C++ source code of this part,
> > and I think currently record batch specific metadata is not written into
> > the IPC file probably due to a bug in the code. I logged a bug to track
> > this issue (https://issues.apache.org/jira/browse/ARROW-16131), thanks
> so
> > much for the help.
> >
> > On Wed, Apr 6, 2022 at 12:58 AM Aldrin <akmon...@ucsc.edu.invalid>
> wrote:
> >
> > > Hm, I didn't think it was possible, but it looks like there may be some
> > > things you can try?
> > >
> > > My understanding was that you create a writer for an IPC stream or
> file and
> > > you pass a schema on construction which is used as "the schema" for
> the IPC
> > > stream/file. So, RecordBatches written using that writer should/need to
> > > match the given schema. This doesn't check the metadata, I don't
> think, but
> > > it only writes an "IPC payload" if the equality check passes.
> > >
> > > That being said, I did some checking, and some things seem like it's
> more
> > > flexible now (but I could be wrong). I'm not sure what the dictionary
> > > deltas are (maybe it's for dictionary arrays rather than metadata), but
> > > the "emit_dictionary_deltas" IpcOption may be relevant [1]. Otherwise,
> the
> > > `WriteRecordBatch` function appears to take a metadata length [2] and
> the
> > > `WriteRecordBatchStream` function [3] seems to only check that a
> vector of
> > > RecordBatches have matching schemas. Also, the `WritePayload` function
> > > (from a RecordBatchWriter via MakeFileWriter) seems to be relevant for
> how
> > > to write metadata that can be leveraged for a seek-based interface [4].
> > >
> > > But, ultimately, I am not sure these things are exposed at a higher
> level
> > > (e.g. pyarrow), even though they're available for use. They're also not
> > > exposed via the feather interface, as far as I know.
> > >
> > > [1]:
> > >
> > >
> https://arrow.apache.org/docs/cpp/api/ipc.html#_CPPv4N5arrow3ipc15IpcWriteOptions22emit_dictionary_deltasE
> > > [2]:
> > >
> > >
> https://github.com/apache/arrow/blob/apache-arrow-7.0.0/cpp/src/arrow/ipc/writer.cc#L644
> > > [3]:
> > >
> > >
> https://github.com/apache/arrow/blob/apache-arrow-7.0.0/cpp/src/arrow/ipc/writer.cc#L665
> > > [4]:
> > >
> > >
> https://github.com/apache/arrow/blob/apache-arrow-7.0.0/cpp/src/arrow/ipc/writer.cc#L1253
> > >
> > > Aldrin Montana
> > > Computer Science PhD Student
> > > UC Santa Cruz
> > >
> > >
> > > On Tue, Apr 5, 2022 at 1:55 AM Yue Ni <niyue....@gmail.com> wrote:
> > >
> > > > Hi there,
> > > >
> > > > I am investigating analyzing time series data using apache arrow. I
> would
> > > > like to store some record batch specific metadata, for example, some
> > > > statistics/tags about data in a particular record batch. More
> > > specifically,
> > > > I may use a single record batch to store metric samples for a certain
> > > time
> > > > range, and would like to store the min/max time and some dimensional
> data
> > > > like `host` and `aws_region` as metadata for a particular record
> batch so
> > > > that when loading multiple record batches from IPC file, the
> metadata may
> > > > vary from batch to batch in an IPC file, and I can filter these
> batches
> > > > quickly simply using metadata without looking into data in the
> arrays.
> > > And
> > > > I would like to know if it is possible to store such per record batch
> > > > metadata in an arrow IPC file.
> > > >
> > > > There is a similar effort I can find on the web [1], but it stores
> all
> > > the
> > > > record batches metadata in the IPC file footer's schema. I think the
> > > footer
> > > > will be fully loaded for every access, which will introduce some
> > > > unnecessary IO if only a few of the record batches are read each
> time.
> > > >
> > > > I read some docs/source code [2] [3], and if my understanding is
> correct,
> > > > it is technically possible to store different metadata in different
> > > record
> > > > batches since in the streaming format, each message has a
> > > `custom_metadata`
> > > > associated with it. But I don't find any API (at least in pyarrow)
> > > allowing
> > > > me to do this. APIs like `pyarrow.record_batch` does allow users to
> > > specify
> > > > metadata when constructing a record batch, but it doesn't seem to be
> used
> > > > if `RecordBatchFileWriter` has a schema provided (which of course
> doesn't
> > > > have such record batch specific metadata).
> > > >
> > > > I haven't looked into the lower level C++ API yet, and it seems the
> > > > assumption is that all the batches in the IPC file should share the
> same
> > > > schema, but do we allow them to have different metadata if the schema
> > > > (field names and their types) is the same? If we don't allow such
> usage
> > > > currently, do you think it is a valid use case to support this kind
> of
> > > > usage? Thanks.
> > > >
> > > > [1]
> > > >
> > > >
> > >
> https://github.com/heterodb/pg-strom/wiki/806%3A-Apache-Arrow-Min-Max-Statistics-Hint
> > > > [2]
> > > >
> > > >
> > >
> https://arrow.apache.org/docs/format/Columnar.html#serialization-and-interprocess-communication-ipc
> > > > [3]
> > > >
> > > >
> > >
> https://github.com/apache/arrow/blob/master/python/pyarrow/tests/test_table.py
> > > >
> > >
>

Re: storing per record batch metadata in arrow IPC file

Reply via email to