Re: storing per record batch metadata in arrow IPC file

Yue Ni Tue, 05 Apr 2022 22:26:04 -0700

Hi Aldrin,

Thanks for the pointers. I checked out the C++ source code of this part,
and I think currently record batch specific metadata is not written into
the IPC file probably due to a bug in the code. I logged a bug to track
this issue (https://issues.apache.org/jira/browse/ARROW-16131), thanks so
much for the help.


On Wed, Apr 6, 2022 at 12:58 AM Aldrin <akmon...@ucsc.edu.invalid> wrote:

> Hm, I didn't think it was possible, but it looks like there may be some
> things you can try?
>
> My understanding was that you create a writer for an IPC stream or file and
> you pass a schema on construction which is used as "the schema" for the IPC
> stream/file. So, RecordBatches written using that writer should/need to
> match the given schema. This doesn't check the metadata, I don't think, but
> it only writes an "IPC payload" if the equality check passes.
>
> That being said, I did some checking, and some things seem like it's more
> flexible now (but I could be wrong). I'm not sure what the dictionary
> deltas are (maybe it's for dictionary arrays rather than metadata), but
> the "emit_dictionary_deltas" IpcOption may be relevant [1]. Otherwise, the
> `WriteRecordBatch` function appears to take a metadata length [2] and the
> `WriteRecordBatchStream` function [3] seems to only check that a vector of
> RecordBatches have matching schemas. Also, the `WritePayload` function
> (from a RecordBatchWriter via MakeFileWriter) seems to be relevant for how
> to write metadata that can be leveraged for a seek-based interface [4].
>
> But, ultimately, I am not sure these things are exposed at a higher level
> (e.g. pyarrow), even though they're available for use. They're also not
> exposed via the feather interface, as far as I know.
>
> [1]:
>
> https://arrow.apache.org/docs/cpp/api/ipc.html#_CPPv4N5arrow3ipc15IpcWriteOptions22emit_dictionary_deltasE
> [2]:
>
> https://github.com/apache/arrow/blob/apache-arrow-7.0.0/cpp/src/arrow/ipc/writer.cc#L644
> [3]:
>
> https://github.com/apache/arrow/blob/apache-arrow-7.0.0/cpp/src/arrow/ipc/writer.cc#L665
> [4]:
>
> https://github.com/apache/arrow/blob/apache-arrow-7.0.0/cpp/src/arrow/ipc/writer.cc#L1253
>
> Aldrin Montana
> Computer Science PhD Student
> UC Santa Cruz
>
>
> On Tue, Apr 5, 2022 at 1:55 AM Yue Ni <niyue....@gmail.com> wrote:
>
> > Hi there,
> >
> > I am investigating analyzing time series data using apache arrow. I would
> > like to store some record batch specific metadata, for example, some
> > statistics/tags about data in a particular record batch. More
> specifically,
> > I may use a single record batch to store metric samples for a certain
> time
> > range, and would like to store the min/max time and some dimensional data
> > like `host` and `aws_region` as metadata for a particular record batch so
> > that when loading multiple record batches from IPC file, the metadata may
> > vary from batch to batch in an IPC file, and I can filter these batches
> > quickly simply using metadata without looking into data in the arrays.
> And
> > I would like to know if it is possible to store such per record batch
> > metadata in an arrow IPC file.
> >
> > There is a similar effort I can find on the web [1], but it stores all
> the
> > record batches metadata in the IPC file footer's schema. I think the
> footer
> > will be fully loaded for every access, which will introduce some
> > unnecessary IO if only a few of the record batches are read each time.
> >
> > I read some docs/source code [2] [3], and if my understanding is correct,
> > it is technically possible to store different metadata in different
> record
> > batches since in the streaming format, each message has a
> `custom_metadata`
> > associated with it. But I don't find any API (at least in pyarrow)
> allowing
> > me to do this. APIs like `pyarrow.record_batch` does allow users to
> specify
> > metadata when constructing a record batch, but it doesn't seem to be used
> > if `RecordBatchFileWriter` has a schema provided (which of course doesn't
> > have such record batch specific metadata).
> >
> > I haven't looked into the lower level C++ API yet, and it seems the
> > assumption is that all the batches in the IPC file should share the same
> > schema, but do we allow them to have different metadata if the schema
> > (field names and their types) is the same? If we don't allow such usage
> > currently, do you think it is a valid use case to support this kind of
> > usage? Thanks.
> >
> > [1]
> >
> >
> https://github.com/heterodb/pg-strom/wiki/806%3A-Apache-Arrow-Min-Max-Statistics-Hint
> > [2]
> >
> >
> https://arrow.apache.org/docs/format/Columnar.html#serialization-and-interprocess-communication-ipc
> > [3]
> >
> >
> https://github.com/apache/arrow/blob/master/python/pyarrow/tests/test_table.py
> >
>

Re: storing per record batch metadata in arrow IPC file

Reply via email to