Hi there,

I am investigating analyzing time series data using apache arrow. I would
like to store some record batch specific metadata, for example, some
statistics/tags about data in a particular record batch. More specifically,
I may use a single record batch to store metric samples for a certain time
range, and would like to store the min/max time and some dimensional data
like `host` and `aws_region` as metadata for a particular record batch so
that when loading multiple record batches from IPC file, the metadata may
vary from batch to batch in an IPC file, and I can filter these batches
quickly simply using metadata without looking into data in the arrays. And
I would like to know if it is possible to store such per record batch
metadata in an arrow IPC file.

There is a similar effort I can find on the web [1], but it stores all the
record batches metadata in the IPC file footer's schema. I think the footer
will be fully loaded for every access, which will introduce some
unnecessary IO if only a few of the record batches are read each time.

I read some docs/source code [2] [3], and if my understanding is correct,
it is technically possible to store different metadata in different record
batches since in the streaming format, each message has a `custom_metadata`
associated with it. But I don't find any API (at least in pyarrow) allowing
me to do this. APIs like `pyarrow.record_batch` does allow users to specify
metadata when constructing a record batch, but it doesn't seem to be used
if `RecordBatchFileWriter` has a schema provided (which of course doesn't
have such record batch specific metadata).

I haven't looked into the lower level C++ API yet, and it seems the
assumption is that all the batches in the IPC file should share the same
schema, but do we allow them to have different metadata if the schema
(field names and their types) is the same? If we don't allow such usage
currently, do you think it is a valid use case to support this kind of
usage? Thanks.

[1]
https://github.com/heterodb/pg-strom/wiki/806%3A-Apache-Arrow-Min-Max-Statistics-Hint
[2]
https://arrow.apache.org/docs/format/Columnar.html#serialization-and-interprocess-communication-ipc
[3]
https://github.com/apache/arrow/blob/master/python/pyarrow/tests/test_table.py

Reply via email to