Hi there, I am investigating analyzing time series data using apache arrow. I would like to store some record batch specific metadata, for example, some statistics/tags about data in a particular record batch. More specifically, I may use a single record batch to store metric samples for a certain time range, and would like to store the min/max time and some dimensional data like `host` and `aws_region` as metadata for a particular record batch so that when loading multiple record batches from IPC file, the metadata may vary from batch to batch in an IPC file, and I can filter these batches quickly simply using metadata without looking into data in the arrays. And I would like to know if it is possible to store such per record batch metadata in an arrow IPC file.
There is a similar effort I can find on the web [1], but it stores all the record batches metadata in the IPC file footer's schema. I think the footer will be fully loaded for every access, which will introduce some unnecessary IO if only a few of the record batches are read each time. I read some docs/source code [2] [3], and if my understanding is correct, it is technically possible to store different metadata in different record batches since in the streaming format, each message has a `custom_metadata` associated with it. But I don't find any API (at least in pyarrow) allowing me to do this. APIs like `pyarrow.record_batch` does allow users to specify metadata when constructing a record batch, but it doesn't seem to be used if `RecordBatchFileWriter` has a schema provided (which of course doesn't have such record batch specific metadata). I haven't looked into the lower level C++ API yet, and it seems the assumption is that all the batches in the IPC file should share the same schema, but do we allow them to have different metadata if the schema (field names and their types) is the same? If we don't allow such usage currently, do you think it is a valid use case to support this kind of usage? Thanks. [1] https://github.com/heterodb/pg-strom/wiki/806%3A-Apache-Arrow-Min-Max-Statistics-Hint [2] https://arrow.apache.org/docs/format/Columnar.html#serialization-and-interprocess-communication-ipc [3] https://github.com/apache/arrow/blob/master/python/pyarrow/tests/test_table.py