niyue commented on code in PR #12812:
URL: https://github.com/apache/arrow/pull/12812#discussion_r845757121
##########
cpp/src/arrow/ipc/writer.cc:
##########
@@ -263,6 +263,14 @@ class RecordBatchSerializer {
out_->body_length = offset - buffer_start_offset_;
DCHECK(bit_util::IsMultipleOf8(out_->body_length));
+ // copy given record batch's schema metadata to the serializer for
serialization
+ auto const &metadata = batch.schema()->metadata();
Review Comment:
I think there are two cases.
1) in one case, a user simply re-uses the overall schema that happens to
have some metadata, and it is less likely this user wants to duplicate the
schema metadata in all batches and discard them silently is okay (and
duplicating it all record batches may be undesirable and problematic):
```python
schema = pa.schema(
[
("values", pa.int64()),
],
metadata={"foo": "bar"},
)
writer = pa.RecordBatchFileWriter(
ipc_file, schema
)
for i in range(num_batches):
batch = pa.record_batch(
[int_array],
schema=schema # <=== re-use the overall schema having metadata
)
writer.write_batch(batch)
```
2) in the other case, a user provides metadata explicitly when creating a
record batch, and discarding the metadata silently may not be desirable in this
case:
```python
schema = pa.schema(
[
("values", pa.int64()),
],
metadata={"foo": "bar"},
)
writer = pa.RecordBatchFileWriter(
ipc_file, schema
)
for i in range(num_batches):
batch = pa.record_batch(
[int_array],
names=["values"],
metadata={"batch_id": str(i)}, # <=== pass a metadata explicitly
)
writer.write_batch(batch)
```
The underlying implementation is difficult to tell these two cases.
Some current API, like `pyarrow.record_batch`, already allows users to
specify metadata explicitly. If we provide an overloaded function asking users
to provide metadata, it seems we make the `metadata` in these APIs useless (or
only useful when the record batch is in memory). The code may look like:
```python
batch = pa.record_batch(
[int_array],
names=["values"] # <== `metadata` passed here will never be used by
`write_batch` so users won't pass this parameter any more
)
writer.write_batch(batch, metadata={"batch_id": str(i)})
```
Maybe a boolean flag indicating if the metadata in record batch should be
serialized?
```python
batch = pa.record_batch(
[int_array],
names=["values"],
metadata={"batch_id": str(i)}
)
writer.write_batch(batch, serializing_metadata=True)
```
--
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.
To unsubscribe, e-mail: [email protected]
For queries about this service, please contact Infrastructure at:
[email protected]