[GitHub] [arrow] niyue commented on a diff in pull request #12812: ARROW-16131 [C++] support saving and retrieving custom metadata in batches for IPC file

GitBox Thu, 07 Apr 2022 22:50:54 -0700


niyue commented on code in PR #12812:
URL: https://github.com/apache/arrow/pull/12812#discussion_r845757121



##########
cpp/src/arrow/ipc/writer.cc:
##########
@@ -263,6 +263,14 @@ class RecordBatchSerializer {
     out_->body_length = offset - buffer_start_offset_;
     DCHECK(bit_util::IsMultipleOf8(out_->body_length));
 
+    // copy given record batch's schema metadata to the serializer for 
serialization
+    auto const &metadata = batch.schema()->metadata();

Review Comment:
   I think there are two cases.
   
   1) in one case, a user simply re-uses the overall schema that happens to 
have some metadata, and it is less likely this user wants to duplicate the 
schema metadata in all batches and discard them silently is okay (and 
duplicating it all record batches may be undesirable and problematic):
   ```python
       schema = pa.schema(
           [
               ("values", pa.int64()),
           ],
           metadata={"foo": "bar"},
       )
       writer = pa.RecordBatchFileWriter(
           ipc_file, schema
       )
       for i in range(num_batches):
           batch = pa.record_batch(
               [int_array],
               schema=schema # <=== re-use the overall schema having metadata
           )
           writer.write_batch(batch)
   ```
   
   2) in the other case, a user provides metadata explicitly when creating a 
record batch, and discarding the metadata silently may not be desirable in this 
case:
   ```python
      schema = pa.schema(
           [
               ("values", pa.int64()),
           ],
           metadata={"foo": "bar"},
       )
       writer = pa.RecordBatchFileWriter(
           ipc_file, schema
       )
       for i in range(num_batches):
           batch = pa.record_batch(
               [int_array],
               names=["values"],
               metadata={"batch_id": str(i)}, # <=== pass a metadata explicitly
           )
           writer.write_batch(batch)
   ```
   The underlying implementation is difficult to tell these two cases.
   
   Some current API, like `pyarrow.record_batch`, already allows users to 
specify metadata explicitly. If we provide an overloaded function asking users 
to provide metadata, it seems we make the `metadata` in these APIs useless (or 
only useful when the record batch is in memory). The code may look like:
   ```python
   batch = pa.record_batch(
       [int_array],
       names=["values"] # <== `metadata` passed here will never be used by 
`write_batch` so users won't pass this parameter any more
   )
   writer.write_batch(batch, metadata={"batch_id": str(i)})
   ```
   
   Maybe a boolean flag indicating if the metadata in record batch should be 
serialized?
   ```python
   batch = pa.record_batch(
       [int_array],
       names=["values"],
       metadata={"batch_id": str(i)}
   )
   writer.write_batch(batch, serializing_metadata=True)
   ```



-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: [email protected]

For queries about this service, please contact Infrastructure at:
[email protected]

[GitHub] [arrow] niyue commented on a diff in pull request #12812: ARROW-16131 [C++] support saving and retrieving custom metadata in batches for IPC file

Reply via email to