[GitHub] [arrow] niyue commented on a diff in pull request #12812: ARROW-16131 [C++] support saving and retrieving custom metadata in batches for IPC file

GitBox Fri, 08 Apr 2022 17:27:18 -0700


niyue commented on code in PR #12812:
URL: https://github.com/apache/arrow/pull/12812#discussion_r846549698



##########
cpp/src/arrow/ipc/writer.cc:
##########
@@ -263,6 +263,14 @@ class RecordBatchSerializer {
     out_->body_length = offset - buffer_start_offset_;
     DCHECK(bit_util::IsMultipleOf8(out_->body_length));
 
+    // copy given record batch's schema metadata to the serializer for 
serialization
+    auto const &metadata = batch.schema()->metadata();

Review Comment:
   @westonpace 
   >  Let's call it serialize_metadata to keep the tenses consistent
   Sure. I am not a native English speaker, and feel free to let me know if 
there is better naming for this parameter.
   
   @westonpace @lidavidm 
   I will give it a try but I am not completely sure what kind of API 
compatibility is required. Could you give me some suggestions on what the 
overloaded function signatures are desirable?
   
   Currently in the C++ API, we have:
   ```C++
   class ARROW_EXPORT RecordBatchWriter {
    public:
     virtual ~RecordBatchWriter();
   
     /// \brief Write a record batch to the stream
     ///
     /// \param[in] batch the record batch to write to the stream
     /// \return Status
     virtual Status WriteRecordBatch(const RecordBatch& batch) = 0;
   
     /// \brief Write possibly-chunked table by creating sequence of record 
batches
     /// \param[in] table table to write
     /// \return Status
     Status WriteTable(const Table& table);
   
     /// \brief Write Table with a particular chunksize
     /// \param[in] table table to write
     /// \param[in] max_chunksize maximum length of table chunks. To indicate
     /// that no maximum should be enforced, pass -1.
     /// \return Status
     virtual Status WriteTable(const Table& table, int64_t max_chunksize);
   ```
   
   Do we need to keep all existing APIs unchanged or is it allowed to introduce 
some compatible API changes?
   Some options I can think of:
   1) option 1, keep all existing APIs unchanged and add overloaded APIs only
   ```C++
   virtual Status WriteRecordBatch(const RecordBatch& batch) = 0;
   virtual Status WriteRecordBatch(const RecordBatch& batch, bool 
serialize_metadata); // new overload API
   Status WriteTable(const Table& table);
   virtual Status WriteTable(const Table& table, int64_t max_chunksize);
   virtual Status WriteTable(const Table& table, int64_t max_chunksize, bool 
serialize_metadata); // new overload API
   ```
   
   2) option 2, add a default parameter to existing APIs
   ```C++
   virtual Status WriteRecordBatch(const RecordBatch& batch, bool 
serialize_metadatda=false) = 0; // this will require modification to all sub 
classes
   Status WriteTable(const Table& table);
   virtual Status WriteTable(const Table& table, int64_t max_chunksize, bool 
serialize_metadata=false);
   ```
   
   I assume we expect option 1, is it correct?
   
   Additionally, there may be some higher level binding libraries 
(Python/Ruby/etc) using these APIs, and they may have to be changed to take 
advantage of these new APIs, what is our strategy for managing them for such a 
change? Do I need to change them all in this PR (I think I can try changing 
Python at least but I am not sure if I am familiar with all the binding 
languages to make such a change) or should I separate the changes into several 
PRs? Thanks.
   



##########
cpp/src/arrow/ipc/writer.cc:
##########
@@ -263,6 +263,14 @@ class RecordBatchSerializer {
     out_->body_length = offset - buffer_start_offset_;
     DCHECK(bit_util::IsMultipleOf8(out_->body_length));
 
+    // copy given record batch's schema metadata to the serializer for 
serialization
+    auto const &metadata = batch.schema()->metadata();

Review Comment:
   @westonpace 
   >  Let's call it serialize_metadata to keep the tenses consistent
   
   Sure. I am not a native English speaker, and feel free to let me know if 
there is better naming for this parameter.
   
   @westonpace @lidavidm 
   I will give it a try but I am not completely sure what kind of API 
compatibility is required. Could you give me some suggestions on what the 
overloaded function signatures are desirable?
   
   Currently in the C++ API, we have:
   ```C++
   class ARROW_EXPORT RecordBatchWriter {
    public:
     virtual ~RecordBatchWriter();
   
     /// \brief Write a record batch to the stream
     ///
     /// \param[in] batch the record batch to write to the stream
     /// \return Status
     virtual Status WriteRecordBatch(const RecordBatch& batch) = 0;
   
     /// \brief Write possibly-chunked table by creating sequence of record 
batches
     /// \param[in] table table to write
     /// \return Status
     Status WriteTable(const Table& table);
   
     /// \brief Write Table with a particular chunksize
     /// \param[in] table table to write
     /// \param[in] max_chunksize maximum length of table chunks. To indicate
     /// that no maximum should be enforced, pass -1.
     /// \return Status
     virtual Status WriteTable(const Table& table, int64_t max_chunksize);
   ```
   
   Do we need to keep all existing APIs unchanged or is it allowed to introduce 
some compatible API changes?
   Some options I can think of:
   1) option 1, keep all existing APIs unchanged and add overloaded APIs only
   ```C++
   virtual Status WriteRecordBatch(const RecordBatch& batch) = 0;
   virtual Status WriteRecordBatch(const RecordBatch& batch, bool 
serialize_metadata); // new overload API
   Status WriteTable(const Table& table);
   virtual Status WriteTable(const Table& table, int64_t max_chunksize);
   virtual Status WriteTable(const Table& table, int64_t max_chunksize, bool 
serialize_metadata); // new overload API
   ```
   
   2) option 2, add a default parameter to existing APIs
   ```C++
   virtual Status WriteRecordBatch(const RecordBatch& batch, bool 
serialize_metadatda=false) = 0; // this will require modification to all sub 
classes
   Status WriteTable(const Table& table);
   virtual Status WriteTable(const Table& table, int64_t max_chunksize, bool 
serialize_metadata=false);
   ```
   
   I assume we expect option 1, is it correct?
   
   Additionally, there may be some higher level binding libraries 
(Python/Ruby/etc) using these APIs, and they may have to be changed to take 
advantage of these new APIs, what is our strategy for managing them for such a 
change? Do I need to change them all in this PR (I think I can try changing 
Python at least but I am not sure if I am familiar with all the binding 
languages to make such a change) or should I separate the changes into several 
PRs? Thanks.
   



-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: [email protected]

For queries about this service, please contact Infrastructure at:
[email protected]

[GitHub] [arrow] niyue commented on a diff in pull request #12812: ARROW-16131 [C++] support saving and retrieving custom metadata in batches for IPC file

Reply via email to