benibus commented on code in PR #15194:
URL: https://github.com/apache/arrow/pull/15194#discussion_r1088205031


##########
cpp/src/arrow/ipc/writer.cc:
##########
@@ -176,19 +176,43 @@ class RecordBatchSerializer {
                                    field_nodes_, buffer_meta_, options_, 
&out_->metadata);
   }
 
+  bool ShouldCompress(int64_t uncompressed_size, int64_t compressed_size) 
const {
+    auto max_compressed_size = static_cast<int64_t>(
+        std::floor((1.0 - options_.min_space_savings) * uncompressed_size));
+    return compressed_size <= max_compressed_size;
+  }
+
   Status CompressBuffer(const Buffer& buffer, util::Codec* codec,
                         std::shared_ptr<Buffer>* out) {
-    // Convert buffer to uncompressed-length-prefixed compressed buffer
+    // Convert buffer to uncompressed-length-prefixed buffer. The actual body 
may or may
+    // not be compressed, depending on user-preference and projected size 
reduction.
     int64_t maximum_length = codec->MaxCompressedLen(buffer.size(), 
buffer.data());
-    ARROW_ASSIGN_OR_RAISE(auto result, AllocateBuffer(maximum_length + 
sizeof(int64_t)));
+    int64_t prefixed_length = buffer.size();
 
-    int64_t actual_length;
-    ARROW_ASSIGN_OR_RAISE(actual_length,
+    ARROW_ASSIGN_OR_RAISE(auto result,
+                          AllocateResizableBuffer(maximum_length + 
sizeof(int64_t)));
+    ARROW_ASSIGN_OR_RAISE(auto actual_length,
                           codec->Compress(buffer.size(), buffer.data(), 
maximum_length,
                                           result->mutable_data() + 
sizeof(int64_t)));
+    // FIXME: Not the most sophisticated way to handle this. Ideally, you'd 
want to avoid
+    // pre-compressing the entire buffer via some kind of sampling method. As 
the feature
+    // gains adoption, this may become a worthwhile optimization.
+    if (!ShouldCompress(buffer.size(), actual_length)) {
+      if (buffer.size() < actual_length || buffer.size() > maximum_length) {

Review Comment:
   It was strictly to zero the excess padding without manually using `memset` 
in a separate path. TBH, I'm not _entirely_ sure if zeroing is necessary in 
this context, but I erred on the side of caution since the original allocation 
would've been pre-initialized.



-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: [email protected]

For queries about this service, please contact Infrastructure at:
[email protected]

Reply via email to