sfc-gh-nthimmegowda commented on code in PR #14147:
URL: https://github.com/apache/arrow/pull/14147#discussion_r974954780


##########
cpp/src/parquet/encoding.cc:
##########
@@ -2355,6 +2355,81 @@ class DeltaLengthByteArrayDecoder : public DecoderImpl,
   std::shared_ptr<ResizableBuffer> buffered_data_;
 };
 
+// ----------------------------------------------------------------------
+// RLE_BOOLEAN_DECODER
+
+class RleBooleanDecoder : public DecoderImpl, virtual public BooleanDecoder {
+ public:
+  explicit RleBooleanDecoder(const ColumnDescriptor* descr)
+      : DecoderImpl(descr, Encoding::RLE) {}
+
+  void SetData(int num_values, const uint8_t* data, int len) override {
+    num_values_ = num_values;
+    uint32_t num_bytes = 0;
+
+    if (len < 4) {
+      throw ParquetException("Received invalid length : " + 
std::to_string(len) +
+                             " (corrupt data page?)");
+    }
+    // Load the first 4 bytes in little-endian, which indicates the length
+    num_bytes =
+        
::arrow::bit_util::ToLittleEndian(::arrow::util::SafeLoadAs<uint32_t>(data));
+    if (num_bytes < 0 || num_bytes > static_cast<uint32_t>(len - 4)) {
+      throw ParquetException("Received invalid number of bytes : " +
+                             std::to_string(num_bytes) + " (corrupt data 
page?)");
+    }
+
+    auto decoder_data = data + 4;
+    decoder_ = std::make_shared<::arrow::util::RleDecoder>(decoder_data, 
num_bytes,
+                                                           /*bit_width=*/1);
+  }
+
+  int Decode(bool* buffer, int max_values) override {
+    max_values = std::min(max_values, num_values_);
+    int val = 0;
+
+    for (int i = 0; i < max_values; ++i) {
+      if (!decoder_->Get(&val)) {
+        throw ParquetException("Unable to parse bits for position (0 based) : 
" +
+                               std::to_string(i) + " (corrupt data page?)");
+      }
+      if (val) {
+        buffer[i] = true;
+      } else {
+        buffer[i] = false;
+      }
+    }
+    num_values_ -= max_values;
+    return max_values;
+  }
+
+  int Decode(uint8_t* buffer, int max_values) override {
+    max_values = std::min(max_values, num_values_);
+    if (decoder_->GetBatch(buffer, max_values) != max_values) {

Review Comment:
   Yes, the test uses this method. 
   
   No, this does not write each value as 1 byte. This also write 1 bit value 
similarly to `PlainBooleanDecoder::Decode(uint8_t* buffer)`. The `PlainDecoder` 
calls `GetBatch` specifying bit-width as 1. 
   `bit_reader_->GetBatch(1, buffer, max_values)`. 
   
   RleBooleanDecoder also calls this internally in the same fashion.
   
https://github.com/apache/arrow/blob/8daa7a4ed5629c0020dadf7325a6b523bdfc62e9/cpp/src/arrow/util/rle_encoding.h#L320
   Here bit-width is 1, as defined during initialization. 
   ```
       decoder_ = std::make_shared<::arrow::util::RleDecoder>(decoder_data, 
num_bytes,
                                                              /*bit_width=*/1);
   ```



-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: github-unsubscr...@arrow.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org

Reply via email to