sfc-gh-nthimmegowda commented on code in PR #14147: URL: https://github.com/apache/arrow/pull/14147#discussion_r974763276
########## cpp/src/parquet/encoding.cc: ########## @@ -2355,6 +2355,80 @@ class DeltaLengthByteArrayDecoder : public DecoderImpl, std::shared_ptr<ResizableBuffer> buffered_data_; }; +// ---------------------------------------------------------------------- +// RLE_BOOLEAN_DECODER + +class RleBooleanDecoder : public DecoderImpl, virtual public BooleanDecoder { + public: + explicit RleBooleanDecoder(const ColumnDescriptor* descr) + : DecoderImpl(descr, Encoding::RLE) {} + + void SetData(int num_values, const uint8_t* data, int len) override { + num_values_ = num_values; + uint32_t num_bytes = 0; + + if (len < 4) { + throw ParquetException("Received invalid length : " + std::to_string(len) + + " (corrupt data page?)"); + } + // Load the first 4 bytes in little-endian, which indicates the length + num_bytes = + ::arrow::bit_util::ToLittleEndian(::arrow::util::SafeLoadAs<uint32_t>(data)); + if (num_bytes < 0 || num_bytes > (uint32_t)(len - 4)) { + throw ParquetException("Received invalid number of bytes : " + + std::to_string(num_bytes) + " (corrupt data page?)"); + } + + const uint8_t* decoder_data = data + 4; + decoder_ = std::make_shared<::arrow::util::RleDecoder>(decoder_data, num_bytes, + /*bit_width=*/1); + } + + int Decode(bool* buffer, int max_values) override { + max_values = std::min(max_values, num_values_); + int val = 0; Review Comment: Changing to `bool` or `decoder_->GetBatch()` fails in compilation stage. This barfs at compile time check with ``` bit_stream_utils.h:415:41: error: no matching function for call to ‘FromLittleEndian(bool&)’ 415 | *v = arrow::bit_util::FromLittleEndian(*v); arrow/util/type_traits.h: In substitution of ‘template<class T, class ... Args> using EnableIfIsOneOf = typename std::enable_if<arrow::internal::IsOneOf<T, Args ...>::value, T>::type [with T = bool; Args = {long int, long unsigned int, int, unsigned int, short int, short unsigned int, unsigned char, signed char, float, double}]’: ``` Code path **RleDecoder::GetBatch(T*, int) -> RleDecoder::NextCounts() -> BitReader::GetAligned(int, T*)** In BitReader::GetAligned() :::: Line 415 , there is a call to bit_util::FromLittleEndian. However FromLittleEndian can be called on only one of the following types int8...int64, float and double, whereas in this case it is bool All the accepted values are at the byte boundary / 8 bits --- 8,16,32. I am not exactly sure how boolean can fit into this structure. Little Endian or Big Endian wouldn't matter for boolean, but I worry this might change the computation because of the alignment of byte boundary. -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: github-unsubscr...@arrow.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org