[GitHub] [arrow] mapleFU commented on issue #15173: [Parquet][C++] ByteStreamSplitDecoder broken in presence of nulls

GitBox Thu, 12 Jan 2023 20:02:34 -0800


mapleFU commented on issue #15173:
URL: https://github.com/apache/arrow/issues/15173#issuecomment-1381287613


   ```c++
     // The total number of values stored in the data page. This is the maximum 
of
     // the number of encoded definition levels or encoded values. For
     // non-repeated, required columns, this is equal to the number of encoded
     // values. For repeated or optional values, there may be fewer data values
     // than levels, and this tells you how many encoded levels there are in 
that
     // case.
     int64_t num_buffered_values_;
   ```
   
   Seems it's only used as "values upper bounds", some encoding mark value 
number in it's header, like `DELTA_BINARY_PACKED`, but `PLAIN` or 
`BYTE_STREAM_SPLIT` would not use it.
   
   In most of case it's ok, because in `TypedColumnReaderImpl`, the reading 
always be like:
   
   ```c++
   template <typename DType>
   int64_t TypedColumnReaderImpl<DType>::ReadBatch(int64_t batch_size, int16_t* 
def_levels,
                                                   int16_t* rep_levels, T* 
values,
                                                   int64_t* values_read) {
     // ..
     ReadLevels(batch_size, def_levels, rep_levels, &num_def_levels, 
&values_to_read);
   
     *values_read = this->ReadValues(values_to_read, values);
     int64_t total_values = std::max<int64_t>(num_def_levels, *values_read);
     int64_t expected_values =
         std::min(batch_size, this->num_buffered_values_ - 
this->num_decoded_values_);
     if (total_values == 0 && expected_values > 0) {
       std::stringstream ss;
       ss << "Read 0 values, expected " << expected_values;
       ParquetException::EofException(ss.str());
     }
     this->ConsumeBufferedValues(total_values);
   
     return total_values;
   }
   ```
   
   And code here:
   
   ```c++
   template <typename DType>
   void ByteStreamSplitDecoder<DType>::SetData(int num_values, const uint8_t* 
data,
                                               int len) {
     DecoderImpl::SetData(num_values, data, len);
     if (num_values * static_cast<int64_t>(sizeof(T)) > len) {
       throw ParquetException("Data size too small for number of values 
(corrupted file?)");
     }
     num_values_in_buffer_ = num_values;
   }
   ```
   
   The checking is added by @pitrou in this patch: 
https://github.com/apache/arrow/commit/faf9bc011cb452e9def7dcd63dd3a0f6ec823534 
. And I don't know the context for it. At least it help us find that problem.
   
   And `stride` here would be wrong:
   
   ```c++
   template <typename T>
   void inline ByteStreamSplitDecode(const uint8_t* data, int64_t num_values, 
int64_t stride,
                                     T* out) {
   #if defined(ARROW_HAVE_SIMD_SPLIT)
     return ByteStreamSplitDecodeSimd(data, num_values, stride, out);
   #else
     return ByteStreamSplitDecodeScalar(data, num_values, stride, out);
   #endif
   }
   ```
   
   Because:
   
   ```c++
   template <typename DType>
   int ByteStreamSplitDecoder<DType>::Decode(T* buffer, int max_values) {
     const int values_to_decode = std::min(num_values_, max_values);
     const int num_decoded_previously = num_values_in_buffer_ - num_values_;
     const uint8_t* data = data_ + num_decoded_previously;
   
     ::arrow::util::internal::ByteStreamSplitDecode<T>(data, values_to_decode,
                                                       num_values_in_buffer_, 
buffer);
     num_values_ -= values_to_decode;
     len_ -= sizeof(T) * values_to_decode;
     return values_to_decode;
   }
   ```
   
   Here it use `num_values_in_buffer_`?
   
   I would try to reproduce this later. I write a DecodeSpace, but didn't 
reproduce that bug. My current code is:
   
   ```c++
     void CheckRoundtripSpaced(const uint8_t* valid_bits,
                               int64_t valid_bits_offset) override {
       std::cout << "CheckRoundtripSpaced for BYTE_STREAM_SPLIT called" << 
std::endl;
       auto encoder =
           MakeTypedEncoder<Type>(Encoding::BYTE_STREAM_SPLIT, false, 
descr_.get());
       auto decoder = MakeTypedDecoder<Type>(Encoding::BYTE_STREAM_SPLIT, 
descr_.get());
       int null_count = 0;
       for (auto i = 0; i < num_values_; i++) {
         if (!bit_util::GetBit(valid_bits, valid_bits_offset + i)) {
           null_count++;
         }
       }
   
       encoder->PutSpaced(draws_, num_values_, valid_bits, valid_bits_offset);
       encode_buffer_ = encoder->FlushValues();
       decoder->SetData(num_values_ - null_count, encode_buffer_->data(),
                        static_cast<int>(encode_buffer_->size()));
       auto values_decoded = decoder->DecodeSpaced(decode_buf_, num_values_, 
null_count,
                                                   valid_bits, 
valid_bits_offset);
       ASSERT_EQ(num_values_, values_decoded);
       ASSERT_NO_FATAL_FAILURE(VerifyResultsSpaced<c_type>(
           decode_buf_, draws_, num_values_, valid_bits, valid_bits_offset));
     }
   ```


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: [email protected]

For queries about this service, please contact Infrastructure at:
[email protected]

[GitHub] [arrow] mapleFU commented on issue #15173: [Parquet][C++] ByteStreamSplitDecoder broken in presence of nulls

Reply via email to