Hello Michal,

Thanks for reporting this issue.
When you say "saving that amount of data into parquet file failed", can you describe the symptoms? If it crashes or throws an exception, can you share a stack trace?

Also, can you share the code used for writing the Parquet file?

Regards

Antoine.


Le 28/08/2025 à 08:06, Michał Puczyński a écrit :
Hi,
I run into issue when storing a huge number of doubles, namely
arrow::float64.
I've tried to put 300 million entries and saving that amount of data into
parquet file failed.
The reason of this is a bug in calculations within bit buffer
implementation.
I am using Visual Studio 2022 and the build is for Windows x64 platform,
and unfortunately calculations went out of control for declared "int"
control variables.

Here is the patch to arrow version 21 that solves the issue, please include:

///PATCH-START
  cpp/src/arrow/util/bit_stream_utils_internal.h | 14 +++++++-------
  1 file changed, 7 insertions(+), 7 deletions(-)

diff --git a/cpp/src/arrow/util/bit_stream_utils_internal.h
b/cpp/src/arrow/util/bit_stream_utils_internal.h
index 9d67c278bc..47ab1e59d1 100644
--- a/cpp/src/arrow/util/bit_stream_utils_internal.h
+++ b/cpp/src/arrow/util/bit_stream_utils_internal.h
@@ -98,14 +98,14 @@ class BitWriter {

   private:
    uint8_t* buffer_;
-  int max_bytes_;
+  uint32_t max_bytes_;

    /// Bit-packed values are initially written to this variable before
being memcpy'd to
    /// buffer_. This is faster than writing values byte by byte directly to
buffer_.
    uint64_t buffered_values_;

-  int byte_offset_;  // Offset in buffer_
-  int bit_offset_;   // Offset in buffered_values_
+  uint32_t byte_offset_;  // Offset in buffer_
+  unsigned bit_offset_;   // Offset in buffered_values_
  };

  namespace detail {
@@ -196,14 +196,14 @@ class BitReader {

   private:
    const uint8_t* buffer_;
-  int max_bytes_;
+  uint32_t max_bytes_;

    /// Bytes are memcpy'd from buffer_ and values are read from this
variable. This is
    /// faster than reading values byte by byte directly from buffer_.
    uint64_t buffered_values_;

-  int byte_offset_;  // Offset in buffer_
-  int bit_offset_;   // Offset in buffered_values_
+  uint32_t byte_offset_;  // Offset in buffer_
+  unsigned bit_offset_;   // Offset in buffered_values_
  };

  inline bool BitWriter::PutValue(uint64_t v, int num_bits) {
@@ -212,7 +212,7 @@ inline bool BitWriter::PutValue(uint64_t v, int
num_bits) {
      ARROW_DCHECK_EQ(v >> num_bits, 0) << "v = " << v << ", num_bits = " <<
num_bits;
    }

-  if (ARROW_PREDICT_FALSE(byte_offset_ * 8 + bit_offset_ + num_bits >
max_bytes_ * 8))
+  if (ARROW_PREDICT_FALSE(byte_offset_ * 8i64 + bit_offset_ + num_bits >
max_bytes_ * 8i64))
      return false;

    buffered_values_ |= v << bit_offset_;
///PATCH-END

Reply via email to