Hi, I run into issue when storing a huge number of doubles, namely arrow::float64. I've tried to put 300 million entries and saving that amount of data into parquet file failed. The reason of this is a bug in calculations within bit buffer implementation. I am using Visual Studio 2022 and the build is for Windows x64 platform, and unfortunately calculations went out of control for declared "int" control variables.
Here is the patch to arrow version 21 that solves the issue, please include: ///PATCH-START cpp/src/arrow/util/bit_stream_utils_internal.h | 14 +++++++------- 1 file changed, 7 insertions(+), 7 deletions(-) diff --git a/cpp/src/arrow/util/bit_stream_utils_internal.h b/cpp/src/arrow/util/bit_stream_utils_internal.h index 9d67c278bc..47ab1e59d1 100644 --- a/cpp/src/arrow/util/bit_stream_utils_internal.h +++ b/cpp/src/arrow/util/bit_stream_utils_internal.h @@ -98,14 +98,14 @@ class BitWriter { private: uint8_t* buffer_; - int max_bytes_; + uint32_t max_bytes_; /// Bit-packed values are initially written to this variable before being memcpy'd to /// buffer_. This is faster than writing values byte by byte directly to buffer_. uint64_t buffered_values_; - int byte_offset_; // Offset in buffer_ - int bit_offset_; // Offset in buffered_values_ + uint32_t byte_offset_; // Offset in buffer_ + unsigned bit_offset_; // Offset in buffered_values_ }; namespace detail { @@ -196,14 +196,14 @@ class BitReader { private: const uint8_t* buffer_; - int max_bytes_; + uint32_t max_bytes_; /// Bytes are memcpy'd from buffer_ and values are read from this variable. This is /// faster than reading values byte by byte directly from buffer_. uint64_t buffered_values_; - int byte_offset_; // Offset in buffer_ - int bit_offset_; // Offset in buffered_values_ + uint32_t byte_offset_; // Offset in buffer_ + unsigned bit_offset_; // Offset in buffered_values_ }; inline bool BitWriter::PutValue(uint64_t v, int num_bits) { @@ -212,7 +212,7 @@ inline bool BitWriter::PutValue(uint64_t v, int num_bits) { ARROW_DCHECK_EQ(v >> num_bits, 0) << "v = " << v << ", num_bits = " << num_bits; } - if (ARROW_PREDICT_FALSE(byte_offset_ * 8 + bit_offset_ + num_bits > max_bytes_ * 8)) + if (ARROW_PREDICT_FALSE(byte_offset_ * 8i64 + bit_offset_ + num_bits > max_bytes_ * 8i64)) return false; buffered_values_ |= v << bit_offset_; ///PATCH-END -- - Kind regards / Pozdrawiam *Michał Puczyński* Senior Software Engineer M: +48 606167722 -- *XTPL* *S.A.*ul. Legnicka 48E 54-202 Wrocław Poland xtpl.com <https://xtpl.com/> Please be informed that the controller of any personal data is XTPL S.A. more information <https://xtpl.com/data-processing-information/> / Informujemy, że administratorem danych osobowych jest XTPL S.A. więcej informacji <https://xtpl.com/pl/przetwarzanie-danych-osobowych/> This email contains confidential and / or legally protected information. If you are not the correct recipient or if you have received this email by mistake, please inform the sender and delete this email. The company is entered in the Register of Entrepreneurs kept by the District Court for Wrocław -Fabryczna VI Commercial Department of the National Court Register under the number 0000619674, share capital in the amount of 264 987,70 PLN paid in full, NIP 9512394886 / Niniejsza wiadomość może zawierać informacje prawnie chronione. Jeżeli nie jesteś właściwym odbiorcą lub otrzymałaś/eś tę wiadomość przez pomyłkę, prosimy o poinformowanie o tym nadawcy oraz usunięcie tej wiadomości. Spółka wpisana w Rejestrze Przedsiębiorców prowadzonym przez Sąd Rejonowy dla Wrocławia -Fabrycznej VI Wydział Gospodarczy KRS pod numerem 0000619674, Kapitał zakładowy w wysokości 264 987,70 zł wpłacony w całości, NIP 9512394886