Hi,
I run into issue when storing a huge number of doubles, namely
arrow::float64.
I've tried to put 300 million entries and saving that amount of data into
parquet file failed.
The reason of this is a bug in calculations within bit buffer
implementation.
I am using Visual Studio 2022 and the build is for Windows x64 platform,
and unfortunately calculations went out of control for declared "int"
control variables.

Here is the patch to arrow version 21 that solves the issue, please include:

///PATCH-START
 cpp/src/arrow/util/bit_stream_utils_internal.h | 14 +++++++-------
 1 file changed, 7 insertions(+), 7 deletions(-)

diff --git a/cpp/src/arrow/util/bit_stream_utils_internal.h
b/cpp/src/arrow/util/bit_stream_utils_internal.h
index 9d67c278bc..47ab1e59d1 100644
--- a/cpp/src/arrow/util/bit_stream_utils_internal.h
+++ b/cpp/src/arrow/util/bit_stream_utils_internal.h
@@ -98,14 +98,14 @@ class BitWriter {

  private:
   uint8_t* buffer_;
-  int max_bytes_;
+  uint32_t max_bytes_;

   /// Bit-packed values are initially written to this variable before
being memcpy'd to
   /// buffer_. This is faster than writing values byte by byte directly to
buffer_.
   uint64_t buffered_values_;

-  int byte_offset_;  // Offset in buffer_
-  int bit_offset_;   // Offset in buffered_values_
+  uint32_t byte_offset_;  // Offset in buffer_
+  unsigned bit_offset_;   // Offset in buffered_values_
 };

 namespace detail {
@@ -196,14 +196,14 @@ class BitReader {

  private:
   const uint8_t* buffer_;
-  int max_bytes_;
+  uint32_t max_bytes_;

   /// Bytes are memcpy'd from buffer_ and values are read from this
variable. This is
   /// faster than reading values byte by byte directly from buffer_.
   uint64_t buffered_values_;

-  int byte_offset_;  // Offset in buffer_
-  int bit_offset_;   // Offset in buffered_values_
+  uint32_t byte_offset_;  // Offset in buffer_
+  unsigned bit_offset_;   // Offset in buffered_values_
 };

 inline bool BitWriter::PutValue(uint64_t v, int num_bits) {
@@ -212,7 +212,7 @@ inline bool BitWriter::PutValue(uint64_t v, int
num_bits) {
     ARROW_DCHECK_EQ(v >> num_bits, 0) << "v = " << v << ", num_bits = " <<
num_bits;
   }

-  if (ARROW_PREDICT_FALSE(byte_offset_ * 8 + bit_offset_ + num_bits >
max_bytes_ * 8))
+  if (ARROW_PREDICT_FALSE(byte_offset_ * 8i64 + bit_offset_ + num_bits >
max_bytes_ * 8i64))
     return false;

   buffered_values_ |= v << bit_offset_;
///PATCH-END
-- 
-

Kind regards / Pozdrawiam

*Michał Puczyński*

Senior Software Engineer

M: +48 606167722

-- 


*XTPL* *S.A.*ul. Legnicka 48E
54-202 Wrocław

Poland

xtpl.com 
<https://xtpl.com/>

Please be informed that the controller of any personal 
data is XTPL S.A. more information 
<https://xtpl.com/data-processing-information/>       /         
Informujemy, że administratorem danych osobowych jest XTPL S.A. więcej 
informacji <https://xtpl.com/pl/przetwarzanie-danych-osobowych/>



This 
email contains confidential and / or legally protected information. If you 
are not the correct recipient or if you have received this email by  
mistake, please inform the sender and delete this email. The company is 
entered in the Register of Entrepreneurs kept by the District Court for 
Wrocław -Fabryczna VI Commercial Department of the National Court Register 
under the number 0000619674, share capital in the amount of 264 987,70 PLN 
paid in full, NIP 9512394886        /         Niniejsza wiadomość może 
zawierać informacje prawnie chronione. Jeżeli nie jesteś właściwym odbiorcą 
lub otrzymałaś/eś tę wiadomość przez pomyłkę, prosimy o poinformowanie o 
tym nadawcy oraz usunięcie tej wiadomości. Spółka wpisana w Rejestrze 
Przedsiębiorców prowadzonym przez Sąd Rejonowy dla Wrocławia -Fabrycznej VI 
Wydział Gospodarczy KRS pod numerem 0000619674, Kapitał zakładowy w 
wysokości 264 987,70 zł wpłacony w całości, NIP 9512394886

Reply via email to