garyanaplan commented on issue #349: URL: https://github.com/apache/arrow-rs/issues/349#issuecomment-858568206
Looks like that hard-coded value (256) in the bit-writer is the root cause. When writing, if we try to put > 2048 boolean values, then the writer just "ignores" the writes. This is caused by the fact that bool encoder silently ignores calls to put_value that return false. I have a fix for this which works by extending the size of the BitWriter (in 256 byte) increments and also checks the return of put_value in BoolType::encode() and raises an error if the call fails. `garypennington@Garys-MBP parquet % git diff diff --git a/parquet/src/data_type.rs b/parquet/src/data_type.rs index aa1def3d..f7514d8b 100644 --- a/parquet/src/data_type.rs +++ b/parquet/src/data_type.rs @@ -662,7 +662,9 @@ pub(crate) mod private { bit_writer: &mut BitWriter, ) -> Result<()> { for value in values { - bit_writer.put_value(*value as u64, 1); + if !bit_writer.put_value(*value as u64, 1) { + return Err(ParquetError::EOF("unable to put boolean value".to_string())); + } } Ok(()) } diff --git a/parquet/src/encodings/encoding.rs b/parquet/src/encodings/encoding.rs index d0427381..eb5ebb13 100644 --- a/parquet/src/encodings/encoding.rs +++ b/parquet/src/encodings/encoding.rs @@ -112,6 +112,7 @@ pub struct PlainEncoder<T: DataType> { buffer: ByteBuffer, bit_writer: BitWriter, desc: ColumnDescPtr, + bw_bytes_written: usize, _phantom: PhantomData<T>, } @@ -124,6 +125,7 @@ impl<T: DataType> PlainEncoder<T> { buffer: byte_buffer, bit_writer: BitWriter::new(256), desc, + bw_bytes_written: 0, _phantom: PhantomData, } } @@ -153,7 +155,11 @@ impl<T: DataType> Encoder<T> for PlainEncoder<T> { #[inline] fn put(&mut self, values: &[T::T]) -> Result<()> { + if self.bw_bytes_written + values.len() >= self.bit_writer.capacity() { + self.bit_writer.extend(256); + } T::T::encode(values, &mut self.buffer, &mut self.bit_writer)?; + self.bw_bytes_written += values.len(); Ok(()) } } diff --git a/parquet/src/util/bit_util.rs b/parquet/src/util/bit_util.rs index 8dfb6312..45cfe2b6 100644 --- a/parquet/src/util/bit_util.rs +++ b/parquet/src/util/bit_util.rs @@ -223,6 +223,20 @@ impl BitWriter { } } + /// Extend buffer size + #[inline] + pub fn extend(&mut self, increment: usize) { + self.max_bytes += increment; + let extra = vec![0; increment]; + self.buffer.extend(extra); + } + + /// Report buffer size + #[inline] + pub fn capacity(&mut self) -> usize { + self.max_bytes + } + /// Consumes and returns the current buffer. #[inline] pub fn consume(mut self) -> Vec<u8> { garypennington@Garys-MBP parquet % ` Can anyone comment on this approach? -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. For queries about this service, please contact Infrastructure at: us...@infra.apache.org