Jefffrey commented on code in PR #9020: URL: https://github.com/apache/arrow-rs/pull/9020#discussion_r2636775085
########## arrow-buffer/src/lib.rs: ########## @@ -16,6 +16,21 @@ // under the License. //! Low-level buffer abstractions for [Apache Arrow Rust](https://docs.rs/arrow) +//! +//! # Byte Storage abstractions +//! - [`MutableBuffer`]: Raw memory buffer that can be mutated and grown +//! - [`Buffer`]: Immutable buffer that is shared across threads +//! +//! # Typed Abstractions +//! +//! There are also several wrappers over [`Buffer`] with methods for +//! easier manipulation: +//! +//! - [`BooleanBuffer`] for Bitmasks (buffer of packed bits) +//! - [`NullBuffer`], Arrow null (validity) bitmaps ([`BooleanBuffer`] with extra utilities) Review Comment: ```suggestion //! - [`BooleanBuffer`]: Bitmasks (buffer of packed bits) //! - [`NullBuffer`]: Arrow null (validity) bitmaps ([`BooleanBuffer`] with extra utilities) ``` Just to make it consistent with the bottom three 😅 ########## arrow-buffer/src/buffer/boolean.rs: ########## @@ -26,17 +26,56 @@ use std::ops::{BitAnd, BitOr, BitXor, Not}; /// A slice-able [`Buffer`] containing bit-packed booleans /// -/// `BooleanBuffer`s can be modified using [`BooleanBufferBuilder`] +/// This structure represents a sequence of boolean values packed into a +/// byte-aligned [`Buffer`]. Both the offset and length are represented in bits. +/// +/// # Layout +/// +/// The values are represented as little endian bit-packed values, where the +/// least significant bit of each byte represents the first boolean value and +/// then proceeding to the most significant bit. +/// +/// For example, the 10 bit bitmask `0b0111001101` has length 10, and is +/// represented using 2 bytes with offset 0 like this: +/// +/// ```text +/// ┌─────────────────────────────────┐ ┌─────────────────────────────────┐ +/// │┌───┬───┬───┬───┬───┬───┬───┬───┐│ │┌───┬───┬───┬───┬───┬───┬───┬───┐│ +/// ││ 1 │ 1 │ 0 │ 0 │ 1 │ 1 │ 0 │ 1 ││ ││ ? │ ? │ ? │ ? │ ? │ ? │ 0 │ 1 ││ +/// │└───┴───┴───┴───┴───┴───┴───┴───┘│ │└───┴───┴───┴───┴───┴───┴───┴───┘│ +/// └─────────────────────────────────┘ └─────────────────────────────────┘ +/// 7 Byte 0 0 7 Byte 1 0 bit +/// offset +/// length = 10 bits, offset = 0 +/// ``` +/// +/// The same bitmask with length 10 and offset 3 would be represented like this: +/// ``` +/// ┌─────────────────────────────────┐ ┌─────────────────────────────────┐ +/// │┌───┬───┬───┬───┬───┬───┬───┬───┐│ │┌───┬───┬───┬───┬───┬───┬───┬───┐│ +/// ││ 0 │ 1 │ 1 │ 0 │ 1 │ ? │ ? │ ? ││ ││ ? │ ? │ ? │ 0 │ 1 │ 1 │ 1 │ 0 ││ +/// │└───┴───┴───┴───┴───┴───┴───┴───┘│ │└───┴───┴───┴───┴───┴───┴───┴───┘│ +/// └─────────────────────────────────┘ └─────────────────────────────────┘ +/// 7 Byte 0 0 7 Byte 1 0 bit +/// offset Review Comment: Nice diagrams, though the right-to-left bits with left-to-right bytes threw me for a bit 😅 ########## arrow-buffer/src/buffer/boolean.rs: ########## @@ -26,17 +26,56 @@ use std::ops::{BitAnd, BitOr, BitXor, Not}; /// A slice-able [`Buffer`] containing bit-packed booleans /// -/// `BooleanBuffer`s can be modified using [`BooleanBufferBuilder`] +/// This structure represents a sequence of boolean values packed into a +/// byte-aligned [`Buffer`]. Both the offset and length are represented in bits. +/// +/// # Layout +/// +/// The values are represented as little endian bit-packed values, where the +/// least significant bit of each byte represents the first boolean value and +/// then proceeding to the most significant bit. +/// +/// For example, the 10 bit bitmask `0b0111001101` has length 10, and is +/// represented using 2 bytes with offset 0 like this: +/// +/// ```text +/// ┌─────────────────────────────────┐ ┌─────────────────────────────────┐ +/// │┌───┬───┬───┬───┬───┬───┬───┬───┐│ │┌───┬───┬───┬───┬───┬───┬───┬───┐│ +/// ││ 1 │ 1 │ 0 │ 0 │ 1 │ 1 │ 0 │ 1 ││ ││ ? │ ? │ ? │ ? │ ? │ ? │ 0 │ 1 ││ +/// │└───┴───┴───┴───┴───┴───┴───┴───┘│ │└───┴───┴───┴───┴───┴───┴───┴───┘│ +/// └─────────────────────────────────┘ └─────────────────────────────────┘ +/// 7 Byte 0 0 7 Byte 1 0 bit +/// offset +/// length = 10 bits, offset = 0 +/// ``` +/// +/// The same bitmask with length 10 and offset 3 would be represented like this: +/// ``` +/// ┌─────────────────────────────────┐ ┌─────────────────────────────────┐ +/// │┌───┬───┬───┬───┬───┬───┬───┬───┐│ │┌───┬───┬───┬───┬───┬───┬───┬───┐│ +/// ││ 0 │ 1 │ 1 │ 0 │ 1 │ ? │ ? │ ? ││ ││ ? │ ? │ ? │ 0 │ 1 │ 1 │ 1 │ 0 ││ +/// │└───┴───┴───┴───┴───┴───┴───┴───┘│ │└───┴───┴───┴───┴───┴───┴───┴───┘│ +/// └─────────────────────────────────┘ └─────────────────────────────────┘ +/// 7 Byte 0 0 7 Byte 1 0 bit +/// offset +/// length = 10 bits, offset = 3 +/// ``` +/// Note that the bits marked `?` are not part of the (logical) mask and +/// may contain either `0` or `1` /// /// /// # See Also +/// * [`BooleanBufferBuilder`] for building [`BooleanBuffer`] instances /// * [`NullBuffer`] for representing null values in Arrow arrays /// /// [`NullBuffer`]: crate::NullBuffer #[derive(Debug, Clone, Eq)] pub struct BooleanBuffer { + /// Underlying buffer (byte aligned) buffer: Buffer, + /// Offset in bits (not bytes) offset: usize, + /// Length in bits (not bytes) len: usize, Review Comment: Worth renaming the fields themselves to `bit_offset` and `bit_length` for further clarity? ########## arrow-buffer/src/lib.rs: ########## @@ -16,6 +16,21 @@ // under the License. //! Low-level buffer abstractions for [Apache Arrow Rust](https://docs.rs/arrow) +//! +//! # Byte Storage abstractions +//! - [`MutableBuffer`]: Raw memory buffer that can be mutated and grown +//! - [`Buffer`]: Immutable buffer that is shared across threads +//! +//! # Typed Abstractions +//! +//! There are also several wrappers over [`Buffer`] with methods for +//! easier manipulation: +//! +//! - [`BooleanBuffer`] for Bitmasks (buffer of packed bits) +//! - [`NullBuffer`], Arrow null (validity) bitmaps ([`BooleanBuffer`] with extra utilities) +//! - [`ScalarBuffer<T>`]: Typed buffer for primitive types (e.g., `i32`, `f64`) +//! - [`OffsetBuffer<O>`]: Offsets used in variable-length types (e.g., strings, lists) +//! - [`RunEndBuffer<E>`]: Run-lengths used in run-encoded encoded data Review Comment: ```suggestion //! - [`RunEndBuffer<E>`]: Run-ends used in run-encoded encoded data ``` -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: [email protected] For queries about this service, please contact Infrastructure at: [email protected]
