alamb commented on code in PR #4818:
URL: https://github.com/apache/arrow-rs/pull/4818#discussion_r1327413573
##########
arrow-row/src/variable.rs:
##########
@@ -26,6 +26,12 @@ use arrow_schema::{DataType, SortOptions};
/// The block size of the variable length encoding
pub const BLOCK_SIZE: usize = 32;
+/// The first block is split into `MINI_BLOCK_COUNT` mini-blocks
Review Comment:
Perhaps this would be a good place to add comments about the **why** for
miniblocks.
##########
arrow-row/src/variable.rs:
##########
@@ -128,35 +116,81 @@ pub fn encode_one(out: &mut [u8], val: Option<&[u8]>,
opts: SortOptions) -> usiz
}
}
-/// Returns the number of bytes of encoded data
-fn decoded_len(row: &[u8], options: SortOptions) -> usize {
+#[inline]
+fn encode_blocks<const SIZE: usize>(out: &mut [u8], val: &[u8]) -> usize {
Review Comment:
perhaps we could add a comment here explaining what this writes
(specifically continuation tokens vs BLOCK_CONTINUATION)
##########
arrow-row/src/variable.rs:
##########
@@ -26,6 +26,12 @@ use arrow_schema::{DataType, SortOptions};
/// The block size of the variable length encoding
pub const BLOCK_SIZE: usize = 32;
+/// The first block is split into `MINI_BLOCK_COUNT` mini-blocks
+pub const MINI_BLOCK_COUNT: usize = 4;
Review Comment:
I wonder if using a MINI_BLOCK count of 2 might get most of the memory
savings but have lower CPU overhead :thinking:
##########
arrow-row/src/variable.rs:
##########
@@ -45,7 +51,10 @@ pub fn encoded_len(a: Option<&[u8]>) -> usize {
#[inline]
pub fn padded_length(a: Option<usize>) -> usize {
match a {
- Some(a) => 1 + ceil(a, BLOCK_SIZE) * (BLOCK_SIZE + 1),
+ Some(a) if a <= BLOCK_SIZE => {
+ 1 + ceil(a, MINI_BLOCK_SIZE) * (MINI_BLOCK_SIZE + 1)
+ }
+ Some(a) => MINI_BLOCK_COUNT + ceil(a, BLOCK_SIZE) * (BLOCK_SIZE + 1),
Review Comment:
I don't understand why the padded length gets a `MINI_BLOCK_COUNT ` on it.
Is it because each miniblock contains 1 byte continuation?
Maybe a comment would help
##########
arrow-array/src/types.rs:
##########
@@ -1368,12 +1368,14 @@ pub(crate) mod bytes {
}
impl ByteArrayNativeType for [u8] {
+ #[inline]
Review Comment:
`inline(always)`? -- at some point I thought that saying 'always' was
required to get cross crate inlining to work
--
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.
To unsubscribe, e-mail: [email protected]
For queries about this service, please contact Infrastructure at:
[email protected]