alamb commented on code in PR #8762:
URL: https://github.com/apache/arrow-rs/pull/8762#discussion_r2586829463


##########
parquet/src/bloom_filter/mod.rs:
##########
@@ -283,7 +300,7 @@ impl Sbbf {
     /// Write the bloom filter data (header and then bitset) to the output. 
This doesn't
     /// flush the writer in order to boost performance of bulk writing all 
blocks. Caller
     /// must remember to flush the writer.
-    pub(crate) fn write<W: Write>(&self, mut writer: W) -> Result<(), 
ParquetError> {
+    pub fn write<W: Write>(&self, mut writer: W) -> Result<(), ParquetError> {

Review Comment:
   This method writes out the bytes using thrift encoding, which is basically
   ```text
   (header)
   (bitset)
   ```
   
   We already have the code to read a `Sbff` back from the thrift encoding here
   
   
https://github.com/apache/arrow-rs/blob/b65d20767c7510570cff0ab0154a268c29a3f712/parquet/src/bloom_filter/mod.rs#L409-L408
   
   Rather than exposing `new()` and relying on the user having to pass in the 
right bits (and relying on the bloom filter code not to change), what I think 
we should do:
   
   1. Introduce a `read()` method that is the inverse of `write()` -- 
specifically, something like
   
   ```rust
   /// reads a Sbff from thrift encoded bytes
   pub fn from_bytes(bytes: &[u8]) -> Result<Self> { 
   ...
   }
   ```
   
   And then a round trip serialization would look like:
   
   ```rust
   let mut serialized = Vec::new();
   original.write(&mut serialized)?;
   // read the serialized bytes back
   let reconstructed = Sbff::from_bytes(&serialized)?;
   ```
   



##########
parquet/src/bloom_filter/mod.rs:
##########
@@ -292,6 +309,44 @@ impl Sbbf {
         Ok(())
     }
 
+    /// Returns the raw bitset bytes encoded in little-endian order.
+    pub fn as_slice(&self) -> Vec<u8> {

Review Comment:
   typically methods named starting with `as_*` would not copy data 



-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: [email protected]

For queries about this service, please contact Infrastructure at:
[email protected]

Reply via email to