alamb commented on issue #547:
URL: https://github.com/apache/parquet-format/issues/547#issuecomment-3761888626

   I am not an expert in this area (I think @jimexist originally implemented 
it). Here is the relevant calculation in arrow-rs (looks the same as for C++ 
and Java to me)
   
   
   
   
   
https://github.com/apache/arrow-rs/blob/c1333339626430ceb23efc7eff8b6af46d0b4a3b/parquet/src/bloom_filter/mod.rs#L245-L253
   
   ```rust
   // see http://algo2.iti.kit.edu/documents/cacheefficientbloomfilters-jea.pdf
   // given fpp = (1 - e^(-k * n / m)) ^ k
   // we have m = - k * n / ln(1 - fpp ^ (1 / k))
   // where k = number of hash functions, m = number of bits, n = number of 
distinct values
   #[inline]
   fn num_of_bits_from_ndv_fpp(ndv: u64, fpp: f64) -> usize {
       let num_bits = -8.0 * ndv as f64 / (1.0 - fpp.powf(1.0 / 8.0)).ln();
       num_bits as usize
   }
   ```
   
   I found the prose confusing -- if the table is supposed to reflect the bits 
for distinct value I agree the table doesn't seem to match the calcluatiosn
   


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: [email protected]

For queries about this service, please contact Infrastructure at:
[email protected]


---------------------------------------------------------------------
To unsubscribe, e-mail: [email protected]
For additional commands, e-mail: [email protected]

Reply via email to