lyang24 opened a new issue, #9761:
URL: https://github.com/apache/arrow-rs/issues/9761

   Hey — wanted to float an idea: [Ribbon 
filter](https://arxiv.org/pdf/2103.02515)(been shipping in RocksDB since 6.15) 
as a second option next to the existing SBBF in `bloom_filter/`. Writer picks 
one, reader dispatches on the thrift algorithm tag.                             
                                                         
                                                                                
                                                                                
                                                                                
                                                                            
     ### Why I think it's worth talking about                                   
                                                                                
                                                                                
                                                                            
                                                 
     SBBF sits around 10 bits/key for 1% FPR. Ribbon gets close to the 
information-theoretic floor — ~6.7 bits/key for the same FPR. That's roughly a 
third off the bloom footprint. For a Parquet file with bloom on a handful of 
columns across a bunch of row groups, that adds up to real bytes in the footer. 
         
                                                                                
                                                                                
                                                                                
                                                                  
     Where I think it actually shows up:                                        
                                                                                
                                                                                
                                                                            
     - Cold opens from S3 / GCS — fewer bytes per `GET`
     - Lakehouse setups with tons of small-ish files — metadata cache holds 
more files                                                                      
                                                                                
                                                                                
     - Anywhere DataFusion's `prune_by_bloom_filters` is doing real work today  
      
                                                                                
                                                                                
                                                                                
                                                                            
     What I don't want to oversell:              
     - Query throughput is ~3× slower per probe in the paper's benchmark. But 
the paper's own limitations section calls out that this is a throughput 
measurement; for uncorrelated single probes (which is what Parquet actually 
does) latency is memory-bound and basically a wash.                             
          
     - Construction is 6–25× slower per key. That's a real cost on the writer 
side. Probably fine for write-once lake data, probably annoying for high-QPS 
streaming ETL.                                                                  
                                                                                
 
                                                                                
                                                                                
                                                                                
                                                                          


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: [email protected]

For queries about this service, please contact Infrastructure at:
[email protected]

Reply via email to