[GitHub] [arrow-rs] Jimexist opened a new issue, #3138: audit and create a document for bloom filter configurations

GitBox Fri, 18 Nov 2022 21:30:25 -0800


Jimexist opened a new issue, #3138:
URL: https://github.com/apache/arrow-rs/issues/3138


           Thank you @Jimexist  -- this is very cool. I went through the code 
fairly thoroughly. I had some minor suggestions / comments for documentation 
and code structure but nothing that would block merging.
   
   I think the biggest thing I would like to discuss is "what parameters to 
expose for the writer API". I was thinking, for example, will users of this 
feature be able to set "fpp" and "ndv" reasonably? I suppose having the number 
of distinct values before writing a parquet file is reasonable, but maybe not 
the expected number of distinct values for each row group.
   
   I did some research of other implementations. Here are the spark settingss 
https://spark.apache.org/docs/latest/configuration.html
   
   spark.sql.optimizer.runtime.bloomFilter.creationSideThreshold | 10MB | Size 
threshold of the bloom filter creation side plan. Estimated size needs to be 
under this value to try to inject bloom filter. | 3.3.0
   -- | -- | -- | --
   spark.sql.optimizer.runtime.bloomFilter.enabled | false | When true and if 
one side of a shuffle join has a selective predicate, we attempt to insert a 
bloom filter in the other side to reduce the amount of shuffle data. | 3.3.0
   spark.sql.optimizer.runtime.bloomFilter.expectedNumItems | 1000000 | The 
default number of expected items for the runtime bloomfilter | 3.3.0
   spark.sql.optimizer.runtime.bloomFilter.maxNumBits | 67108864 | The max 
number of bits to use for the runtime bloom filter | 3.3.0
   spark.sql.optimizer.runtime.bloomFilter.maxNumItems | 4000000 | The max 
allowed number of expected items for the runtime bloom filter | 3.3.0
   spark.sql.optimizer.runtime.bloomFilter.numBits | 8388608 | The default 
number of bits to use for the runtime bloom filter | 3.3.0
   
   
   the arrow parquet C++ writer seems to allow for the fpp setting
   
   
https://arrow.apache.org/docs/cpp/api/formats.html#_CPPv4N5arrow8adapters3orc12WriteOptions16bloom_filter_fppE
   
   ```
   double bloom_filter_fpp = 0.05
   The upper limit of the false-positive rate of the bloom filter, default 0.05.
   ```
   
   Databricks seems to expose the fpp, max_fpp, and num distinct values:
   
https://docs.databricks.com/sql/language-manual/delta-create-bloomfilter-index.html
   
   _Originally posted by @alamb in 
https://github.com/apache/arrow-rs/pull/3119#pullrequestreview-1186585988_
         


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: [email protected]

For queries about this service, please contact Infrastructure at:
[email protected]

[GitHub] [arrow-rs] Jimexist opened a new issue, #3138: audit and create a document for bloom filter configurations

Reply via email to