etseidl commented on code in PR #49377:
URL: https://github.com/apache/arrow/pull/49377#discussion_r2851026880


##########
python/pyarrow/parquet/core.py:
##########
@@ -951,6 +951,30 @@ def _sanitize_table(table, new_schema, flavor):
     are expressed in reference to midnight in the UTC timezone.
     If False (the default), the TIME columns are assumed to be expressed
     in reference to midnight in an unknown, presumably local, timezone.
+bloom_filter_options : dict, default None
+    Create Bloom filters for the columns specified by the provided `dict`.
+
+    Bloom filters can be configured with two parameters: number of distinct 
values
+    (NDV), and false-positive probability (FPP).
+
+    Bloom filters are most effective for high-cardinality columns. A good 
default
+    is to set NDV equal to the number of rows. Lower values reduce disk usage 
but
+    may not be worthwhile for very small NDVs. Increasing NDV (without 
increasing FPP)
+    increases disk and memory usage.
+
+    Lower FPP values require more disk and memory space. For a fixed NDV, the
+    space requirement grows roughly proportional to log(1/FPP). Recommended
+    values are 0.1, 0.05, or 0.01. Very small values are counterproductive as
+    the bitset may exceed the size of the actual data. Set NDV appropriately
+    to minimize space usage.
+
+    The keys of the `dict` are column paths. For each path, the value can be 
either:
+
+    - A boolean, with ``True`` indicating that a Bloom filter should be 
produced with
+      the default values of `NDV=1048576` and `FPP=0.05`.
+    - A dictionary, with keys `ndv` and `fpp`. `ndv` must be a positive 
integer, and
+      `fpp` must be a float between 0.0 and 1.0. Default values will be used 
for any

Review Comment:
   Yes. I've reworked the docs to hopefully make this clearer.



-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: [email protected]

For queries about this service, please contact Infrastructure at:
[email protected]

Reply via email to