[
https://issues.apache.org/jira/browse/COLLECTIONS-883?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=18052083#comment-18052083
]
Alex Herbert commented on COLLECTIONS-883:
------------------------------------------
The use of int for the size of the shape was a design decision. There was a
discussion about this on the mailing list during development. I could not find
reference. Disclaimer: I was in favour of using a long given the extra size of
filter that can be handled. However the use of an int for all the package
functionality does make it more efficient in memory consumption and runtime
speed. Some implementations are not practically possible when using a long
index, for example the counting Bloom filter.
For an array based Bloom filter using single bits for each index you are
limited within reason to using a long[] of the maximum array length. This would
be approximately 2^6 * 2^31 = 2^37 bits. This is larger than the shape limit of
~2^31, but only by 64-fold. So if you move to a long for the shape you can
practically increase capacity by 64x. Given the equation above this is
increasing the size of supported n with the same false-positive probability by
64x. The smallest amount of memory this would consume would be 2^37/2^30 Gbits
= 16Gb. You need a lot of RAM to use such a filter.
For reference the Bloom filter implemented within Guava does allow addressing
bits up to 2^37 using a long to represent the bit index. So implementations
that support this do exist.
Your example of a filter for half a billion items does not seem unreasonable.
Given the limit of bits above and various p-values we can see the maximum
support number of items:
n = -m * ln(2)^2 / ln(p)
||p-value||m=2^31||m=2^37||
|0.001|149363281|9559249967|
|0.005|194734464|12463005678|
|0.05|344411615|22042343372|
|0.01|224044921|14338874951|
|0.05|344411615|22042343372|
If the maximum n is set at a reasonable size of 2^31 (an array of items), then
the smallest supported probability is:
p = exp(-m * ln(2)^2 / n)
||m=2^31||m=2^37||
|0.61850314|4.42468E-14|
So the current implementation does not support a large number of items but use
of a long[] array for the bits would make a Bloom filter practical for all
reasonable use cases of number of items.
I believe there are some reasons we choose int for the index:
The Index that is used in all the interfaces is an int and so saves half the
memory space of using a long. This does have implications for some of the
implementations. They are currently non-specialised and based around an int. A
change to basing around a long would double some of the memory requirements
when using the library. The alternative is a doubling of the code by providing
two (or more) implementations of all affected classes that optimise the storage
space based on the maximum index size.
Some of the implementations may not be practical when addressing an index this
large. A sparse Bloom filter is currently based on a TreeSet. Although you may
not wish to saturate it, if you did then storing 2^37 longs in a TreeSet would
consume a lot of memory. Assuming 64 bits for the forward/reverse pointers from
each node, 64 bits for the address of the node and the stored Long object
wrapped around a long would be 5*64-bits per node. Total = 2^37 * 40 bytes =
5120 Gb of RAM. This is currently impractical. Performance would be poor. The
library would then have to partially support some implementations regarding a
long index, e.g. log a warning about performance or throw an exception if the
shape exceeds n=2^31.
The library has a lot of functionality beyond using an array of longs to store
bits. A quick glance through the code finds a few places where the Shape being
based on an int is an assumption. I did not check the implications of a change
to long for the implementation details of various interfaces. This is because
changing the API would break binary compatibility. If a change is made then it
would be for v5.0 of the library, and would require some further investigation
of the feasibility.
TLDR: At best we could add support for a Bloom filter based on using a long
index into a long[]. To maintain binary compatibility this would be an entirely
new API and so would not work with most of the existing code. The additional
use cases (sparse filters, counting filters, set operations, index filtering)
for a filter are made more efficient in terms of space and runtime using an int
index. But from a practical point of view supporting a long index would allow
supporting the maximum reasonable number of items you could handle in memory
elsewhere as an array.
> BloomFilter Shape class limits numberOfBits to int, preventing large-scale
> filters (>2.1B bits)
> -----------------------------------------------------------------------------------------------
>
> Key: COLLECTIONS-883
> URL: https://issues.apache.org/jira/browse/COLLECTIONS-883
> Project: Commons Collections
> Issue Type: Bug
> Components: Bloomfilter
> Affects Versions: 4.5.0
> Reporter: Ayush Sharma
> Priority: Major
> Attachments: code.png, error.png
>
>
> *Problem*
> When creating a Bloom filter for large datasets using Shape.fromNP(n, p), the
> operation fails if the calculated number of bits exceeds Integer.MAX_VALUE
> (~2.1 billion).
> *Error Message*
> Resulting filter has more than 2147483647 bits: 7.569340059E9
> *Environment*
> * Dataset: ~500 million elements
> * False Positive Probability: 0.005
> * Apache Commons Collections version: 4.5.0-M2
> *Root Cause*
> The Shape class stores numberOfBits as an int:
> * Shape.fromNP(int n, double p)
> * Shape.fromKM(int k, int m)
> * Shape.fromNM(int n, int m)
> * Shape.getNumberOfBits() returns int
> For large-scale applications, the required bits can exceed Integer.MAX_VALUE.
> *Calculation*
> m = -n × ln(p) / (ln(2))²
> m = -500,000,000 × ln(0.005) / 0.4805
> m ≈ 5.5+ billion bits
> *Suggested Fix*
> Change numberOfBits from int to long in the Shape class and related methods.
--
This message was sent by Atlassian Jira
(v8.20.10#820010)