[ 
https://issues.apache.org/jira/browse/COLLECTIONS-883?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=18052083#comment-18052083
 ] 

Alex Herbert commented on COLLECTIONS-883:
------------------------------------------

The use of int for the size of the shape was a design decision. There was a 
discussion about this on the mailing list during development. I could not find 
reference. Disclaimer: I was in favour of using a long given the extra size of 
filter that can be handled. However the use of an int for all the package 
functionality does make it more efficient in memory consumption and runtime 
speed. Some implementations are not practically possible when using a long 
index, for example the counting Bloom filter.

For an array based Bloom filter using single bits for each index you are 
limited within reason to using a long[] of the maximum array length. This would 
be approximately 2^6 * 2^31 = 2^37 bits. This is larger than the shape limit of 
~2^31, but only by 64-fold. So if you move to a long for the shape you can 
practically increase capacity by 64x. Given the equation above this is 
increasing the size of supported n with the same false-positive probability by 
64x. The smallest amount of memory this would consume would be 2^37/2^30 Gbits 
= 16Gb. You need a lot of RAM to use such a filter.

For reference the Bloom filter implemented within Guava does allow addressing 
bits up to 2^37 using a long to represent the bit index. So implementations 
that support this do exist.

Your example of a filter for half a billion items does not seem unreasonable. 
Given the limit of bits above and various p-values we can see the maximum 
support number of items:

n = -m * ln(2)^2 / ln(p)
||p-value||m=2^31||m=2^37||
|0.001|149363281|9559249967|
|0.005|194734464|12463005678|
|0.05|344411615|22042343372|
|0.01|224044921|14338874951|
|0.05|344411615|22042343372|

If the maximum n is set at a reasonable size of 2^31 (an array of items), then 
the smallest supported probability is:

p = exp(-m * ln(2)^2 / n)
||m=2^31||m=2^37||
|0.61850314|4.42468E-14|

So the current implementation does not support a large number of items but use 
of a long[] array for the bits would make a Bloom filter practical for all 
reasonable use cases of number of items.

I believe there are some reasons we choose int for the index:

The Index that is used in all the interfaces is an int and so saves half the 
memory space of using a long. This does have implications for some of the 
implementations. They are currently non-specialised and based around an int. A 
change to basing around a long would double some of the memory requirements 
when using the library. The alternative is a doubling of the code by providing 
two (or more) implementations of all affected classes that optimise the storage 
space based on the maximum index size.

Some of the implementations may not be practical when addressing an index this 
large. A sparse Bloom filter is currently based on a TreeSet. Although you may 
not wish to saturate it, if you did then storing 2^37 longs in a TreeSet would 
consume a lot of memory. Assuming 64 bits for the forward/reverse pointers from 
each node, 64 bits for the address of the node and the stored Long object 
wrapped around a long would be 5*64-bits per node. Total = 2^37 * 40 bytes = 
5120 Gb of RAM. This is currently impractical. Performance would be poor. The 
library would then have to partially support some implementations regarding a 
long index, e.g. log a warning about performance or throw an exception if the 
shape exceeds n=2^31.

The library has a lot of functionality beyond using an array of longs to store 
bits. A quick glance through the code finds a few places where the Shape being 
based on an int is an assumption. I did not check the implications of a change 
to long for the implementation details of various interfaces. This is because 
changing the API would break binary compatibility. If a change is made then it 
would be for v5.0 of the library, and would require some further investigation 
of the feasibility.

TLDR: At best we could add support for a Bloom filter based on using a long 
index into a long[]. To maintain binary compatibility this would be an entirely 
new API and so would not work with most of the existing code. The additional 
use cases (sparse filters, counting filters, set operations, index filtering) 
for a filter are made more efficient in terms of space and runtime using an int 
index. But from a practical point of view supporting a long index would allow 
supporting the maximum reasonable number of items you could handle in memory 
elsewhere as an array.

> BloomFilter Shape class limits numberOfBits to int, preventing large-scale 
> filters (>2.1B bits)
> -----------------------------------------------------------------------------------------------
>
>                 Key: COLLECTIONS-883
>                 URL: https://issues.apache.org/jira/browse/COLLECTIONS-883
>             Project: Commons Collections
>          Issue Type: Bug
>          Components: Bloomfilter
>    Affects Versions: 4.5.0
>            Reporter: Ayush Sharma
>            Priority: Major
>         Attachments: code.png, error.png
>
>
> *Problem*
> When creating a Bloom filter for large datasets using Shape.fromNP(n, p), the 
> operation fails if the calculated number of bits exceeds Integer.MAX_VALUE 
> (~2.1 billion).
> *Error Message*
> Resulting filter has more than 2147483647 bits: 7.569340059E9
> *Environment*
> * Dataset: ~500 million elements
> * False Positive Probability: 0.005
> * Apache Commons Collections version: 4.5.0-M2
> *Root Cause*
> The Shape class stores numberOfBits as an int:
> * Shape.fromNP(int n, double p)
> * Shape.fromKM(int k, int m)  
> * Shape.fromNM(int n, int m)
> * Shape.getNumberOfBits() returns int
> For large-scale applications, the required bits can exceed Integer.MAX_VALUE.
> *Calculation*
> m = -n × ln(p) / (ln(2))²
> m = -500,000,000 × ln(0.005) / 0.4805
> m ≈ 5.5+ billion bits
> *Suggested Fix*
> Change numberOfBits from int to long in the Shape class and related methods.



--
This message was sent by Atlassian Jira
(v8.20.10#820010)

Reply via email to