Re: JENA-624: "Develop a new in-memory RDF Dataset implementation"

Claude Warren Mon, 31 Aug 2015 02:08:54 -0700

Rob,

You have missed how the search works.

What we are searching for is not quads (or triples) but sets of nodes.
 We store nodes in sets of four (quads).  So each quad is reduced to a
single bloom filter comprising 4 items (15-bytes).  This allows the bloom
filter to replace all the standard indexes.

So now we have N bloom filters (where N is the number of quads in the
graph).  So searching the set becomes the problem.  It turns out that for
most cases performing a linear search is fastest.  There are some other
interesting options like Bloofi (
http://www.usna.edu/Users/cs/adina/research/Bloofi%20_CloudI2013.pdf) that
attempt to store the filters in a b-tree.  However, Bloofi does not work
well above a few thousand entries.

Another solution is to create buckets of the bloom filters and use a bloom
filter to gate them.  In this case you are correct and the number of items
in the filter is the size of the bucket, or more likely the size of the
bucket*4 (4 nodes in the quad).  While that number would be higher than
expected due to duplicate nodes (e.g. same subject on multiple quads), the
impact on the filter would be negligible.

Anyway, the point is that you can remove all the standard indexes by using
bloom filters.  Arranging the resulting filters in a way that provides
rapid searching when the list is very large can be a problem, but for an in
memory version a scan of the list will be fast enough.

Claude

On Mon, Aug 31, 2015 at 9:44 AM, Rob Vesse <[email protected]> wrote:

> On 30/08/2015 00:18, "Claude Warren" <[email protected]> wrote:
>
> >Quickly searching a large collection of bloom filters.
>
> This is O(k) where k is a constant so even with lots of filters isn't this
> still a constant time operation?
>
> Rob
>
>
>
>
>

-- 
I like: Like Like - The likeliest place on the web
<http://like-like.xenei.com>
LinkedIn: http://www.linkedin.com/in/claudewarren

Re: JENA-624: "Develop a new in-memory RDF Dataset implementation"

Reply via email to