Re: JENA-624: "Develop a new in-memory RDF Dataset implementation"

Andy Seaborne Mon, 31 Aug 2015 09:45:03 -0700

On 31/08/15 16:46, Claude Warren wrote:

matching depends on which match step you mean.





*Bloom Filter Match*
create a bloom filter for your target ( call it T).  Compare with each
Bloomfilter in the table (call these B).  A match is this case is an "AND"
operation on the 2 bitmaps if
T & B == T then it is a match.

find(G,S,*,*) is all quads in one particular graph that have the samesubject. Even in large datasets, that's 10s of quads, maybe 100, a verysmall %-age of the dataset.

So in step 2, if there are 1000 named graphs of 10,000 triples (=> 10e6quads), is that 10e6 checks? Or can something better be done? It'strying to be comparable to iterating over 100 items in a datastructure.


It would be very interesting to see this done for real.

Also, for this done as Java vs C because (below) it looks to me like RAMfetch dominates and Java, with no object method calls, JITs to C-likeperformance.

*Verification Match (Step 3 above)*
For the records returned by the bloom filter match there may be false
positives (this is the nature of the bloom filter).  So you look at each
record (not bloom filter) and verify that the requested fields (G and S in
the example above) from the returned records actually match the values for
G and S that you are looking for.


*Side Observation*
Paul mentioned building indexes on the fly in Salesforce.  It may be
possible to do something similar.  If you can create a subset of the data
that contains the solutions you are looking for and then index that it may
be faster than using a very large index.  But finding the subset is not
trivial and may be too expensive in and of itself.

Claude

Processing a long sequence of quads is not that processor cachefriendly. The comparison is really, really quick but predicativecache-ahead only goes so far; data still has to be fetched. If theprocessing is "free" (overlaps with a RAM fetch) then the dominant costis the time in RAM fetch (very roughly L2 is x2 faster than main RAM oncommodity processors [1]).

An L1 or L2 cache line is (typically, today) 64 bytes so only a verysmall number of filters are in one line and not reused.

It seems to be to be a lot of cache misses (N/4). There is still a mainmemory fetch going on.

Now, if a disk, even an SSD, is involved, the whole balance of numbersis very different.


        Andy

[1]http://duartes.org/gustavo/blog/post/what-your-computer-does-while-you-wait/- it says x20 between L2 and RAM.


[2] http://eprints.soton.ac.uk/185969/



On Mon, Aug 31, 2015 at 4:35 PM, [email protected] <[email protected]>
wrote:

I apologize if I am being thick here, but I don't understand how one goes
about checking the potential match without some kind of covering resource
against which to do that, something with a full representation of the
graph. Can you elaborate on how to check the validity of the match?

Thank you for taking the time to walk through this!

---
A. Soroka
The University of Virginia Library

On Aug 31, 2015, at 10:04 AM, Claude Warren <[email protected]> wrote:

Step 3 is about removing the false positives from the bloom filter.  It

does not require an index, it requires checking the values to ensure match.

Re: JENA-624: "Develop a new in-memory RDF Dataset implementation"

Reply via email to