On 31/08/15 16:46, Claude Warren wrote:
matching depends on which match step you mean.




*Bloom Filter Match*
create a bloom filter for your target ( call it T).  Compare with each
Bloomfilter in the table (call these B).  A match is this case is an "AND"
operation on the 2 bitmaps if
T & B == T then it is a match.

find(G,S,*,*) is all quads in one particular graph that have the same subject. Even in large datasets, that's 10s of quads, maybe 100, a very small %-age of the dataset.

So in step 2, if there are 1000 named graphs of 10,000 triples (=> 10e6 quads), is that 10e6 checks? Or can something better be done? It's trying to be comparable to iterating over 100 items in a datastructure.

It would be very interesting to see this done for real.

Also, for this done as Java vs C because (below) it looks to me like RAM fetch dominates and Java, with no object method calls, JITs to C-like performance.

*Verification Match (Step 3 above)*
For the records returned by the bloom filter match there may be false
positives (this is the nature of the bloom filter).  So you look at each
record (not bloom filter) and verify that the requested fields (G and S in
the example above) from the returned records actually match the values for
G and S that you are looking for.


*Side Observation*
Paul mentioned building indexes on the fly in Salesforce.  It may be
possible to do something similar.  If you can create a subset of the data
that contains the solutions you are looking for and then index that it may
be faster than using a very large index.  But finding the subset is not
trivial and may be too expensive in and of itself.

Claude

Processing a long sequence of quads is not that processor cache friendly. The comparison is really, really quick but predicative cache-ahead only goes so far; data still has to be fetched. If the processing is "free" (overlaps with a RAM fetch) then the dominant cost is the time in RAM fetch (very roughly L2 is x2 faster than main RAM on commodity processors [1]).

An L1 or L2 cache line is (typically, today) 64 bytes so only a very small number of filters are in one line and not reused.

It seems to be to be a lot of cache misses (N/4). There is still a main memory fetch going on.

Now, if a disk, even an SSD, is involved, the whole balance of numbers is very different.

        Andy

[1] http://duartes.org/gustavo/blog/post/what-your-computer-does-while-you-wait/ - it says x20 between L2 and RAM.

[2] http://eprints.soton.ac.uk/185969/



On Mon, Aug 31, 2015 at 4:35 PM, [email protected] <[email protected]>
wrote:

I apologize if I am being thick here, but I don't understand how one goes
about checking the potential match without some kind of covering resource
against which to do that, something with a full representation of the
graph. Can you elaborate on how to check the validity of the match?

Thank you for taking the time to walk through this!

---
A. Soroka
The University of Virginia Library

On Aug 31, 2015, at 10:04 AM, Claude Warren <[email protected]> wrote:

Step 3 is about removing the false positives from the bloom filter.  It
does not require an index, it requires checking the values to ensure match.





Reply via email to