to find find(G,S,*,*) with bloom filters and return an iterator you 1. construct a bloom filter with G and S 2. scan the list of quads checking for matches. 3. for each result that matches verify that it has G and S (I have done this with an extended iterator in Jena)
result is an iterator that returns all (G,S,*,*) quads. similar tests can be performed for any pattern -- same code used. Step 2 is the expensive one. But the bloom filter check is so efficient that it becomes very difficult to perform search operations in less time than it takes to scan the list. Claude On Mon, Aug 31, 2015 at 11:01 AM, Andy Seaborne <[email protected]> wrote: > On 29/08/15 14:55, Claude Warren wrote: > >> Something I have been thinking about.... >> >> you could replace GSPO, GOPS, SPOG, OSGP, PGSO, OPSG. with a single >> bloomfilter implementation. It means a 2 step process to find matches but >> it might be fast enough and reduce the overhead significantly. >> >> I did an in-memory and a relational DB based version recently, but it was >> just a quick POC. >> >> Claude >> > > So we're talking about in-memory, where the items are java classes. A > quad is 2 slots java overhead + 4 slots for G, S, P, O pointers. That's 48 > bytes if the heap is >32G and 24 bytes otherwise (compressed pointers or 32 > bit). > > For storage, the key test is "contains" to maintain the "set of" > semantics. Something to stop index traversal for each insert would be > great but it's still stored and 1 not up to 6 would be good. (Note that > most data is unique quads.) > > The import retrieval operation is find(G,S,P,O) where any of those can be > a wildcard and return (ideally as a stream) all matching quads with a > prefix. The multiple indexes exist to find based on prefix. > > How would that work for, say find(G,S,*,*) with bloom filters and 1b > quads? How does the code go from returning G,S,P1,O1 to the next G,S,P1,O2 > without trying every value for the O slot? > > For a hash map based hierarchical index G->S->P->O, it's O(1) to find the > start of the scan then datastructure iteration. A hash-based is not > necessarily the best choice [*] but it's a baseline to discuss. > > And in memory, will a bloom filter-based system be faster? Because of > false-positives, isn't a definitively index still needed? If one is kept, > not 6, there could be great space gains but every quad returned is a > top-to-bottom traversal of that index (which is now a not a range index). > > The design should work for 1+ billion in-memory quads - that's the way the > world is going. > > So each quad is reduced to a >> single bloom filter comprising 4 items (15-bytes). >> > > Andy > > [*] even in memory, it might be worth allocating internal ids and working > in longs like a disk based system because it is more compact - naive > hashmaps take a lot of space to when storing small items like quads. > tradeoffs, tradeoffs, ... > > > > >> On Wed, Aug 26, 2015 at 3:27 PM, A. Soroka <[email protected]> wrote: >> >> Hey, folks-- >>> >>> There hasn't been too much feedback on my proposal for a journaling >>> DatasetGraph: >>> >>> https://github.com/ajs6f/jena/tree/JournalingDatasetgraph >>> >>> which was and is to be a step towards JENA-624: Develop a new in-memory >>> RDF Dataset implementation. So I'm moving on to look at the real problem: >>> an in-memory DatasetGraph with high concurrency, for use with modern >>> hardware running many, many threads in large core memory. >>> >>> I'm beginning to sketch out rough code, and I'd like to run some design >>> decisions past the list to get criticism/advice/horrified >>> warnings/whatever >>> needs to be said. >>> >>> 1) All-transactional action: i.e. no non-transactional operation. This is >>> obviously a great thing for simplifying my work, but I hope it won't be >>> out >>> of line with the expected uses for this stuff. >>> >>> 2) 6 covering indexes in the forms GSPO, GOPS, SPOG, OSGP, PGSO, OPSG. I >>> figure to play to the strength of in-core-memory operation: raw speed, >>> but >>> obviously this is going to cost space. >>> >>> 3) At least for now, all commits succeed. >>> >>> 4) The use of persistent datastructures to avoid complex and error-prone >>> fine-grained locking regimes. I'm using http://pcollections.org/ for >>> now, >>> but I am in no way committed to it nor do I claim to have thoroughly >>> vetted >>> it. It's simple but enough to get started, and that's all I need to bring >>> the real design questions into focus. >>> >>> 5) Snapshot isolation. Transactions do not see commits that occur during >>> their lifetime. Each works entirely from the state of the DatasetGraph at >>> the start of its life. >>> >>> 6) Only as many as one transaction per thread, for now. Transactions are >>> not thread-safe. These are simplifying assumptions that could be relaxed >>> later. >>> >>> My current design operates as follows: >>> >>> At the start of a transaction, a fresh in-transaction reference is taken >>> atomically from the AtomicReference that points to the index block. As >>> operations are performed in the transaction, that in-transaction >>> reference >>> is progressed (in the sense in which any persistent datastructure is >>> progressed) while the operations are recorded. Upon an abort, the >>> in-transaction reference and the record are just thrown away. Upon a >>> commit, the in-transaction reference is thrown away and the operation >>> record is re-run against the main reference (the one that is copied at >>> the >>> beginning of a transaction). That rerun happens inside an atomic update >>> (hence the use of AtomicReference). This all should avoid the need for >>> explicit locking in Jena and should confine any blocking against the >>> indexes to the actual duration of a commit. >>> >>> What do you guys think? >>> >>> >>> >>> --- >>> A. Soroka >>> The University of Virginia Library >>> >>> >>> >> >> > -- I like: Like Like - The likeliest place on the web <http://like-like.xenei.com> LinkedIn: http://www.linkedin.com/in/claudewarren
