I'm still a bit confused as to why you don't regard step 3 as being potentially very expensive. In order to verify a match, we will have to examine an "exact" index, and that (as Andy remarked) is likely to require traversal, or else we throw away all the space gains.
Is this technique a way to pay a lot of time for a lot of space savings? Perhaps it is appropriate for an alternative implementation for very large datasets? --- A. Soroka The University of Virginia Library On Aug 31, 2015, at 6:48 AM, Claude Warren <[email protected]> wrote: > to find find(G,S,*,*) with bloom filters and return an iterator you > > 1. construct a bloom filter with G and S > 2. scan the list of quads checking for matches. > 3. for each result that matches verify that it has G and S (I have done > this with an extended iterator in Jena) > > result is an iterator that returns all (G,S,*,*) quads. > > similar tests can be performed for any pattern -- same code used. > > Step 2 is the expensive one. But the bloom filter check is so efficient > that it becomes very difficult to perform search operations in less time > than it takes to scan the list. > > Claude > > On Mon, Aug 31, 2015 at 11:01 AM, Andy Seaborne <[email protected]> wrote: > >> On 29/08/15 14:55, Claude Warren wrote: >> >>> Something I have been thinking about.... >>> >>> you could replace GSPO, GOPS, SPOG, OSGP, PGSO, OPSG. with a single >>> bloomfilter implementation. It means a 2 step process to find matches but >>> it might be fast enough and reduce the overhead significantly. >>> >>> I did an in-memory and a relational DB based version recently, but it was >>> just a quick POC. >>> >>> Claude >>> >> >> So we're talking about in-memory, where the items are java classes. A >> quad is 2 slots java overhead + 4 slots for G, S, P, O pointers. That's 48 >> bytes if the heap is >32G and 24 bytes otherwise (compressed pointers or 32 >> bit). >> >> For storage, the key test is "contains" to maintain the "set of" >> semantics. Something to stop index traversal for each insert would be >> great but it's still stored and 1 not up to 6 would be good. (Note that >> most data is unique quads.) >> >> The import retrieval operation is find(G,S,P,O) where any of those can be >> a wildcard and return (ideally as a stream) all matching quads with a >> prefix. The multiple indexes exist to find based on prefix. >> >> How would that work for, say find(G,S,*,*) with bloom filters and 1b >> quads? How does the code go from returning G,S,P1,O1 to the next G,S,P1,O2 >> without trying every value for the O slot? >> >> For a hash map based hierarchical index G->S->P->O, it's O(1) to find the >> start of the scan then datastructure iteration. A hash-based is not >> necessarily the best choice [*] but it's a baseline to discuss. >> >> And in memory, will a bloom filter-based system be faster? Because of >> false-positives, isn't a definitively index still needed? If one is kept, >> not 6, there could be great space gains but every quad returned is a >> top-to-bottom traversal of that index (which is now a not a range index). >> >> The design should work for 1+ billion in-memory quads - that's the way the >> world is going. >> >> So each quad is reduced to a >>> single bloom filter comprising 4 items (15-bytes). >>> >> >> Andy >> >> [*] even in memory, it might be worth allocating internal ids and working >> in longs like a disk based system because it is more compact - naive >> hashmaps take a lot of space to when storing small items like quads. >> tradeoffs, tradeoffs, ... >> >> >> >> >>> On Wed, Aug 26, 2015 at 3:27 PM, A. Soroka <[email protected]> wrote: >>> >>> Hey, folks-- >>>> >>>> There hasn't been too much feedback on my proposal for a journaling >>>> DatasetGraph: >>>> >>>> https://github.com/ajs6f/jena/tree/JournalingDatasetgraph >>>> >>>> which was and is to be a step towards JENA-624: Develop a new in-memory >>>> RDF Dataset implementation. So I'm moving on to look at the real problem: >>>> an in-memory DatasetGraph with high concurrency, for use with modern >>>> hardware running many, many threads in large core memory. >>>> >>>> I'm beginning to sketch out rough code, and I'd like to run some design >>>> decisions past the list to get criticism/advice/horrified >>>> warnings/whatever >>>> needs to be said. >>>> >>>> 1) All-transactional action: i.e. no non-transactional operation. This is >>>> obviously a great thing for simplifying my work, but I hope it won't be >>>> out >>>> of line with the expected uses for this stuff. >>>> >>>> 2) 6 covering indexes in the forms GSPO, GOPS, SPOG, OSGP, PGSO, OPSG. I >>>> figure to play to the strength of in-core-memory operation: raw speed, >>>> but >>>> obviously this is going to cost space. >>>> >>>> 3) At least for now, all commits succeed. >>>> >>>> 4) The use of persistent datastructures to avoid complex and error-prone >>>> fine-grained locking regimes. I'm using http://pcollections.org/ for >>>> now, >>>> but I am in no way committed to it nor do I claim to have thoroughly >>>> vetted >>>> it. It's simple but enough to get started, and that's all I need to bring >>>> the real design questions into focus. >>>> >>>> 5) Snapshot isolation. Transactions do not see commits that occur during >>>> their lifetime. Each works entirely from the state of the DatasetGraph at >>>> the start of its life. >>>> >>>> 6) Only as many as one transaction per thread, for now. Transactions are >>>> not thread-safe. These are simplifying assumptions that could be relaxed >>>> later. >>>> >>>> My current design operates as follows: >>>> >>>> At the start of a transaction, a fresh in-transaction reference is taken >>>> atomically from the AtomicReference that points to the index block. As >>>> operations are performed in the transaction, that in-transaction >>>> reference >>>> is progressed (in the sense in which any persistent datastructure is >>>> progressed) while the operations are recorded. Upon an abort, the >>>> in-transaction reference and the record are just thrown away. Upon a >>>> commit, the in-transaction reference is thrown away and the operation >>>> record is re-run against the main reference (the one that is copied at >>>> the >>>> beginning of a transaction). That rerun happens inside an atomic update >>>> (hence the use of AtomicReference). This all should avoid the need for >>>> explicit locking in Jena and should confine any blocking against the >>>> indexes to the actual duration of a commit. >>>> >>>> What do you guys think? >>>> >>>> >>>> >>>> --- >>>> A. Soroka >>>> The University of Virginia Library >>>> >>>> >>>> >>> >>> >> > > > -- > I like: Like Like - The likeliest place on the web > <http://like-like.xenei.com> > LinkedIn: http://www.linkedin.com/in/claudewarren
