Re: JENA-624: "Develop a new in-memory RDF Dataset implementation"

[email protected] Mon, 31 Aug 2015 05:54:14 -0700

I'm still a bit confused as to why you don't regard step 3 as being potentially 
very expensive. In order to verify a match, we will have to examine an "exact" 
index, and that (as Andy remarked) is likely to require traversal, or else we 
throw away all the space gains.


Is this technique a way to pay a lot of time for a lot of space savings? 
Perhaps it is appropriate for an alternative implementation for very large 
datasets?

---
A. Soroka
The University of Virginia Library

On Aug 31, 2015, at 6:48 AM, Claude Warren <[email protected]> wrote:

> to find find(G,S,*,*) with bloom filters and return an iterator you
> 
>   1. construct a bloom filter with G and S
>   2. scan the list of quads checking for matches.
>   3. for each result that matches verify that it has G and S (I have done 
> this with an extended iterator in Jena)
> 
> result is an iterator that returns all (G,S,*,*) quads.
> 
> similar tests can be performed for any pattern -- same code used.
> 
> Step 2 is the expensive one.  But the bloom filter check is so efficient
> that it becomes very difficult to perform search operations in less time
> than it takes to scan the list.
> 
> Claude
> 
> On Mon, Aug 31, 2015 at 11:01 AM, Andy Seaborne <[email protected]> wrote:
> 
>> On 29/08/15 14:55, Claude Warren wrote:
>> 
>>> Something I have been thinking about....
>>> 
>>> you could replace  GSPO, GOPS, SPOG, OSGP, PGSO, OPSG. with a single
>>> bloomfilter implementation.  It means a 2 step process to find matches but
>>> it might be fast enough and reduce the overhead significantly.
>>> 
>>> I did an in-memory and a relational DB based version recently, but it was
>>> just a quick POC.
>>> 
>>> Claude
>>> 
>> 
>> So we're talking about in-memory, where the items are java classes.  A
>> quad is 2 slots java overhead + 4 slots for G, S, P, O pointers.  That's 48
>> bytes if the heap is >32G and 24 bytes otherwise (compressed pointers or 32
>> bit).
>> 
>> For storage, the key test is "contains" to maintain the "set of"
>> semantics.  Something to stop index traversal for each insert would be
>> great but it's still stored and 1 not up to 6 would be good. (Note that
>> most data is unique quads.)
>> 
>> The import retrieval operation is find(G,S,P,O) where any of those can be
>> a wildcard and return (ideally as a stream) all matching quads with a
>> prefix.  The multiple indexes exist to find based on prefix.
>> 
>> How would that work for, say find(G,S,*,*) with bloom filters and 1b
>> quads?  How does the code go from returning G,S,P1,O1 to the next G,S,P1,O2
>> without trying every value for the O slot?
>> 
>> For a hash map based hierarchical index G->S->P->O, it's O(1) to find the
>> start of the scan then datastructure iteration.  A hash-based is not
>> necessarily the best choice [*] but it's a baseline to discuss.
>> 
>> And in memory, will a bloom filter-based system be faster?  Because of
>> false-positives, isn't a definitively index still needed?  If one is kept,
>> not 6, there could be great space gains but every quad returned is a
>> top-to-bottom traversal of that index (which is now a not a range index).
>> 
>> The design should work for 1+ billion in-memory quads - that's the way the
>> world is going.
>> 
>> So each quad is reduced to a
>>> single bloom filter comprising 4 items (15-bytes).
>>> 
>> 
>>        Andy
>> 
>> [*] even in memory, it might be worth allocating internal ids and working
>> in longs like a disk based system because it is more compact - naive
>> hashmaps take a lot of space to when storing small items like quads.
>> tradeoffs, tradeoffs, ...
>> 
>> 
>> 
>> 
>>> On Wed, Aug 26, 2015 at 3:27 PM, A. Soroka <[email protected]> wrote:
>>> 
>>> Hey, folks--
>>>> 
>>>> There hasn't been too much feedback on my proposal for a journaling
>>>> DatasetGraph:
>>>> 
>>>> https://github.com/ajs6f/jena/tree/JournalingDatasetgraph
>>>> 
>>>> which was and is to be a step towards JENA-624: Develop a new in-memory
>>>> RDF Dataset implementation. So I'm moving on to look at the real problem:
>>>> an in-memory  DatasetGraph with high concurrency, for use with modern
>>>> hardware running many, many threads in large core memory.
>>>> 
>>>> I'm beginning to sketch out rough code, and I'd like to run some design
>>>> decisions past the list to get criticism/advice/horrified
>>>> warnings/whatever
>>>> needs to be said.
>>>> 
>>>> 1) All-transactional action: i.e. no non-transactional operation. This is
>>>> obviously a great thing for simplifying my work, but I hope it won't be
>>>> out
>>>> of line with the expected uses for this stuff.
>>>> 
>>>> 2) 6 covering indexes in the forms GSPO, GOPS, SPOG, OSGP, PGSO, OPSG. I
>>>> figure to play to the strength of in-core-memory operation: raw speed,
>>>> but
>>>> obviously this is going to cost space.
>>>> 
>>>> 3) At least for now, all commits succeed.
>>>> 
>>>> 4) The use of persistent datastructures to avoid complex and error-prone
>>>> fine-grained locking regimes. I'm using http://pcollections.org/ for
>>>> now,
>>>> but I am in no way committed to it nor do I claim to have thoroughly
>>>> vetted
>>>> it. It's simple but enough to get started, and that's all I need to bring
>>>> the real design questions into focus.
>>>> 
>>>> 5) Snapshot isolation. Transactions do not see commits that occur during
>>>> their lifetime. Each works entirely from the state of the DatasetGraph at
>>>> the start of its life.
>>>> 
>>>> 6) Only as many as one transaction per thread, for now. Transactions are
>>>> not thread-safe. These are simplifying assumptions that could be relaxed
>>>> later.
>>>> 
>>>> My current design operates as follows:
>>>> 
>>>> At the start of a transaction, a fresh in-transaction reference is taken
>>>> atomically from the AtomicReference that points to the index block. As
>>>> operations are performed in the transaction, that in-transaction
>>>> reference
>>>> is progressed (in the sense in which any persistent datastructure is
>>>> progressed) while the operations are recorded. Upon an abort, the
>>>> in-transaction reference and the record are just thrown away. Upon a
>>>> commit, the in-transaction reference is thrown away and the operation
>>>> record is re-run against the main reference (the one that is copied at
>>>> the
>>>> beginning of a transaction). That rerun happens inside an atomic update
>>>> (hence the use of AtomicReference). This all should avoid the need for
>>>> explicit locking in Jena and should confine any blocking against the
>>>> indexes to the actual duration of a commit.
>>>> 
>>>> What do you guys think?
>>>> 
>>>> 
>>>> 
>>>> ---
>>>> A. Soroka
>>>> The University of Virginia Library
>>>> 
>>>> 
>>>> 
>>> 
>>> 
>> 
> 
> 
> -- 
> I like: Like Like - The likeliest place on the web
> <http://like-like.xenei.com>
> LinkedIn: http://www.linkedin.com/in/claudewarren

Re: JENA-624: "Develop a new in-memory RDF Dataset implementation"

Reply via email to