Re: JENA-624: "Develop a new in-memory RDF Dataset implementation"

[email protected] Thu, 03 Sep 2015 13:16:28 -0700

> Probably a lot of codesign is possible if the index type is fixed leading to 
> simpler and better performant code,  but it would be nice to have a pluggable 
> approach so that you can pick the structures that work for your workload.


This is certainly very true. I am trying to develop some good abstractions to 
this end.

So far the only sticking point is DatasetGraph::listGraphNodes. For the moment, 
I have added a listGraphNodes method to the index abstraction (alongside a find 
method that returns Iterator<Quad> to support DatasetGraph::find and ::findNG), 
because a map-based approach is just O(1) into iteration and I do want to be 
able to take advantage of that performance. But I will continue to look for a 
better approach.

---
A. Soroka
The University of Virginia Library

On Aug 31, 2015, at 11:20 AM, Paul Houle <[email protected]> wrote:

> The thing about bloom filters is that they are (i) highly dense because
> they are bit packed and (ii) accessed in sequential order so that modern
> processors can fetch ahead and hide RAM latency.
> 
> Most kinds of indexes,  on the other hand,  involve pointer chasing over an
> area that does not fit in the cache,  so your brand new processor is
> performing like a Pentium II.  Even if the O() part is better for some
> other algorithm,  the constant in front is awesome for the bloom filter.
> 
> Probably a lot of codesign is possible if the index type is fixed leading
> to simpler and better performant code,  but it would be nice to have a
> pluggable approach so that you can pick the structures that work for your
> workload.  Also from a research perspective we could get real questions
> 
> There are a lot of patents in the 2000 or so time period on systems that
> profile the queries and automatically build indexes.  For instance,
> Salesforce.com has a patent,  and they also have an implementation where
> the "master copy" of data is stored in an Oracle table that looks a lot
> like a triple store and then it automatically creates materialized views
> and functional indexes in Oracle to support client workloads.
> 
> (I remember writing a query against one of the larger Salesforce orgs and
> being frustrated when it timed out,  only to run the same query two minutes
> later and have it work perfectly.)
> 
> Patents do expire and it won't be too long that this technology could make
> it to open source tools.
> 
> On Mon, Aug 31, 2015 at 10:04 AM, Claude Warren <[email protected]> wrote:
> 
>> step 3 is relatively expensive if there are a log of quads with the G and S
>> values (e.g. *,G,*,S) but few with G,S,*,*
>> 
>> Step 3 is about removing the false positives from the bloom filter.  It
>> does not require an index, it requires checking the values to ensure match.
>> 
>> While there is a time to space trade off here.  The time is not as much as
>> one would think.  Keep in mind that a bloomfilter check is very fast.  On
>> the other hand the traversal of a b-tree or other index is much more
>> computationally complex.
>> 
>> For example, I know that if you can devise a mechanism by which you can
>> index bloom filters such  that you have a 2 segment key and you know that
>> you want to check every key where the compound keys are greater than the
>> one you are looking for it is often faster to do the linear search through
>> the data even when dealing with 1 million entries.  There are special cases
>> where the index is faster but for the most part the linear search is
>> faster.
>> 
>> Claude
>> 
>> On Mon, Aug 31, 2015 at 1:53 PM, [email protected] <[email protected]>
>> wrote:
>> 
>>> I'm still a bit confused as to why you don't regard step 3 as being
>>> potentially very expensive. In order to verify a match, we will have to
>>> examine an "exact" index, and that (as Andy remarked) is likely to
>> require
>>> traversal, or else we throw away all the space gains.
>>> 
>>> Is this technique a way to pay a lot of time for a lot of space savings?
>>> Perhaps it is appropriate for an alternative implementation for very
>> large
>>> datasets?
>>> 
>>> ---
>>> A. Soroka
>>> The University of Virginia Library
>>> 
>>> On Aug 31, 2015, at 6:48 AM, Claude Warren <[email protected]> wrote:
>>> 
>>>> to find find(G,S,*,*) with bloom filters and return an iterator you
>>>> 
>>>>  1. construct a bloom filter with G and S
>>>>  2. scan the list of quads checking for matches.
>>>>  3. for each result that matches verify that it has G and S (I have
>>> done this with an extended iterator in Jena)
>>>> 
>>>> result is an iterator that returns all (G,S,*,*) quads.
>>>> 
>>>> similar tests can be performed for any pattern -- same code used.
>>>> 
>>>> Step 2 is the expensive one.  But the bloom filter check is so
>> efficient
>>>> that it becomes very difficult to perform search operations in less
>> time
>>>> than it takes to scan the list.
>>>> 
>>>> Claude
>>>> 
>>>> On Mon, Aug 31, 2015 at 11:01 AM, Andy Seaborne <[email protected]>
>> wrote:
>>>> 
>>>>> On 29/08/15 14:55, Claude Warren wrote:
>>>>> 
>>>>>> Something I have been thinking about....
>>>>>> 
>>>>>> you could replace  GSPO, GOPS, SPOG, OSGP, PGSO, OPSG. with a single
>>>>>> bloomfilter implementation.  It means a 2 step process to find
>> matches
>>> but
>>>>>> it might be fast enough and reduce the overhead significantly.
>>>>>> 
>>>>>> I did an in-memory and a relational DB based version recently, but it
>>> was
>>>>>> just a quick POC.
>>>>>> 
>>>>>> Claude
>>>>>> 
>>>>> 
>>>>> So we're talking about in-memory, where the items are java classes.  A
>>>>> quad is 2 slots java overhead + 4 slots for G, S, P, O pointers.
>>> That's 48
>>>>> bytes if the heap is >32G and 24 bytes otherwise (compressed pointers
>>> or 32
>>>>> bit).
>>>>> 
>>>>> For storage, the key test is "contains" to maintain the "set of"
>>>>> semantics.  Something to stop index traversal for each insert would be
>>>>> great but it's still stored and 1 not up to 6 would be good. (Note
>> that
>>>>> most data is unique quads.)
>>>>> 
>>>>> The import retrieval operation is find(G,S,P,O) where any of those can
>>> be
>>>>> a wildcard and return (ideally as a stream) all matching quads with a
>>>>> prefix.  The multiple indexes exist to find based on prefix.
>>>>> 
>>>>> How would that work for, say find(G,S,*,*) with bloom filters and 1b
>>>>> quads?  How does the code go from returning G,S,P1,O1 to the next
>>> G,S,P1,O2
>>>>> without trying every value for the O slot?
>>>>> 
>>>>> For a hash map based hierarchical index G->S->P->O, it's O(1) to find
>>> the
>>>>> start of the scan then datastructure iteration.  A hash-based is not
>>>>> necessarily the best choice [*] but it's a baseline to discuss.
>>>>> 
>>>>> And in memory, will a bloom filter-based system be faster?  Because of
>>>>> false-positives, isn't a definitively index still needed?  If one is
>>> kept,
>>>>> not 6, there could be great space gains but every quad returned is a
>>>>> top-to-bottom traversal of that index (which is now a not a range
>>> index).
>>>>> 
>>>>> The design should work for 1+ billion in-memory quads - that's the way
>>> the
>>>>> world is going.
>>>>> 
>>>>> So each quad is reduced to a
>>>>>> single bloom filter comprising 4 items (15-bytes).
>>>>>> 
>>>>> 
>>>>>       Andy
>>>>> 
>>>>> [*] even in memory, it might be worth allocating internal ids and
>>> working
>>>>> in longs like a disk based system because it is more compact - naive
>>>>> hashmaps take a lot of space to when storing small items like quads.
>>>>> tradeoffs, tradeoffs, ...
>>>>> 
>>>>> 
>>>>> 
>>>>> 
>>>>>> On Wed, Aug 26, 2015 at 3:27 PM, A. Soroka <[email protected]>
>> wrote:
>>>>>> 
>>>>>> Hey, folks--
>>>>>>> 
>>>>>>> There hasn't been too much feedback on my proposal for a journaling
>>>>>>> DatasetGraph:
>>>>>>> 
>>>>>>> https://github.com/ajs6f/jena/tree/JournalingDatasetgraph
>>>>>>> 
>>>>>>> which was and is to be a step towards JENA-624: Develop a new
>>> in-memory
>>>>>>> RDF Dataset implementation. So I'm moving on to look at the real
>>> problem:
>>>>>>> an in-memory  DatasetGraph with high concurrency, for use with
>> modern
>>>>>>> hardware running many, many threads in large core memory.
>>>>>>> 
>>>>>>> I'm beginning to sketch out rough code, and I'd like to run some
>>> design
>>>>>>> decisions past the list to get criticism/advice/horrified
>>>>>>> warnings/whatever
>>>>>>> needs to be said.
>>>>>>> 
>>>>>>> 1) All-transactional action: i.e. no non-transactional operation.
>>> This is
>>>>>>> obviously a great thing for simplifying my work, but I hope it won't
>>> be
>>>>>>> out
>>>>>>> of line with the expected uses for this stuff.
>>>>>>> 
>>>>>>> 2) 6 covering indexes in the forms GSPO, GOPS, SPOG, OSGP, PGSO,
>>> OPSG. I
>>>>>>> figure to play to the strength of in-core-memory operation: raw
>> speed,
>>>>>>> but
>>>>>>> obviously this is going to cost space.
>>>>>>> 
>>>>>>> 3) At least for now, all commits succeed.
>>>>>>> 
>>>>>>> 4) The use of persistent datastructures to avoid complex and
>>> error-prone
>>>>>>> fine-grained locking regimes. I'm using http://pcollections.org/
>> for
>>>>>>> now,
>>>>>>> but I am in no way committed to it nor do I claim to have thoroughly
>>>>>>> vetted
>>>>>>> it. It's simple but enough to get started, and that's all I need to
>>> bring
>>>>>>> the real design questions into focus.
>>>>>>> 
>>>>>>> 5) Snapshot isolation. Transactions do not see commits that occur
>>> during
>>>>>>> their lifetime. Each works entirely from the state of the
>>> DatasetGraph at
>>>>>>> the start of its life.
>>>>>>> 
>>>>>>> 6) Only as many as one transaction per thread, for now. Transactions
>>> are
>>>>>>> not thread-safe. These are simplifying assumptions that could be
>>> relaxed
>>>>>>> later.
>>>>>>> 
>>>>>>> My current design operates as follows:
>>>>>>> 
>>>>>>> At the start of a transaction, a fresh in-transaction reference is
>>> taken
>>>>>>> atomically from the AtomicReference that points to the index block.
>> As
>>>>>>> operations are performed in the transaction, that in-transaction
>>>>>>> reference
>>>>>>> is progressed (in the sense in which any persistent datastructure is
>>>>>>> progressed) while the operations are recorded. Upon an abort, the
>>>>>>> in-transaction reference and the record are just thrown away. Upon a
>>>>>>> commit, the in-transaction reference is thrown away and the
>> operation
>>>>>>> record is re-run against the main reference (the one that is copied
>> at
>>>>>>> the
>>>>>>> beginning of a transaction). That rerun happens inside an atomic
>>> update
>>>>>>> (hence the use of AtomicReference). This all should avoid the need
>> for
>>>>>>> explicit locking in Jena and should confine any blocking against the
>>>>>>> indexes to the actual duration of a commit.
>>>>>>> 
>>>>>>> What do you guys think?
>>>>>>> 
>>>>>>> 
>>>>>>> 
>>>>>>> ---
>>>>>>> A. Soroka
>>>>>>> The University of Virginia Library
>>>>>>> 
>>>>>>> 
>>>>>>> 
>>>>>> 
>>>>>> 
>>>>> 
>>>> 
>>>> 
>>>> --
>>>> I like: Like Like - The likeliest place on the web
>>>> <http://like-like.xenei.com>
>>>> LinkedIn: http://www.linkedin.com/in/claudewarren
>>> 
>>> 
>> 
>> 
>> --
>> I like: Like Like - The likeliest place on the web
>> <http://like-like.xenei.com>
>> LinkedIn: http://www.linkedin.com/in/claudewarren
>> 
> 
> 
> 
> -- 
> Paul Houle
> 
> *Applying Schemas for Natural Language Processing, Distributed Systems,
> Classification and Text Mining and Data Lakes*
> 
> (607) 539 6254    paul.houle on Skype   [email protected]
> 
> :BaseKB -- Query Freebase Data With SPARQL
> http://basekb.com/gold/
> 
> Legal Entity Identifier Lookup
> https://legalentityidentifier.info/lei/lookup/
> <http://legalentityidentifier.info/lei/lookup/>
> 
> Join our Data Lakes group on LinkedIn
> https://www.linkedin.com/grp/home?gid=8267275

Re: JENA-624: "Develop a new in-memory RDF Dataset implementation"

Reply via email to