Re: JENA-624: "Develop a new in-memory RDF Dataset implementation"

Paul Houle Mon, 31 Aug 2015 08:22:38 -0700

The thing about bloom filters is that they are (i) highly dense because
they are bit packed and (ii) accessed in sequential order so that modern
processors can fetch ahead and hide RAM latency.


Most kinds of indexes,  on the other hand,  involve pointer chasing over an
area that does not fit in the cache,  so your brand new processor is
performing like a Pentium II.  Even if the O() part is better for some
other algorithm,  the constant in front is awesome for the bloom filter.

Probably a lot of codesign is possible if the index type is fixed leading
to simpler and better performant code,  but it would be nice to have a
pluggable approach so that you can pick the structures that work for your
workload.  Also from a research perspective we could get real questions

There are a lot of patents in the 2000 or so time period on systems that
profile the queries and automatically build indexes.  For instance,
 Salesforce.com has a patent,  and they also have an implementation where
the "master copy" of data is stored in an Oracle table that looks a lot
like a triple store and then it automatically creates materialized views
and functional indexes in Oracle to support client workloads.

(I remember writing a query against one of the larger Salesforce orgs and
being frustrated when it timed out,  only to run the same query two minutes
later and have it work perfectly.)

Patents do expire and it won't be too long that this technology could make
it to open source tools.

On Mon, Aug 31, 2015 at 10:04 AM, Claude Warren <[email protected]> wrote:

> step 3 is relatively expensive if there are a log of quads with the G and S
> values (e.g. *,G,*,S) but few with G,S,*,*
>
> Step 3 is about removing the false positives from the bloom filter.  It
> does not require an index, it requires checking the values to ensure match.
>
> While there is a time to space trade off here.  The time is not as much as
> one would think.  Keep in mind that a bloomfilter check is very fast.  On
> the other hand the traversal of a b-tree or other index is much more
> computationally complex.
>
> For example, I know that if you can devise a mechanism by which you can
> index bloom filters such  that you have a 2 segment key and you know that
> you want to check every key where the compound keys are greater than the
> one you are looking for it is often faster to do the linear search through
> the data even when dealing with 1 million entries.  There are special cases
> where the index is faster but for the most part the linear search is
> faster.
>
> Claude
>
> On Mon, Aug 31, 2015 at 1:53 PM, [email protected] <[email protected]>
> wrote:
>
> > I'm still a bit confused as to why you don't regard step 3 as being
> > potentially very expensive. In order to verify a match, we will have to
> > examine an "exact" index, and that (as Andy remarked) is likely to
> require
> > traversal, or else we throw away all the space gains.
> >
> > Is this technique a way to pay a lot of time for a lot of space savings?
> > Perhaps it is appropriate for an alternative implementation for very
> large
> > datasets?
> >
> > ---
> > A. Soroka
> > The University of Virginia Library
> >
> > On Aug 31, 2015, at 6:48 AM, Claude Warren <[email protected]> wrote:
> >
> > > to find find(G,S,*,*) with bloom filters and return an iterator you
> > >
> > >   1. construct a bloom filter with G and S
> > >   2. scan the list of quads checking for matches.
> > >   3. for each result that matches verify that it has G and S (I have
> > done this with an extended iterator in Jena)
> > >
> > > result is an iterator that returns all (G,S,*,*) quads.
> > >
> > > similar tests can be performed for any pattern -- same code used.
> > >
> > > Step 2 is the expensive one.  But the bloom filter check is so
> efficient
> > > that it becomes very difficult to perform search operations in less
> time
> > > than it takes to scan the list.
> > >
> > > Claude
> > >
> > > On Mon, Aug 31, 2015 at 11:01 AM, Andy Seaborne <[email protected]>
> wrote:
> > >
> > >> On 29/08/15 14:55, Claude Warren wrote:
> > >>
> > >>> Something I have been thinking about....
> > >>>
> > >>> you could replace  GSPO, GOPS, SPOG, OSGP, PGSO, OPSG. with a single
> > >>> bloomfilter implementation.  It means a 2 step process to find
> matches
> > but
> > >>> it might be fast enough and reduce the overhead significantly.
> > >>>
> > >>> I did an in-memory and a relational DB based version recently, but it
> > was
> > >>> just a quick POC.
> > >>>
> > >>> Claude
> > >>>
> > >>
> > >> So we're talking about in-memory, where the items are java classes.  A
> > >> quad is 2 slots java overhead + 4 slots for G, S, P, O pointers.
> > That's 48
> > >> bytes if the heap is >32G and 24 bytes otherwise (compressed pointers
> > or 32
> > >> bit).
> > >>
> > >> For storage, the key test is "contains" to maintain the "set of"
> > >> semantics.  Something to stop index traversal for each insert would be
> > >> great but it's still stored and 1 not up to 6 would be good. (Note
> that
> > >> most data is unique quads.)
> > >>
> > >> The import retrieval operation is find(G,S,P,O) where any of those can
> > be
> > >> a wildcard and return (ideally as a stream) all matching quads with a
> > >> prefix.  The multiple indexes exist to find based on prefix.
> > >>
> > >> How would that work for, say find(G,S,*,*) with bloom filters and 1b
> > >> quads?  How does the code go from returning G,S,P1,O1 to the next
> > G,S,P1,O2
> > >> without trying every value for the O slot?
> > >>
> > >> For a hash map based hierarchical index G->S->P->O, it's O(1) to find
> > the
> > >> start of the scan then datastructure iteration.  A hash-based is not
> > >> necessarily the best choice [*] but it's a baseline to discuss.
> > >>
> > >> And in memory, will a bloom filter-based system be faster?  Because of
> > >> false-positives, isn't a definitively index still needed?  If one is
> > kept,
> > >> not 6, there could be great space gains but every quad returned is a
> > >> top-to-bottom traversal of that index (which is now a not a range
> > index).
> > >>
> > >> The design should work for 1+ billion in-memory quads - that's the way
> > the
> > >> world is going.
> > >>
> > >> So each quad is reduced to a
> > >>> single bloom filter comprising 4 items (15-bytes).
> > >>>
> > >>
> > >>        Andy
> > >>
> > >> [*] even in memory, it might be worth allocating internal ids and
> > working
> > >> in longs like a disk based system because it is more compact - naive
> > >> hashmaps take a lot of space to when storing small items like quads.
> > >> tradeoffs, tradeoffs, ...
> > >>
> > >>
> > >>
> > >>
> > >>> On Wed, Aug 26, 2015 at 3:27 PM, A. Soroka <[email protected]>
> wrote:
> > >>>
> > >>> Hey, folks--
> > >>>>
> > >>>> There hasn't been too much feedback on my proposal for a journaling
> > >>>> DatasetGraph:
> > >>>>
> > >>>> https://github.com/ajs6f/jena/tree/JournalingDatasetgraph
> > >>>>
> > >>>> which was and is to be a step towards JENA-624: Develop a new
> > in-memory
> > >>>> RDF Dataset implementation. So I'm moving on to look at the real
> > problem:
> > >>>> an in-memory  DatasetGraph with high concurrency, for use with
> modern
> > >>>> hardware running many, many threads in large core memory.
> > >>>>
> > >>>> I'm beginning to sketch out rough code, and I'd like to run some
> > design
> > >>>> decisions past the list to get criticism/advice/horrified
> > >>>> warnings/whatever
> > >>>> needs to be said.
> > >>>>
> > >>>> 1) All-transactional action: i.e. no non-transactional operation.
> > This is
> > >>>> obviously a great thing for simplifying my work, but I hope it won't
> > be
> > >>>> out
> > >>>> of line with the expected uses for this stuff.
> > >>>>
> > >>>> 2) 6 covering indexes in the forms GSPO, GOPS, SPOG, OSGP, PGSO,
> > OPSG. I
> > >>>> figure to play to the strength of in-core-memory operation: raw
> speed,
> > >>>> but
> > >>>> obviously this is going to cost space.
> > >>>>
> > >>>> 3) At least for now, all commits succeed.
> > >>>>
> > >>>> 4) The use of persistent datastructures to avoid complex and
> > error-prone
> > >>>> fine-grained locking regimes. I'm using http://pcollections.org/
> for
> > >>>> now,
> > >>>> but I am in no way committed to it nor do I claim to have thoroughly
> > >>>> vetted
> > >>>> it. It's simple but enough to get started, and that's all I need to
> > bring
> > >>>> the real design questions into focus.
> > >>>>
> > >>>> 5) Snapshot isolation. Transactions do not see commits that occur
> > during
> > >>>> their lifetime. Each works entirely from the state of the
> > DatasetGraph at
> > >>>> the start of its life.
> > >>>>
> > >>>> 6) Only as many as one transaction per thread, for now. Transactions
> > are
> > >>>> not thread-safe. These are simplifying assumptions that could be
> > relaxed
> > >>>> later.
> > >>>>
> > >>>> My current design operates as follows:
> > >>>>
> > >>>> At the start of a transaction, a fresh in-transaction reference is
> > taken
> > >>>> atomically from the AtomicReference that points to the index block.
> As
> > >>>> operations are performed in the transaction, that in-transaction
> > >>>> reference
> > >>>> is progressed (in the sense in which any persistent datastructure is
> > >>>> progressed) while the operations are recorded. Upon an abort, the
> > >>>> in-transaction reference and the record are just thrown away. Upon a
> > >>>> commit, the in-transaction reference is thrown away and the
> operation
> > >>>> record is re-run against the main reference (the one that is copied
> at
> > >>>> the
> > >>>> beginning of a transaction). That rerun happens inside an atomic
> > update
> > >>>> (hence the use of AtomicReference). This all should avoid the need
> for
> > >>>> explicit locking in Jena and should confine any blocking against the
> > >>>> indexes to the actual duration of a commit.
> > >>>>
> > >>>> What do you guys think?
> > >>>>
> > >>>>
> > >>>>
> > >>>> ---
> > >>>> A. Soroka
> > >>>> The University of Virginia Library
> > >>>>
> > >>>>
> > >>>>
> > >>>
> > >>>
> > >>
> > >
> > >
> > > --
> > > I like: Like Like - The likeliest place on the web
> > > <http://like-like.xenei.com>
> > > LinkedIn: http://www.linkedin.com/in/claudewarren
> >
> >
>
>
> --
> I like: Like Like - The likeliest place on the web
> <http://like-like.xenei.com>
> LinkedIn: http://www.linkedin.com/in/claudewarren
>



-- 
Paul Houle

*Applying Schemas for Natural Language Processing, Distributed Systems,
Classification and Text Mining and Data Lakes*

(607) 539 6254    paul.houle on Skype   [email protected]

:BaseKB -- Query Freebase Data With SPARQL
http://basekb.com/gold/

Legal Entity Identifier Lookup
https://legalentityidentifier.info/lei/lookup/
<http://legalentityidentifier.info/lei/lookup/>

Join our Data Lakes group on LinkedIn
https://www.linkedin.com/grp/home?gid=8267275

Re: JENA-624: "Develop a new in-memory RDF Dataset implementation"

Reply via email to