Re: JENA-624: "Develop a new in-memory RDF Dataset implementation"

Claude Warren Sat, 29 Aug 2015 16:18:52 -0700

Well to start with create one bloom filter for each quad.  Using
http://hur.st/bloomfilter?n=4&p=1.0E-6 you can see that 4 items with a 1 on
1 million collision rate requires 15 bytes.  So each bloom filter is
15bytes.


When you go to search you put the known quad values into a bloom filter (so
the SP or SPO or GSP or whatever) and then search for matching filters.
When you find the matches you then have to check that they actually match
as the bloom filter may be a false positive and also because the bloom
filter does not preserve positional information.

There are several issues that need to be resolved:


   1. Quickly searching a large collection of bloom filters.  While I have
   solutions I am not at liberty to say more until IBM (my employers) decides
   I can publish my solutions.  I will tell you that there are a number of
   solutions that do not work in that they are not faster than a linear
   search.  The linear search is a valid solution and fairly efficient.  You
   might use multiple buckets each with a bloom filter of all the nodes
   contained within it, however those bloom filters should  be size
   appropriately for the number of objects they are to contain.  Otherwise
   they will yield massive false positive rates.
   2. There is a possibility of preserving the positional information
   within the bloom filter.  As with the above I have to leave that for now.
   3. Quick calculation of the bloom filter.  There is code in
org.apache.cassandra.utils.BloomFilter
   that has a very fast bloom filter calculation.

I hope this helps.

Claude

On Sat, Aug 29, 2015 at 7:55 PM, [email protected] <[email protected]>
wrote:

> Thanks for the feedback!
>
> I can see how one Bloom filter could be used with an accompanying
> structure to replace one of the indexes, but I don't quite see how one
> could replace all of them-- can you elaborate?
>
> ---
> A. Soroka
> The University of Virginia Library
>
> On Aug 29, 2015, at 9:55 AM, Claude Warren <[email protected]> wrote:
>
> > Something I have been thinking about....
> >
> > you could replace  GSPO, GOPS, SPOG, OSGP, PGSO, OPSG. with a single
> > bloomfilter implementation.  It means a 2 step process to find matches
> but
> > it might be fast enough and reduce the overhead significantly.
> >
> > I did an in-memory and a relational DB based version recently, but it was
> > just a quick POC.
> >
> > Claude
> >
> > On Wed, Aug 26, 2015 at 3:27 PM, A. Soroka <[email protected]> wrote:
> >
> >> Hey, folks--
> >>
> >> There hasn't been too much feedback on my proposal for a journaling
> >> DatasetGraph:
> >>
> >> https://github.com/ajs6f/jena/tree/JournalingDatasetgraph
> >>
> >> which was and is to be a step towards JENA-624: Develop a new in-memory
> >> RDF Dataset implementation. So I'm moving on to look at the real
> problem:
> >> an in-memory  DatasetGraph with high concurrency, for use with modern
> >> hardware running many, many threads in large core memory.
> >>
> >> I'm beginning to sketch out rough code, and I'd like to run some design
> >> decisions past the list to get criticism/advice/horrified
> warnings/whatever
> >> needs to be said.
> >>
> >> 1) All-transactional action: i.e. no non-transactional operation. This
> is
> >> obviously a great thing for simplifying my work, but I hope it won't be
> out
> >> of line with the expected uses for this stuff.
> >>
> >> 2) 6 covering indexes in the forms GSPO, GOPS, SPOG, OSGP, PGSO, OPSG. I
> >> figure to play to the strength of in-core-memory operation: raw speed,
> but
> >> obviously this is going to cost space.
> >>
> >> 3) At least for now, all commits succeed.
> >>
> >> 4) The use of persistent datastructures to avoid complex and error-prone
> >> fine-grained locking regimes. I'm using http://pcollections.org/ for
> now,
> >> but I am in no way committed to it nor do I claim to have thoroughly
> vetted
> >> it. It's simple but enough to get started, and that's all I need to
> bring
> >> the real design questions into focus.
> >>
> >> 5) Snapshot isolation. Transactions do not see commits that occur during
> >> their lifetime. Each works entirely from the state of the DatasetGraph
> at
> >> the start of its life.
> >>
> >> 6) Only as many as one transaction per thread, for now. Transactions are
> >> not thread-safe. These are simplifying assumptions that could be relaxed
> >> later.
> >>
> >> My current design operates as follows:
> >>
> >> At the start of a transaction, a fresh in-transaction reference is taken
> >> atomically from the AtomicReference that points to the index block. As
> >> operations are performed in the transaction, that in-transaction
> reference
> >> is progressed (in the sense in which any persistent datastructure is
> >> progressed) while the operations are recorded. Upon an abort, the
> >> in-transaction reference and the record are just thrown away. Upon a
> >> commit, the in-transaction reference is thrown away and the operation
> >> record is re-run against the main reference (the one that is copied at
> the
> >> beginning of a transaction). That rerun happens inside an atomic update
> >> (hence the use of AtomicReference). This all should avoid the need for
> >> explicit locking in Jena and should confine any blocking against the
> >> indexes to the actual duration of a commit.
> >>
> >> What do you guys think?
> >>
> >>
> >>
> >> ---
> >> A. Soroka
> >> The University of Virginia Library
> >>
> >>
> >
> >
> > --
> > I like: Like Like - The likeliest place on the web
> > <http://like-like.xenei.com>
> > LinkedIn: http://www.linkedin.com/in/claudewarren
>
>


-- 
I like: Like Like - The likeliest place on the web
<http://like-like.xenei.com>
LinkedIn: http://www.linkedin.com/in/claudewarren

Re: JENA-624: "Develop a new in-memory RDF Dataset implementation"

Reply via email to