Re: JENA-624: "Develop a new in-memory RDF Dataset implementation"

[email protected] Sat, 29 Aug 2015 11:56:59 -0700

Thanks for the feedback!

I can see how one Bloom filter could be used with an accompanying structure to 
replace one of the indexes, but I don't quite see how one could replace all of 
them-- can you elaborate?


---
A. Soroka
The University of Virginia Library

On Aug 29, 2015, at 9:55 AM, Claude Warren <[email protected]> wrote:

> Something I have been thinking about....
> 
> you could replace  GSPO, GOPS, SPOG, OSGP, PGSO, OPSG. with a single
> bloomfilter implementation.  It means a 2 step process to find matches but
> it might be fast enough and reduce the overhead significantly.
> 
> I did an in-memory and a relational DB based version recently, but it was
> just a quick POC.
> 
> Claude
> 
> On Wed, Aug 26, 2015 at 3:27 PM, A. Soroka <[email protected]> wrote:
> 
>> Hey, folks--
>> 
>> There hasn't been too much feedback on my proposal for a journaling
>> DatasetGraph:
>> 
>> https://github.com/ajs6f/jena/tree/JournalingDatasetgraph
>> 
>> which was and is to be a step towards JENA-624: Develop a new in-memory
>> RDF Dataset implementation. So I'm moving on to look at the real problem:
>> an in-memory  DatasetGraph with high concurrency, for use with modern
>> hardware running many, many threads in large core memory.
>> 
>> I'm beginning to sketch out rough code, and I'd like to run some design
>> decisions past the list to get criticism/advice/horrified warnings/whatever
>> needs to be said.
>> 
>> 1) All-transactional action: i.e. no non-transactional operation. This is
>> obviously a great thing for simplifying my work, but I hope it won't be out
>> of line with the expected uses for this stuff.
>> 
>> 2) 6 covering indexes in the forms GSPO, GOPS, SPOG, OSGP, PGSO, OPSG. I
>> figure to play to the strength of in-core-memory operation: raw speed, but
>> obviously this is going to cost space.
>> 
>> 3) At least for now, all commits succeed.
>> 
>> 4) The use of persistent datastructures to avoid complex and error-prone
>> fine-grained locking regimes. I'm using http://pcollections.org/ for now,
>> but I am in no way committed to it nor do I claim to have thoroughly vetted
>> it. It's simple but enough to get started, and that's all I need to bring
>> the real design questions into focus.
>> 
>> 5) Snapshot isolation. Transactions do not see commits that occur during
>> their lifetime. Each works entirely from the state of the DatasetGraph at
>> the start of its life.
>> 
>> 6) Only as many as one transaction per thread, for now. Transactions are
>> not thread-safe. These are simplifying assumptions that could be relaxed
>> later.
>> 
>> My current design operates as follows:
>> 
>> At the start of a transaction, a fresh in-transaction reference is taken
>> atomically from the AtomicReference that points to the index block. As
>> operations are performed in the transaction, that in-transaction reference
>> is progressed (in the sense in which any persistent datastructure is
>> progressed) while the operations are recorded. Upon an abort, the
>> in-transaction reference and the record are just thrown away. Upon a
>> commit, the in-transaction reference is thrown away and the operation
>> record is re-run against the main reference (the one that is copied at the
>> beginning of a transaction). That rerun happens inside an atomic update
>> (hence the use of AtomicReference). This all should avoid the need for
>> explicit locking in Jena and should confine any blocking against the
>> indexes to the actual duration of a commit.
>> 
>> What do you guys think?
>> 
>> 
>> 
>> ---
>> A. Soroka
>> The University of Virginia Library
>> 
>> 
> 
> 
> -- 
> I like: Like Like - The likeliest place on the web
> <http://like-like.xenei.com>
> LinkedIn: http://www.linkedin.com/in/claudewarren

Re: JENA-624: "Develop a new in-memory RDF Dataset implementation"

Reply via email to