On 26/08/15 15:27, A. Soroka wrote:
Hey, folks--

There hasn't been too much feedback on my proposal for a journaling
DatasetGraph:

https://github.com/ajs6f/jena/tree/JournalingDatasetgraph

which was and is to be a step towards JENA-624: Develop a new
in-memory RDF Dataset implementation. So I'm moving on to look at the
real problem: an in-memory  DatasetGraph with high concurrency, for
use with modern hardware running many, many threads in large core
memory.

I'm beginning to sketch out rough code, and I'd like to run some
design decisions past the list to get criticism/advice/horrified
warnings/whatever needs to be said.

1) All-transactional action: i.e. no non-transactional operation.
This is obviously a great thing for simplifying my work, but I hope
it won't be out of line with the expected uses for this stuff.

You could add an auto-commit feature so that any update outside a transaction has a transaction wrapper applied. Feature.

2) 6 covering indexes in the forms GSPO, GOPS, SPOG, OSGP, PGSO,
OPSG. I figure to play to the strength of in-core-memory operation:
raw speed, but obviously this is going to cost space.

There are choices :-) esp in memory are the datastructure kind can chnage on the way down. e.g. have a hash map for GS->PO and keep the PO tightly packed (a few for each S) and scan them.

Is that going to 6 pcollections datastructres or all held in one datastructres (c.f. BerkeleyDB)?


3) At least for now, all commits succeed.

4) The use of persistent datastructures to avoid complex and
error-prone fine-grained locking regimes. I'm using
http://pcollections.org/ for now, but I am in no way committed to it
nor do I claim to have thoroughly vetted it. It's simple but enough
to get started, and that's all I need to bring the real design
questions into focus.

Is a consequence that there is one truly active writer (and many readers)?

If there are multiple writers, then (1) system aborts will always be possible (conflicting updates) and (2) locking on datastructres is necessary ... or timestamps and vector clocks or some such.

5) Snapshot isolation. Transactions do not see commits that occur
during their lifetime. Each works entirely from the state of the
DatasetGraph at the start of its life.

But they see their own updates presumably?


6) Only as many as one transaction per thread, for now. Transactions
are not thread-safe. These are simplifying assumptions that could be
relaxed later.

TDB ended up there as well. There is, internally, a transaction object but it's held in a ThreadLocal and fetched when needed. Otherwise a lot of interface need a "transaction" parameter and its hard to reuse other code that does pass it through.

I have taken a second take on transactions with TDB2. This module is an independent transactions system, unlike TDB1 where it's TDB1-specific.

https://github.com/afs/mantis/tree/master/dboe-transaction

It needs documentation for use on its own but I have used in in another project to coordinate distributed transactions.


(dboe = database operating environment)

My current design operates as follows:

At the start of a transaction, a fresh in-transaction reference is
taken atomically from the AtomicReference that points to the index
block. As operations are performed in the transaction, that
in-transaction reference is progressed (in the sense in which any
persistent datastructure is progressed) while the operations are
recorded. Upon an abort, the in-transaction reference and the record
are just thrown away. Upon a commit, the in-transaction reference is
thrown away and the operation record is re-run against the main
reference (the one that is copied at the beginning of a transaction).
That rerun happens inside an atomic update (hence the use of
AtomicReference). This all should avoid the need for explicit locking
in Jena and should confine any blocking against the indexes to the
actual duration of a commit.

Looks good. I don't quite understand the need to record and rerun though - isn't the power of pcollections that there can be old and new roots to the datastructures and commit is swap to new one, abort is forget the new one.

TDB2 uses persistent datastructures (which in its case can by persistent on disk). When a transaction starts, it gets the reference to the root of the B+Tree.

When a write happen, the path from the place in the tree for the update to the root is duplicated. There is a new root - the writer has a new root, all readers have the old root so they have complete isolation. Readers can find newly written blocks.

A difference from pcollections, if I understand it correctly, is that blocks are only duplicated once then updated in-place. Blocks are only ever added the block file above the writer-start water-mark so the test is whether a block is above or below the write-transaction start water mark.

Preparing the commit is sync the file, don't write the new root to disk, do write the id of the root block to the transaction journal. Repeat for all indexes, write commit end {*}, sync. {*} is the true commit point. Afterwards, and on journal replay, the new root id is written to the disk.

Abort is truncate the file back to the start of the writer. Nice feature - the bulk work of the writer is done during the transaction and the OS can be writing to disk all the time as it see fit. No bulk byte going though the journal.

When you pcollections PMap::put it returns a new PMap reference so it look like the same idea.


You will probably need a short lock in two places but they are short and not going to affect throughput, only developer sanity.

When a transaction starts, then the system needs to set it out, set any thread local variables, and do any housekeeping. I found getting a consistent set of all the internal state variables needs a lock. An ultra minimal system might be able to get away with no lock but if it's got counters or any transaction tracking datastructures, then there are multiple things to read. Paranoia about overlapping threads suggests needing a short period to take all the readings.

The second point is on commit/abort for the same reason and also writing the journal atomically (which you may not need).

Hope that helps,

        Andy


What do you guys think?



--- A. Soroka The University of Virginia Library


Reply via email to