Re: JENA-624: "Develop a new in-memory RDF Dataset implementation"

Andy Seaborne Mon, 31 Aug 2015 03:02:40 -0700

On 29/08/15 14:55, Claude Warren wrote:

Something I have been thinking about....


you could replace  GSPO, GOPS, SPOG, OSGP, PGSO, OPSG. with a single
bloomfilter implementation.  It means a 2 step process to find matches but
it might be fast enough and reduce the overhead significantly.

I did an in-memory and a relational DB based version recently, but it was
just a quick POC.

Claude

So we're talking about in-memory, where the items are java classes. Aquad is 2 slots java overhead + 4 slots for G, S, P, O pointers. That's48 bytes if the heap is >32G and 24 bytes otherwise (compressed pointersor 32 bit).

For storage, the key test is "contains" to maintain the "set of"semantics. Something to stop index traversal for each insert would begreat but it's still stored and 1 not up to 6 would be good. (Note thatmost data is unique quads.)

The import retrieval operation is find(G,S,P,O) where any of those canbe a wildcard and return (ideally as a stream) all matching quads with aprefix. The multiple indexes exist to find based on prefix.

How would that work for, say find(G,S,*,*) with bloom filters and 1bquads? How does the code go from returning G,S,P1,O1 to the nextG,S,P1,O2 without trying every value for the O slot?

For a hash map based hierarchical index G->S->P->O, it's O(1) to findthe start of the scan then datastructure iteration. A hash-based is notnecessarily the best choice [*] but it's a baseline to discuss.

And in memory, will a bloom filter-based system be faster? Because offalse-positives, isn't a definitively index still needed? If one iskept, not 6, there could be great space gains but every quad returned isa top-to-bottom traversal of that index (which is now a not a range index).

The design should work for 1+ billion in-memory quads - that's the waythe world is going.

So each quad is reduced to a
single bloom filter comprising 4 items (15-bytes).


        Andy

[*] even in memory, it might be worth allocating internal ids andworking in longs like a disk based system because it is more compact -naive hashmaps take a lot of space to when storing small items likequads. tradeoffs, tradeoffs, ...


On Wed, Aug 26, 2015 at 3:27 PM, A. Soroka <[email protected]> wrote:

Hey, folks--

There hasn't been too much feedback on my proposal for a journaling
DatasetGraph:

https://github.com/ajs6f/jena/tree/JournalingDatasetgraph

which was and is to be a step towards JENA-624: Develop a new in-memory
RDF Dataset implementation. So I'm moving on to look at the real problem:
an in-memory  DatasetGraph with high concurrency, for use with modern
hardware running many, many threads in large core memory.

I'm beginning to sketch out rough code, and I'd like to run some design
decisions past the list to get criticism/advice/horrified warnings/whatever
needs to be said.

1) All-transactional action: i.e. no non-transactional operation. This is
obviously a great thing for simplifying my work, but I hope it won't be out
of line with the expected uses for this stuff.

2) 6 covering indexes in the forms GSPO, GOPS, SPOG, OSGP, PGSO, OPSG. I
figure to play to the strength of in-core-memory operation: raw speed, but
obviously this is going to cost space.

3) At least for now, all commits succeed.

4) The use of persistent datastructures to avoid complex and error-prone
fine-grained locking regimes. I'm using http://pcollections.org/ for now,
but I am in no way committed to it nor do I claim to have thoroughly vetted
it. It's simple but enough to get started, and that's all I need to bring
the real design questions into focus.

5) Snapshot isolation. Transactions do not see commits that occur during
their lifetime. Each works entirely from the state of the DatasetGraph at
the start of its life.

6) Only as many as one transaction per thread, for now. Transactions are
not thread-safe. These are simplifying assumptions that could be relaxed
later.

My current design operates as follows:

At the start of a transaction, a fresh in-transaction reference is taken
atomically from the AtomicReference that points to the index block. As
operations are performed in the transaction, that in-transaction reference
is progressed (in the sense in which any persistent datastructure is
progressed) while the operations are recorded. Upon an abort, the
in-transaction reference and the record are just thrown away. Upon a
commit, the in-transaction reference is thrown away and the operation
record is re-run against the main reference (the one that is copied at the
beginning of a transaction). That rerun happens inside an atomic update
(hence the use of AtomicReference). This all should avoid the need for
explicit locking in Jena and should confine any blocking against the
indexes to the actual duration of a commit.

What do you guys think?



---
A. Soroka
The University of Virginia Library

Re: JENA-624: "Develop a new in-memory RDF Dataset implementation"

Reply via email to