On 29/08/15 14:55, Claude Warren wrote:
Something I have been thinking about....
you could replace GSPO, GOPS, SPOG, OSGP, PGSO, OPSG. with a single
bloomfilter implementation. It means a 2 step process to find matches but
it might be fast enough and reduce the overhead significantly.
I did an in-memory and a relational DB based version recently, but it was
just a quick POC.
Claude
So we're talking about in-memory, where the items are java classes. A
quad is 2 slots java overhead + 4 slots for G, S, P, O pointers. That's
48 bytes if the heap is >32G and 24 bytes otherwise (compressed pointers
or 32 bit).
For storage, the key test is "contains" to maintain the "set of"
semantics. Something to stop index traversal for each insert would be
great but it's still stored and 1 not up to 6 would be good. (Note that
most data is unique quads.)
The import retrieval operation is find(G,S,P,O) where any of those can
be a wildcard and return (ideally as a stream) all matching quads with a
prefix. The multiple indexes exist to find based on prefix.
How would that work for, say find(G,S,*,*) with bloom filters and 1b
quads? How does the code go from returning G,S,P1,O1 to the next
G,S,P1,O2 without trying every value for the O slot?
For a hash map based hierarchical index G->S->P->O, it's O(1) to find
the start of the scan then datastructure iteration. A hash-based is not
necessarily the best choice [*] but it's a baseline to discuss.
And in memory, will a bloom filter-based system be faster? Because of
false-positives, isn't a definitively index still needed? If one is
kept, not 6, there could be great space gains but every quad returned is
a top-to-bottom traversal of that index (which is now a not a range index).
The design should work for 1+ billion in-memory quads - that's the way
the world is going.
So each quad is reduced to a
single bloom filter comprising 4 items (15-bytes).
Andy
[*] even in memory, it might be worth allocating internal ids and
working in longs like a disk based system because it is more compact -
naive hashmaps take a lot of space to when storing small items like
quads. tradeoffs, tradeoffs, ...
On Wed, Aug 26, 2015 at 3:27 PM, A. Soroka <[email protected]> wrote:
Hey, folks--
There hasn't been too much feedback on my proposal for a journaling
DatasetGraph:
https://github.com/ajs6f/jena/tree/JournalingDatasetgraph
which was and is to be a step towards JENA-624: Develop a new in-memory
RDF Dataset implementation. So I'm moving on to look at the real problem:
an in-memory DatasetGraph with high concurrency, for use with modern
hardware running many, many threads in large core memory.
I'm beginning to sketch out rough code, and I'd like to run some design
decisions past the list to get criticism/advice/horrified warnings/whatever
needs to be said.
1) All-transactional action: i.e. no non-transactional operation. This is
obviously a great thing for simplifying my work, but I hope it won't be out
of line with the expected uses for this stuff.
2) 6 covering indexes in the forms GSPO, GOPS, SPOG, OSGP, PGSO, OPSG. I
figure to play to the strength of in-core-memory operation: raw speed, but
obviously this is going to cost space.
3) At least for now, all commits succeed.
4) The use of persistent datastructures to avoid complex and error-prone
fine-grained locking regimes. I'm using http://pcollections.org/ for now,
but I am in no way committed to it nor do I claim to have thoroughly vetted
it. It's simple but enough to get started, and that's all I need to bring
the real design questions into focus.
5) Snapshot isolation. Transactions do not see commits that occur during
their lifetime. Each works entirely from the state of the DatasetGraph at
the start of its life.
6) Only as many as one transaction per thread, for now. Transactions are
not thread-safe. These are simplifying assumptions that could be relaxed
later.
My current design operates as follows:
At the start of a transaction, a fresh in-transaction reference is taken
atomically from the AtomicReference that points to the index block. As
operations are performed in the transaction, that in-transaction reference
is progressed (in the sense in which any persistent datastructure is
progressed) while the operations are recorded. Upon an abort, the
in-transaction reference and the record are just thrown away. Upon a
commit, the in-transaction reference is thrown away and the operation
record is re-run against the main reference (the one that is copied at the
beginning of a transaction). That rerun happens inside an atomic update
(hence the use of AtomicReference). This all should avoid the need for
explicit locking in Jena and should confine any blocking against the
indexes to the actual duration of a commit.
What do you guys think?
---
A. Soroka
The University of Virginia Library