Re: JENA-624: "Develop a new in-memory RDF Dataset implementation"

Andy Seaborne Fri, 25 Sep 2015 04:49:41 -0700

On 24/09/15 18:46, A. Soroka wrote:

Thanks for the feedback, Andy! In order of your points:


1) Okay, I will get on those transaction-related problems pronto. None of them 
seem like a big deal to fix, and you’ve already given me the main point (finer 
lifecycle granularity) and pointers to the Mantis material.


Mantis does not have internal documentation so feel free to ask questions.

The dboe-* modules are a rewrite of TDB for MVCC and putting modulestructure in.

And it would solve the large-delete problems that come up on users@.There is a companion "large add problem" where at the moment people haveto resort to the bulk loader.

In TDB2, doing 100m loads into a live Fuseki database is no big deal andit's a write-once update, not the logged two-write of TDB1.


It's usable at the moment if you don't mind the disk space growing.

What is missing is an async compaction process to reduce a live databasewithout needing to go off line.

2) Yeah, I can see that. I’ll switch it to QuadTable for now, pending further 
suggestions. You know, I think I actually called that “QuadIndex” in an earlier 
version!

3) It didn’t even occur to me to manage prefixes (prefixen?). I didn’t realize 
that was a Dataset responsibility, although that certainly makes sense. Well, 
that’s why we don’t make assumptions! {grin} I’ll fix that based on 
DatasetPrefixStorage.


It's not an absolute requirement but it is nice to do it.


4) That sounds great. First step: I’ll get these problems squared away and 
hopefully other people will glance over and find other things to fix.

Thanks again!


        Andy


---
A. Soroka
The University of Virginia Library

On Sep 24, 2015, at 1:05 PM, Andy Seaborne <[email protected]> wrote:

On 23/09/15 17:44, A. Soroka wrote:

Following up on this conversation, I have now have a branch available
at:

https://github.com/ajs6f/jena/tree/jena-624

with a six-way-map-based version of this, advancing from (but not
directly using) the journaling branch already discussed. (Of course I
can separate these if so desired.)

There is an Index type which hopefully is a start towards abstracting
out index behavior, as Paul Houle suggested. There have already been
interesting suggestions made by Andy and Claude about possible
implementations that are more sophisticated than my simple approach,
so I just hope that this branch will get the ball rolling.

Comments/advice/criticism eagerly desired!

--- A. Soroka The University of Virginia Library


Hi - I pulled the code down and had a partial look.

And it looks very good.

(As you probably know by now, having me (test/demo/use/) anywhere near new code 
is very risky but I broke very little ... including TDB.)

0/ Basic read/write of files into the dataset worked as expected. :-)

1/
I tried with existing test harnesses:
(see below for code)

AbstractDatasetGraphTests
  Green line!

AbstractTestTransaction
  Red line.

This has tests for various error conditions.

The begin-begin cases: your code throws JenaException.  There is a 
sub-exception JenaTransactionException and that what the tests look for.

The rest are testing that e.g. begin-abort-commit is an error.  It looks like the 
transaction lifecycle is not being tracked.  It needs finer granularity than 
DatasetGraphInMemory.isInTransaction.  May be as simple as NOT_TXN -> ACTIVE_TXN -> 
FINISHING_TXN (-> NOT_TXN). Just in/out isn't enough for, e.g. 
begin-commit-add_quad->end , begin-commit-commit.  I managed to add after commit.

There are some presumable related things here: A writer that .end() without 
.commit or .abort doesn't indicate an error. It probably should. And be added 
to the AbstractTestTransaction tests.

And plain commit (no begin) isn't caught as wrong.

  Dataset ds = DatasetFactory.create(new DatasetGraphInMemory()) ;
  ds.commit() ;

Ditto .abort.

Stray .end()s are probably reasonable - "when in doubt call end()" - so 
multiple ends() on a transaction, which in effect is end on a non-transaction, is good to 
allow.

(I'll add some tests to AbstractTestTransaction for these cases)

2/
I found the use of the name "Index" a odd.  Usually an index (in database speak) is a 
specific lookup pattern.  S->P->O->G and needing a prefix of that for partials.

Hence, in SQL and NoSQL, having multiple indexes per table, one primary, and 0-N secondary.  The 
use in your code is more like the whole "table" (in individual components are all 
covering solike all RDF subsystems no need to have a distinguished "table"

org.apache.jena.tdb.index.Index
org.apache.jena.tdb.index.RangeIndex

Is this really something like "QuadTable"? "QuadStorage"?

I am encountering a similar split between the storage and provision of the 
interface in TDB2.  There, I want to be able to swap the storage on the fly to 
give parallel storage a compaction option to a running database.  Being 
on-disk, there isn't a GC to manage the multi-version datastructures.

3/
No prefixes on the dataset?  I only got them to work with getDefaultModel etc.

TDB uses DatasetPrefixStorage for managing prefixes and then GraphTDB
Some of DatasetPrefixesTDB could be extracted for common use.

leading to...
4/
We should look for commonality between TDB and InMem and pull out a separate framework.  
That's long term - getting something working and released should not have 
"architectural internal reorg" on the critical path.

    Andy


public class TestDatasetGraphInMem_AFS
             extends AbstractDatasetGraphTests {
    @Override
    protected DatasetGraph emptyDataset() {
        return new DatasetGraphInMemory() ;
    }
}

public class TestDatasetGraphTxn_AFS
             extends AbstractTestTransaction {
    @Override
    protected Dataset create() {
        return DatasetFactory.create(new DatasetGraphInMemory()) ;
    }
}

----------------------

And if all that begin-commit stuff is boring code ... WIP ...

https://github.com/afs/mantis/blob/master/dboe-transaction/src/main/java/org/seaborne/dboe/transaction/Txn.java

    Txn.executeRead(dsg, ()->{
    ... query ...
        }) ;


This includes ThreadTxn - starting a transaction on another thread and 
executing the body at some later date.  Great for isolation testing.

This "Transactional" is a different interface to Jena's (slight change - 
backwards compatible) but the code should work for Jena.

Re: JENA-624: "Develop a new in-memory RDF Dataset implementation"

Reply via email to