On 24/09/15 18:46, A. Soroka wrote:
Thanks for the feedback, Andy! In order of your points:
1) Okay, I will get on those transaction-related problems pronto. None of them
seem like a big deal to fix, and you’ve already given me the main point (finer
lifecycle granularity) and pointers to the Mantis material.
Mantis does not have internal documentation so feel free to ask questions.
The dboe-* modules are a rewrite of TDB for MVCC and putting module
structure in.
And it would solve the large-delete problems that come up on users@.
There is a companion "large add problem" where at the moment people have
to resort to the bulk loader.
In TDB2, doing 100m loads into a live Fuseki database is no big deal and
it's a write-once update, not the logged two-write of TDB1.
It's usable at the moment if you don't mind the disk space growing.
What is missing is an async compaction process to reduce a live database
without needing to go off line.
2) Yeah, I can see that. I’ll switch it to QuadTable for now, pending further
suggestions. You know, I think I actually called that “QuadIndex” in an earlier
version!
3) It didn’t even occur to me to manage prefixes (prefixen?). I didn’t realize
that was a Dataset responsibility, although that certainly makes sense. Well,
that’s why we don’t make assumptions! {grin} I’ll fix that based on
DatasetPrefixStorage.
It's not an absolute requirement but it is nice to do it.
4) That sounds great. First step: I’ll get these problems squared away and
hopefully other people will glance over and find other things to fix.
Thanks again!
Andy
---
A. Soroka
The University of Virginia Library
On Sep 24, 2015, at 1:05 PM, Andy Seaborne <[email protected]> wrote:
On 23/09/15 17:44, A. Soroka wrote:
Following up on this conversation, I have now have a branch available
at:
https://github.com/ajs6f/jena/tree/jena-624
with a six-way-map-based version of this, advancing from (but not
directly using) the journaling branch already discussed. (Of course I
can separate these if so desired.)
There is an Index type which hopefully is a start towards abstracting
out index behavior, as Paul Houle suggested. There have already been
interesting suggestions made by Andy and Claude about possible
implementations that are more sophisticated than my simple approach,
so I just hope that this branch will get the ball rolling.
Comments/advice/criticism eagerly desired!
--- A. Soroka The University of Virginia Library
Hi - I pulled the code down and had a partial look.
And it looks very good.
(As you probably know by now, having me (test/demo/use/) anywhere near new code
is very risky but I broke very little ... including TDB.)
0/ Basic read/write of files into the dataset worked as expected. :-)
1/
I tried with existing test harnesses:
(see below for code)
AbstractDatasetGraphTests
Green line!
AbstractTestTransaction
Red line.
This has tests for various error conditions.
The begin-begin cases: your code throws JenaException. There is a
sub-exception JenaTransactionException and that what the tests look for.
The rest are testing that e.g. begin-abort-commit is an error. It looks like the
transaction lifecycle is not being tracked. It needs finer granularity than
DatasetGraphInMemory.isInTransaction. May be as simple as NOT_TXN -> ACTIVE_TXN ->
FINISHING_TXN (-> NOT_TXN). Just in/out isn't enough for, e.g.
begin-commit-add_quad->end , begin-commit-commit. I managed to add after commit.
There are some presumable related things here: A writer that .end() without
.commit or .abort doesn't indicate an error. It probably should. And be added
to the AbstractTestTransaction tests.
And plain commit (no begin) isn't caught as wrong.
Dataset ds = DatasetFactory.create(new DatasetGraphInMemory()) ;
ds.commit() ;
Ditto .abort.
Stray .end()s are probably reasonable - "when in doubt call end()" - so
multiple ends() on a transaction, which in effect is end on a non-transaction, is good to
allow.
(I'll add some tests to AbstractTestTransaction for these cases)
2/
I found the use of the name "Index" a odd. Usually an index (in database speak) is a
specific lookup pattern. S->P->O->G and needing a prefix of that for partials.
Hence, in SQL and NoSQL, having multiple indexes per table, one primary, and 0-N secondary. The
use in your code is more like the whole "table" (in individual components are all
covering solike all RDF subsystems no need to have a distinguished "table"
org.apache.jena.tdb.index.Index
org.apache.jena.tdb.index.RangeIndex
Is this really something like "QuadTable"? "QuadStorage"?
I am encountering a similar split between the storage and provision of the
interface in TDB2. There, I want to be able to swap the storage on the fly to
give parallel storage a compaction option to a running database. Being
on-disk, there isn't a GC to manage the multi-version datastructures.
3/
No prefixes on the dataset? I only got them to work with getDefaultModel etc.
TDB uses DatasetPrefixStorage for managing prefixes and then GraphTDB
Some of DatasetPrefixesTDB could be extracted for common use.
leading to...
4/
We should look for commonality between TDB and InMem and pull out a separate framework.
That's long term - getting something working and released should not have
"architectural internal reorg" on the critical path.
Andy
public class TestDatasetGraphInMem_AFS
extends AbstractDatasetGraphTests {
@Override
protected DatasetGraph emptyDataset() {
return new DatasetGraphInMemory() ;
}
}
public class TestDatasetGraphTxn_AFS
extends AbstractTestTransaction {
@Override
protected Dataset create() {
return DatasetFactory.create(new DatasetGraphInMemory()) ;
}
}
----------------------
And if all that begin-commit stuff is boring code ... WIP ...
https://github.com/afs/mantis/blob/master/dboe-transaction/src/main/java/org/seaborne/dboe/transaction/Txn.java
Txn.executeRead(dsg, ()->{
... query ...
}) ;
This includes ThreadTxn - starting a transaction on another thread and
executing the body at some later date. Great for isolation testing.
This "Transactional" is a different interface to Jena's (slight change -
backwards compatible) but the code should work for Jena.