Re: Horizontal scalability and limits of TDB

A. Soroka Sun, 22 Jan 2017 07:31:58 -0800

First, to your specific questions:

> 1. Atomicity, consistency, isolation and durability of a transaction on a 
> single tdb database: Apart from the limitations on the documentation of TDB 
> Transactions and Txn,  there are current issues? edge cases detected and not 
> yet covered?

I'm not really sure what we mean by "consistency" once we go beyond a single 
writer. Without a schema and therefore without any understanding of data 
dependencies within the database, it's not clear to me how we can automatically 
understand when a state is consistent. It seems we have to leave that to the 
applications, for the most part. I'm very interested myself in ways we could 
"hint" to a triplestore the data dependencies we want it to understand (perhaps 
something like OWL/ICV), but that's not really a scaling issue.

I've recently been investigating the possibility of lock regions more granular 
that a whole dataset:

https://github.com/apache/jena/pull/204

for the special case of named graphs as the lock regions. We discussed this 
about a year ago when Claude Warren (Jena committer/PMC) made up some designs 
for discussion:

https://lists.apache.org/thread.html/916eed68e9847c6f4c0330fecff8b6f416a27344f2d995400e834562@1451744303@%3Cdev.jena.apache.org%3E

and there is a _lot_ more to be thought about there. 

Jena uses threads as stand-ins for transactions, and there is definitely work 
to be done to separate those ideas so that more than one thread can participate 
in a transaction and so that transactions can be managed independently of 
threading and low-level concurrency. That would be a pretty major change in the 
codebase, but Andy has been making some moves that will help set that up by 
changing from a single class being transactional to several type together 
composing a transactional thing.

> 2. Are there currently available strategies to achieve a horizontal-scaled 
> tdb database?

I'l let Andy speak to this, but I know of none (and I would very much like to!).

> 3. What do you think of try to implement a horizontal scalability with 
> DatasetGraph or something else with, let's say, cockroachdb, voltdb, 
> postgresql, etc?

See Claude's reply about Cassandra. Claude's is not the only work with 
Cassandra for RDF. There is also:

https://github.com/cumulusrdf/cumulusrdf

but that does not seem to be a very active project.

> 4. If there are some stress tests available, e.g. I read about a 100M of BSBM 
> test, is it included in the src? or may I have a copy of it? ... Or, some 
> guidelines, so I can start to create this stress code. Will it be useful to 
> you also?

You will definitely want to know about the work Rob Vesse (Jena committer/PMC) 
has done on this front:

https://github.com/rvesse/sparql-query-bm

Modeling workloads for triplestores, in general, is hard because people use 
them in so many different ways. Also knowing (say) the maximum number of nodes 
you could put in a dataset might not help you very much if the query time for 
that dataset with your queries isn't what you need. That's not to discourage 
you from working on this problem, just to point out that there is a lot of 
subtlety to even defining and scoping the problem well. It seems to me that 
most famous benchmarks for RDF stores take up a particular system of use cases 
and model that.

Otherwise: I've been thinking about scale-out for Jena for a while, too. 
Particularly I've been inspired by some of the advanced ideas being worked on 
in RDFox and TriAD [1], [2], and Andy pointed out this [3] blog post from the 
folks working on the closed-source product Stardog.

In fact, I was about to write some questions to the list (particularly Andy) 
about how we might start thinking about working in ARQ to split queries to 
partitions in different nodes, perhaps using summary graphs to avoid sending 
BGPs where they aren't going to find results or even using metadata at the 
branching nodes of the query tree to do cost accounting and results cardinality 
bounding. It seems we could at least get basic partitioning with enough time to 
work on it (he wrote blithely!). We might use something like Apache Zookeeper 
to manage the partitions and nodes and help figure out where to send different 
branches of the query. TriAD and RDFox are using clever ways of letting 
different paths through the query slip asynchronously against each other, but 
that seems to me like a bridge too far at first. Just getting a distributed 
approach basically working and giving correct results would be a great start! 
:grin: 

---
A. Soroka

[1] https://www.cs.ox.ac.uk/ian.horrocks/Publications/download/2016/PoMH16a.pdf
[2] http://adrem.ua.ac.be/~tmartin/Gurajada-Sigmod14.pdf
[3] http://blog.stardog.com/how-to-read-stardog-query-plans/

> On Jan 20, 2017, at 8:38 PM, De Gyves <[email protected]> wrote:
> 
> I'd like to participate on the storage portion of Jena, maybe TDB. As I
> have worked many years developing with RBDMS I like to explore new
> horizonts of persistence and graph based ones seem very promising to my
> next projects, so i'd like to use SPARQL and RDF with Jena/TDB and see how
> far I can go.
> 
> So I've spent the last two days exploring subjects of the mail archives
> from august 2015 to january of this year the of jena-dev and found some
> interesting threads, as the development of TDB2, the tests of 100m of BSBM
> data, a question of horizontal scaling, and that anything that implements
> DatasetGraph can be used for a triples store. Some readings of jena doc
> include: SPARQL, The RDF API, Txn and TDB transactions.
> 
> What I am looking for is to get a clear perspective of some requirements
> which are taken for granted on a traditional RDBMS. These are:
> 
> 1. Atomicity, consistency, isolation and durability of a transaction on a
> single tdb database: Apart from the limitations on the documentation of TDB
> Transactions and Txn,  there are current issues? edge cases detected and
> not yet covered?
> 2. Are there currently available strategies to achieve a horizontal-scaled
> tdb database?
> 3. What do you think of try to implement a horizontal scalability with
> DatasetGraph or something else with, let's say, cockroachdb, voltdb,
> postgresql, etc?
> 4. If there are some stress tests available, e.g. I read about a 100M of
> BSBM test, is it included in the src? or may I have a copy of it? I'd like
> to see what the limits are of the current TDB, and maybe of TDB2: maximum
> size on disk of a dataset, max number of nodes on a dataset, of models or
> graphs on a dataset, the limiting behavior of a typical read/write
> transaction vs. the number of nodes, datasets, etcetera. Or, some
> guidelines, so I can start to create this stress code. Will it be useful to
> you also?
> 
> -- 
> Víctor-Polo de Gyvés Montero.
> +52 (55) 4926 9478 (Cellphone in Mexico city)
> Address: Daniel Delgadillo 7 6A, Agricultura neighborhood, Miguel Hidalgo
> burough
> ZIP: 11360, México City.
> 
> http://degyves.googlepages.com

Re: Horizontal scalability and limits of TDB

Reply via email to