First, to your specific questions: > 1. Atomicity, consistency, isolation and durability of a transaction on a > single tdb database: Apart from the limitations on the documentation of TDB > Transactions and Txn, there are current issues? edge cases detected and not > yet covered?
I'm not really sure what we mean by "consistency" once we go beyond a single writer. Without a schema and therefore without any understanding of data dependencies within the database, it's not clear to me how we can automatically understand when a state is consistent. It seems we have to leave that to the applications, for the most part. I'm very interested myself in ways we could "hint" to a triplestore the data dependencies we want it to understand (perhaps something like OWL/ICV), but that's not really a scaling issue. I've recently been investigating the possibility of lock regions more granular that a whole dataset: https://github.com/apache/jena/pull/204 for the special case of named graphs as the lock regions. We discussed this about a year ago when Claude Warren (Jena committer/PMC) made up some designs for discussion: https://lists.apache.org/thread.html/916eed68e9847c6f4c0330fecff8b6f416a27344f2d995400e834562@1451744303@%3Cdev.jena.apache.org%3E and there is a _lot_ more to be thought about there. Jena uses threads as stand-ins for transactions, and there is definitely work to be done to separate those ideas so that more than one thread can participate in a transaction and so that transactions can be managed independently of threading and low-level concurrency. That would be a pretty major change in the codebase, but Andy has been making some moves that will help set that up by changing from a single class being transactional to several type together composing a transactional thing. > 2. Are there currently available strategies to achieve a horizontal-scaled > tdb database? I'l let Andy speak to this, but I know of none (and I would very much like to!). > 3. What do you think of try to implement a horizontal scalability with > DatasetGraph or something else with, let's say, cockroachdb, voltdb, > postgresql, etc? See Claude's reply about Cassandra. Claude's is not the only work with Cassandra for RDF. There is also: https://github.com/cumulusrdf/cumulusrdf but that does not seem to be a very active project. > 4. If there are some stress tests available, e.g. I read about a 100M of BSBM > test, is it included in the src? or may I have a copy of it? ... Or, some > guidelines, so I can start to create this stress code. Will it be useful to > you also? You will definitely want to know about the work Rob Vesse (Jena committer/PMC) has done on this front: https://github.com/rvesse/sparql-query-bm Modeling workloads for triplestores, in general, is hard because people use them in so many different ways. Also knowing (say) the maximum number of nodes you could put in a dataset might not help you very much if the query time for that dataset with your queries isn't what you need. That's not to discourage you from working on this problem, just to point out that there is a lot of subtlety to even defining and scoping the problem well. It seems to me that most famous benchmarks for RDF stores take up a particular system of use cases and model that. Otherwise: I've been thinking about scale-out for Jena for a while, too. Particularly I've been inspired by some of the advanced ideas being worked on in RDFox and TriAD [1], [2], and Andy pointed out this [3] blog post from the folks working on the closed-source product Stardog. In fact, I was about to write some questions to the list (particularly Andy) about how we might start thinking about working in ARQ to split queries to partitions in different nodes, perhaps using summary graphs to avoid sending BGPs where they aren't going to find results or even using metadata at the branching nodes of the query tree to do cost accounting and results cardinality bounding. It seems we could at least get basic partitioning with enough time to work on it (he wrote blithely!). We might use something like Apache Zookeeper to manage the partitions and nodes and help figure out where to send different branches of the query. TriAD and RDFox are using clever ways of letting different paths through the query slip asynchronously against each other, but that seems to me like a bridge too far at first. Just getting a distributed approach basically working and giving correct results would be a great start! :grin: --- A. Soroka [1] https://www.cs.ox.ac.uk/ian.horrocks/Publications/download/2016/PoMH16a.pdf [2] http://adrem.ua.ac.be/~tmartin/Gurajada-Sigmod14.pdf [3] http://blog.stardog.com/how-to-read-stardog-query-plans/ > On Jan 20, 2017, at 8:38 PM, De Gyves <[email protected]> wrote: > > I'd like to participate on the storage portion of Jena, maybe TDB. As I > have worked many years developing with RBDMS I like to explore new > horizonts of persistence and graph based ones seem very promising to my > next projects, so i'd like to use SPARQL and RDF with Jena/TDB and see how > far I can go. > > So I've spent the last two days exploring subjects of the mail archives > from august 2015 to january of this year the of jena-dev and found some > interesting threads, as the development of TDB2, the tests of 100m of BSBM > data, a question of horizontal scaling, and that anything that implements > DatasetGraph can be used for a triples store. Some readings of jena doc > include: SPARQL, The RDF API, Txn and TDB transactions. > > What I am looking for is to get a clear perspective of some requirements > which are taken for granted on a traditional RDBMS. These are: > > 1. Atomicity, consistency, isolation and durability of a transaction on a > single tdb database: Apart from the limitations on the documentation of TDB > Transactions and Txn, there are current issues? edge cases detected and > not yet covered? > 2. Are there currently available strategies to achieve a horizontal-scaled > tdb database? > 3. What do you think of try to implement a horizontal scalability with > DatasetGraph or something else with, let's say, cockroachdb, voltdb, > postgresql, etc? > 4. If there are some stress tests available, e.g. I read about a 100M of > BSBM test, is it included in the src? or may I have a copy of it? I'd like > to see what the limits are of the current TDB, and maybe of TDB2: maximum > size on disk of a dataset, max number of nodes on a dataset, of models or > graphs on a dataset, the limiting behavior of a typical read/write > transaction vs. the number of nodes, datasets, etcetera. Or, some > guidelines, so I can start to create this stress code. Will it be useful to > you also? > > -- > Víctor-Polo de Gyvés Montero. > +52 (55) 4926 9478 (Cellphone in Mexico city) > Address: Daniel Delgadillo 7 6A, Agricultura neighborhood, Miguel Hidalgo > burough > ZIP: 11360, México City. > > http://degyves.googlepages.com
