Cassandra and TDB2 seem to be the next routes to explore and learn about its characteristics and best uses cases... It seems that I have a lot to learn before I even can think on how to actively contribute to the project...
Thanks for all the answers and links! Seems that I have some clues o what to read from now on. But any other clues, opinions, guides on these topics will be greatly appreciated. On Mon, Jan 23, 2017 at 1:02 PM, Andy Seaborne <[email protected]> wrote: > Hi there - good to hear from you. > > I hope all these pointers in the thread are helpful. > > On 22/01/17 15:31, A. Soroka wrote: > >> First, to your specific questions: >> >> 1. Atomicity, consistency, isolation and durability of a >>> transaction on a single tdb database: Apart from the limitations on >>> the documentation of TDB Transactions and Txn, there are current >>> issues? edge cases detected and not yet covered? >>> >> >> I'm not really sure what we mean by "consistency" once we go beyond a >> single writer. Without a schema and therefore without any >> understanding of data dependencies within the database, it's not >> clear to me how we can automatically understand when a state is >> consistent. It seems we have to leave that to the applications, for >> the most part. I'm very interested myself in ways we could "hint" to >> a triplestore the data dependencies we want it to understand (perhaps >> something like OWL/ICV), but that's not really a scaling issue. >> >> I've recently been investigating the possibility of lock regions more >> granular that a whole dataset: >> >> https://github.com/apache/jena/pull/204 >> >> for the special case of named graphs as the lock regions. We >> discussed this about a year ago when Claude Warren (Jena >> committer/PMC) made up some designs for discussion: >> >> https://lists.apache.org/thread.html/916eed68e9847c6f4c0330f >> ecff8b6f416a27344f2d995400e834562@1451744303@%3Cdev.jena.apache.org%3E >> >> and there is a _lot_ more to be thought about there. >> >> Jena uses threads as stand-ins for transactions, and there is >> definitely work to be done to separate those ideas so that more than >> one thread can participate in a transaction >> > > Even in TDB1, transactions can have multiple threads, as multiple-reader, > single writer. It's only the API that has the thread-transaction linkage. > > and so that transactions >> can be managed independently of threading and low-level concurrency. >> That would be a pretty major change in the codebase, but Andy has >> been making some moves that will help set that up by changing from a >> single class being transactional to several type together composing a >> transactional thing. >> > > In RDF stores, TDB included, the basic storage and query execution is more > decoupled than an SQL DBMS. There are improvements to query processing > that are separate from the storage. > > Interesting things to consider are multilthreading (Java fork/join) for > query processing or Apache Spark. > > 2. Are there currently available strategies to achieve a >>> horizontal-scaled tdb database? >>> >> >> I'l let Andy speak to this, but I know of none (and I would very much >> like to!). >> > > Currently - no, not for sideways scale, only (nearly ready!) for High > Availability. > > 3. What do you think of try to implement a horizontal scalability >>> with DatasetGraph or something else with, let's say, cockroachdb, >>> voltdb, postgresql, etc? >>> >> > I did a Project Voldemort backend once. The issue is the amount of data > to move between storage and query engine. It needs careful > pattern-sensitive caching. > > Apache Spark looks interesting - there are a few papers around that have > looked at it but I think the starting point is specific problem to solve. > There are reasonable but different design choices depending on what the > problem to be addressed is. Otherwise, without a focus, its hard to make > choices with confidence. > > Apeche Spark is also easy to work with on a local machine. > > >> See Claude's reply about Cassandra. Claude's is not the only work >> with Cassandra for RDF. There is also: >> >> https://github.com/cumulusrdf/cumulusrdf >> >> but that does not seem to be a very active project. >> >> 4. If there are some stress tests available, e.g. I read about a >>> 100M of BSBM test, is it included in the src? or may I have a copy >>> of it? ... Or, some guidelines, so I can start to create this >>> stress code. Will it be useful to you also? >>> >> > BSBM is https://sourceforge.net/projects/bsbmtools/ > > I've hacked it for my own use to add handling local databases (i.e same > process) not just over remote connections. > > https://github.com/afs/BSBM-Local/ > > Like an benchmark, it emphasises some aspects and not others. It is > filter-focused. > > There are commercial benchmarks from the Linked Data Benchmark Council > (not just RDF) but they are commercially focused. It's trying to be TPC, > including fees. > > 100m is not a huge database these days ... except for the SP2B benchmark > which is all about basic pattern joins. Even 25m is hard for that. > > One to avoid is LUBM - even the originators say that should not be used. > > >> You will definitely want to know about the work Rob Vesse (Jena >> committer/PMC) has done on this front: >> >> https://github.com/rvesse/sparql-query-bm >> > > This is probably the best starting point - BSBM is not maintained as far > as I can see so a BSBM setup for Rob's project would have more life. I > think (Rob correct me if I'm wrong) it needs the "random parameters" added > - or more practically, generate a set of queries from templates and then > use sparql-query-bm. BSBMtools does both at once which makes it inflexible > - I think separate benchmark execution from benchmark generation means > spliting the functional roles and allowing each to do its part well. > > >> Modeling workloads for triplestores, in general, is hard because >> people use them in so many different ways. Also knowing (say) the >> maximum number of nodes you could put in a dataset might not help you >> very much if the query time for that dataset with your queries isn't >> what you need. That's not to discourage you from working on this >> problem, just to point out that there is a lot of subtlety to even >> defining and scoping the problem well. It seems to me that most >> famous benchmarks for RDF stores take up a particular system of use >> cases and model that. >> >> Otherwise: I've been thinking about scale-out for Jena for a while, >> too. Particularly I've been inspired by some of the advanced ideas >> being worked on in RDFox and TriAD [1], [2], and Andy pointed out >> this [3] blog post from the folks working on the closed-source >> product Stardog. >> >> In fact, I was about to write some questions to the list >> (particularly Andy) about how we might start thinking about working >> in ARQ to split queries to partitions in different nodes, perhaps >> using summary graphs to avoid sending BGPs where they aren't going to >> find results or even using metadata at the branching nodes of the >> query tree to do cost accounting and results cardinality bounding. It >> seems we could at least get basic partitioning with enough time to >> work on it (he wrote blithely!). We might use something like Apache >> Zookeeper to manage the partitions and nodes and help figure out >> where to send different branches of the query. TriAD and RDFox are >> using clever ways of letting different paths through the query slip >> asynchronously against each other, but that seems to me like a bridge >> too far at first. Just getting a distributed approach basically >> working and giving correct results would be a great start! :grin: >> >> --- A. Soroka >> >> [1] >> https://www.cs.ox.ac.uk/ian.horrocks/Publications/download/ >> 2016/PoMH16a.pdf >> >> >> [2] http://adrem.ua.ac.be/~tmartin/Gurajada-Sigmod14.pdf > >> [3] http://blog.stardog.com/how-to-read-stardog-query-plans/ >> >> On Jan 20, 2017, at 8:38 PM, De Gyves <[email protected]> wrote: >>> >>> I'd like to participate on the storage portion of Jena, maybe TDB. >>> As I have worked many years developing with RBDMS I like to explore >>> new horizonts of persistence and graph based ones seem very >>> promising to my next projects, so i'd like to use SPARQL and RDF >>> with Jena/TDB and see how far I can go. >>> >>> So I've spent the last two days exploring subjects of the mail >>> archives from august 2015 to january of this year the of jena-dev >>> and found some interesting threads, as the development of TDB2, the >>> tests of 100m of BSBM data, a question of horizontal scaling, and >>> that anything that implements DatasetGraph can be used for a >>> triples store. Some readings of jena doc include: SPARQL, The RDF >>> API, Txn and TDB transactions. >>> >>> What I am looking for is to get a clear perspective of some >>> requirements which are taken for granted on a traditional RDBMS. >>> These are: >>> >>> 1. Atomicity, consistency, isolation and durability of a >>> transaction on a single tdb database: Apart from the limitations on >>> the documentation of TDB Transactions and Txn, there are current >>> issues? edge cases detected and not yet covered? 2. Are there >>> currently available strategies to achieve a horizontal-scaled tdb >>> database? 3. What do you think of try to implement a horizontal >>> scalability with DatasetGraph or something else with, let's say, >>> cockroachdb, voltdb, postgresql, etc? 4. If there are some stress >>> tests available, e.g. I read about a 100M of BSBM test, is it >>> included in the src? or may I have a copy of it? I'd like to see >>> what the limits are of the current TDB, and maybe of TDB2: maximum >>> size on disk of a dataset, max number of nodes on a dataset, of >>> models or graphs on a dataset, the limiting behavior of a typical >>> read/write transaction vs. the number of nodes, datasets, etcetera. >>> Or, some guidelines, so I can start to create this stress code. >>> Will it be useful to you also? >>> >>> -- Víctor-Polo de Gyvés Montero. +52 (55) 4926 9478 (Cellphone in >>> Mexico city) Address: Daniel Delgadillo 7 6A, Agricultura >>> neighborhood, Miguel Hidalgo burough ZIP: 11360, México City. >>> >>> http://degyves.googlepages.com >>> >> >> >> >> -- Víctor-Polo de Gyvés Montero. +52 (55) 4926 9478 (Cellphone in Mexico city) Address: Daniel Delgadillo 7 6A, Agricultura neighborhood, Miguel Hidalgo burough ZIP: 11360, México City. http://degyves.googlepages.com
