Re: Horizontal scalability and limits of TDB

De Gyves Tue, 24 Jan 2017 02:03:07 -0800

Cassandra and TDB2 seem to be the next routes to explore and learn about
its characteristics and best uses cases... It seems that I have a lot to
learn before I even can think on how to actively contribute to the
project...


Thanks for all the answers and links! Seems that I have some clues o what
to read from now on. But any other clues, opinions, guides on these topics
will be greatly appreciated.


On Mon, Jan 23, 2017 at 1:02 PM, Andy Seaborne <[email protected]> wrote:

> Hi there - good to hear from you.
>
> I hope all these pointers in the thread are helpful.
>
> On 22/01/17 15:31, A. Soroka wrote:
>
>> First, to your specific questions:
>>
>> 1. Atomicity, consistency, isolation and durability of a
>>> transaction on a single tdb database: Apart from the limitations on
>>> the documentation of TDB Transactions and Txn,  there are current
>>> issues? edge cases detected and not yet covered?
>>>
>>
>> I'm not really sure what we mean by "consistency" once we go beyond a
>> single writer. Without a schema and therefore without any
>> understanding of data dependencies within the database, it's not
>> clear to me how we can automatically understand when a state is
>> consistent. It seems we have to leave that to the applications, for
>> the most part. I'm very interested myself in ways we could "hint" to
>> a triplestore the data dependencies we want it to understand (perhaps
>> something like OWL/ICV), but that's not really a scaling issue.
>>
>> I've recently been investigating the possibility of lock regions more
>> granular that a whole dataset:
>>
>> https://github.com/apache/jena/pull/204
>>
>> for the special case of named graphs as the lock regions. We
>> discussed this about a year ago when Claude Warren (Jena
>> committer/PMC) made up some designs for discussion:
>>
>> https://lists.apache.org/thread.html/916eed68e9847c6f4c0330f
>> ecff8b6f416a27344f2d995400e834562@1451744303@%3Cdev.jena.apache.org%3E
>>
>>  and there is a _lot_ more to be thought about there.
>>
>> Jena uses threads as stand-ins for transactions, and there is
>> definitely work to be done to separate those ideas so that more than
>> one thread can participate in a transaction
>>
>
> Even in TDB1, transactions can have multiple threads, as multiple-reader,
> single writer. It's only the API that has the thread-transaction linkage.
>
> and so that transactions
>> can be managed independently of threading and low-level concurrency.
>> That would be a pretty major change in the codebase, but Andy has
>> been making some moves that will help set that up by changing from a
>> single class being transactional to several type together composing a
>> transactional thing.
>>
>
> In RDF stores, TDB included, the basic storage and query execution is more
> decoupled than an SQL DBMS.  There are improvements to query processing
> that are separate from the storage.
>
> Interesting things to consider are multilthreading (Java fork/join) for
> query processing or Apache Spark.
>
> 2. Are there currently available strategies to achieve a
>>> horizontal-scaled tdb database?
>>>
>>
>> I'l let Andy speak to this, but I know of none (and I would very much
>> like to!).
>>
>
> Currently - no, not for sideways scale, only (nearly ready!) for High
> Availability.
>
> 3. What do you think of try to implement a horizontal scalability
>>> with DatasetGraph or something else with, let's say, cockroachdb,
>>> voltdb, postgresql, etc?
>>>
>>
> I did a Project Voldemort backend once.  The issue is the amount of data
> to move between storage and query engine.  It needs careful
> pattern-sensitive caching.
>
> Apache Spark looks interesting - there are a few papers around that have
> looked at it but I think the starting point is specific problem to solve.
> There are reasonable but different design choices depending on what the
> problem to be addressed is.  Otherwise, without a focus, its hard to make
> choices with confidence.
>
> Apeche Spark is also easy to work with on a local machine.
>
>
>> See Claude's reply about Cassandra. Claude's is not the only work
>> with Cassandra for RDF. There is also:
>>
>> https://github.com/cumulusrdf/cumulusrdf
>>
>> but that does not seem to be a very active project.
>>
>> 4. If there are some stress tests available, e.g. I read about a
>>> 100M of BSBM test, is it included in the src? or may I have a copy
>>> of it? ... Or, some guidelines, so I can start to create this
>>> stress code. Will it be useful to you also?
>>>
>>
> BSBM is https://sourceforge.net/projects/bsbmtools/
>
> I've hacked it for my own use to add handling local databases (i.e same
> process) not just over remote connections.
>
>    https://github.com/afs/BSBM-Local/
>
> Like an benchmark, it emphasises some aspects and not others.  It is
> filter-focused.
>
> There are commercial benchmarks from the Linked Data Benchmark Council
> (not just RDF) but they are commercially focused.  It's trying to be TPC,
> including fees.
>
> 100m is not a huge database these days ... except for the SP2B benchmark
> which is all about basic pattern joins. Even 25m is hard for that.
>
> One to avoid is LUBM - even the originators say that should not be used.
>
>
>> You will definitely want to know about the work Rob Vesse (Jena
>> committer/PMC) has done on this front:
>>
>> https://github.com/rvesse/sparql-query-bm
>>
>
> This is probably the best starting point - BSBM is not maintained as far
> as I can see so a BSBM setup for Rob's project would have more life.  I
> think (Rob correct me if I'm wrong) it needs the "random parameters" added
> - or more practically, generate a set of queries from templates and then
> use sparql-query-bm.  BSBMtools does both at once which makes it inflexible
> - I think separate benchmark execution from benchmark generation means
> spliting the functional roles and allowing each to do its part well.
>
>
>> Modeling workloads for triplestores, in general, is hard because
>> people use them in so many different ways. Also knowing (say) the
>> maximum number of nodes you could put in a dataset might not help you
>> very much if the query time for that dataset with your queries isn't
>> what you need. That's not to discourage you from working on this
>> problem, just to point out that there is a lot of subtlety to even
>> defining and scoping the problem well. It seems to me that most
>> famous benchmarks for RDF stores take up a particular system of use
>> cases and model that.
>>
>> Otherwise: I've been thinking about scale-out for Jena for a while,
>> too. Particularly I've been inspired by some of the advanced ideas
>> being worked on in RDFox and TriAD [1], [2], and Andy pointed out
>> this [3] blog post from the folks working on the closed-source
>> product Stardog.
>>
>> In fact, I was about to write some questions to the list
>> (particularly Andy) about how we might start thinking about working
>> in ARQ to split queries to partitions in different nodes, perhaps
>> using summary graphs to avoid sending BGPs where they aren't going to
>> find results or even using metadata at the branching nodes of the
>> query tree to do cost accounting and results cardinality bounding. It
>> seems we could at least get basic partitioning with enough time to
>> work on it (he wrote blithely!). We might use something like Apache
>> Zookeeper to manage the partitions and nodes and help figure out
>> where to send different branches of the query. TriAD and RDFox are
>> using clever ways of letting different paths through the query slip
>> asynchronously against each other, but that seems to me like a bridge
>> too far at first. Just getting a distributed approach basically
>> working and giving correct results would be a great start! :grin:
>>
>> --- A. Soroka
>>
>> [1]
>> https://www.cs.ox.ac.uk/ian.horrocks/Publications/download/
>> 2016/PoMH16a.pdf
>>
>>
>> [2] http://adrem.ua.ac.be/~tmartin/Gurajada-Sigmod14.pdf
>
>> [3] http://blog.stardog.com/how-to-read-stardog-query-plans/
>>
>> On Jan 20, 2017, at 8:38 PM, De Gyves <[email protected]> wrote:
>>>
>>> I'd like to participate on the storage portion of Jena, maybe TDB.
>>> As I have worked many years developing with RBDMS I like to explore
>>> new horizonts of persistence and graph based ones seem very
>>> promising to my next projects, so i'd like to use SPARQL and RDF
>>> with Jena/TDB and see how far I can go.
>>>
>>> So I've spent the last two days exploring subjects of the mail
>>> archives from august 2015 to january of this year the of jena-dev
>>> and found some interesting threads, as the development of TDB2, the
>>> tests of 100m of BSBM data, a question of horizontal scaling, and
>>> that anything that implements DatasetGraph can be used for a
>>> triples store. Some readings of jena doc include: SPARQL, The RDF
>>> API, Txn and TDB transactions.
>>>
>>> What I am looking for is to get a clear perspective of some
>>> requirements which are taken for granted on a traditional RDBMS.
>>> These are:
>>>
>>> 1. Atomicity, consistency, isolation and durability of a
>>> transaction on a single tdb database: Apart from the limitations on
>>> the documentation of TDB Transactions and Txn,  there are current
>>> issues? edge cases detected and not yet covered? 2. Are there
>>> currently available strategies to achieve a horizontal-scaled tdb
>>> database? 3. What do you think of try to implement a horizontal
>>> scalability with DatasetGraph or something else with, let's say,
>>> cockroachdb, voltdb, postgresql, etc? 4. If there are some stress
>>> tests available, e.g. I read about a 100M of BSBM test, is it
>>> included in the src? or may I have a copy of it? I'd like to see
>>> what the limits are of the current TDB, and maybe of TDB2: maximum
>>> size on disk of a dataset, max number of nodes on a dataset, of
>>> models or graphs on a dataset, the limiting behavior of a typical
>>> read/write transaction vs. the number of nodes, datasets, etcetera.
>>> Or, some guidelines, so I can start to create this stress code.
>>> Will it be useful to you also?
>>>
>>> -- Víctor-Polo de Gyvés Montero. +52 (55) 4926 9478 (Cellphone in
>>> Mexico city) Address: Daniel Delgadillo 7 6A, Agricultura
>>> neighborhood, Miguel Hidalgo burough ZIP: 11360, México City.
>>>
>>> http://degyves.googlepages.com
>>>
>>
>>
>>
>>


-- 
Víctor-Polo de Gyvés Montero.
+52 (55) 4926 9478 (Cellphone in Mexico city)
Address: Daniel Delgadillo 7 6A, Agricultura neighborhood, Miguel Hidalgo
burough
ZIP: 11360, México City.

http://degyves.googlepages.com

Re: Horizontal scalability and limits of TDB

Reply via email to