Hi there - good to hear from you.
I hope all these pointers in the thread are helpful.
On 22/01/17 15:31, A. Soroka wrote:
First, to your specific questions:
1. Atomicity, consistency, isolation and durability of a
transaction on a single tdb database: Apart from the limitations on
the documentation of TDB Transactions and Txn, there are current
issues? edge cases detected and not yet covered?
I'm not really sure what we mean by "consistency" once we go beyond a
single writer. Without a schema and therefore without any
understanding of data dependencies within the database, it's not
clear to me how we can automatically understand when a state is
consistent. It seems we have to leave that to the applications, for
the most part. I'm very interested myself in ways we could "hint" to
a triplestore the data dependencies we want it to understand (perhaps
something like OWL/ICV), but that's not really a scaling issue.
I've recently been investigating the possibility of lock regions more
granular that a whole dataset:
https://github.com/apache/jena/pull/204
for the special case of named graphs as the lock regions. We
discussed this about a year ago when Claude Warren (Jena
committer/PMC) made up some designs for discussion:
https://lists.apache.org/thread.html/916eed68e9847c6f4c0330fecff8b6f416a27344f2d995400e834562@1451744303@%3Cdev.jena.apache.org%3E
and there is a _lot_ more to be thought about there.
Jena uses threads as stand-ins for transactions, and there is
definitely work to be done to separate those ideas so that more than
one thread can participate in a transaction
Even in TDB1, transactions can have multiple threads, as
multiple-reader, single writer. It's only the API that has the
thread-transaction linkage.
and so that transactions
can be managed independently of threading and low-level concurrency.
That would be a pretty major change in the codebase, but Andy has
been making some moves that will help set that up by changing from a
single class being transactional to several type together composing a
transactional thing.
In RDF stores, TDB included, the basic storage and query execution is
more decoupled than an SQL DBMS. There are improvements to query
processing that are separate from the storage.
Interesting things to consider are multilthreading (Java fork/join) for
query processing or Apache Spark.
2. Are there currently available strategies to achieve a
horizontal-scaled tdb database?
I'l let Andy speak to this, but I know of none (and I would very much
like to!).
Currently - no, not for sideways scale, only (nearly ready!) for High
Availability.
3. What do you think of try to implement a horizontal scalability
with DatasetGraph or something else with, let's say, cockroachdb,
voltdb, postgresql, etc?
I did a Project Voldemort backend once. The issue is the amount of data
to move between storage and query engine. It needs careful
pattern-sensitive caching.
Apache Spark looks interesting - there are a few papers around that have
looked at it but I think the starting point is specific problem to
solve. There are reasonable but different design choices depending on
what the problem to be addressed is. Otherwise, without a focus, its
hard to make choices with confidence.
Apeche Spark is also easy to work with on a local machine.
See Claude's reply about Cassandra. Claude's is not the only work
with Cassandra for RDF. There is also:
https://github.com/cumulusrdf/cumulusrdf
but that does not seem to be a very active project.
4. If there are some stress tests available, e.g. I read about a
100M of BSBM test, is it included in the src? or may I have a copy
of it? ... Or, some guidelines, so I can start to create this
stress code. Will it be useful to you also?
BSBM is https://sourceforge.net/projects/bsbmtools/
I've hacked it for my own use to add handling local databases (i.e same
process) not just over remote connections.
https://github.com/afs/BSBM-Local/
Like an benchmark, it emphasises some aspects and not others. It is
filter-focused.
There are commercial benchmarks from the Linked Data Benchmark Council
(not just RDF) but they are commercially focused. It's trying to be
TPC, including fees.
100m is not a huge database these days ... except for the SP2B benchmark
which is all about basic pattern joins. Even 25m is hard for that.
One to avoid is LUBM - even the originators say that should not be used.
You will definitely want to know about the work Rob Vesse (Jena
committer/PMC) has done on this front:
https://github.com/rvesse/sparql-query-bm
This is probably the best starting point - BSBM is not maintained as far
as I can see so a BSBM setup for Rob's project would have more life. I
think (Rob correct me if I'm wrong) it needs the "random parameters"
added - or more practically, generate a set of queries from templates
and then use sparql-query-bm. BSBMtools does both at once which makes
it inflexible - I think separate benchmark execution from benchmark
generation means spliting the functional roles and allowing each to do
its part well.
Modeling workloads for triplestores, in general, is hard because
people use them in so many different ways. Also knowing (say) the
maximum number of nodes you could put in a dataset might not help you
very much if the query time for that dataset with your queries isn't
what you need. That's not to discourage you from working on this
problem, just to point out that there is a lot of subtlety to even
defining and scoping the problem well. It seems to me that most
famous benchmarks for RDF stores take up a particular system of use
cases and model that.
Otherwise: I've been thinking about scale-out for Jena for a while,
too. Particularly I've been inspired by some of the advanced ideas
being worked on in RDFox and TriAD [1], [2], and Andy pointed out
this [3] blog post from the folks working on the closed-source
product Stardog.
In fact, I was about to write some questions to the list
(particularly Andy) about how we might start thinking about working
in ARQ to split queries to partitions in different nodes, perhaps
using summary graphs to avoid sending BGPs where they aren't going to
find results or even using metadata at the branching nodes of the
query tree to do cost accounting and results cardinality bounding. It
seems we could at least get basic partitioning with enough time to
work on it (he wrote blithely!). We might use something like Apache
Zookeeper to manage the partitions and nodes and help figure out
where to send different branches of the query. TriAD and RDFox are
using clever ways of letting different paths through the query slip
asynchronously against each other, but that seems to me like a bridge
too far at first. Just getting a distributed approach basically
working and giving correct results would be a great start! :grin:
--- A. Soroka
[1]
https://www.cs.ox.ac.uk/ian.horrocks/Publications/download/2016/PoMH16a.pdf
[2] http://adrem.ua.ac.be/~tmartin/Gurajada-Sigmod14.pdf
[3] http://blog.stardog.com/how-to-read-stardog-query-plans/
On Jan 20, 2017, at 8:38 PM, De Gyves <[email protected]> wrote:
I'd like to participate on the storage portion of Jena, maybe TDB.
As I have worked many years developing with RBDMS I like to explore
new horizonts of persistence and graph based ones seem very
promising to my next projects, so i'd like to use SPARQL and RDF
with Jena/TDB and see how far I can go.
So I've spent the last two days exploring subjects of the mail
archives from august 2015 to january of this year the of jena-dev
and found some interesting threads, as the development of TDB2, the
tests of 100m of BSBM data, a question of horizontal scaling, and
that anything that implements DatasetGraph can be used for a
triples store. Some readings of jena doc include: SPARQL, The RDF
API, Txn and TDB transactions.
What I am looking for is to get a clear perspective of some
requirements which are taken for granted on a traditional RDBMS.
These are:
1. Atomicity, consistency, isolation and durability of a
transaction on a single tdb database: Apart from the limitations on
the documentation of TDB Transactions and Txn, there are current
issues? edge cases detected and not yet covered? 2. Are there
currently available strategies to achieve a horizontal-scaled tdb
database? 3. What do you think of try to implement a horizontal
scalability with DatasetGraph or something else with, let's say,
cockroachdb, voltdb, postgresql, etc? 4. If there are some stress
tests available, e.g. I read about a 100M of BSBM test, is it
included in the src? or may I have a copy of it? I'd like to see
what the limits are of the current TDB, and maybe of TDB2: maximum
size on disk of a dataset, max number of nodes on a dataset, of
models or graphs on a dataset, the limiting behavior of a typical
read/write transaction vs. the number of nodes, datasets, etcetera.
Or, some guidelines, so I can start to create this stress code.
Will it be useful to you also?
-- Víctor-Polo de Gyvés Montero. +52 (55) 4926 9478 (Cellphone in
Mexico city) Address: Daniel Delgadillo 7 6A, Agricultura
neighborhood, Miguel Hidalgo burough ZIP: 11360, México City.
http://degyves.googlepages.com