On 01/01/17 19:28, A. Soroka wrote:
Andy, this is really great to hear about, congratulations! A great beginning to
the new year. A few general questions and then a few questions in-line.
* Is the source in the TDB2 sections of https://github.com/afs/mantis ?
It's all of that GH repo.
What was (TDB1) one big repo is now several : the "dboe" are technology
components that TDB2 pulls together into a database.
The fuseki integration and tdb2-cmds are transient.
* What kind of trajectory do you expect this project to take this year (e.g.
towards integration as part of the Jena release)?
No specific timescale in mind. Only one thing is missing - the space
reclamation - the most important thing now is to beta test it.
* What kind of input are you hoping for immediately? I.e. are you looking for
active development contributors, sites willing to do testing at scale, or just
basic feedback on the code itself and small-scale testing?
Yes, yes and yes.
Having it tried out is the most important of those. One reason for not
rushing to move to ASF is that it can be released with fixes to any
small annoying things that are barriers to use; a different tempo to
Jena and hopefully one that will not last very long.
The other area to polish is smooth working with TDB1 and TDB2 e.g
detecting which is existing database it is and doing the right thing.
* Do you have a sense of how far away from production-readiness this code is?
Is there anything missing that could be supplied to change that timeline?
It's not "all new" as it is the TDB1 codebase, reworked so I'm hoping it
is close to jena-ready than a clean-slate component might be.
* Do you expect the ideas about clustering and distribution with which you've
been working to come back into TDB2?
Clustering and distribution are interesting topics with lots of
different angles.
I hope to have the patch stuff [1] out for clustering for high
availability and for publishing changes.
There are different use cases, which lead to different solutions, for
straight scale. Large scale analytics is not the same as large knowledge
graphs and they lead to different designs and base technology.
The separation into "dboe" components helps - they are more reusable.
By also one internal change in TDB2 is clear up NodeIds to allow them to
be larger - either more bytes long, or using more bits for the node
pointers. In TDB2, they can be 63 bits. Coupled with more compact
indexes, there is also the opportunity to get more RDF triples onto one
machine.
So what is needed is a use case to target - real data, real queries,
real business problem to focus the choices.
Andy
[1]
https://lists.apache.org/thread.html/79e0fbd41126a1d8d0b2fb3b7b837d0d1d58d568a3583701b366cfcc@%3Cdev.jena.apache.org%3E
---
A. Soroka
The University of Virginia Library
On Jan 1, 2017, at 12:17 PM, Andy Seaborne <[email protected]> wrote:
TDB2 is an upgrade of TDB. It provides fully scalable transactions,e.g. loading
100's of million triples into a live database.
<snipped>
So what are the differences between TDB1 and TDB2?
= Indexes
<details snipped>
Given the common use of persistent/functional structures between TIM and TDB2,
do you expect us eventually to be able to factor out common behavior? Or is the
difference between in-memory and on-disk budgeting too great, or is it just too
soon to say?
= Nodes
The node data is now held in a binary form (using RDF/Thrift).
Does that mean just the same as:
https://jena.apache.org/documentation/io/rdf-binary.html
?
<snipped>
= Transactions
There is a completely new transaction mechanism. It is now a general framework
that can work with multiple components. A TDB2 database is a number of such
components - one per index, and also the node table. It could be enhanced to
provide multiple dataset transactions and work with external indexes. The API
on datasets is unchanged.
Along the lines of my ask above about persistent structures and TIM and TDB2,
do you expect us eventually to be able to migrate Jena itself to use this new
more-flexible approach? Does the new approach finally separate threads and
transactions?
<snipped>
== Possibilities
Given this design, some features are possible, i.e. could be done but aren't.
"See into the past" - a read-transaction can be started that sees some specific
committed state from the past, not the latest commit. The database does not forget any
committed changes unless storage is reclaimed.
TDB2 is still MRSW, not MR+SW?