On 01/01/17 19:28, A. Soroka wrote:
Andy, this is really great to hear about, congratulations! A great beginning to 
the new year. A few general questions and then a few questions in-line.

* Is the source in the TDB2 sections of https://github.com/afs/mantis ?

It's all of that GH repo.

What was (TDB1) one big repo is now several : the "dboe" are technology components that TDB2 pulls together into a database.

The fuseki integration and tdb2-cmds are transient.

* What kind of trajectory do you expect this project to take this year (e.g. 
towards integration as part of the Jena release)?

No specific timescale in mind. Only one thing is missing - the space reclamation - the most important thing now is to beta test it.

* What kind of input are you hoping for immediately? I.e. are you looking for 
active development contributors, sites willing to do testing at scale, or just 
basic feedback on the code itself and small-scale testing?

Yes, yes and yes.

Having it tried out is the most important of those. One reason for not rushing to move to ASF is that it can be released with fixes to any small annoying things that are barriers to use; a different tempo to Jena and hopefully one that will not last very long.

The other area to polish is smooth working with TDB1 and TDB2 e.g detecting which is existing database it is and doing the right thing.

* Do you have a sense of how far away from production-readiness this code is? 
Is there anything missing that could be supplied to change that timeline?

It's not "all new" as it is the TDB1 codebase, reworked so I'm hoping it is close to jena-ready than a clean-slate component might be.

* Do you expect the ideas about clustering and distribution with which you've 
been working to come back into TDB2?

Clustering and distribution are interesting topics with lots of different angles.

I hope to have the patch stuff [1] out for clustering for high availability and for publishing changes.

There are different use cases, which lead to different solutions, for straight scale. Large scale analytics is not the same as large knowledge graphs and they lead to different designs and base technology.

The separation into "dboe" components helps - they are more reusable.

By also one internal change in TDB2 is clear up NodeIds to allow them to be larger - either more bytes long, or using more bits for the node pointers. In TDB2, they can be 63 bits. Coupled with more compact indexes, there is also the opportunity to get more RDF triples onto one machine.

So what is needed is a use case to target - real data, real queries, real business problem to focus the choices.

    Andy

[1] https://lists.apache.org/thread.html/79e0fbd41126a1d8d0b2fb3b7b837d0d1d58d568a3583701b366cfcc@%3Cdev.jena.apache.org%3E


---
A. Soroka
The University of Virginia Library

On Jan 1, 2017, at 12:17 PM, Andy Seaborne <[email protected]> wrote:

TDB2 is an upgrade of TDB. It provides fully scalable transactions,e.g. loading 
100's of million triples into a live database.
<snipped>
So what are the differences between TDB1 and TDB2?

= Indexes
<details snipped>

Given the common use of persistent/functional structures between TIM and TDB2, 
do you expect us eventually to be able to factor out common behavior? Or is the 
difference between in-memory and on-disk budgeting too great, or is it just too 
soon to say?

= Nodes

The node data is now held in a binary form (using RDF/Thrift).

Does that mean just the same as:

https://jena.apache.org/documentation/io/rdf-binary.html

?

<snipped>
= Transactions

There is a completely new transaction mechanism. It is now a general framework 
that can work with multiple components.  A TDB2 database is a number of such 
components - one per index, and also the node table.  It could be enhanced to 
provide multiple dataset transactions and work with external indexes. The API 
on datasets is unchanged.

Along the lines of my ask above about persistent structures and TIM and TDB2, 
do you expect us eventually to be able to migrate Jena itself to use this new 
more-flexible approach? Does the new approach finally separate threads and 
transactions?

<snipped>
== Possibilities

Given this design, some features are possible, i.e. could be done but aren't.

"See into the past" - a read-transaction can be started that sees some specific 
committed state from the past, not the latest commit.  The database does not forget any 
committed changes unless storage is reclaimed.

TDB2 is still MRSW, not MR+SW?


Reply via email to