Re: TDB2 - technical background

A. Soroka Tue, 03 Jan 2017 04:49:58 -0800

This is a good way to begin the year. I can probably get a test site going 
sometime early this year, but the one of which I am thinking isn't very large-- 
maybe 100Mt, definitely more of the "large graph" end of things. They are 
interested in clustering more from a HA/robustness point of view. Meanwhile 
there's a good amount of code here to read.


---
A. Soroka
The University of Virginia Library

> On Jan 2, 2017, at 4:42 PM, Andy Seaborne <[email protected]> wrote:
> 
> 
> 
> On 01/01/17 19:28, A. Soroka wrote:
>> Andy, this is really great to hear about, congratulations! A great beginning 
>> to the new year. A few general questions and then a few questions in-line.
>> 
>> * Is the source in the TDB2 sections of https://github.com/afs/mantis ?
> 
> It's all of that GH repo.
> 
> What was (TDB1) one big repo is now several : the "dboe" are  technology 
> components that TDB2 pulls together into a database.
> 
> The fuseki integration and tdb2-cmds are transient.
> 
>> * What kind of trajectory do you expect this project to take this year (e.g. 
>> towards integration as part of the Jena release)?
> 
> No specific timescale in mind.  Only one thing is missing - the space 
> reclamation - the most important thing now is to beta test it.
> 
>> * What kind of input are you hoping for immediately? I.e. are you looking 
>> for active development contributors, sites willing to do testing at scale, 
>> or just basic feedback on the code itself and small-scale testing?
> 
> Yes, yes and yes.
> 
> Having it tried out is the most important of those.  One reason for not 
> rushing to move to ASF is that it can be released with fixes to any small 
> annoying things that are barriers to use; a different tempo to Jena and 
> hopefully one that will not last very long.
> 
> The other area to polish is smooth working with TDB1 and TDB2 e.g detecting 
> which is existing database it is and doing the right thing.
> 
>> * Do you have a sense of how far away from production-readiness this code 
>> is? Is there anything missing that could be supplied to change that timeline?
> 
> It's not "all new" as it is the TDB1 codebase, reworked so I'm hoping it is 
> close to jena-ready than a clean-slate component might be.
> 
>> * Do you expect the ideas about clustering and distribution with which 
>> you've been working to come back into TDB2?
> 
> Clustering and distribution are interesting topics with lots of different 
> angles.
> 
> I hope to have the patch stuff [1] out for clustering for high availability 
> and for publishing changes.
> 
> There are different use cases, which lead to different solutions, for 
> straight scale. Large scale analytics is not the same as large knowledge 
> graphs and they lead to different designs and base technology.
> 
> The separation into "dboe" components helps - they are more reusable.
> 
> By also one internal change in TDB2 is clear up NodeIds to allow them to be 
> larger - either more bytes long, or using more bits for the node pointers. In 
> TDB2, they can be 63 bits.  Coupled with more compact indexes, there is also 
> the opportunity to get more RDF triples onto one machine.
> 
> So what is needed is a use case to target - real data, real queries, real 
> business problem to focus the choices.



> 
>    Andy
> 
> [1] 
> https://lists.apache.org/thread.html/79e0fbd41126a1d8d0b2fb3b7b837d0d1d58d568a3583701b366cfcc@%3Cdev.jena.apache.org%3E
> 
>> 
>> ---
>> A. Soroka
>> The University of Virginia Library
>> 
>>> On Jan 1, 2017, at 12:17 PM, Andy Seaborne <[email protected]> wrote:
>>> 
>>> TDB2 is an upgrade of TDB. It provides fully scalable transactions,e.g. 
>>> loading 100's of million triples into a live database.
>> <snipped>
>>> So what are the differences between TDB1 and TDB2?
>>> 
>>> = Indexes
>> <details snipped>
>> 
>> Given the common use of persistent/functional structures between TIM and 
>> TDB2, do you expect us eventually to be able to factor out common behavior? 
>> Or is the difference between in-memory and on-disk budgeting too great, or 
>> is it just too soon to say?
>> 
>>> = Nodes
>>> 
>>> The node data is now held in a binary form (using RDF/Thrift).
>> 
>> Does that mean just the same as:
>> 
>> https://jena.apache.org/documentation/io/rdf-binary.html
>> 
>> ?
>> 
>> <snipped>
>>> = Transactions
>>> 
>>> There is a completely new transaction mechanism. It is now a general 
>>> framework that can work with multiple components.  A TDB2 database is a 
>>> number of such components - one per index, and also the node table.  It 
>>> could be enhanced to provide multiple dataset transactions and work with 
>>> external indexes. The API on datasets is unchanged.
>> 
>> Along the lines of my ask above about persistent structures and TIM and 
>> TDB2, do you expect us eventually to be able to migrate Jena itself to use 
>> this new more-flexible approach? Does the new approach finally separate 
>> threads and transactions?
>> 
>> <snipped>
>>> == Possibilities
>>> 
>>> Given this design, some features are possible, i.e. could be done but 
>>> aren't.
>>> 
>>> "See into the past" - a read-transaction can be started that sees some 
>>> specific committed state from the past, not the latest commit.  The 
>>> database does not forget any committed changes unless storage is reclaimed.
>> 
>> TDB2 is still MRSW, not MR+SW?
>> 
>>

Re: TDB2 - technical background

Reply via email to