This is a good way to begin the year. I can probably get a test site going sometime early this year, but the one of which I am thinking isn't very large-- maybe 100Mt, definitely more of the "large graph" end of things. They are interested in clustering more from a HA/robustness point of view. Meanwhile there's a good amount of code here to read.
--- A. Soroka The University of Virginia Library > On Jan 2, 2017, at 4:42 PM, Andy Seaborne <[email protected]> wrote: > > > > On 01/01/17 19:28, A. Soroka wrote: >> Andy, this is really great to hear about, congratulations! A great beginning >> to the new year. A few general questions and then a few questions in-line. >> >> * Is the source in the TDB2 sections of https://github.com/afs/mantis ? > > It's all of that GH repo. > > What was (TDB1) one big repo is now several : the "dboe" are technology > components that TDB2 pulls together into a database. > > The fuseki integration and tdb2-cmds are transient. > >> * What kind of trajectory do you expect this project to take this year (e.g. >> towards integration as part of the Jena release)? > > No specific timescale in mind. Only one thing is missing - the space > reclamation - the most important thing now is to beta test it. > >> * What kind of input are you hoping for immediately? I.e. are you looking >> for active development contributors, sites willing to do testing at scale, >> or just basic feedback on the code itself and small-scale testing? > > Yes, yes and yes. > > Having it tried out is the most important of those. One reason for not > rushing to move to ASF is that it can be released with fixes to any small > annoying things that are barriers to use; a different tempo to Jena and > hopefully one that will not last very long. > > The other area to polish is smooth working with TDB1 and TDB2 e.g detecting > which is existing database it is and doing the right thing. > >> * Do you have a sense of how far away from production-readiness this code >> is? Is there anything missing that could be supplied to change that timeline? > > It's not "all new" as it is the TDB1 codebase, reworked so I'm hoping it is > close to jena-ready than a clean-slate component might be. > >> * Do you expect the ideas about clustering and distribution with which >> you've been working to come back into TDB2? > > Clustering and distribution are interesting topics with lots of different > angles. > > I hope to have the patch stuff [1] out for clustering for high availability > and for publishing changes. > > There are different use cases, which lead to different solutions, for > straight scale. Large scale analytics is not the same as large knowledge > graphs and they lead to different designs and base technology. > > The separation into "dboe" components helps - they are more reusable. > > By also one internal change in TDB2 is clear up NodeIds to allow them to be > larger - either more bytes long, or using more bits for the node pointers. In > TDB2, they can be 63 bits. Coupled with more compact indexes, there is also > the opportunity to get more RDF triples onto one machine. > > So what is needed is a use case to target - real data, real queries, real > business problem to focus the choices. > > Andy > > [1] > https://lists.apache.org/thread.html/79e0fbd41126a1d8d0b2fb3b7b837d0d1d58d568a3583701b366cfcc@%3Cdev.jena.apache.org%3E > >> >> --- >> A. Soroka >> The University of Virginia Library >> >>> On Jan 1, 2017, at 12:17 PM, Andy Seaborne <[email protected]> wrote: >>> >>> TDB2 is an upgrade of TDB. It provides fully scalable transactions,e.g. >>> loading 100's of million triples into a live database. >> <snipped> >>> So what are the differences between TDB1 and TDB2? >>> >>> = Indexes >> <details snipped> >> >> Given the common use of persistent/functional structures between TIM and >> TDB2, do you expect us eventually to be able to factor out common behavior? >> Or is the difference between in-memory and on-disk budgeting too great, or >> is it just too soon to say? >> >>> = Nodes >>> >>> The node data is now held in a binary form (using RDF/Thrift). >> >> Does that mean just the same as: >> >> https://jena.apache.org/documentation/io/rdf-binary.html >> >> ? >> >> <snipped> >>> = Transactions >>> >>> There is a completely new transaction mechanism. It is now a general >>> framework that can work with multiple components. A TDB2 database is a >>> number of such components - one per index, and also the node table. It >>> could be enhanced to provide multiple dataset transactions and work with >>> external indexes. The API on datasets is unchanged. >> >> Along the lines of my ask above about persistent structures and TIM and >> TDB2, do you expect us eventually to be able to migrate Jena itself to use >> this new more-flexible approach? Does the new approach finally separate >> threads and transactions? >> >> <snipped> >>> == Possibilities >>> >>> Given this design, some features are possible, i.e. could be done but >>> aren't. >>> >>> "See into the past" - a read-transaction can be started that sees some >>> specific committed state from the past, not the latest commit. The >>> database does not forget any committed changes unless storage is reclaimed. >> >> TDB2 is still MRSW, not MR+SW? >> >>
