Since I received no pushback on my proposal, let's move on discussing the database model.
I see this model is good enough for certain aspects of the proposed 3.0, but
not for all. We can't store the metadata in it, in order to perform builds
from, there is clearly insufficient information.
Correct. That model comes from my work on Dynagump and it's a contract between the stage 2 (build) and stage 3 (presentation).
That said, I am more than happy to start on a 3.0 break-up by splitting the outputs from the presentation of those outputs via this model.
Great. It would also be useful to have information on what kind of information you think you need (repositories come to mind... what else?)
That said, I still need more information on the contents of ids (and such), to verify the model is correct. Here are some initial reactions:
One thing I noticed you mentioned was a desire for this database model to
allow Gump to be distributed.
Correct. This is critical when we start building native code, since we can't assume to have VMware-like virtual machines running all sort of different OSes on Brutus.
But it's also critical for Java too... kaffe is showing all sort of weaknesses in portability of java code across platforms... I'm sure we might find such weaknesses even more exposed by running them on different architectures (hmmmm, makes me think that not all modules run on all operating systems... hmmmm, this requires a model change)
I like that goal. We can't assume one host can do all builds (although Brutus is doing a fine fine job) so perhaps we could allow different hosts to build and contribute data for individual aspects.
That was my thinking.
Maybe this is a goal to work towards, not focus on now, but I beleive that
"project id" including a host are not correct (they ought be independent of
the host)
Well, I completely agree that the project IDs should be independent on the host where they are built, but I think that in order to have global uniqueness, we need to have the IDs tied to something that identifies their provenance.
I would welcome project IDs of the form
http://www.apache.org/projects/cocoon
and then
http://www.apache.org/projects/cocoon#v1.0
for a particular released version, or
http://www.apache.org/projects/cocoon#20041210
for a particular packaged snapshot of a project built at a particular time (note: NOT for the gump builds, those *need* to be identified with the host they originated from!)
[Q: Are we comfortable with allowing remote hosts to connect to a center MySQL database, or do we need an intermediary representation and more secure protocol for such?]
MySQL has a triple authentication scheme, which I like very much.
Assuming that you use MySQL in networking mode *and* that you bind to 0.0.0.0 (if you just bind to 127.0.0.1 or if you have networking disabled, you can only connect from localhost)
First, it checks the IP of the machines connecting. If the machine is not listed in the allowed hosts, the connection is dropped. DoS attacks are still possible but the operating of dropping the connection is pretty fast so it would saturate the bandwidth before achieving any damage to the machine itself (there are way worse DoS attacks you can do already than this, so the risk is pratically zero hero).
Also, given the use of MySQL around, I'm sure that an eventual buffer-overflow bug in that check would be reported and fixed in no time and would make so much noise that we'll hear it even if we were all on vacation ;-)
Second, if the IP is listed, it asks for a username and password. If the two matches, the user is allowed in.
At this point, the user is used to lookup the priviledges. It is possible to define such a granular priviledge system that a particular user is able only to perform a particular tipe of query on a particular table.
For example, we can allow hosts to perform "inserts" but not "updates". This means that even if an offender gets control of a machine that performs gump builds, it can only "add" some defective data but not modify the data that the host already dumped in the repository, preserving the validity of history for those tables (and, once we identify the intrusion, we can easily "cleanup" the database just by removing the data from that particular host from that point in time on...).
Note that only the "time modelling" tables will be open for 'insertion' from the outside. Those tables that don't model time (hosts, workspaces, projects, modules) will be maintained by *US* since, if damaged, it wouldn't be possible to "roll back" automatically.
The "granting priviledges" operation on mysql is trivial and can be performed with SQL queries directly.
Do we need environment, i.e, kaffe or JDK 1.5 or whatever?
Yes, we do, and they are identified by the "packages"... keep in mind that we might decide to build kaffe before building bootstrap-ant ;-)
This creates a problem though: we said we don't want per-workspace dependencies, but if we want to build kaffe and then being able to run bootstrap-ant with it, we need to be able to say so... one thing that comes to mind is to use "polymorphic" dependencies... which is the same thing that Debian does with "virtual packages".
Hmmmm
Ought we have hosts/workspaces as mainly informational, with environment (what ought be the only differentiator for two builds of the same stuff, at exact time) as the key to builds?
This works for java, but wouldn't work for a general build.
Do we need to allow "build output" to be optionally outside of the database, for those of us w/o terrabytes to spare?
We can get the gump database hosted and maintained over at ayax.apache.org which has a few terabytes of disk space ;-)
I like "dependency" within the database, but do we need more information (such as optional, etc.) on that?
Yeah, good point the "type" of the dependency is needed.
Also, one key piece of information in the current object model (which is used to document from) is "cause". We didn't build this thing 'cos X failed to build. That, along with annotations (we build this, but w/o X 'cos it was an optional failed dependency), seem important. Personally I like all the information on this page being available.
http://brutus.apache.org/gump/public/ant/ant/details.html
Well, my strategy in building the database design was that duplication of information should be zero, everything else should be inferred from the model.
Since it is entirely possible to infer the "cause" of a build simply by asking a query to the database, that information should not be contained explicitly.
The same thing can be said for the "percentage" of the failures, the FOG factors and all those things.
This is the biggest problem that I have with today's gump historical database: it's mainly a dump of the "post-processing" of today's gump logic... if I want to recalculate fog factors based on a different heuristics, I'm screwed, becaues only the result were saved, the operands were lost!
Note, since the FOG queries are expensive, those will be cached by Dynagump and eventually placed into another, temporary database, but it's important that we understand that the principle of our model design is that no "heuristics" should be placed, only facts.
And "cause" estimation, as factual as it appears to be, it's still a heuristic judgement.
Maybe (as a transition) we generate simple pages from the existing object model, but generate a results database (with history) and migrate more and more to it over time.
Sure, I don't mind that.
Thanks, both, for putting this together.
You're welcome.
It's actually a lot of fun :-)
-- Stefano.
--------------------------------------------------------------------- To unsubscribe, e-mail: [EMAIL PROTECTED] For additional commands, e-mail: [EMAIL PROTECTED]
