Wow, I'll try to be concise!
Sander Striker wrote:
I understand. But it has to be fairly mature before one can
deploy and recommend using it to the PMCs. Also, you can't
force all the projects to use it. For this you need some way
of handmaintaining 'shadow' packages in java-repository,
Maven exists and is used by many apache projects, I am, in no way, promoting any sort of "requirement" that projects be forced to use it. I am primarily concerned with maintaining the content properly for the projects that do use it and in assuring that the Apache Repository structure be convergent and that future versions of Maven be able to support it. I approach this because I am and will be a user of both, not as a developer of one or the other.
Currently the duplication is minimal because projects that distribute actual jars are already using maven specifically, otherwise they package into archives (zip/tar).
That would mean that this entire area would have to be rw to all groups producing releases that are to be in there. This kindof means apcvs group ownership, which I don't really fancy doing. The other way around, control and access of each projects dist/ area seperated, and symlinking to that from java-repository, seems a bit sa[fn]er to me.
Ultimately we are seeking a convergence here between what the repository folks want to see, the maven users want to see and the infrastructure folks want to see.
1.) For the repository (and Maven) folks, we want to see the contents of dist become standardized according to the Repository URI specification. This means "all" distributables (java or not) are organized according to this specification.
But this is fairly utopic at this point, no? Is the Repository URI spec stable? Is the tool mature?
I may only be able to speak only for myself at this point.
The spec is actually very "independent" of the clients/tools that may make use of it. Using Maven as an example, the tool is becoming mature, not in that it has officially gone 1.0, but that the usage (especially for java projects at apache) is becoming very prevalent. It represents a popular base of users that need accessibility to the repository.
It may be utopic in that I/We are working to unify disparate groups and existing resources into a standardized directory structure. But ultimately such a goal can only benefit Apache as a whole.
2.) For Maven users, no matter what happens, we need to maintain a functionally working repository the works with the existing version of Maven.
Isn't the repository format versioned? Can Maven advise the user to upgrade?
Yes, the user can upgrade, Maven can release new versions. But overall, on the developer side, I and others may represent a "lobbying" force in the direction that the various repository formats Maven may/will be able to support. But I digress, the point of this statement is that there is a currently working Maven Repository with jars that people do use and which I am working to keep functioning while we evolve "java-repository" through its various refactorings. This is obvious to me because, for example, in my placing the bad (non-rsyncable) symlinks in java-repository, reports started poping up on the Maven IRC channel that access attempts on the symlinks were throwing 404's on ibiblio. SO there is a flow of content here.
3.) For Infrastructure, all this needs to be properly secured and maintained according to Apache standards.
I need allot of help here, if I'm not doing something up to par, I need to hear about it. The help you've provided in this area is greatly appreciated.
The java-repository structure is broken down into
this would mean each project would need to maintain a separate set of symlinks for "jars", "distributables", "...".
I'm assuming you are stating this as fact, correct?
Yes, its just the way its implemented at this point. Not necessarily the way it will be in the future.
Given a 'regular' release in .../dist/<project>/, would Maven be able to automate the creation of the symlinks?
It already automates allot of symlinks. Symlinks provide a way to "resolve" what is called a SNAPSHOT to a existing versioned release.
Only when someone releases a build that is not intrinsically different from the last, ie only differing in file name, do we currently get duplicates. This is something that would/should be automated to test the directory contents and symlink properly if the md5's are identical. I've been working on simple ant scripts to do this.
Between the dist directory maintainers and the the mirrors out there represent a "control" on the whole situation, if it doesn't work for them, then its not realistic as a strategy.
I'm assuming that you mean that Maven is the tool implementing the "control"?
No, I mean "control" as in "test group". the whole "dist/mirrors" efforts represent the group that is a "control" in that if we create a directory architecture that turns out to be very problematic for you mirroring guys, then we've failed (And we'll probably hear about it from you guys). Your the "optimization force".
Is Maven using the mirrors today, like getting the list of active mirrors from the main site and finding the closest? Or is it only using the main site and perhaps iblibio?
Currently, all Maven clients use www.ibiblio.org/maven to retrieve content.
So basically, the entire .../dist/java-repository directory is not being used at all. And all the while we are pushing all this data to all of our mirrors (~200).
No, it is being used, all Apache Projects that use Maven are being instructed to publish to java-repository, its contents (plus others) are aggregate mirrored on www.ibiblio.org/maven at this time. This way, we control the vertical and horizontal when it comes to what is released into Maven and what other projects can build on.
www.ibibilio.org is also a mirror of /java-repository for all its apache content.
Just for my clarity, there are non-ASF packages distributed in /java-repository on ibiblio? As in, ibiblio has a java-repository which contains more than the one on www.apache.org/dist/java-repository?
Yes, thats my point above.
Actually Maven users DO NOT go to www.apache.org/dist/java-repository to download files, and only Apache developers can publish to www.apache.org/dist/java-repository.
Erm, I'm confused, if noone is going to pull things from there, why do we have/need it?
Specifically and most importantly, to maintain control over what Apache Projects released into ibiblio (and to other mirrors).
What server is used is currently based on the configuration of the Maven client, servers currently do not maintain any capability to hand this client off to another mirror. I think, in the future as the Repository comes into existence and machine readable metadata or mechanisms for directing clients off to mirrors come into existence, then clients like Maven will implement such capabilities.
That looks like a priority then. It will make it actually make all the mirrorring worth it. And it should help the user aswell, since closer hosts usually mean quicker downloads.
Yesssss, we need to make our tools take advantage of it.
When it comes to things like the ibiblio maven repository, it would only maintain full version releases of apache projects.
Can you explain why ibiblio is special here? I mean, what you describe is what is supposed to be on all the mirrors right?
Just because it is the "default" repository used by the Maven Client.
I'd say the default should be www.apache.org, and from there it should select the 'best' mirror. Note that for any mirror use, and that includes ibiblio, integrity checking is a must for an application like Maven.
Its a struggle, because, www.apache.org currently doesn't represent the canonical contents for maven itself. I'm not sure how positive the Maven group is about "distributed" retrieval, but it is something I see as very important.
Can www.apache.org automatically reroute a download request to a mirrored location? If so, what is the service/tool/script that supports this sort of functionality (suspect its some sort of Apache_mod or cgi script)?
And the only publishing of jars by actual humans (Release Managers) would be the full releases onto
Symlinks I hope. Mirrors handle symlinks efficiently, that is, if they follow our rsync instructions.
The only mirroring that would be done would be via:
The 'only mirrorring that would be done' equals pushing everything in there out to approximately 200 mirrors. And then it isn't used.
But we do want to reach a point where it is used via tools. I'd suspect that, if the redirection can be automated, the its eventually going to be totally transparent to Maven as a client.
Er no. We advertise this as the rsync options to use:
rsync -rtlzv --delete www.apache.org::apache-dist /local/path/to/mirror
-r recurse into directories
-t preserve times
-l copy symlinks as symlinks
-z compress file data
-v increase verbosity
--delete delete files that don't exist on sender
Note the we do _not_ ask to include:
-p preserve permissions -o preserve owner (root only) -g preserve group
IOW, on the mirrors, the tree exists as if you cp'd it (ownership and permission wise). rsync is even sensitive to the umask setting.
Great, I stand corrected (and feel much better for it!)
I believe this creates a problem in that I cannot simply create symlinks from java-repository/excalibur-component/ to avalon/excalibur-component/ as they will not be followed by rsync.
This is simply not true, per above. You are probably thinking of the 'SymLinksIfOwnerMatch' option we are recommending in the httpd.conf of the mirrors. This is not a problem, since the copy of dist/ on the mirrors will in its entirety have the same owner. On www.apache.org we have 'FollowSymLinks' enabled, so it's not a problem there either.
Wow, I'm very relieved now...I'll stop making such "brash" assumptions.
However, the other 50% of duplicates within the java-repository directory should be properly alleviated with symlinking, I can work on this as I now (as of a couple days ago) own all the files :-). I will start working on a script I can run periodically which will accomplish this.
Like I said, don't worry about the ownership. The only thing you need to worry about is setting your umaks to 002, so that other members of the group are able to do modifications like you ;).
I'll ask Henk to disable the checks for presence of md5 in the dist/java-repository, since that doesn't seem to be applicable there. It seems to me that you do want to do some verification in maven, but you are probably storing signature information somewhere in the maven 'database'?
No, it is in the directory structure (no db) and md5's should exist next to the files, there is a bug in maven caused by the fact that on BSD checksums are generated by "md5" not "md5sum" like on linux, this needs to be addressed, for example, you see my md5 was bad on the math jar (which I just fixed).
Does this mean you are running maven on minotaur? Or was it that the
platform of the one who ran maven was BSD?
Its not like "Maven" is run on minotaur, Maven is run on a client, it establishes an ssh session and performs the md5sum command on minotaur, but the client side script that does this isn't configurable, and is hardcoded to call Linux/Gnu md5sum. I will submit a bug to Maven to make it more configurable, currently all md5 checksums generated using Maven are broken because of this, I think others have recognized this and generate them by hand on their own.
May I also suggest PGP signatures? You could verify if the package is signed by a trusted source, and if the package integrity is not compromised (an md5 is easily replaced, a PGP sig is somewhat harder ;).
Yes, I've been working on a signature plugin that basically uses GnuPG on the server, its approached same as the md5 stuff, but with more configurable parameters for the command and options to be called. The challenge is that with GPG, you don't want to store the private key on minotaur, so the file would need to be signed on the client side and the published.
See http://www.apache.org/~henkp/sig/ for stats.
Sorry for the, I imagine, somewhat critical feedback. FWIW I do appreciate the project's goal.
No, no, its all very important to address.
Cheers, -Mark -- Mark Diggory Software Developer Harvard MIT Data Center http://www.hmdc.harvard.edu