Re: closer (was Re: TODOs)
Adam R. B. Jack wrote: Mark R. Diggory [EMAIL PROTECTED] wrote: ASF Repository: About this time what I'm maintaining is the following two repository directories: Thanks for this write-up. I haven't yet explored the idea of getting a repository.apache.org virtual host going. Up to this point I wanted to see the reuse of the existing mirroring structure to get these artifacts out to multiple hosts. I know that Henri and others have been working on some new download pages and scripts for redirecting to mirrors for downloads. I'd be really interested in finding out how we can combine this sort of Metadata with a download client to get clients downloading load balanced across the mirrors. I suspect this would be a server side redirect mechanism of some sort. In Depot Update we've tinkered with picking one based off this URL's contents. http://www.apache.org/dyn/closer.cgi/java-repository/ Are you talking about closer.cgi, or a newer script? regards Adam Hmm, I was reviewing the emails and most of it was about building proper links to closer.cgi from the various download pages in Jakarta Commons. Not neccessarily improving upon closer.cgi. http://www.mail-archive.com/commons-dev@jakarta.apache.org/msg43827.html I think what really needs to happen is that theres a machine readable or parametized version of this script which returns either a redirect or a machine parsable list of locations (for instance in XML/RDF) instead of the human readable html page. The current script doesn't offer this, but a template could written for it to meet this need. At first I thought it would be fine to just have a serverside script that returns a redirect. But after some thought I realized the client in this situation would probibly just be a simple Http client or ftp client and if the connection failed, there would be a need for some sort of mechanism for recovery and retry at a new location, in which case getting and parsing a list would give the client multiple locations to choose from. I think this could really be accomplished by extending the script as such that it could detect user-agent or other request header/parameter and return the appropriate formatted content instead. Then the client would parse the content and iterate over the list until it retrieved a successfull download. -Mark begin:vcard fn:Mark Diggory n:Diggory;Mark org:Harvard University;Harvard MIT Data Center adr:Harvard University;;G-6 Littauer Center (North Yard);Cambridge;Ma;02138-2901;United States email;internet:[EMAIL PROTECTED] title:Software Engineer tel;work:617 496 7246 tel;fax:617 495 0438 tel;home:617 718 2033 tel;cell:617 285 4106 url:http://www.hmdc.harvard.edu version:2.1 end:vcard
Re: ASF Repository, closer.cgi and Depot
Erik Abele wrote: I suspect their views would include what you suggest, that distribution might save some nomimal (c.f. artifact sizes) bandwidth savings give some CPU saving, but it'd be at significant loss of 'control' (of well behaved clients). Central control over this seems the most appealing. Agreed. Since I doubt the CPU cycles are worth saving (or the script would've been optimised), could we not just change the script to check for some header from the client, and return XML or some structured text, for non-human browsers. [BTW: viewcvs seems to do this nicely, returning the file if non-human and the presentation is human (as browser identifies). This sounds promising. You have central control, you get the geoip-mapping stuff for free and the CPU cycles as well as the bandwidth for (XML-ized) responses are a no-brainer in this case. But then this becomes a project spanning both the Repository group and the various clients out there Depot/Maven/etc. And agreement on the GEO_IP request protocol and xml format etc becomes a touchy subject don't they? -Mark begin:vcard fn:Mark Diggory n:Diggory;Mark org:Harvard University;Harvard MIT Data Center adr:Harvard University;;G-6 Littauer Center (North Yard);Cambridge;Ma;02138-2901;United States email;internet:[EMAIL PROTECTED] title:Software Engineer tel;work:617 496 7246 tel;fax:617 495 0438 tel;home:617 718 2033 tel;cell:617 285 4106 url:http://www.hmdc.harvard.edu version:2.1 end:vcard
Re: Download Manager
Adam R. B. Jack wrote: Ought we simple download it using Download-Manager w/ trusted? :) Ought we not simple copy it down to the local repository location? - copy file to a tmp-Directory with tmp-Name - check tmp file to MD5 - if correct, copy tmp file to local repository with correct name One thing I can chime in on here is that the mirror folks sure do like the idea of uploading to a staging area then copying to the desired location. It benefits their server side scripts to do this since an upload could take much longer time than a server side copy operation. I'm not sure how rsync handles partially complete uploaded files, but I know the signature scripts Henning is running via cron will generate an md5 for a partial file. Then when you upload your md5 they clearly don't match. Though this is a shortfall of his script, uploading to a staging area just seems more robust in the long run. 2cents, Mark begin:vcard fn:Mark Diggory n:Diggory;Mark org:Harvard University;Harvard MIT Data Center adr:Harvard University;;G-6 Littauer Center (North Yard);Cambridge;Ma;02138-2901;United States email;internet:[EMAIL PROTECTED] title:Software Engineer tel;work:617 496 7246 tel;fax:617 495 0438 tel;home:617 718 2033 tel;cell:617 285 4106 url:http://www.hmdc.harvard.edu version:2.1 end:vcard
Re: duplicate data
Sorry for the later response, currently, I think the major issues are in managing the content of java-repository in responsible manner. Key issues I can see needing to be addressed are the following. 1.) Get projects to be as responsible for their content in java-repository as they are for the content of their project directories in dist. 2.) Resolve the only duplication left which has to do with the fact that avalon runs their own Avalon Repository in /dist/avalon. So its contents are currently duplicated with java-repository/ibiblio. 3.) Maintaining proper permissions on all the directory contents of the repository (group write, group ownership of files and new directories should be the users primary group) 4.) Find a way around the current shortsightedness in Maven where the command executed on the serverside are all gnu linux and do not map to BSD, md5 = md5sum. So using the repository goals in maven fails to produce proper md5 checksums. Nicola Ken Barozzi wrote: Some action items: 1- how can we make mirroring work for both of us? (IIRC Ruper already showed it's easy to do, but I need help from Adam that already tried it) 2- How does Mark think we can proceed in not making it compulsory to phisically have jars in a defined location? These two are a double-whammy, but I think I have a possible solution to the whole subject. Currently we think of the Repository as just that A Repository, a physical location for the jars. But what if we defined the URL's for a Repository to simple be pointers or addresses that when resolved by a client, point to the proper location of that resource. This in essance, makes the repository into a resolver or a naming service. In my line of work (Digital Libraries) we already have a service that accomplishes this task (actually we have 2 competing/complimentary naming systems) PURLS (Persistent Uniform Resource Identifier) http://www.purl.org http://www.oclc.org/research/projects/purl/download.htm Handles (analogous to publicId's, ISBN's, Dewey Decimal System,, etc...) http://www.handle.net/ http://www.handle.net/download.html If your wondering how PURLs and Handles stack up, heres some comparison documentation. http://www.nclis.gov/govt/assess/handles.html http://memory.loc.gov/ammem/award/docs/PURL-handle.html http://web.mit.edu/handle/www/purl-eval.html So what we are talking about really is a naming system that provides for the resolution of registered names to physical locations. Interesting ly, I think the lack of this sort of separation of Resolution from Storage is exactly the issue that is causing friction in our community. I think its quite possible, that one could completely and transparently replace the underlying URL based repository syntax in both Maven and other tools with a resolving layer. to clarify this, heres a few examples. 1.) an example using PURL's. http://www.ibiblio.org/maven/xerces/jars/xerces.jar this is currently a URL pointing to physical resource on ibiblio. if this were not a physical resource but a PURL in the PURL naming system, then it could (redirect using currently existing PURL server software) the client to the appropriate resource (mirrored or not). http://repository.apache.org/maven/xerces/jars/xerces.jar would actually resolve (through redirection to) http://www.apache.org/dist/xml/xerces-j/jars/xerces.jar and http://repository.apache.org/xml.apache.org/xerces/2.0/jars/xerces.jar could also point to http://www.apache.org/dist/xml/xerces-j/jars/xerces.jar This provides a layer of flexiblity, its solves issues with both the projects needing to place their content in a specific structure/location and it also solves issues of name changes over time, 2.) So if we decide that we want to have different groupID's in maven for a specific project, the naming system maintains the old naming structure pointing to the jars as a means for dependent projects to still be able to resolve to the resource. http://www.ibiblio.org/maven/commons-collections/jars/commons-collections-1.0.jar we are currently planning to adopt a more hierarchical naming approach http://www.ibiblio.org/maven/jakarta.apache.org/commons-collections/jars/commons-collections-1.0.jar We could (at little cost in both maintenance and diskspace) , maintain the old naming resolution and the new one. In fact, this is the very foundation of the PURL system, the old uri's stay persistent over time. 3.) With such a level of redirection, we can also maintain archival and production releases of the content without the actual location specifier changing. So when Apache retires commons-collection-1.0.jar from production and removes it from the mirrors, instead placing it onto archives.apache.org, then that resolver entry in the PURL database can be adjusted to point at the new location http://repository.apache.org/maven/commons-collections/jars/commons-collections-1.0.jar now points to the following location instead:
MD% Standards (was Re: MD5 and Mirrors ( was Re: MD5 Hash ))
Besides, my current experiments with gnu md5sum (2.0.21) show that the sum's on the Maven contents arn't verifyable to any other tool but the maven checksum plugin. If they aren't verifiable to extenral tools thats a bad situation. I'm going to bring this up on the Maven list too. http://www.faqs.org/rfcs/rfc1321.html A hard fast dig through the RFC suggests a loophole here as there is no reference to what the contents of a md5 signature fle should look like. Seems more of a inherant suggestion in the implementation itself. -Mark Mark R. Diggory wrote: Its a tough call, is there any standard for the structure of the md5 contents out there? I think the Maven team would be keen to play along with a standard and yet play along with any configurability as well. -Mark Diggory Markus M. May wrote: Adam is perfectly right about this stuff. There is one more thing we need to think about. Some repositories treat md5-files different. The structure on apache.org is [filename - MD5 Hash]. But on ibiblio (maven-repository) it is just [MD5 Hash]. So this needs to be somehow configurable. One more thing to think about :-) -- Mark Diggory Software Developer Harvard MIT Data Center http://www.hmdc.harvard.edu