Re: closer (was Re: TODOs)

2004-07-14 Thread Mark R. Diggory
Adam R. B. Jack wrote:
Mark R. Diggory [EMAIL PROTECTED] wrote:
 

ASF Repository:
About this time what I'm maintaining is the following two repository
directories:
   

Thanks for this write-up.
 

I haven't yet explored the idea of getting a repository.apache.org
virtual host going. Up to this point I wanted to see the reuse of the
existing mirroring structure to get these artifacts out to multiple
hosts. I know that Henri and others have been working on some new
download pages and scripts for redirecting to mirrors for downloads. I'd
be really interested in finding out how we can combine this sort of
Metadata with a download client to get clients downloading load
balanced across the mirrors. I suspect this would be a server side
redirect mechanism of some sort.
   

In Depot Update we've tinkered with picking one based off this URL's
contents.
http://www.apache.org/dyn/closer.cgi/java-repository/
Are you talking about closer.cgi, or a newer script?
regards
Adam
 

Hmm, I was reviewing the emails and most of it was about building proper 
links to closer.cgi from the various download pages in Jakarta Commons. 
Not neccessarily improving upon closer.cgi.

http://www.mail-archive.com/commons-dev@jakarta.apache.org/msg43827.html
I think what really needs to happen is that theres a machine readable 
or parametized version of this script which returns either a redirect or 
a machine parsable list of locations (for instance in XML/RDF) instead 
of the human readable html page. The current script doesn't offer this, 
but a template could written for it to meet this need. At first I 
thought it would be fine to just have a serverside script that returns a 
redirect. But after some thought I realized the client in this situation 
would probibly just be a simple Http client or ftp client and if the 
connection failed, there would be a need for some sort of mechanism for 
recovery and retry at a new location, in which case getting and parsing 
a list would give the client multiple locations to choose from. I think 
this could really be accomplished by extending the script as such that 
it could detect user-agent or other request header/parameter and return 
the appropriate formatted content instead. Then the client would parse 
the content and iterate over the list until it retrieved a successfull 
download.

-Mark

begin:vcard
fn:Mark Diggory
n:Diggory;Mark
org:Harvard University;Harvard MIT Data Center
adr:Harvard University;;G-6 Littauer Center (North Yard);Cambridge;Ma;02138-2901;United States
email;internet:[EMAIL PROTECTED]
title:Software Engineer
tel;work:617 496 7246
tel;fax:617 495 0438
tel;home:617 718 2033 
tel;cell:617 285 4106
url:http://www.hmdc.harvard.edu
version:2.1
end:vcard



Re: ASF Repository, closer.cgi and Depot

2004-07-14 Thread Mark R. Diggory
Erik Abele wrote:
I suspect their views would include what you suggest, that 
distribution might
save some nomimal (c.f. artifact sizes) bandwidth savings  give some 
CPU
saving, but it'd be at significant loss of 'control' (of well behaved
clients). Central control over this seems the most appealing.

Agreed.
Since I doubt the CPU cycles are worth saving (or the script would've 
been
optimised), could we not just change the script to check for some header
from the client, and return XML or some structured text, for non-human
browsers. [BTW: viewcvs seems to do this nicely, returning the file if
non-human and the presentation is human (as browser identifies).

This sounds promising. You have central control, you get the 
geoip-mapping stuff for free and the CPU cycles as well as the 
bandwidth for (XML-ized) responses are a no-brainer in this case.

But then this becomes a project spanning both the Repository group and 
the various clients out there Depot/Maven/etc. And agreement on the 
GEO_IP request protocol and xml format etc becomes a touchy subject 
don't they?

-Mark
begin:vcard
fn:Mark Diggory
n:Diggory;Mark
org:Harvard University;Harvard MIT Data Center
adr:Harvard University;;G-6 Littauer Center (North Yard);Cambridge;Ma;02138-2901;United States
email;internet:[EMAIL PROTECTED]
title:Software Engineer
tel;work:617 496 7246
tel;fax:617 495 0438
tel;home:617 718 2033 
tel;cell:617 285 4106
url:http://www.hmdc.harvard.edu
version:2.1
end:vcard



Re: Download Manager

2004-07-13 Thread Mark R. Diggory
Adam R. B. Jack wrote:
Ought we simple download it using Download-Manager w/ trusted? :)
Ought we not simple copy it down to the local repository location?
 

- copy file to a tmp-Directory with tmp-Name
- check tmp file to MD5
- if correct, copy tmp file to local repository with correct name
   

One thing I can chime in on here is that the mirror folks sure do like 
the idea of uploading to a staging area then copying to the desired 
location. It benefits their server side scripts to do this since an 
upload could take much longer time than a server side copy operation. 
I'm not sure how rsync handles partially complete uploaded files, but I 
know the signature scripts Henning is running via cron will generate an 
md5 for a partial file. Then when you upload your md5 they clearly don't 
match. Though this is a shortfall of his script, uploading to a staging 
area just seems more robust in the long run.

2cents,
Mark
begin:vcard
fn:Mark Diggory
n:Diggory;Mark
org:Harvard University;Harvard MIT Data Center
adr:Harvard University;;G-6 Littauer Center (North Yard);Cambridge;Ma;02138-2901;United States
email;internet:[EMAIL PROTECTED]
title:Software Engineer
tel;work:617 496 7246
tel;fax:617 495 0438
tel;home:617 718 2033 
tel;cell:617 285 4106
url:http://www.hmdc.harvard.edu
version:2.1
end:vcard



Re: duplicate data

2004-03-07 Thread Mark R. Diggory
Sorry for the later response, currently, I think the major issues are in 
managing the content of java-repository in responsible manner.

Key issues I can see needing to be addressed are the following.
1.) Get projects to be as responsible for their content in 
java-repository as they are for the content of their project 
directories in dist.

2.) Resolve the only duplication left which has to do with the fact that 
avalon runs their own Avalon Repository in /dist/avalon. So its 
contents are currently duplicated with java-repository/ibiblio.

3.) Maintaining proper permissions on all the directory contents of the 
repository (group write, group ownership of files and new directories 
should be the users primary group)

4.) Find a way around the current shortsightedness in Maven where the 
command executed on the serverside are all gnu linux and do not map to 
BSD, md5 = md5sum. So using the repository goals in maven fails to 
produce proper md5 checksums.

Nicola Ken Barozzi wrote:
Some action items:
1- how can we make mirroring work for both of us? (IIRC Ruper already 
showed it's easy to do, but I need help from Adam that already tried it)

2- How does Mark think we can proceed in not making it compulsory to 
phisically have jars in a defined location?

These two are a double-whammy, but I think I have a possible solution 
to the whole subject. Currently we think of the Repository as just 
that A Repository, a physical location for the jars. But what if we 
defined the URL's for a Repository to simple be pointers or addresses 
that when resolved by a client, point to the proper location of that 
resource. This in essance, makes the repository into a resolver or a 
naming service.

In my line of work (Digital Libraries) we already have a service that 
accomplishes this task (actually we have 2 competing/complimentary 
naming systems)

PURLS (Persistent Uniform Resource Identifier)
http://www.purl.org
http://www.oclc.org/research/projects/purl/download.htm
Handles (analogous to publicId's, ISBN's, Dewey Decimal System,, etc...)
http://www.handle.net/
http://www.handle.net/download.html
If your wondering how PURLs and Handles stack up, heres some comparison 
documentation.
http://www.nclis.gov/govt/assess/handles.html
http://memory.loc.gov/ammem/award/docs/PURL-handle.html
http://web.mit.edu/handle/www/purl-eval.html

So what we are talking about really is a naming system that provides for 
the resolution of registered names to physical locations. Interesting 
ly, I think the lack of this sort of separation of Resolution from 
Storage is exactly the issue that is causing friction in our community.

I think its quite possible, that one could completely and transparently 
replace the underlying URL based repository syntax in both Maven and 
other tools with a resolving layer. to clarify this, heres a few examples.

1.) an example using PURL's.
http://www.ibiblio.org/maven/xerces/jars/xerces.jar
this is currently a URL pointing to physical resource on ibiblio.
if this were not a physical resource but a PURL in the PURL naming 
system, then it could (redirect using currently existing PURL server 
software) the client to the appropriate resource (mirrored or not).

http://repository.apache.org/maven/xerces/jars/xerces.jar
would actually resolve (through redirection to)
http://www.apache.org/dist/xml/xerces-j/jars/xerces.jar
and
http://repository.apache.org/xml.apache.org/xerces/2.0/jars/xerces.jar
could also point to
http://www.apache.org/dist/xml/xerces-j/jars/xerces.jar
This provides a layer of flexiblity, its solves issues with both the 
projects needing to place their content in a specific structure/location 
and it also solves issues of name changes over time,

2.) So if we decide that we want to have different groupID's in maven 
for a specific project, the naming system maintains the old naming 
structure pointing to the jars as a means for dependent projects to 
still be able to resolve to the resource.

http://www.ibiblio.org/maven/commons-collections/jars/commons-collections-1.0.jar 

we are currently planning to adopt a more hierarchical naming approach
http://www.ibiblio.org/maven/jakarta.apache.org/commons-collections/jars/commons-collections-1.0.jar
We could (at little cost in both maintenance and diskspace) , maintain 
the old naming resolution and the new one. In fact, this is the very 
foundation of the PURL system, the old uri's stay persistent over time.

3.) With such a level of redirection, we can also maintain archival 
and production releases of the content without the actual location 
specifier changing. So when Apache retires commons-collection-1.0.jar 
from production and removes it from the mirrors, instead placing it onto 
archives.apache.org, then that resolver entry in the PURL database can 
be adjusted to point at the new location

http://repository.apache.org/maven/commons-collections/jars/commons-collections-1.0.jar 

now points to the following location instead:

MD% Standards (was Re: MD5 and Mirrors ( was Re: MD5 Hash ))

2004-02-11 Thread Mark R. Diggory
Besides, my current experiments with gnu md5sum (2.0.21) show that the 
sum's on the Maven contents arn't verifyable to any other tool but the 
maven checksum plugin.

If they aren't verifiable to extenral tools thats a bad situation. I'm 
going to bring this up on the Maven list too.

http://www.faqs.org/rfcs/rfc1321.html
A hard fast dig through the RFC suggests a loophole here as there is 
no reference to what the contents of a md5 signature fle should look 
like. Seems more of a inherant suggestion in the implementation itself.

-Mark
Mark R. Diggory wrote:
Its a tough call, is there any standard for the structure of the md5 
contents out there? I think the Maven team would be keen to play along 
with a standard and yet play along with any configurability as well.

-Mark Diggory
Markus M. May wrote:
Adam is perfectly right about this stuff. There is one more thing we 
need to
think about. Some repositories treat md5-files different. The 
structure on
apache.org is [filename - MD5 Hash]. But on ibiblio (maven-repository) 
it is
just [MD5 Hash]. So this needs to be somehow configurable.
One more thing to think about :-)
--
Mark Diggory
Software Developer
Harvard MIT Data Center
http://www.hmdc.harvard.edu