Here you are.

The patch will normalize the file part of an URL in the following manner:

1. "/aa/../" will be replaced by "/"

This is done step by step until the url doesnīt change anymore. So we
ensure, that
"/aa/bb/../../" will be replaced by "/", too

2. leading "/../" will be replaced by "/"

We could discuss on that, but I think Urls like
http://www.foo.com/../foo.html should return a 404 error.
If a webserver does not return 404 (bad server configuration!!) and forwards
to http://www.foo.com/foo.html  - the problem, I
described in the opener of this thread, will occur.
So, we should replace these leading "/../" by "/" !
The patch also covers the following case:

http://www.foo.com/aa/../../foo.html
http://www.foo.com/../foo.html
http://www.foo.com/foo.html


I hope, I didnīt miss a point and the code is useful.

Cheers

Sven


> -----Original Message-----
> From: [EMAIL PROTECTED] 
> [mailto:[EMAIL PROTECTED] On 
> Behalf Of CC Chaman
> Sent: Samstag, 13. November 2004 06:52
> To: [EMAIL PROTECTED]
> Subject: RE: [Nutch-dev] url normalization
> 
> Sven:
> 
> Yes...I would say simply attach the files. One of the 
> committers should add it to CVS in the next day or two.
>  
> 
> -----Original Message-----
> From: [EMAIL PROTECTED]
> [mailto:[EMAIL PROTECTED] On 
> Behalf Of Sven Wende
> Sent: Friday, November 12, 2004 8:11 PM
> To: [EMAIL PROTECTED]
> Subject: [JNK] RE: [Nutch-dev] [SPAM] url normalization
> 
> I have developed a patch as well as some tests.
> 
> May I send the two updated files to the list - so that 
> someone can review and comitt? 
> 
> > -----Original Message-----
> > From: [EMAIL PROTECTED]
> > [mailto:[EMAIL PROTECTED] On Behalf Of 
> > Luke Baker
> > Sent: Freitag, 12. November 2004 15:23
> > To: [EMAIL PROTECTED]
> > Subject: Re: [Nutch-dev] [SPAM] url normalization
> > 
> > On 11/12/2004 09:02 AM, Matthias Jaekle wrote:
> > > Hi,
> > > I had this problem with old nutch versions.
> > > Did you checkout the newest nutch version from cvs?
> > > This should be fixed in the current version.
> > > Matthias
> > > 
> > 
> > I don't see any code in the BasicUrlNormalizer that would 
> do this.  Is 
> > it possible that what was fixed for you didn't have to do with URL 
> > normalization but rather URL parsing?  Meaning for you, Nutch was 
> > previously not "parsing" the URLs properly when it was encountering 
> > them?
> > 
> > I believe code to normalize these URLs should be put in 
> > BasicUrlNormalizer.java (and add relevent tests).
> > 
> > 
> > Luke Baker
> > 
> > 
> > -------------------------------------------------------
> > This SF.Net email is sponsored by:
> > Sybase ASE Linux Express Edition - download now for FREE LinuxWorld 
> > Reader's Choice Award Winner for best database on Linux.
> > http://ads.osdn.com/?ad_id=5588&alloc_id=12065&op=click
> > _______________________________________________
> > Nutch-developers mailing list
> > [EMAIL PROTECTED]
> > https://lists.sourceforge.net/lists/listinfo/nutch-developers
> > 
> > 
> 
> 
> 
> 
> 
> -------------------------------------------------------
> This SF.Net email is sponsored by: InterSystems CACHE FREE 
> OODBMS DOWNLOAD - A multidimensional database that combines 
> robust object and relational technologies, making it a 
> perfect match for Java, C++,COM, XML, ODBC and JDBC. 
> www.intersystems.com/match8 
> _______________________________________________
> Nutch-developers mailing list
> [EMAIL PROTECTED]
> https://lists.sourceforge.net/lists/listinfo/nutch-developers
> 
> 
> 
> 
> 
> -------------------------------------------------------
> This SF.Net email is sponsored by: InterSystems CACHE FREE 
> OODBMS DOWNLOAD - A multidimensional database that combines 
> robust object and relational technologies, making it a 
> perfect match for Java, C++,COM, XML, ODBC and JDBC. 
> www.intersystems.com/match8 
> _______________________________________________
> Nutch-developers mailing list
> [EMAIL PROTECTED]
> https://lists.sourceforge.net/lists/listinfo/nutch-developers
> 
> 

Attachment: BasicUrlNormalizer.java
Description: Binary data

Attachment: TestBasicUrlNormalizer.java
Description: Binary data

Reply via email to