Here you are. The patch will normalize the file part of an URL in the following manner:
1. "/aa/../" will be replaced by "/" This is done step by step until the url doesnīt change anymore. So we ensure, that "/aa/bb/../../" will be replaced by "/", too 2. leading "/../" will be replaced by "/" We could discuss on that, but I think Urls like http://www.foo.com/../foo.html should return a 404 error. If a webserver does not return 404 (bad server configuration!!) and forwards to http://www.foo.com/foo.html - the problem, I described in the opener of this thread, will occur. So, we should replace these leading "/../" by "/" ! The patch also covers the following case: http://www.foo.com/aa/../../foo.html http://www.foo.com/../foo.html http://www.foo.com/foo.html I hope, I didnīt miss a point and the code is useful. Cheers Sven > -----Original Message----- > From: [EMAIL PROTECTED] > [mailto:[EMAIL PROTECTED] On > Behalf Of CC Chaman > Sent: Samstag, 13. November 2004 06:52 > To: [EMAIL PROTECTED] > Subject: RE: [Nutch-dev] url normalization > > Sven: > > Yes...I would say simply attach the files. One of the > committers should add it to CVS in the next day or two. > > > -----Original Message----- > From: [EMAIL PROTECTED] > [mailto:[EMAIL PROTECTED] On > Behalf Of Sven Wende > Sent: Friday, November 12, 2004 8:11 PM > To: [EMAIL PROTECTED] > Subject: [JNK] RE: [Nutch-dev] [SPAM] url normalization > > I have developed a patch as well as some tests. > > May I send the two updated files to the list - so that > someone can review and comitt? > > > -----Original Message----- > > From: [EMAIL PROTECTED] > > [mailto:[EMAIL PROTECTED] On Behalf Of > > Luke Baker > > Sent: Freitag, 12. November 2004 15:23 > > To: [EMAIL PROTECTED] > > Subject: Re: [Nutch-dev] [SPAM] url normalization > > > > On 11/12/2004 09:02 AM, Matthias Jaekle wrote: > > > Hi, > > > I had this problem with old nutch versions. > > > Did you checkout the newest nutch version from cvs? > > > This should be fixed in the current version. > > > Matthias > > > > > > > I don't see any code in the BasicUrlNormalizer that would > do this. Is > > it possible that what was fixed for you didn't have to do with URL > > normalization but rather URL parsing? Meaning for you, Nutch was > > previously not "parsing" the URLs properly when it was encountering > > them? > > > > I believe code to normalize these URLs should be put in > > BasicUrlNormalizer.java (and add relevent tests). > > > > > > Luke Baker > > > > > > ------------------------------------------------------- > > This SF.Net email is sponsored by: > > Sybase ASE Linux Express Edition - download now for FREE LinuxWorld > > Reader's Choice Award Winner for best database on Linux. > > http://ads.osdn.com/?ad_id=5588&alloc_id=12065&op=click > > _______________________________________________ > > Nutch-developers mailing list > > [EMAIL PROTECTED] > > https://lists.sourceforge.net/lists/listinfo/nutch-developers > > > > > > > > > > ------------------------------------------------------- > This SF.Net email is sponsored by: InterSystems CACHE FREE > OODBMS DOWNLOAD - A multidimensional database that combines > robust object and relational technologies, making it a > perfect match for Java, C++,COM, XML, ODBC and JDBC. > www.intersystems.com/match8 > _______________________________________________ > Nutch-developers mailing list > [EMAIL PROTECTED] > https://lists.sourceforge.net/lists/listinfo/nutch-developers > > > > > > ------------------------------------------------------- > This SF.Net email is sponsored by: InterSystems CACHE FREE > OODBMS DOWNLOAD - A multidimensional database that combines > robust object and relational technologies, making it a > perfect match for Java, C++,COM, XML, ODBC and JDBC. > www.intersystems.com/match8 > _______________________________________________ > Nutch-developers mailing list > [EMAIL PROTECTED] > https://lists.sourceforge.net/lists/listinfo/nutch-developers > >
BasicUrlNormalizer.java
Description: Binary data
TestBasicUrlNormalizer.java
Description: Binary data