Chris Schneider wrote: > Gang, > > Pardon my ignorance, but I noticed recently that some URLs were > duplicated in my crawldb, once with a terminating slash and once > without it. For example, both of the following URLs were found in the > same crawldb: > > http://mail.python.org/mailman/listinfo/ > http://mail.python.org/mailman/listinfo > > As I understand it, if the URL refers to a folder on the server, a > terminating slash should be added to the URL, since this improves > performance of loading the page (presumably because the server > doesn't have to check to see if it refers to a file). See > <http://en.wikipedia.org/wiki/URL_normalization> for more details. > > Given this, shouldn't the default URL normalizer just add a slash to > the end of a URL that doesn't have a file extension?
There's no way we can tell (from outside) if single url points to directory or not (or that it's url could be normalized in a way you describe) for example try http://en.wikipedia.org/wiki/URL_normalization http://en.wikipedia.org/wiki/URL_normalization/ The referred paper [http://www2006.org/programme/item.php?id=p20] presents an interesting idea for eliminating redundant urls from a list of urls. Currently duplicate pages can be caught (from search results) by running dedup on index. If you have run dedup and still see those two pages in search results then please check the hash for each page - dedup only catches pages with identical hash and it is quite common for a web site to change a very small part of the html content even for every request. It might be a good idea extend current functionality with some kind of tagging of reduntant (by content) urls in webdb to prevent them from being fetched again. -- Sami Siren ------------------------------------------------------------------------- Take Surveys. Earn Cash. Influence the Future of IT Join SourceForge.net's Techsay panel and you'll get the chance to share your opinions on IT & business topics through brief surveys -- and earn cash http://www.techsay.com/default.php?page=join.php&p=sourceforge&CID=DEVDEV _______________________________________________ Nutch-developers mailing list [email protected] https://lists.sourceforge.net/lists/listinfo/nutch-developers
