Chris Schneider wrote:
> Gang,
> 
> Pardon my ignorance, but I noticed recently that some URLs were 
> duplicated in my crawldb, once with a terminating slash and once
> without it. For example, both of the following URLs were found in the
> same crawldb:
> 
> http://mail.python.org/mailman/listinfo/ 
> http://mail.python.org/mailman/listinfo
> 
> As I understand it, if the URL refers to a folder on the server, a 
> terminating slash should be added to the URL, since this improves 
> performance of loading the page (presumably because the server
> doesn't have to check to see if it refers to a file). See 
> <http://en.wikipedia.org/wiki/URL_normalization> for more details.
> 
> Given this, shouldn't the default URL normalizer just add a slash to
> the end of a URL that doesn't have a file extension?

There's no way we can tell (from outside) if single url points to 
directory or not (or that it's url could be normalized in a way you 
describe)

for example try
  http://en.wikipedia.org/wiki/URL_normalization
  http://en.wikipedia.org/wiki/URL_normalization/

The referred paper [http://www2006.org/programme/item.php?id=p20] 
presents an interesting idea for eliminating redundant urls from a list 
of urls.

Currently duplicate pages can be caught (from search results) by running
dedup on index. If you have run dedup and still see those two pages in
search results then please check the hash for each page - dedup only
catches pages with identical hash and it is quite common for a web site 
to change a very small part of the html content even for every request.

It might be a good idea extend current functionality with some kind of 
tagging of reduntant (by content) urls in webdb to prevent them from 
being fetched again.

--
  Sami Siren


-------------------------------------------------------------------------
Take Surveys. Earn Cash. Influence the Future of IT
Join SourceForge.net's Techsay panel and you'll get the chance to share your
opinions on IT & business topics through brief surveys -- and earn cash
http://www.techsay.com/default.php?page=join.php&p=sourceforge&CID=DEVDEV
_______________________________________________
Nutch-developers mailing list
[email protected]
https://lists.sourceforge.net/lists/listinfo/nutch-developers

Reply via email to