Niti - currently Nutch does not resolve back to the IP address to fig re out which are virtual hosts. In reality that may not be correct if small sites are virtually hosted at an ISP.
Figuring out if a site is the same / alias requires a little more checking which is not really require for the default Nutch installation. (BTW, the segment merge tool will remove pages that have the same content, so that should solve the problem after the fact) -----Original Message----- From: [EMAIL PROTECTED] [mailto:[EMAIL PROTECTED] On Behalf Of Niti Witthayawiroj Sent: Wednesday, February 23, 2005 9:11 AM To: [EMAIL PROTECTED] Subject: Re: [Nutch-dev] Why found the unabsolute links from Nutch Hi Olaf, I have used the Intranet crawling of Nutch to crawl, the root URLs are: http://www.l3s.uni-hannover.de/ http://www.l3s.de/ http://www.learninglab.uni-hannover.de/ http://www.learninglab.de/ and the domain names of the root URLs above refer to the same IP address(Host names aliases). After the crawling has completed, i used the WebDBReader command line(bin/nutch readdb <db> -dumplinks) to get data about link of URLs. >From the dumplinks, i found some link is not correct (see the example at below). Why the source page(/morob/Galleries/ER1/pages/09_DSCF0492.html)on the host http://www.l3s.uni-hannover.de has outlinks to pages of the other hosts (http://www.learninglab.de/ and http://www.learninglab.uni-hannover.de/). In fact, the source page has only 3 outlinks(absolute outlinks) but from the dumplinks it has in total 9 outlinks(6 outlinks are false). The detail in pages of the 6 false outlinks are same the 3 pages of absolute outlinks but on other host name. Is maybe problem about the Host names aliases and can you tell me why? Thank a lot! Niti Date: Mon, 21 Feb 2005 20:50:54 +0100 From: Olaf Thiele <[EMAIL PROTECTED]> To: [EMAIL PROTECTED] Subject: Re: [Nutch-dev] Why found the unabsolute links from Nutch Reply-To: [EMAIL PROTECTED] Hi Niti, I don't get your question. Just write it in German and I will post it in English. Bye Olaf On Mon, 21 Feb 2005 05:29:43 -0800 (PST), Niti Witthayawiroj <[EMAIL PROTECTED]> wrote: > Hi, > > I have used Nutch to crawl four hosts and the four host names correspond to > the same IP address. I used the WebDBReader to get the dump links of URLs. > Why it found the unabsolute links (pages in one host have links to pages in > other hosts). > > For example: > > from > http://www.l3s.uni-hannover.de/morob/Galleries/ER1/pages/09_DSCF0492.html > to http://www.l3s.uni-hannover.de/morob/Galleries/ER1/index.html > to > http://www.l3s.uni-hannover.de/morob/Galleries/ER1/pages/08_DSCF0493.html > to > http://www.l3s.uni-hannover.de/morob/Galleries/ER1/pages/10_DSCF0499.html > to http://www.learninglab.de/morob/Galleries/ER1/index.html > to http://www.learninglab.de/morob/Galleries/ER1/pages/08_DSCF0493.html > to http://www.learninglab.de/morob/Galleries/ER1/pages/10_DSCF0499.html > to http://www.learninglab.uni-hannover.de/morob/Galleries/ER1/index.html > to > http://www.learninglab.uni-hannover.de/morob/Galleries/ER1/pages/08_DSCF0493 .html > to > http://www.learninglab.uni-hannover.de/morob/Galleries/ER1/pages/10_DSCF0499 .html > > ragards, > Niti __________________________________ Do you Yahoo!? Yahoo! Sports - Sign up for Fantasy Baseball. http://baseball.fantasysports.yahoo.com/ ------------------------------------------------------- SF email is sponsored by - The IT Product Guide Read honest & candid reviews on hundreds of IT Products from real users. Discover which products truly live up to the hype. Start reading now. http://ads.osdn.com/?ad_id=6595&alloc_id=14396&op=click _______________________________________________ Nutch-developers mailing list [email protected] https://lists.sourceforge.net/lists/listinfo/nutch-developers ------------------------------------------------------- SF email is sponsored by - The IT Product Guide Read honest & candid reviews on hundreds of IT Products from real users. Discover which products truly live up to the hype. Start reading now. http://ads.osdn.com/?ad_id=6595&alloc_id=14396&op=click _______________________________________________ Nutch-developers mailing list [email protected] https://lists.sourceforge.net/lists/listinfo/nutch-developers
