Hi Olaf,
I have used the Intranet crawling of Nutch to
crawl, the root URLs are:
http://www.l3s.uni-hannover.de/
http://www.l3s.de/
http://www.learninglab.uni-hannover.de/
http://www.learninglab.de/
and the domain names of the root URLs above refer to
the same IP address(Host names aliases). After the
crawling has completed, i used the WebDBReader command
line(bin/nutch readdb <db> -dumplinks) to get data
about link of URLs.
>From the dumplinks, i found some link is not correct
(see the example at below). Why the source
page(/morob/Galleries/ER1/pages/09_DSCF0492.html)on
the host http://www.l3s.uni-hannover.de has outlinks
to pages of the other hosts
(http://www.learninglab.de/ and
http://www.learninglab.uni-hannover.de/).
In fact, the source page has only 3 outlinks(absolute
outlinks) but from the dumplinks it has in total 9
outlinks(6 outlinks are false). The detail in pages of
the 6 false outlinks are same the 3 pages of absolute
outlinks but on other host name.
Is maybe problem about the Host names aliases and can
you tell me why?
Thank a lot!
Niti
Date: Mon, 21 Feb 2005 20:50:54 +0100
From: Olaf Thiele <[EMAIL PROTECTED]>
To: [EMAIL PROTECTED]
Subject: Re: [Nutch-dev] Why found the unabsolute
links from Nutch
Reply-To: [EMAIL PROTECTED]
Hi Niti,
I don't get your question. Just write it in German
and I will post it in English.
Bye
Olaf
On Mon, 21 Feb 2005 05:29:43 -0800 (PST), Niti
Witthayawiroj
<[EMAIL PROTECTED]> wrote:
> Hi,
>
> I have used Nutch to crawl four hosts and the four
host names
correspond to
> the same IP address. I used the WebDBReader to get
the dump links of
URLs.
> Why it found the unabsolute links (pages in one host
have links to
pages in
> other hosts).
>
> For example:
>
> from
>
http://www.l3s.uni-hannover.de/morob/Galleries/ER1/pages/09_DSCF0492.html
> to
http://www.l3s.uni-hannover.de/morob/Galleries/ER1/index.html
> to
>
http://www.l3s.uni-hannover.de/morob/Galleries/ER1/pages/08_DSCF0493.html
> to
>
http://www.l3s.uni-hannover.de/morob/Galleries/ER1/pages/10_DSCF0499.html
> to
http://www.learninglab.de/morob/Galleries/ER1/index.html
> to
http://www.learninglab.de/morob/Galleries/ER1/pages/08_DSCF0493.html
> to
http://www.learninglab.de/morob/Galleries/ER1/pages/10_DSCF0499.html
> to
http://www.learninglab.uni-hannover.de/morob/Galleries/ER1/index.html
> to
>
http://www.learninglab.uni-hannover.de/morob/Galleries/ER1/pages/08_DSCF0493.html
> to
>
http://www.learninglab.uni-hannover.de/morob/Galleries/ER1/pages/10_DSCF0499.html
>
> ragards,
> Niti
__________________________________
Do you Yahoo!?
Yahoo! Sports - Sign up for Fantasy Baseball.
http://baseball.fantasysports.yahoo.com/
-------------------------------------------------------
SF email is sponsored by - The IT Product Guide
Read honest & candid reviews on hundreds of IT Products from real users.
Discover which products truly live up to the hype. Start reading now.
http://ads.osdn.com/?ad_id=6595&alloc_id=14396&op=click
_______________________________________________
Nutch-developers mailing list
[email protected]
https://lists.sourceforge.net/lists/listinfo/nutch-developers