Hi,
I have used Nutch to crawl four hosts and the four host names correspond to the same IP address. I used the WebDBReader to get the dump links of URLs. Why it found the unabsolute links (pages in one host have links to pages in other hosts).
For example:
from http://www.l3s.uni-hannover.de/morob/Galleries/ER1/pages/09_DSCF0492.html
to http://www.l3s.uni-hannover.de/morob/Galleries/ER1/index.html
to http://www.l3s.uni-hannover.de/morob/Galleries/ER1/pages/08_DSCF0493.html
to http://www.l3s.uni-hannover.de/morob/Galleries/ER1/pages/10_DSCF0499.html
to http://www.learninglab.de/morob/Galleries/ER1/index.html
to http://www.learninglab.de/morob/Galleries/ER1/pages/08_DSCF0493.html
to http://www.learninglab.de/morob/Galleries/ER1/pages/10_DSCF0499.html
to http://www.learninglab.uni-hannover.de/morob/Galleries/ER1/index.html
to http://www.learninglab.uni-hannover.de/morob/Galleries/ER1/pages/08_DSCF0493.html
to http://www.learninglab.uni-hannover.de/morob/Galleries/ER1/pages/10_DSCF0499.html
to http://www.l3s.uni-hannover.de/morob/Galleries/ER1/index.html
to http://www.l3s.uni-hannover.de/morob/Galleries/ER1/pages/08_DSCF0493.html
to http://www.l3s.uni-hannover.de/morob/Galleries/ER1/pages/10_DSCF0499.html
to http://www.learninglab.de/morob/Galleries/ER1/index.html
to http://www.learninglab.de/morob/Galleries/ER1/pages/08_DSCF0493.html
to http://www.learninglab.de/morob/Galleries/ER1/pages/10_DSCF0499.html
to http://www.learninglab.uni-hannover.de/morob/Galleries/ER1/index.html
to http://www.learninglab.uni-hannover.de/morob/Galleries/ER1/pages/08_DSCF0493.html
to http://www.learninglab.uni-hannover.de/morob/Galleries/ER1/pages/10_DSCF0499.html
ragards,
Niti
__________________________________________________
Do You Yahoo!?
Tired of spam? Yahoo! Mail has the best spam protection around
http://mail.yahoo.com
