Doug Cutting wrote:
> Isabel Drost wrote:
> > OK. Before doing so, I might set up the experiment with a somewhat
> > smaller segment so that anyone who wants to can repeat it easily.

I haven't had time yet to try this out with a smaller segment but:


> Also, please first try to reproduce it with the current version of
> Nutch.  Nutch is no longer maintained in CVS at sourceforge, but is now
> in Subversion at Apache.

Thanks for the URL. I successfully downloaded the new nutch version. Yet, the 
problem with updatedb persists. When recreating the webdb with the commands 
admin and updatedb from the segment retrieved by intranet-crawling, some 
pages' outlinks are still missing that are available in the webdb from the 
intranet crawl. So merging "intranet crawled" data still is a problem for me.

I have had a look at the implementation of the intranet crawl tool: It seems 
to fetch pages of common (link-)depth together. After each such fetch 
updatedb is invoked. In my case it is called only once on the whole segment - 
could this cause my problems?


Have a nice week,
Isabel

-- 
QOTD: The New England Journal of Medicine reports that 9 out of 10 doctors 
agree that 1 out of 10 doctors is an idiot. 
  |\      _,,,---,,_     
  /,`.-'`'    -.  ;-;;,_  More information about the
 |,4-  ) )-,_..;\ (  `'-' sender of this mail available
'---''(_/--'  `-'\_) (fL) at http://www.isabel-drost.de ;)


-------------------------------------------------------
SF email is sponsored by - The IT Product Guide
Read honest & candid reviews on hundreds of IT Products from real users.
Discover which products truly live up to the hype. Start reading now.
http://ads.osdn.com/?ad_id=6595&alloc_id=14396&op=click
_______________________________________________
Nutch-developers mailing list
[email protected]
https://lists.sourceforge.net/lists/listinfo/nutch-developers

Reply via email to