Doug Cutting wrote: > Isabel Drost wrote: > > OK. Before doing so, I might set up the experiment with a somewhat > > smaller segment so that anyone who wants to can repeat it easily.
I haven't had time yet to try this out with a smaller segment but: > Also, please first try to reproduce it with the current version of > Nutch. Nutch is no longer maintained in CVS at sourceforge, but is now > in Subversion at Apache. Thanks for the URL. I successfully downloaded the new nutch version. Yet, the problem with updatedb persists. When recreating the webdb with the commands admin and updatedb from the segment retrieved by intranet-crawling, some pages' outlinks are still missing that are available in the webdb from the intranet crawl. So merging "intranet crawled" data still is a problem for me. I have had a look at the implementation of the intranet crawl tool: It seems to fetch pages of common (link-)depth together. After each such fetch updatedb is invoked. In my case it is called only once on the whole segment - could this cause my problems? Have a nice week, Isabel -- QOTD: The New England Journal of Medicine reports that 9 out of 10 doctors agree that 1 out of 10 doctors is an idiot. |\ _,,,---,,_ /,`.-'`' -. ;-;;,_ More information about the |,4- ) )-,_..;\ ( `'-' sender of this mail available '---''(_/--' `-'\_) (fL) at http://www.isabel-drost.de ;) ------------------------------------------------------- SF email is sponsored by - The IT Product Guide Read honest & candid reviews on hundreds of IT Products from real users. Discover which products truly live up to the hype. Start reading now. http://ads.osdn.com/?ad_id=6595&alloc_id=14396&op=click _______________________________________________ Nutch-developers mailing list [email protected] https://lists.sourceforge.net/lists/listinfo/nutch-developers
