Hi Vince, Are you using the anchors in any way? In that case, you would need to use the complete (and correct) linkdb. AFAIK, this is the only information that's used from the linkdb. The page scores from the scoring plugins (like OPIC) are stored in the crawldb.
Regards, -vishal. -----Original Message----- From: Vince Filby [mailto:[EMAIL PROTECTED] Sent: Tuesday, August 21, 2007 12:14 AM To: [email protected] Subject: Can't create index with merged linkdb Hi, I am currently writing a custom crawl script that lets us incrementally crawl our database of urls to a certain depth (custom Nutch with page depth limited on a Debian machine). It seems to be working well until I start merging linkdb's. On the second iteration I merge the linkdb's then try to index my current segment using the merged linkdb and it fails with the following error in hadoop.log: 2007-08-20 17:09:32,154 WARN mapred.LocalJobRunner - job_qr0uto java.io.IOException: Not a file: /home/vfilby/TrueLocal/nutch-0.8-svn /wwt/linkdb/current/linkdb-merge-1441573667/data at org.apache.hadoop.mapred.InputFormatBase.getSplits( InputFormatBase.java:121) at org.apache.hadoop.mapred.LocalJobRunner$Job.run( LocalJobRunner.java:81) Here is the directory structure of the merged linkdb: linkdb `-- current |-- linkdb-merge-1441573667 | `-- part-00000 | |-- data | `-- index `-- part-00000 |-- data `-- index It looks like the index is looking for data in linkdb-merge-1441573667 rather than linkdb-merge-1441573667/part-00000. Ben had this problem back in Sept 2006 but there was no reply to the message and he didn't find a solution to the problem (message copied below). The link score doesn't matter to us, in fact we are ignoring it entirely in our searcher, so would there be any negative side effects to indexing with only the linkdb generated from the segment? Or is there a better way to solve this problem? Cheers, Vince --- Hi all, I am having problems recrawling our intranet. Something in the recrawl script (is it invertlinks?) creates a crawldir\linkdb\current\linkdb-merge-<number> folder which has a part-00000 folder under that. When the indexer is invoked, it looks for crawldir\linkdb\current\linkdb-merge-<number>\data, but that file doesnt exist cause its in the part-00000 directory. How do I get the indexer to look in the part-00000 dir? Is it a configuration error? I am running a python port of recrawl script on a windows 2000 machine without cygwin, where the crawldir and nutch 0.8 is on a windows 2003 server that I have very limited access to. Heres what the hadoop.log says about it: 2006-09-07 13:02:39,696 INFO indexer.Indexer - Indexer: starting 2006-09-07 13:02:39,696 INFO indexer.Indexer - Indexer: linkdb: F:/nutch-0.8/intranet-crawl/linkdb 2006-09-07 13:02:40,696 INFO indexer.Indexer - Indexer: adding segment: F:/nutch-0.8/intranet-crawl/segments/20060907130151 2006-09-07 13:02:50,804 WARN mapred.LocalJobRunner - job_fn20sr java.io.IOException: Not a file: F:/nutch-0.8/intranet-crawl/linkdb/current/linkdb-merge-216906667/data at org.apache.hadoop.mapred.InputFormatBase.getSplits(InputFormatBase.java:121) at org.apache.hadoop.mapred.LocalJobRunner$Job.run(LocalJobRunner.java:80) If I move the contents of linkdb-merge-216906667/part-00000 to linkdb-merge-216906667, indexing works ok (well, it wont delete _0.f0, but thats another issue). The same thing happens when this linkdb-merge-* directory exists already and I run invertlinks. What am I doing wrong? I havent been able to find anyone with these issues, so I must be doing something wrong.
