Hi,
I am currently writing a custom crawl script that lets us incrementally
crawl our database of urls to a certain depth (custom Nutch with page depth
limited on a Debian machine). It seems to be working well until I start
merging linkdb's. On the second iteration I merge the linkdb's then try to
index my current segment using the merged linkdb and it fails with the
following error in hadoop.log:
2007-08-20 17:09:32,154 WARN mapred.LocalJobRunner - job_qr0uto
java.io.IOException: Not a file: /home/vfilby/TrueLocal/nutch-0.8-svn
/wwt/linkdb/current/linkdb-merge-1441573667/data
at org.apache.hadoop.mapred.InputFormatBase.getSplits(
InputFormatBase.java:121)
at org.apache.hadoop.mapred.LocalJobRunner$Job.run(
LocalJobRunner.java:81)
Here is the directory structure of the merged linkdb:
linkdb
`-- current
|-- linkdb-merge-1441573667
| `-- part-00000
| |-- data
| `-- index
`-- part-00000
|-- data
`-- index
It looks like the index is looking for data in linkdb-merge-1441573667
rather than linkdb-merge-1441573667/part-00000. Ben had this problem back
in Sept 2006 but there was no reply to the message and he didn't find a
solution to the problem (message copied below).
The link score doesn't matter to us, in fact we are ignoring it entirely in
our searcher, so would there be any negative side effects to indexing with
only the linkdb generated from the segment? Or is there a better way to
solve this problem?
Cheers,
Vince
---
Hi all, I am having problems recrawling our intranet. Something in the
recrawl script (is it invertlinks?) creates a
crawldir\linkdb\current\linkdb-merge-<number> folder which has a part-00000
folder under that. When the indexer is invoked, it looks for
crawldir\linkdb\current\linkdb-merge-<number>\data, but that file doesnt
exist cause its in the part-00000 directory. How do I get the indexer to
look in the part-00000 dir? Is it a configuration error?
I am running a python port of recrawl script on a windows 2000 machine
without cygwin, where the crawldir and nutch 0.8 is on a windows 2003 server
that I have very limited access to. Heres what the hadoop.log says about it:
2006-09-07 13:02:39,696 INFO indexer.Indexer - Indexer: starting
2006-09-07 13:02:39,696 INFO indexer.Indexer - Indexer: linkdb:
F:/nutch-0.8/intranet-crawl/linkdb
2006-09-07 13:02:40,696 INFO indexer.Indexer - Indexer: adding segment:
F:/nutch-0.8/intranet-crawl/segments/20060907130151
2006-09-07 13:02:50,804 WARN mapred.LocalJobRunner - job_fn20sr
java.io.IOException: Not a file:
F:/nutch-0.8/intranet-crawl/linkdb/current/linkdb-merge-216906667/data
at
org.apache.hadoop.mapred.InputFormatBase.getSplits(InputFormatBase.java:121)
at
org.apache.hadoop.mapred.LocalJobRunner$Job.run(LocalJobRunner.java:80)
If I move the contents of linkdb-merge-216906667/part-00000 to
linkdb-merge-216906667, indexing works ok (well, it wont delete _0.f0, but
thats another issue).
The same thing happens when this linkdb-merge-* directory exists already and
I run invertlinks.
What am I doing wrong? I havent been able to find anyone with these issues,
so I must be doing something wrong.