The attached patches for Generator.java and Injector.java allow a
specific temporary directory to be specified. This gives Nutch the full
path to these temporary directories and seems to fix the "No input
directories" issue when using a local filesystem with multiple task
trackers.
On Mon, 2005-11-07 at 09:57 -0500, Rod Taylor wrote:
> On Fri, 2005-11-04 at 20:41 -0800, Doug Cutting wrote:
> > Rod Taylor wrote:
> > > Here you go. local filesystem and a single job tracker on another
> > > machine. When the tasktracker and jobtracker are on the same box there
> > > isn't a problem. When they are on different machines it runs into
> > > issues.
> > >
> > > This is using mapred.local.dir on the local machine (not sharedd between
> > > sbider4 and sbider5):
> >
> > > parsing /home/sitesell/localt/taskTracker/task_m_o59djj/job.xml
> > > [Fatal Error] :-1:-1: Premature end of file.
> >
> > What is mapred.system.dir? That must be shared. Also, filenames you
> > pass to commands must be pathnames that work on all hosts.
>
> I managed to get past all of the initial injection problems by running a
> local crawl (no jobtracker) which created the crawldb/current/part-00000
> files. So I was able to do a real inject, with jobtracker, for all of
> the urls system wide without any complaints about files or directories
> not existing.
>
> Now, when trying to run a generate with a jobtracker it seems to have a
> hard time finding the temporary working areas from one job to the next.
> I cannot figure out where it is creating generate-temp-908680235. With
> NDFS it would be /user/$USER/
>
> <-- nutch generate -->
> 051107 091256 topN: 10000
> 051107 091256 Generator: starting
> 051107 091256 Generator:
> segment: /opt/sitesell/sbider_data/test2/segments/20051107091256
> 051107 091256 Generator: Selecting most-linked urls due for fetch.
> 051107 091256 parsing file:/opt/nutch-0.8_7/conf/nutch-default.xml
> 051107 091256 parsing file:/opt/nutch-0.8_7/conf/mapred-default.xml
> 051107 091256 parsing file:/opt/nutch-0.8_7/conf/nutch-site.xml
> 051107 091256 parsing file:/opt/nutch-0.8_7/conf/nutch-default.xml
> 051107 091256 parsing file:/opt/nutch-0.8_7/conf/nutch-site.xml
> 051107 091256 Client connection to 192.168.100.14:5464: starting
> 051107 091256 Running job: job_xhvq9b
> 051107 091258 map 0%
> 051107 091300 map 5%
> 051107 091303 map 16%
> 051107 091305 map 21%
> 051107 091306 map 26%
> 051107 091308 map 32%
> 051107 091309 map 37%
> 051107 091312 map 47%
> 051107 091315 map 58%
> 051107 091318 map 68%
> 051107 091320 map 74%
> 051107 091321 map 79%
> 051107 091324 map 89%
> 051107 091327 map 100%
> 051107 091330 reduce 5%
> 051107 091332 reduce 11%
> 051107 091333 reduce 16%
> 051107 091335 reduce 21%
> 051107 091337 reduce 26%
> 051107 091339 reduce 37%
> 051107 091342 reduce 47%
> 051107 091344 reduce 53%
> 051107 091345 reduce 58%
> 051107 091347 reduce 63%
> 051107 091348 reduce 68%
> 051107 091351 reduce 79%
> 051107 091354 reduce 89%
> 051107 091357 reduce 100%
> 051107 091359 Job complete: job_xhvq9b
> 051107 091359 Generator: Partitioning selected urls by host, for
> politeness.
> 051107 091359 parsing file:/opt/nutch-0.8_7/conf/nutch-default.xml
> 051107 091359 parsing file:/opt/nutch-0.8_7/conf/mapred-default.xml
> 051107 091359 parsing file:/opt/nutch-0.8_7/conf/nutch-site.xml
> Exception in thread "main" java.io.IOException: No input directories
> specified in: NutchConf: nutch-default.xml ,
> mapred-default.xml , /home/sitesell/local/jobTracker/job_h22fvi.xml ,
> nutch-site.xml
> at org.apache.nutch.ipc.Client.call(Client.java:294)
> at org.apache.nutch.ipc.RPC$Invoker.invoke(RPC.java:127)
> at $Proxy0.submitJob(Unknown Source)
> at
> org.apache.nutch.mapred.JobClient.submitJob(JobClient.java:259)
> at org.apache.nutch.mapred.JobClient.runJob(JobClient.java:288)
> at org.apache.nutch.crawl.Generator.generate(Generator.java:213)
> at org.apache.nutch.crawl.Generator.main(Generator.java:258)
>
> [EMAIL PROTECTED] sbider_data]$
> cat /home/sitesell/local/jobTracker/job_h22fvi.xml | grep input
> <property><name>mapred.input.format.class</name><value>org.apache.nutch.mapred.SequenceFileInputFormat</value></property>
> <property><name>mapred.input.dir</name><value>generate-temp-908680235</value></property>
> <property><name>mapred.input.value.class</name><value>org.apache.nutch.io.UTF8</value></property>
> <property><name>mapred.input.key.class</name><value>org.apache.nutch.crawl.CrawlDatum</value></property>
>
> --
> Rod Taylor <[EMAIL PROTECTED]>
>
>
--
Rod Taylor <[EMAIL PROTECTED]>
*** ./src/java/org/apache/nutch/crawl/Generator.java.orig 2005-10-31 23:35:20.000000000 -0500
--- ./src/java/org/apache/nutch/crawl/Generator.java 2005-11-07 17:06:46.000000000 -0500
***************
*** 155,161 ****
throws IOException {
File tempDir =
! new File("generate-temp-"+
Integer.toString(new Random().nextInt(Integer.MAX_VALUE)));
File segment = new File(segments, getDate());
--- 155,162 ----
throws IOException {
File tempDir =
! new File(NutchConf.get().get("mapred.temp.dir", ".") +
! "/generate-temp-"+
Integer.toString(new Random().nextInt(Integer.MAX_VALUE)));
File segment = new File(segments, getDate());
*** ./src/java/org/apache/nutch/crawl/Injector.java.orig 2005-09-24 19:29:03.000000000 -0400
--- ./src/java/org/apache/nutch/crawl/Injector.java 2005-11-07 17:34:37.000000000 -0500
***************
*** 84,90 ****
LOG.info("Injector: urlDir: " + urlDir);
File tempDir =
! new File("inject-temp-"+
Integer.toString(new Random().nextInt(Integer.MAX_VALUE)));
// map text input file to a <url,CrawlDatum> file
--- 84,91 ----
LOG.info("Injector: urlDir: " + urlDir);
File tempDir =
! new File(NutchConf.get().get("mapred.temp.dir", ".") +
! "/inject-temp-"+
Integer.toString(new Random().nextInt(Integer.MAX_VALUE)));
// map text input file to a <url,CrawlDatum> file