Thanks for the pointer to the LocalFetchRecover tool. It seems there were some changes to the hadoop api since nutch 0.8.1 so this tool didn't work initially. I've made what I think are the correct changes and have attached my changes. (Hopefully the attachment gets through.) I put together a runnable jar and ran it and got the following results:

09/07/23 11:18:04 INFO jvm.JvmMetrics: Initializing JVM Metrics with processName=JobTracker, sessionId= 09/07/23 11:18:04 WARN mapred.JobClient: Use GenericOptionsParser for parsing the arguments. Applications should implement Tool for the same. 09/07/23 11:18:05 INFO mapred.FileInputFormat: Total input paths to process : 907
09/07/23 11:18:05 INFO mapred.JobClient: Running job: job_local_0001
09/07/23 11:18:06 INFO mapred.FileInputFormat: Total input paths to process : 907
09/07/23 11:18:06 INFO mapred.MapTask: numReduceTasks: 1
09/07/23 11:18:06 INFO mapred.MapTask: io.sort.mb = 100
09/07/23 11:18:06 INFO mapred.MapTask: data buffer = 79691776/99614720
09/07/23 11:18:06 INFO mapred.MapTask: record buffer = 262144/327680
09/07/23 11:18:06 WARN mapred.LocalJobRunner: job_local_0001
java.io.IOException: file:[path-to-files]/spillfiles/part-00000 not a SequenceFile at org.apache.hadoop.io.SequenceFile$Reader.init(SequenceFile.java:1455) at org.apache.hadoop.io.SequenceFile$Reader.<init>(SequenceFile.java:1428) at org.apache.hadoop.io.SequenceFile$Reader.<init>(SequenceFile.java:1417) at org.apache.hadoop.io.SequenceFile$Reader.<init>(SequenceFile.java:1412) at org.apache.hadoop.mapred.SequenceFileRecordReader.<init>(SequenceFileRecordReader.java:43) at org.apache.hadoop.mapred.SequenceFileInputFormat.getRecordReader(SequenceFileInputFormat.java:58)
       at org.apache.hadoop.mapred.MapTask.run(MapTask.java:331)
at org.apache.hadoop.mapred.LocalJobRunner$Job.run(LocalJobRunner.java:138)
Exception in thread "main" java.io.IOException: Job failed!
       at org.apache.hadoop.mapred.JobClient.runJob(JobClient.java:1232)
at org.apache.nutch.fetcher.LocalFetchRecover.run(LocalFetchRecover.java:75) at org.apache.nutch.fetcher.LocalFetchRecover.main(LocalFetchRecover.java:85)

Based on this output, it seems it has found my files (renamed from spill[0-906].out), but it seems the format is not correct. The original logic in the LocalFetchRecover tool indicates the "InputFormat" is "SequenceFileInputFormat"... Has this changed in nutch 1.0?

If someone could advise how to change the code to make this work, I'd be most happy...
/FjK

Doğacan Güney wrote:
Hi,

On Mon, Jul 20, 2009 at 19:55, Fred Kuipers<[email protected]> wrote:
Hello all,

I'm attempting to index a large internal website with 6.7 m urls and I'm
running into a map failure after fetching (for 5+ days):

2009-07-20 07:09:23,316 INFO  fetcher.Fetcher - -activeThreads=0
2009-07-20 07:09:23,806 WARN  mapred.LocalJobRunner - job_local_0005
org.apache.hadoop.util.DiskChecker$DiskErrorException: Could not find any
valid local directory for
taskTracker/jobcache/job_local_0005/attempt_local_0005_m_000000_0/output/file.out
      at
org.apache.hadoop.fs.LocalDirAllocator$AllocatorPerContext.getLocalPathForWrite(LocalDirAllocator.java:335)
      at
org.apache.hadoop.fs.LocalDirAllocator.getLocalPathForWrite(LocalDirAllocator.java:124)
      at
org.apache.hadoop.mapred.MapOutputFile.getOutputFileForWrite(MapOutputFile.java:61)
      at
org.apache.hadoop.mapred.MapTask$MapOutputBuffer.mergeParts(MapTask.java:1209)
      at
org.apache.hadoop.mapred.MapTask$MapOutputBuffer.flush(MapTask.java:867)
      at org.apache.hadoop.mapred.MapTask.run(MapTask.java:343)
      at
org.apache.hadoop.mapred.LocalJobRunner$Job.run(LocalJobRunner.java:138)

hadoop-site.xml:

<?xml version="1.0"?>
<?xml-stylesheet type="text/xsl" href="configuration.xsl"?>

<!-- Put site-specific property overrides in this file. -->

<configuration>
<!--
We need LOTS of memory... And we need to disable the gc overhead limit, per
this page:
http://java.sun.com/javase/technologies/hotspot/gc/gc_tuning_6.html#par_gc.oom
-->
<property>
 <name>mapred.child.java.opts</name>
 <value>-Xmx4096m -XX:-UseGCOverheadLimit</value>
</property>

</configuration>

nutch-site.xml (excluding http.agent directives for brevity):

<?xml version="1.0"?>
<?xml-stylesheet type="text/xsl" href="configuration.xsl"?>

<!-- Put site-specific property overrides in this file. -->

<!-- http.agent properties excluded -->

<property>
 <name>http.timeout</name>
 <value>20000</value>
 <description>The default network timeout, in milliseconds.</description>
</property>

<property>
 <name>fetcher.threads.fetch</name>
 <value>20</value>
 <description>The number of FetcherThreads the fetcher should use.
  This is also determines the maximum number of requests that are
  made at once (each FetcherThread handles one connection).</description>
</property>

<property>
 <name>fetcher.threads.per.host</name>
 <value>20</value>
 <description>This number is the maximum number of threads that
  should be allowed to access a host at one time.</description>
</property>

<property>
 <name>fetcher.server.delay</name>
 <value>0.1</value>
 <description>The number of seconds the fetcher will delay between
 successive requests to the same server.</description>
</property>

</configuration>

Relevant environment variables:
NUTCH_JAVA_HOME=/usr/lib/jvm/jre-1.7.0-icedtea.x86_64
NUTCH_HEAPSIZE=3072
JAVA_HOME=/usr/lib/jvm/jre-1.7.0-icedtea.x86_64

I ran nutch with the following command/cwd:
[/home/fred/nutch-1.0]$ bin/nutch crawl urls_wiki_mirror -dir
crawl_wiki_mirror -threads 3 -depth 1

The seed file in urls_wiki_mirror contains 6739469 urls... Those are the
only urls I wish to crawl -- hence depth 1. The configuration I have set up
allows me to crawl this local server with 3 fetchers at the same time at a
rate that it doesn't overwhelm the server.

I'm using defaults for temp directories. Thus, /tmp/hadoop-fred/ is the temp
file location. The error message notes the following partial path:
taskTracker/jobcache/job_local_0005/attempt_local_0005_m_000000_0/output/file.out

I figure that equates to this full path:
/tmp/hadoop-fred/mapred/local/taskTracker/jobcache/job_local_0005/attempt_local_0005_m_000000_0/output/

The contents of this directory is spill[0-906].out... Nothing else. No
file.out. There is 68G of data in this folder (ie. it looks to have
downloaded everything i need)... There is 9+ GB of free space on the
filesystem -- is it possible this is insufficient?


It is possible that you ran out of space, it is also possible that you ran into
a hadoop bug. From the logs, it doesn't seem like a nutch bug.

So, what happened? Is there a way I can recover without re-crawling?


You can try this tool:

http://issues.apache.org/jira/browse/NUTCH-451

There is no guarantee that it will work though.

I am running on a Fedora Core 8 virtual machine with two cores, 4 GB memory.

Let me know if any more information is needed...


Can you try crawling in smaller units? i.e, crawl 1m docs then crawl
the second 1m docs, etc?

Thanks,
/FjK




package org.apache.nutch.fetcher;

import java.io.IOException;
import java.util.Iterator;

import org.apache.hadoop.conf.Configuration;
import org.apache.hadoop.fs.Path;
import org.apache.hadoop.io.*;
import org.apache.hadoop.mapred.*;
import org.apache.hadoop.util.Tool;
import org.apache.nutch.crawl.Generator;
import org.apache.nutch.util.NutchConfiguration;
import org.apache.nutch.util.NutchJob;

/**
 * This class may help you to recover partial data from a failed Fetcher run.
 * <p><b>NOTE 1:</b> this works ONLY if you ran Fetcher using "local" file 
system,
 *  i.e. you didn't use DFS - partial output to DFS is permanently lost if a
 *  process fails to properly close the output streams.</p>
 *  <p>NOTE 2: if Fetcher was stopped abruptly (killed or crashed), then partial
 *  SequenceFile-s will be corrupted at the end. This means that it won't be
 *  possible to recover all data from them - most likely only the data up to the
 *  last sync marker can be recovered.</p>
 * <p>The recovery process requires some preparation:
 * <ul>
 * <li>determine the map directories corresponding to the map task outputs of 
the
 * failed job. These map directories contain SequenceFile-s consisting of pairs 
of
 * &lt;Text, FetcherOutput&gt;, named e.g. <code>part-0.out</code> or
 * <code>file.out</code> or <code>spill0.out</code>.</li>
 * <li>create the new input directory, let's say <code>input/</code>. Copy all
 * SequenceFile-s into this directory, renaming them sequentially like this:
 * <pre>
 *  input/part-00000
 *  input/part-00001
 *  input/part-00002
 *  input/part-00003
 *  ...
 *  </pre>
 * </li>
 * <li>specify the <code>input</code> directory as the input to this tool.</li>
 * </ul>
 * <p>If all goes well, a new segment will be created as a subdirectory of the
 * output dir.</p>
 * 
 * @author Andrzej Bialecki
 * @author Mathijs Homminga (updates for 0.8.1)
 * @author Fred Kuipers (updates for 1.0?)
 */
public class LocalFetchRecover implements Tool, Reducer {

  private Configuration m_config = null;

  public int run(String[] args) throws Exception {
    if (args.length < 2) {
      System.err.println("Usage: LocalFetchRecover <outputDir> <inputDir1> 
...");
      return -1;
    }
    JobConf job = new NutchJob(getConf());
    for (int i = 1; i < args.length; i++) {
        FileInputFormat.addInputPath(job, new Path(args[i]));
    }
    
    job.setInputFormat(SequenceFileInputFormat.class);
    //job.setInputKeyClass(UTF8.class);
    //job.setInputValueClass(FetcherOutput.class); 

    // identity mapper
    //
    // pick-one reducer - protects from duplicate key/values from resubmitted 
tasks
    job.setReducerClass(LocalFetchRecover.class);

    job.setOutputKeyClass(UTF8.class);
    job.setOutputValueClass(FetcherOutput.class);
    job.setOutputFormat(FetcherOutputFormat.class);
    FileOutputFormat.setOutputPath(job, new Path(args[0], 
Generator.generateSegmentName()));
    JobClient.runJob(job);
    return 0;
  }

  /**
   * @param args
   */
  public static void main(String[] args) throws Exception {
     Tool tool = new LocalFetchRecover();
     tool.setConf(NutchConfiguration.create());
     tool.run(args);
  }

        @Override
        public Configuration getConf() {
                return m_config;
        }
        
        @Override
        public void setConf(Configuration arg0) {
                m_config = arg0;
        }
        
        @Override
        public void reduce(Object key, Iterator values, OutputCollector output, 
Reporter reporter) throws IOException {
            output.collect(key, (Writable)values.next());
        }

        @Override
        public void configure(JobConf arg0) {
                
        }

        @Override
        public void close() throws IOException {
        }
        
}

Reply via email to