fetcher should track and shut down hung threads
-----------------------------------------------

                 Key: NUTCH-1182
                 URL: https://issues.apache.org/jira/browse/NUTCH-1182
             Project: Nutch
          Issue Type: Bug
          Components: fetcher
    Affects Versions: 1.3, 1.4
         Environment: Linux, local job runner
            Reporter: Sebastian Nagel
            Priority: Minor


While crawling a slow server with a couple of very large PDF documents (30 MB) 
on it
after some time and a bulk of successfully fetched documents the fetcher stops
with the message: ??Aborting with 10 hung threads.??
>From now on every cycle ends with hung threads, almost no documents are fetched
successfully. In addition, strange hadoop errors are logged:
{noformat}
   fetch of http://.../xyz.pdf failed with: java.lang.NullPointerException
    at java.lang.System.arraycopy(Native Method)
    at 
org.apache.hadoop.mapred.MapTask$MapOutputBuffer$Buffer.write(MapTask.java:1108)
    ...
{noformat}
or
{noformat}
   Exception in thread "QueueFeeder" java.lang.NullPointerException
         at 
org.apache.hadoop.fs.BufferedFSInputStream.getPos(BufferedFSInputStream.java:48)
         at 
org.apache.hadoop.fs.FSDataInputStream.getPos(FSDataInputStream.java:41)
         at 
org.apache.hadoop.fs.ChecksumFileSystem$ChecksumFSInputChecker.readChunk(ChecksumFileSystem.java:214)
{noformat}

I've run the debugger and found:
# after the "hung threads" are reported the fetcher stops but the threads are 
still alive and continue fetching a document. In consequence, this will
#* limit the small bandwidth of network/server even more
#* after the document is fetched the thread tries to write the content via 
{{output.collect()}} which must fail because the fetcher map job is already 
finished and the associated temporary mapred directory is deleted. The error 
message may get mixed with the progress output of the next fetch cycle causing 
additional confusion.
# documents/URLs causing the hung thread are never reported nor stored. That 
is, it's hard to track them down, and they will cause a hung thread again and 
again.

The problem is reproducible when fetching bigger documents and setting 
{{mapred.task.timeout}} to a low value (this will definitely cause hung 
threads).

--
This message is automatically generated by JIRA.
If you think it was sent incorrectly, please contact your JIRA administrators: 
https://issues.apache.org/jira/secure/ContactAdministrators!default.jspa
For more information on JIRA, see: http://www.atlassian.com/software/jira

        

Reply via email to