fetcher should track and shut down hung threads
-----------------------------------------------
Key: NUTCH-1182
URL: https://issues.apache.org/jira/browse/NUTCH-1182
Project: Nutch
Issue Type: Bug
Components: fetcher
Affects Versions: 1.3, 1.4
Environment: Linux, local job runner
Reporter: Sebastian Nagel
Priority: Minor
While crawling a slow server with a couple of very large PDF documents (30 MB)
on it
after some time and a bulk of successfully fetched documents the fetcher stops
with the message: ??Aborting with 10 hung threads.??
>From now on every cycle ends with hung threads, almost no documents are fetched
successfully. In addition, strange hadoop errors are logged:
{noformat}
fetch of http://.../xyz.pdf failed with: java.lang.NullPointerException
at java.lang.System.arraycopy(Native Method)
at
org.apache.hadoop.mapred.MapTask$MapOutputBuffer$Buffer.write(MapTask.java:1108)
...
{noformat}
or
{noformat}
Exception in thread "QueueFeeder" java.lang.NullPointerException
at
org.apache.hadoop.fs.BufferedFSInputStream.getPos(BufferedFSInputStream.java:48)
at
org.apache.hadoop.fs.FSDataInputStream.getPos(FSDataInputStream.java:41)
at
org.apache.hadoop.fs.ChecksumFileSystem$ChecksumFSInputChecker.readChunk(ChecksumFileSystem.java:214)
{noformat}
I've run the debugger and found:
# after the "hung threads" are reported the fetcher stops but the threads are
still alive and continue fetching a document. In consequence, this will
#* limit the small bandwidth of network/server even more
#* after the document is fetched the thread tries to write the content via
{{output.collect()}} which must fail because the fetcher map job is already
finished and the associated temporary mapred directory is deleted. The error
message may get mixed with the progress output of the next fetch cycle causing
additional confusion.
# documents/URLs causing the hung thread are never reported nor stored. That
is, it's hard to track them down, and they will cause a hung thread again and
again.
The problem is reproducible when fetching bigger documents and setting
{{mapred.task.timeout}} to a low value (this will definitely cause hung
threads).
--
This message is automatically generated by JIRA.
If you think it was sent incorrectly, please contact your JIRA administrators:
https://issues.apache.org/jira/secure/ContactAdministrators!default.jspa
For more information on JIRA, see: http://www.atlassian.com/software/jira