So I ran a thread dump and got what I consider to be pretty
meaningless. It doesn't seem to say I'm stuck in a regex filter,
although when I printed out the urls that were being printed by the
reducer, there was one that had some unprintable characters in the
URL. Also, there were a lot of URLs that were severely malformed, so
I assume that could be a problem that I'm going to look into. The
last URL that was printed (on both of the tasks) looked pretty
harmless though: a wiki entry and a .js page, so I assume there must
be a buffer that writes when it fills up. Where is this buffer
located and would it be pretty easy to dump it to stdout rather than a
file for debug purposes?
Here is the thread dump:
"[EMAIL PROTECTED]" daemon prio=1
tid=0x00002aaaab72c6b0 nid=0x4ae8 waiting on condition
[0x0000000041367000..0x0000000041367b80] at
java.lang.Thread.sleep(Native Method) at
org.apache.hadoop.dfs.DFSClient$LeaseChecker.run(DFSClient.java:458) at
java.lang.Thread.run(Thread.java:595)
"Pinger for task_0018_r_000002_0" daemon prio=1 tid=0x00002aaaac2f1d80
nid=0x4ae5 waiting on condition
[0x0000000041165000..0x0000000041165c80] at
java.lang.Thread.sleep(Native Method) at
org.apache.hadoop.mapred.TaskTracker$Child$1.run(TaskTracker.java:1488)
at
java.lang.Thread.run(Thread.java:595)
"IPC Client connection to 0.0.0.0/0.0.0.0:50050" daemon prio=1
tid=0x00002aaaac2d0670 nid=0x4ae4 in Object.wait()
[0x0000000041064000..0x0000000041064d00] at
java.lang.Object.wait(Native Method) - waiting on
<0x00002b141d61d130>- (aorg.apache.hadoop.ipc.Client$Connection) at
java.lang.Object.wait(Object.java:474) at
org.apache.hadoop.ipc.Client$Connection.waitForWork(Client.java:213)
- locked <0x00002b141d61d130> (a
org.apache.hadoop.ipc.Client$Connection) at
org.apache.hadoop.ipc.Client$Connection.run(Client.java:252)
"org.apache.hadoop.io.ObjectWritable Connection Culler" daemon prio=1
tid=0x00002aaaac332a20 nid=0x4ae3 waiting on condition
[0x0000000040f63000..0x0000000040f63d80] at
java.lang.Thread.sleep(Native Method) at
org.apache.hadoop.ipc.Client$ConnectionCuller.run(Client.java:401)
"Low Memory Detector" daemon prio=1 tid=0x00002aaaac0025a0 nid=0x4ae1
runnable [0x0000000000000000..0x0000000000000000]
"CompilerThread1" daemon prio=1 tid=0x00002aaaac000ab0 nid=0x4ae0
waiting on condition [0x0000000000000000..0x0000000040c5f3e0]
"CompilerThread0" daemon prio=1 tid=0x00002aaab00f3290 nid=0x4adf
waiting on condition [0x0000000000000000..0x0000000040b5e460]
"AdapterThread" daemon prio=1 tid=0x00002aaab00f1c70 nid=0x4ade
waiting on condition [0x0000000000000000..0x0000000000000000]
"Signal Dispatcher" daemon prio=1 tid=0x00002aaab00f07b0 nid=0x4add
runnable [0x0000000000000000..0x0000000000000000]
"Finalizer" daemon prio=1 tid=0x00002aaab00dbd70 nid=0x4adc in
Object.wait() [0x000000004085c000..0x000000004085cd00] at
java.lang.Object.wait(Native Method) - waiting on
<0x00002b141d606288> (a java.lang.ref.ReferenceQueue$Lock) at
java.lang.ref.ReferenceQueue.remove(ReferenceQueue.java:116) - locked
<0x00002b141d606288> (a
java.lang.ref.ReferenceQueue$Lock) at
java.lang.ref.ReferenceQueue.remove(ReferenceQueue.java:132) at
java.lang.ref.Finalizer$FinalizerThread.run(Finalizer.java:159)
"Reference Handler" daemon prio=1 tid=0x00002aaab00db290 nid=0x4adb in
Object.wait()
On 9/6/07, Andrzej Bialecki <[EMAIL PROTECTED]> wrote:
> Ned Rockson wrote:
> > (sorry if this is a repost, I'm not sure if it sent last time).
> >
> > I have a very strange, reproducible bug that shows up when running
> > fetch across any number of documents >10000. I'm running 47 map tasks
> > and 47 reduce tasks on 24 nodes. The map phase finishes fine and so
> > does the majority of the reduce phase, however there are always two
> > segments that perpetually hang in the reduce > reduce phase. What
> > happens is the reducer gets to 85.xx% and then stops responding. Once
> > 10 minutes go by, a new worker starts the task, gets to the same
> > 85.xx(+/- .1%) and hangs. The other consistent part is that it's
> > always segment 2 and segment 5 (out of 47 segments).
> >
> > I figured I could fix it by simply copying data from a different
> > segment in and continuing on the next iteration, but low and behold
> > the same exact problem happens in segment 2 and segment 5.
> >
> > I assume it's not IO problems because all of the nodes involved in
> > these segments finish other reduce tasks in the same iteration with no
> > problems. Furthermore, I have seen this happen persistently over the
> > last many iterations. My last iteration had 400,000 (+/-) documents
> > pulled down and I saw the same behavior.
> >
> > Does anyone have any suggestions?
> >
>
> Yes. Most likely this is a problem with urlfilter-regex getting stuck on
> an abnormal URL (such as e.g. extremely long url, or url that contains
> control characters).
>
> Please check the Jobtracker UI which task is stuck, and on which machine
> it's executing. Log in to that machine, and identify what is the pid of
> this task process, and then generate a thread dump (using 'kill
> -SIGQUIT', which does NOT quit the process). If the thread dump shows
> some threads being stuck in regex code then it's likely that this is the
> problem.
>
> The solution is to avoid urlfilter-regex, or to change the order of
> urlfilters and put simpler filters in front of urlfilter-regex, in the
> hope that they will eliminate abnormal urls before they are passed to
> urlfilter-regex.
>
>
> --
> Best regards,
> Andrzej Bialecki <><
> ___. ___ ___ ___ _ _ __________________________________
> [__ || __|__/|__||\/| Information Retrieval, Semantic Web
> ___|||__|| \| || | Embedded Unix, System Integration
> http://www.sigram.com Contact: info at sigram dot com
>
>