On 9/7/07, Ned Rockson <[EMAIL PROTECTED]> wrote: > So I ran a thread dump and got what I consider to be pretty > meaningless. It doesn't seem to say I'm stuck in a regex filter, > although when I printed out the urls that were being printed by the > reducer, there was one that had some unprintable characters in the > URL. Also, there were a lot of URLs that were severely malformed, so > I assume that could be a problem that I'm going to look into. The > last URL that was printed (on both of the tasks) looked pretty > harmless though: a wiki entry and a .js page, so I assume there must > be a buffer that writes when it fills up. Where is this buffer > located and would it be pretty easy to dump it to stdout rather than a > file for debug purposes?
I keep forgetting that people run fetch with parse. Can you try running fetch with "-noParsing" option? If reduce phase has a problem with urlfiltering, this should solve it as no-parsing-fetch's reduce phase is just identity reducing. > > Here is the thread dump: > > "[EMAIL PROTECTED]" daemon prio=1 > tid=0x00002aaaab72c6b0 nid=0x4ae8 waiting on condition > [0x0000000041367000..0x0000000041367b80] at > java.lang.Thread.sleep(Native Method) at > org.apache.hadoop.dfs.DFSClient$LeaseChecker.run(DFSClient.java:458) > at > java.lang.Thread.run(Thread.java:595) > "Pinger for task_0018_r_000002_0" daemon prio=1 tid=0x00002aaaac2f1d80 > nid=0x4ae5 waiting on condition > [0x0000000041165000..0x0000000041165c80] at > java.lang.Thread.sleep(Native Method) at > > org.apache.hadoop.mapred.TaskTracker$Child$1.run(TaskTracker.java:1488) > at > java.lang.Thread.run(Thread.java:595) > "IPC Client connection to 0.0.0.0/0.0.0.0:50050" daemon prio=1 > tid=0x00002aaaac2d0670 nid=0x4ae4 in Object.wait() > [0x0000000041064000..0x0000000041064d00] at > java.lang.Object.wait(Native Method) - waiting on > <0x00002b141d61d130>- (aorg.apache.hadoop.ipc.Client$Connection) at > java.lang.Object.wait(Object.java:474) at > org.apache.hadoop.ipc.Client$Connection.waitForWork(Client.java:213) > - locked <0x00002b141d61d130> (a > org.apache.hadoop.ipc.Client$Connection) at > org.apache.hadoop.ipc.Client$Connection.run(Client.java:252) > "org.apache.hadoop.io.ObjectWritable Connection Culler" daemon prio=1 > tid=0x00002aaaac332a20 nid=0x4ae3 waiting on condition > [0x0000000040f63000..0x0000000040f63d80] at > java.lang.Thread.sleep(Native Method) at > org.apache.hadoop.ipc.Client$ConnectionCuller.run(Client.java:401) > "Low Memory Detector" daemon prio=1 tid=0x00002aaaac0025a0 nid=0x4ae1 > runnable [0x0000000000000000..0x0000000000000000] > "CompilerThread1" daemon prio=1 tid=0x00002aaaac000ab0 nid=0x4ae0 > waiting on condition [0x0000000000000000..0x0000000040c5f3e0] > "CompilerThread0" daemon prio=1 tid=0x00002aaab00f3290 nid=0x4adf > waiting on condition [0x0000000000000000..0x0000000040b5e460] > "AdapterThread" daemon prio=1 tid=0x00002aaab00f1c70 nid=0x4ade > waiting on condition [0x0000000000000000..0x0000000000000000] > "Signal Dispatcher" daemon prio=1 tid=0x00002aaab00f07b0 nid=0x4add > runnable [0x0000000000000000..0x0000000000000000] > "Finalizer" daemon prio=1 tid=0x00002aaab00dbd70 nid=0x4adc in > Object.wait() [0x000000004085c000..0x000000004085cd00] at > java.lang.Object.wait(Native Method) - waiting on > <0x00002b141d606288> (a java.lang.ref.ReferenceQueue$Lock) at > java.lang.ref.ReferenceQueue.remove(ReferenceQueue.java:116) - locked > <0x00002b141d606288> (a > java.lang.ref.ReferenceQueue$Lock) at > java.lang.ref.ReferenceQueue.remove(ReferenceQueue.java:132) at > java.lang.ref.Finalizer$FinalizerThread.run(Finalizer.java:159) > "Reference Handler" daemon prio=1 tid=0x00002aaab00db290 nid=0x4adb in > Object.wait() > > On 9/6/07, Andrzej Bialecki <[EMAIL PROTECTED]> wrote: > > Ned Rockson wrote: > > > (sorry if this is a repost, I'm not sure if it sent last time). > > > > > > I have a very strange, reproducible bug that shows up when running > > > fetch across any number of documents >10000. I'm running 47 map tasks > > > and 47 reduce tasks on 24 nodes. The map phase finishes fine and so > > > does the majority of the reduce phase, however there are always two > > > segments that perpetually hang in the reduce > reduce phase. What > > > happens is the reducer gets to 85.xx% and then stops responding. Once > > > 10 minutes go by, a new worker starts the task, gets to the same > > > 85.xx(+/- .1%) and hangs. The other consistent part is that it's > > > always segment 2 and segment 5 (out of 47 segments). > > > > > > I figured I could fix it by simply copying data from a different > > > segment in and continuing on the next iteration, but low and behold > > > the same exact problem happens in segment 2 and segment 5. > > > > > > I assume it's not IO problems because all of the nodes involved in > > > these segments finish other reduce tasks in the same iteration with no > > > problems. Furthermore, I have seen this happen persistently over the > > > last many iterations. My last iteration had 400,000 (+/-) documents > > > pulled down and I saw the same behavior. > > > > > > Does anyone have any suggestions? > > > > > > > Yes. Most likely this is a problem with urlfilter-regex getting stuck on > > an abnormal URL (such as e.g. extremely long url, or url that contains > > control characters). > > > > Please check the Jobtracker UI which task is stuck, and on which machine > > it's executing. Log in to that machine, and identify what is the pid of > > this task process, and then generate a thread dump (using 'kill > > -SIGQUIT', which does NOT quit the process). If the thread dump shows > > some threads being stuck in regex code then it's likely that this is the > > problem. > > > > The solution is to avoid urlfilter-regex, or to change the order of > > urlfilters and put simpler filters in front of urlfilter-regex, in the > > hope that they will eliminate abnormal urls before they are passed to > > urlfilter-regex. > > > > > > -- > > Best regards, > > Andrzej Bialecki <>< > > ___. ___ ___ ___ _ _ __________________________________ > > [__ || __|__/|__||\/| Information Retrieval, Semantic Web > > ___|||__|| \| || | Embedded Unix, System Integration > > http://www.sigram.com Contact: info at sigram dot com > > > > > -- Doğacan Güney
