On 9/7/07, Ned Rockson <[EMAIL PROTECTED]> wrote:
> So I ran a thread dump and got what I consider to be pretty
> meaningless.  It doesn't seem to say I'm stuck in a regex filter,
> although when I printed out the urls that were being printed by the
> reducer, there was one that had some unprintable characters in the
> URL.  Also, there were a lot of URLs that were severely malformed, so
> I assume that could be a problem that I'm going to look into.  The
> last URL that was printed (on both of the tasks) looked pretty
> harmless though: a wiki entry and a .js page, so I assume there must
> be a buffer that writes when it fills up.  Where is this buffer
> located and would it be pretty easy to dump it to stdout rather than a
> file for debug purposes?

I keep forgetting that people run fetch with parse. Can you try
running fetch with "-noParsing" option? If reduce phase has a problem
with urlfiltering, this should solve it as no-parsing-fetch's reduce
phase is just identity reducing.

>
> Here is the thread dump:
>
> "[EMAIL PROTECTED]" daemon prio=1
> tid=0x00002aaaab72c6b0 nid=0x4ae8 waiting       on condition
> [0x0000000041367000..0x0000000041367b80] at
>         java.lang.Thread.sleep(Native Method) at
>         org.apache.hadoop.dfs.DFSClient$LeaseChecker.run(DFSClient.java:458) 
> at
>         java.lang.Thread.run(Thread.java:595)
> "Pinger for task_0018_r_000002_0" daemon prio=1 tid=0x00002aaaac2f1d80
> nid=0x4ae5 waiting on condition
> [0x0000000041165000..0x0000000041165c80] at
>         java.lang.Thread.sleep(Native Method) at
>         
> org.apache.hadoop.mapred.TaskTracker$Child$1.run(TaskTracker.java:1488)
> at
>         java.lang.Thread.run(Thread.java:595)
> "IPC Client connection to 0.0.0.0/0.0.0.0:50050" daemon prio=1
> tid=0x00002aaaac2d0670 nid=0x4ae4 in Object.wait()
> [0x0000000041064000..0x0000000041064d00] at
> java.lang.Object.wait(Native Method) - waiting on
> <0x00002b141d61d130>-   (aorg.apache.hadoop.ipc.Client$Connection) at
> java.lang.Object.wait(Object.java:474) at
>         org.apache.hadoop.ipc.Client$Connection.waitForWork(Client.java:213)
> - locked <0x00002b141d61d130> (a
>                 org.apache.hadoop.ipc.Client$Connection) at
>         org.apache.hadoop.ipc.Client$Connection.run(Client.java:252)
> "org.apache.hadoop.io.ObjectWritable Connection Culler" daemon prio=1
> tid=0x00002aaaac332a20 nid=0x4ae3 waiting on condition
> [0x0000000040f63000..0x0000000040f63d80] at
>         java.lang.Thread.sleep(Native Method) at
>         org.apache.hadoop.ipc.Client$ConnectionCuller.run(Client.java:401)
> "Low Memory Detector" daemon prio=1 tid=0x00002aaaac0025a0 nid=0x4ae1
> runnable        [0x0000000000000000..0x0000000000000000]
> "CompilerThread1" daemon prio=1 tid=0x00002aaaac000ab0 nid=0x4ae0
> waiting on condition [0x0000000000000000..0x0000000040c5f3e0]
> "CompilerThread0" daemon prio=1 tid=0x00002aaab00f3290 nid=0x4adf
> waiting on condition [0x0000000000000000..0x0000000040b5e460]
> "AdapterThread" daemon prio=1 tid=0x00002aaab00f1c70 nid=0x4ade
> waiting on condition [0x0000000000000000..0x0000000000000000]
> "Signal Dispatcher" daemon prio=1 tid=0x00002aaab00f07b0 nid=0x4add
> runnable [0x0000000000000000..0x0000000000000000]
> "Finalizer" daemon prio=1 tid=0x00002aaab00dbd70 nid=0x4adc in
> Object.wait()   [0x000000004085c000..0x000000004085cd00] at
>         java.lang.Object.wait(Native Method) - waiting on
> <0x00002b141d606288> (a java.lang.ref.ReferenceQueue$Lock) at
>         java.lang.ref.ReferenceQueue.remove(ReferenceQueue.java:116) - locked
> <0x00002b141d606288> (a                               
> java.lang.ref.ReferenceQueue$Lock) at
>         java.lang.ref.ReferenceQueue.remove(ReferenceQueue.java:132) at
>         java.lang.ref.Finalizer$FinalizerThread.run(Finalizer.java:159)
> "Reference Handler" daemon prio=1 tid=0x00002aaab00db290 nid=0x4adb in
> Object.wait()
>
> On 9/6/07, Andrzej Bialecki <[EMAIL PROTECTED]> wrote:
> > Ned Rockson wrote:
> > > (sorry if this is a repost, I'm not sure if it sent last time).
> > >
> > > I have a very strange, reproducible bug that shows up when running
> > > fetch across any number of documents >10000.  I'm running 47 map tasks
> > > and 47 reduce tasks on 24 nodes.  The map phase finishes fine and so
> > > does the majority of the reduce phase, however there are always two
> > > segments that perpetually hang in the reduce > reduce phase.  What
> > > happens is the reducer gets to 85.xx% and then stops responding.  Once
> > > 10 minutes go by, a new worker starts the task, gets to the same
> > > 85.xx(+/- .1%) and hangs.  The other consistent part is that it's
> > > always segment 2 and segment 5 (out of 47 segments).
> > >
> > > I figured I could fix it by simply copying data from a different
> > > segment in and continuing on the next iteration, but low and behold
> > > the same exact problem happens in segment 2 and segment 5.
> > >
> > > I assume it's not IO problems because all of the nodes involved in
> > > these segments finish other reduce tasks in the same iteration with no
> > > problems.  Furthermore, I have seen this happen persistently over the
> > > last many iterations.  My last iteration had 400,000 (+/-) documents
> > > pulled down and I saw the same behavior.
> > >
> > > Does anyone have any suggestions?
> > >
> >
> > Yes. Most likely this is a problem with urlfilter-regex getting stuck on
> > an abnormal URL (such as e.g. extremely long url, or url that contains
> > control characters).
> >
> > Please check the Jobtracker UI which task is stuck, and on which machine
> > it's executing. Log in to that machine, and identify what is the pid of
> > this task process, and then generate a thread dump (using 'kill
> > -SIGQUIT', which does NOT quit the process). If the thread dump shows
> > some threads being stuck in regex code then it's likely that this is the
> > problem.
> >
> > The solution is to avoid urlfilter-regex, or to change the order of
> > urlfilters and put simpler filters in front of urlfilter-regex, in the
> > hope that they will eliminate abnormal urls before they are passed to
> > urlfilter-regex.
> >
> >
> > --
> > Best regards,
> > Andrzej Bialecki     <><
> >   ___. ___ ___ ___ _ _   __________________________________
> > [__ || __|__/|__||\/|  Information Retrieval, Semantic Web
> > ___|||__||  \|  ||  |  Embedded Unix, System Integration
> > http://www.sigram.com  Contact: info at sigram dot com
> >
> >
>


-- 
Doğacan Güney

Reply via email to