On 9/7/07, Ned Rockson <[EMAIL PROTECTED]> wrote: > Oh great, I didn't know that was an option. How would I go about > running the parse by itself?
bin/nutch parse <segment> > > On 9/7/07, Doğacan Güney <[EMAIL PROTECTED]> wrote: > > On 9/7/07, Ned Rockson <[EMAIL PROTECTED]> wrote: > > > So I ran a thread dump and got what I consider to be pretty > > > meaningless. It doesn't seem to say I'm stuck in a regex filter, > > > although when I printed out the urls that were being printed by the > > > reducer, there was one that had some unprintable characters in the > > > URL. Also, there were a lot of URLs that were severely malformed, so > > > I assume that could be a problem that I'm going to look into. The > > > last URL that was printed (on both of the tasks) looked pretty > > > harmless though: a wiki entry and a .js page, so I assume there must > > > be a buffer that writes when it fills up. Where is this buffer > > > located and would it be pretty easy to dump it to stdout rather than a > > > file for debug purposes? > > > > I keep forgetting that people run fetch with parse. Can you try > > running fetch with "-noParsing" option? If reduce phase has a problem > > with urlfiltering, this should solve it as no-parsing-fetch's reduce > > phase is just identity reducing. > > > > > > > > Here is the thread dump: > > > > > > "[EMAIL PROTECTED]" daemon prio=1 > > > tid=0x00002aaaab72c6b0 nid=0x4ae8 waiting on condition > > > [0x0000000041367000..0x0000000041367b80] at > > > java.lang.Thread.sleep(Native Method) at > > > > > > org.apache.hadoop.dfs.DFSClient$LeaseChecker.run(DFSClient.java:458) at > > > java.lang.Thread.run(Thread.java:595) > > > "Pinger for task_0018_r_000002_0" daemon prio=1 tid=0x00002aaaac2f1d80 > > > nid=0x4ae5 waiting on condition > > > [0x0000000041165000..0x0000000041165c80] at > > > java.lang.Thread.sleep(Native Method) at > > > > > > org.apache.hadoop.mapred.TaskTracker$Child$1.run(TaskTracker.java:1488) > > > at > > > java.lang.Thread.run(Thread.java:595) > > > "IPC Client connection to 0.0.0.0/0.0.0.0:50050" daemon prio=1 > > > tid=0x00002aaaac2d0670 nid=0x4ae4 in Object.wait() > > > [0x0000000041064000..0x0000000041064d00] at > > > java.lang.Object.wait(Native Method) - waiting on > > > <0x00002b141d61d130>- (aorg.apache.hadoop.ipc.Client$Connection) at > > > java.lang.Object.wait(Object.java:474) at > > > > > > org.apache.hadoop.ipc.Client$Connection.waitForWork(Client.java:213) > > > - locked <0x00002b141d61d130> (a > > > org.apache.hadoop.ipc.Client$Connection) at > > > org.apache.hadoop.ipc.Client$Connection.run(Client.java:252) > > > "org.apache.hadoop.io.ObjectWritable Connection Culler" daemon prio=1 > > > tid=0x00002aaaac332a20 nid=0x4ae3 waiting on condition > > > [0x0000000040f63000..0x0000000040f63d80] at > > > java.lang.Thread.sleep(Native Method) at > > > org.apache.hadoop.ipc.Client$ConnectionCuller.run(Client.java:401) > > > "Low Memory Detector" daemon prio=1 tid=0x00002aaaac0025a0 nid=0x4ae1 > > > runnable [0x0000000000000000..0x0000000000000000] > > > "CompilerThread1" daemon prio=1 tid=0x00002aaaac000ab0 nid=0x4ae0 > > > waiting on condition [0x0000000000000000..0x0000000040c5f3e0] > > > "CompilerThread0" daemon prio=1 tid=0x00002aaab00f3290 nid=0x4adf > > > waiting on condition [0x0000000000000000..0x0000000040b5e460] > > > "AdapterThread" daemon prio=1 tid=0x00002aaab00f1c70 nid=0x4ade > > > waiting on condition [0x0000000000000000..0x0000000000000000] > > > "Signal Dispatcher" daemon prio=1 tid=0x00002aaab00f07b0 nid=0x4add > > > runnable [0x0000000000000000..0x0000000000000000] > > > "Finalizer" daemon prio=1 tid=0x00002aaab00dbd70 nid=0x4adc in > > > Object.wait() [0x000000004085c000..0x000000004085cd00] at > > > java.lang.Object.wait(Native Method) - waiting on > > > <0x00002b141d606288> (a java.lang.ref.ReferenceQueue$Lock) at > > > java.lang.ref.ReferenceQueue.remove(ReferenceQueue.java:116) - > > > locked > > > <0x00002b141d606288> (a > > > java.lang.ref.ReferenceQueue$Lock) at > > > java.lang.ref.ReferenceQueue.remove(ReferenceQueue.java:132) at > > > java.lang.ref.Finalizer$FinalizerThread.run(Finalizer.java:159) > > > "Reference Handler" daemon prio=1 tid=0x00002aaab00db290 nid=0x4adb in > > > Object.wait() > > > > > > On 9/6/07, Andrzej Bialecki <[EMAIL PROTECTED]> wrote: > > > > Ned Rockson wrote: > > > > > (sorry if this is a repost, I'm not sure if it sent last time). > > > > > > > > > > I have a very strange, reproducible bug that shows up when running > > > > > fetch across any number of documents >10000. I'm running 47 map tasks > > > > > and 47 reduce tasks on 24 nodes. The map phase finishes fine and so > > > > > does the majority of the reduce phase, however there are always two > > > > > segments that perpetually hang in the reduce > reduce phase. What > > > > > happens is the reducer gets to 85.xx% and then stops responding. Once > > > > > 10 minutes go by, a new worker starts the task, gets to the same > > > > > 85.xx(+/- .1%) and hangs. The other consistent part is that it's > > > > > always segment 2 and segment 5 (out of 47 segments). > > > > > > > > > > I figured I could fix it by simply copying data from a different > > > > > segment in and continuing on the next iteration, but low and behold > > > > > the same exact problem happens in segment 2 and segment 5. > > > > > > > > > > I assume it's not IO problems because all of the nodes involved in > > > > > these segments finish other reduce tasks in the same iteration with no > > > > > problems. Furthermore, I have seen this happen persistently over the > > > > > last many iterations. My last iteration had 400,000 (+/-) documents > > > > > pulled down and I saw the same behavior. > > > > > > > > > > Does anyone have any suggestions? > > > > > > > > > > > > > Yes. Most likely this is a problem with urlfilter-regex getting stuck on > > > > an abnormal URL (such as e.g. extremely long url, or url that contains > > > > control characters). > > > > > > > > Please check the Jobtracker UI which task is stuck, and on which machine > > > > it's executing. Log in to that machine, and identify what is the pid of > > > > this task process, and then generate a thread dump (using 'kill > > > > -SIGQUIT', which does NOT quit the process). If the thread dump shows > > > > some threads being stuck in regex code then it's likely that this is the > > > > problem. > > > > > > > > The solution is to avoid urlfilter-regex, or to change the order of > > > > urlfilters and put simpler filters in front of urlfilter-regex, in the > > > > hope that they will eliminate abnormal urls before they are passed to > > > > urlfilter-regex. > > > > > > > > > > > > -- > > > > Best regards, > > > > Andrzej Bialecki <>< > > > > ___. ___ ___ ___ _ _ __________________________________ > > > > [__ || __|__/|__||\/| Information Retrieval, Semantic Web > > > > ___|||__|| \| || | Embedded Unix, System Integration > > > > http://www.sigram.com Contact: info at sigram dot com > > > > > > > > > > > > > > > > > -- > > Doğacan Güney > > > -- Doğacan Güney
