Hi: Although I havent found the solution to the problem, at least I could minimize the effects. First of all I'll try to explain what I think is happening here.
Lets have a look a thread dump when system goes down: Full thread dump Java HotSpot(TM) Server VM (1.5.0_07-b03 mixed mode): "Server handler 4 on 55555" daemon prio=1 tid=0x08464710 nid=0x1317 in Object.wait() [0x7a88e000..0x7a88f040] at java.lang.Object.wait(Native Method) - waiting on <0x80cefc90> (a java.util.LinkedList) at org.apache.hadoop.ipc.Server$Handler.run(Server.java:471) - locked <0x80cefc90> (a java.util.LinkedList) "Server handler 3 on 55555" daemon prio=1 tid=0x085cd550 nid=0x1316 in Object.wait() [0x7a90f000..0x7a90f1c0] at java.lang.Object.wait(Native Method) - waiting on <0x80cefc90> (a java.util.LinkedList) at org.apache.hadoop.ipc.Server$Handler.run(Server.java:471) - locked <0x80cefc90> (a java.util.LinkedList) "Server handler 2 on 55555" daemon prio=1 tid=0x0867ea70 nid=0x1315 in Object.wait() [0x7a98e000..0x7a98f140] at java.lang.Object.wait(Native Method) - waiting on <0x80cefc90> (a java.util.LinkedList) at org.apache.hadoop.ipc.Server$Handler.run(Server.java:471) - locked <0x80cefc90> (a java.util.LinkedList) "Server handler 1 on 55555" daemon prio=1 tid=0x0867e440 nid=0x1314 in Object.wait() [0x7aa0e000..0x7aa0eec0] at java.lang.Object.wait(Native Method) - waiting on <0x80cefc90> (a java.util.LinkedList) at org.apache.hadoop.ipc.Server$Handler.run(Server.java:471) - locked <0x80cefc90> (a java.util.LinkedList) "Server handler 0 on 55555" daemon prio=1 tid=0x0867ec38 nid=0x1313 in Object.wait() [0x7aa8e000..0x7aa8ee40] at java.lang.Object.wait(Native Method) - waiting on <0x80cefc90> (a java.util.LinkedList) at org.apache.hadoop.ipc.Server$Handler.run(Server.java:471) - locked <0x80cefc90> (a java.util.LinkedList) "Server listener on port 55555" daemon prio=1 tid=0x086dafd8 nid=0x1312 in Object.wait() [0x7ab0e000..0x7ab0efc0] at java.lang.Object.wait(Native Method) - waiting on <0x80d15e40> (a java.lang.Object) at org.apache.hadoop.ipc.Server$Connection.processData(Server.java:433) - locked <0x80d15e40> (a java.lang.Object) at org.apache.hadoop.ipc.Server$Connection.readAndProcess(Server.java :402) at org.apache.hadoop.ipc.Server$Listener.doRead(Server.java:300) at org.apache.hadoop.ipc.Server$Listener.run(Server.java:213) "Low Memory Detector" daemon prio=1 tid=0x080f6c88 nid=0x1310 runnable [0x00000000..0x00000000] "CompilerThread1" daemon prio=1 tid=0x080f5800 nid=0x130f waiting on condition [0x00000000..0x7ccd5268] "CompilerThread0" daemon prio=1 tid=0x080f47a0 nid=0x130e waiting on condition [0x00000000..0x7cd553e8] "AdapterThread" daemon prio=1 tid=0x080f3660 nid=0x130d waiting on condition [0x00000000..0x00000000] "Signal Dispatcher" daemon prio=1 tid=0x080f2780 nid=0x130c waiting on condition [0x00000000..0x00000000] "Finalizer" daemon prio=1 tid=0x080e7fb0 nid=0x130b in Object.wait() [0x7d058000..0x7d058e40] at java.lang.Object.wait(Native Method) - waiting on <0x80c603c8> (a java.lang.ref.ReferenceQueue$Lock) at java.lang.ref.ReferenceQueue.remove(ReferenceQueue.java:116) - locked <0x80c603c8> (a java.lang.ref.ReferenceQueue$Lock) at java.lang.ref.ReferenceQueue.remove(ReferenceQueue.java:132) at java.lang.ref.Finalizer$FinalizerThread.run(Finalizer.java:159) "Reference Handler" daemon prio=1 tid=0x080e7af8 nid=0x130a in Object.wait() [0x7d0d8000..0x7d0d8fc0] at java.lang.Object.wait(Native Method) - waiting on <0x80c600b0> (a java.lang.ref.Reference$Lock) at java.lang.Object.wait(Object.java:474) at java.lang.ref.Reference$ReferenceHandler.run(Reference.java:116) - locked <0x80c600b0> (a java.lang.ref.Reference$Lock) "main" prio=1 tid=0x0805e250 nid=0x1305 in Object.wait() [0xaffab000..0xaffac158] at java.lang.Object.wait(Native Method) - waiting on <0x80cefbe0> (a org.apache.hadoop.ipc.RPC$Server) at java.lang.Object.wait(Object.java:474) at org.apache.hadoop.ipc.Server.join(Server.java:587) - locked <0x80cefbe0> (a org.apache.hadoop.ipc.RPC$Server) at org.apache.nutch.searcher.DistributedSearch$Server.main( DistributedSearch.java:75) "VM Thread" prio=1 tid=0x080e5668 nid=0x1309 runnable "VM Periodic Task Thread" prio=1 tid=0x080f8128 nid=0x1311 waiting on condition We can see that all of the handlers (I launch the server using 5 handlers) are waiting for new queued calls, we can watch it in the next piece of code from the run method that belongs to class Handler: while (running && callQueue.size()==0) { // wait for a call callQueue.wait(timeout); //line 471 } In the other hand the Listener is waiting because it thinks queue is full. This is shown in the next piece of code from the processData method that belongs to class Connection: while (running && callQueue.size() >= maxQueuedCalls) { synchronized (callDequeued) { // queue is full1 callDequeued.wait(timeout); // wait for a dequeue for 60 seconds (line 433) } } So Listener thinks queue is full and Handlers that it is empty, somebody must be wrong. As we can see in the next snippet, from the method run of the class Handler, every Handler notify a dequeued call, and this should be enough to wake up the Listener: while (running) { try { Call call; synchronized (callQueue) { while (running && callQueue.size()==0) { // wait for a call callQueue.wait(timeout); } if (!running) break; call = (Call)callQueue.removeFirst(); // pop the queue } synchronized (callDequeued) { // tell others we've dequeued callDequeued.notify(); // <-this should wake up Listener but sometimes dont } So, in order to minimize the problem of having the Listener stopped for 60 seconds I've reduced its waiting time to 500ms. I've also added some traces generated during a load test. 2007-02-13 21:11:46,910 INFO ipc.Server - ----queue is full. The server should wait. 5 calls left (line 1) 2007-02-13 21:11:46,911 INFO ipc.RPC - Return: ... fecha de 30 de junio de 2005 por la que se... 2007-02-13 21:11:46,911 INFO ipc.Server - dequeued a call, 4 calls left 2007-02-13 21:11:46,914 INFO ipc.RPC - Return: ... E, e incluida en la Oferta de Empleo Públi... 2007-02-13 21:11:46,915 INFO ipc.Server - dequeued a call, 3 calls left (line 5) 2007-02-13 21:11:46,922 INFO ipc.RPC - Return: ... PROMOCION INTERNA UN OFICIAL CELADOR OFERT... 2007-02-13 21:11:46,923 INFO ipc.Server - dequeued a call, 2 calls left 2007-02-13 21:11:46,923 INFO ipc.RPC - Return: [EMAIL PROTECTED] 2007-02-13 21:11:46,923 INFO ipc.Server - dequeued a call, 1 calls left 2007-02-13 21:11:46,923 INFO ipc.Server - notify dequeue WAKE UP LISTENER!! (line 10) 2007-02-13 21:11:46,923 INFO ipc.RPC - Call: getDetails(#14599) 2007-02-13 21:11:46,923 INFO ipc.RPC - Return: 20060925130810/http://andaluciajunta.es/portal/... 2007-02-13 21:11:46,924 INFO ipc.Server - dequeued a call, 0 calls left 2007-02-13 21:11:46,924 INFO ipc.Server - notify dequeue WAKE UP LISTENER!! 2007-02-13 21:11:46,924 INFO ipc.RPC - Call: getSummary(20060925130810/http://andaluciajunta.e... (line 15) 2007-02-13 21:11:46,928 INFO ipc.RPC - Return: ... parte de ellos. 2. La Junta se entenderá v... 2007-02-13 21:11:46,928 INFO ipc.Server - Handler waiting a call 2007-02-13 21:11:46,928 INFO ipc.Server - notify dequeue WAKE UP LISTENER!! 2007-02-13 21:11:46,928 INFO ipc.Server - notify dequeue WAKE UP LISTENER!! 2007-02-13 21:11:46,928 INFO ipc.RPC - Call: getSummary(20060925130810/http://andaluciajunta.e... (line 20) 2007-02-13 21:11:46,931 INFO ipc.RPC - Return: ... encomienden el Presidente o la Junta de Go... 2007-02-13 21:11:46,932 INFO ipc.Server - Handler waiting a call 2007-02-13 21:11:46,932 INFO ipc.Server - notify dequeue WAKE UP LISTENER!! 2007-02-13 21:11:46,932 INFO ipc.RPC - Call: getSummary(20060925130810/http://andaluciajunta.e... 2007-02-13 21:11:46,935 INFO ipc.RPC - Return: ... composición y convocatoria de la Junta Gen... (line 25) 2007-02-13 21:11:46,935 INFO ipc.Server - Handler waiting a call 2007-02-13 21:11:46,935 INFO ipc.Server - notify dequeue WAKE UP LISTENER!! 2007-02-13 21:11:46,935 INFO ipc.RPC - Call: getSummary(20060925130719/http://andaluciajunta.e... 2007-02-13 21:11:46,935 INFO ipc.RPC - Return: ... Boletín Oficial de la Junta de Andalucía. ... 2007-02-13 21:11:46,935 INFO ipc.Server - Handler waiting a call (line 30) 2007-02-13 21:11:46,935 INFO ipc.Server - server is stopped for 5 seconds 2007-02-13 21:11:46,935 INFO ipc.RPC - Call: getSummary(20060925130810/http://andaluciajunta.e... 2007-02-13 21:11:46,939 INFO ipc.RPC - Return: ... electos que formen la nueva Junta de Gobie... 2007-02-13 21:11:46,939 INFO ipc.Server - Handler waiting a call 2007-02-13 21:11:51,936 INFO ipc.Server - queued a new Call [EMAIL PROTECTED] (line 35) Observe that it tooks 5 seconds, the Listener timeout i wrote for the load test, from the last but one to the last log line. In the first line the queue is full of calls so it is satisfied the condition to stop the Listener, remember this piece of code from Connection: while (running && callQueue.size() >= maxQueuedCalls) { LOG.info("queued is full. The server shoul wait. "+callQueue.size()+" calls left"); synchronized (callDequeued) { // queue is full1 LOG.info(" server is stopped for 5 seconds"); callDequeued.wait(5000); // wait for a dequeue for 60 seconds } } But before server starts waiting all the calls are processed (from line 2 to line 31). Then, when Listener executes "callDequeued.wait(5000)" none of the handlers should have locked callDequeued, so why it waits instead of keep on running? 2007/2/5, Alvaro Cabrerizo <[EMAIL PROTECTED]>:
Hi, I've detected that kernel version on machines were 2.4 instead of 2.6. So during test It was used linuxthreads-0.10. After reading http://www.unixville.com/~moazam/stories/2004/05/18/debuggingHangsInTheJvm.html<http://www.unixville.com/%7Emoazam/stories/2004/05/18/debuggingHangsInTheJvm.html>, I`ve decided to move to kernel 2.6 in order to install NPTL (Native Posix Thread Library). So it would be very helpful to know which kernel version and which library to manage threads was used when the bug was closed. Thanks in advance. 2007/1/31, Alvaro Cabrerizo <[EMAIL PROTECTED]>: > > Hi: > > During a load test using Jmeter I get this error: > > 2007-01-31 09:27:47 StandardWrapperValve[jsp]: Servlet.service() para > servlet jsp lanzó excepción > java.lang.NullPointerException > at org.apache.nutch.searcher.DistributedSearch$Client.getSummary( > DistributedSearch.java:323) > at org.apache.nutch.searcher.NutchBean.getSummary(NutchBean.java > :349) > at org.apache.jsp.search_jsp._jspService(search_jsp.java:267) > at org.apache.jasper.runtime.HttpJspBase.service ( > HttpJspBase.java:94) > at javax.servlet.http.HttpServlet.service(HttpServlet.java:802) > at org.apache.jasper.servlet.JspServletWrapper.service( > JspServletWrapper.java:324) > at org.apache.jasper.servlet.JspServlet.serviceJspFile ( > JspServlet.java:292) > at org.apache.jasper.servlet.JspServlet.service(JspServlet.java > :236) > ... > 007-01-31 09:27:47 StandardWrapperValve[jsp]: Servlet.service() para > servlet jsp lanzó excepción > java.io.IOException : no responses > at org.apache.hadoop.ipc.Client.call(Client.java:344) > at org.apache.hadoop.ipc.RPC.call(RPC.java:174) > at org.apache.nutch.searcher.DistributedSearch$Client.getDetails > (DistributedSearch.java :307) > at org.apache.nutch.searcher.NutchBean.getDetails(NutchBean.java > :338) > at org.apache.jsp.search_jsp._jspService(search_jsp.java:266) > > The test was done using three machines, in the same lan segment, with > this configuration: > > -Pentium 4 3.0GHz > -1GB RAM > -Jdk 1.5 > -Tomcat 5 > -linux 2.6 kernel. > -nutch-0.8 > -hadoop-0.4.0 > > The first machine runs a "search-server" that loads a Nutch index of > 20000 documents, about 20MB of disk space for the Lucene index. The second > one runs a tomcat that communicates with the "search-server" and the third > one runs jmeter. > > The request that launches the exception takes about 60 seconds, while > rest take about 650ms. Sysstat indicates that first and second machine have > the CPU idle, close to 100%, no swapping and still lot of free memory. > > As I can see that is related with the closed bug nutch-14 (http://issues.apache.org/jira/browse/NUTCH-14 > ). > > Any help will be apreciated in order to solve this problem. >