Hi Dhruba, Thanks for the reply. 1. We are using 0.17.2 version of Hadoop. 2. Max file descriptor settings per process at the time error occurred was 1024. lsof -p <java-proc-id> confirms this as the process ran out of file handles after reaching the limit. Here is the snippet... java 2171 root 0r FIFO 0,7 23261756 pipe java 2171 root 1w CHR 1,3 2067 /dev/null java 2171 root 2w FIFO 0,7 23261747 pipe .. .. java 2171 root 1006w FIFO 0,7 26486656 pipe java 2171 root 1007r 0000 0,8 0 26486657 eventpoll java 2171 root 1008r FIFO 0,7 26492141 pipe java 2171 root 1009w FIFO 0,7 26492141 pipe java 2171 root 1010r 0000 0,8 0 26492142 eventpoll java 2171 root 1011r FIFO 0,7 26497184 pipe java 2171 root 1012w FIFO 0,7 26497184 pipe java 2171 root 1013r 0000 0,8 0 26497185 eventpoll java 2171 root 1014r FIFO 0,7 26514795 pipe java 2171 root 1015w FIFO 0,7 26514795 pipe java 2171 root 1016r 0000 0,8 0 26514796 eventpoll java 2171 root 1017r FIFO 0,7 26510109 pipe java 2171 root 1018w FIFO 0,7 26510109 pipe java 2171 root 1019r 0000 0,8 0 26510110 eventpoll java 2171 root 1020u IPv6 27549169 TCP server.domain.com:46551->hadoop.aol.com:9000 (ESTABLISHED) java 2171 root 1021r FIFO 0,7 26527653 pipe java 2171 root 1022w FIFO 0,7 26527653 pipe java 2171 root 1023u IPv6 26527645 TCP server.domain.com:15245->hadoop.aol.com:9000 (CLOSE_WAIT)
We tried upping the limit and restarting the servers but the problem recurred after 1-2 days. 3. Yes, there are multiple threads in the apache server which are created dynamically. 4. The java log writer plugged into Apache custom log closes and reopens a new log file periodically. The logwriter has 2 threads, one that writes data to FSDataOutputStream and another that wakes up periodically to close the old stream and open a new one.I am trying to see if this is the place where file handles could be leaking. Another thing to note is that we have a signal handler implementation that uses sun.misc package. The signal handler is installed for the java processes to ensure that when Apache gives the java process SIGTERM or SIGINT, we close the file handles properly. I will do some more analysis of our code to find out if it's our code issue or HDFS client issue. In case I find it's a HDFS client issue I'll move this discussion on a Hadoop JIRA. Thanks and Regards -Ankur -----Original Message----- From: Dhruba Borthakur [mailto:[EMAIL PROTECTED] Sent: Friday, September 26, 2008 10:30 PM To: core-dev@hadoop.apache.org Subject: Re: IPC Client error | Too many files open Hi Ankur, 1. which version of Hadoop are you using? 2. What is the max file-descriptor setting on the linux boxes on which the apache server is running? 3. When you do a lsof, how many descriptors are listed as "being open"? 4. Are there multiple threads in the apache server that write logs? are these threads created dynamically? 5. Does one Apache server open a HDFS and keep writing to it for its entire lifetime? Or does it close and reopen a new log file periodically? thanks, dhruba On Fri, Sep 26, 2008 at 8:40 AM, Raghu Angadi <[EMAIL PROTECTED]> wrote: > > What does jstack show for this? > > Probably better suited for jira discussion. > Raghu. > Goel, Ankur wrote: >> >> Hi Folks, >> >> We have developed a simple log writer in Java that is plugged into >> Apache custom log and writes log entries directly to our hadoop cluster >> (50 machines, quad core, each with 16 GB Ram and 800 GB hard-disk, 1 >> machine as dedicated Namenode another machine as JobTracker & >> TaskTracker + DataNode). >> >> There are around 8 Apache servers dumping logs into HDFS via our writer. >> Everything was working fine and we were getting around 15 - 20 MB log >> data per hour from each server. >> >> >> Recently we have been experiencing problems with 2-3 of our Apache >> servers where a file is opened by log-writer in HDFS for writing but it >> never receives any data. >> >> Looking at apache error logs shows the following errors >> >> 08/09/22 05:02:13 INFO ipc.Client: java.io.IOException: Too many open >> files >> at sun.nio.ch.IOUtil.initPipe(Native Method) >> at >> sun.nio.ch.EPollSelectorImpl.<init>(EPollSelectorImpl.java:49) >> at >> sun.nio.ch.EPollSelectorProvider.openSelector(EPollSelectorProvider.java >> :18) >> at >> org.apache.hadoop.net.SocketIOWithTimeout$SelectorPool.get(SocketIOWithT >> imeout.java:312) >> at >> org.apache.hadoop.net.SocketIOWithTimeout$SelectorPool.select(SocketIOWi >> thTimeout.java:227) >> at >> org.apache.hadoop.net.SocketIOWithTimeout.doIO(SocketIOWithTimeout.java: >> 155) >> at >> org.apache.hadoop.net.SocketInputStream.read(SocketInputStream.java:149) >> at >> org.apache.hadoop.net.SocketInputStream.read(SocketInputStream.java:122) >> at java.io.FilterInputStream.read(FilterInputStream.java:116) >> at >> org.apache.hadoop.ipc.Client$Connection$1.read(Client.java:203) >> at >> java.io.BufferedInputStream.fill(BufferedInputStream.java:218) >> at >> java.io.BufferedInputStream.read(BufferedInputStream.java:237) >> at java.io.DataInputStream.readInt(DataInputStream.java:370) >> at org.apache.hadoop.ipc.Client$Connection.run(Client.java:289) >> >> ... >> >> ... >> >> Followed by connection errors saying >> "Retrying to connect to server: hadoop-server.com:9000. Already tried >> 'n' times". >> >> (same as above) ... >> >> .... >> >> and is retrying constantly (log-writer set up so that it waits and >> retries). >> >> >> Doing an lsof on the log writer java process shows that it got stuck in >> a lot of pipe/event poll and eventually ran out of file handles. >> Below is the part of the lsof output >> >> >> lsof -p 2171 >> COMMAND PID USER FD TYPE DEVICE SIZE NODE NAME >> .... >> >> .... >> java 2171 root 20r FIFO 0,7 24090207 pipe >> java 2171 root 21w FIFO 0,7 24090207 pipe >> java 2171 root 22r 0000 0,8 0 24090208 >> eventpoll >> java 2171 root 23r FIFO 0,7 23323281 pipe >> java 2171 root 24r FIFO 0,7 23331536 pipe >> java 2171 root 25w FIFO 0,7 23306764 pipe >> java 2171 root 26r 0000 0,8 0 23306765 >> eventpoll >> java 2171 root 27r FIFO 0,7 23262160 pipe >> java 2171 root 28w FIFO 0,7 23262160 pipe >> java 2171 root 29r 0000 0,8 0 23262161 >> eventpoll >> java 2171 root 30w FIFO 0,7 23299329 pipe >> java 2171 root 31r 0000 0,8 0 23299330 >> eventpoll >> java 2171 root 32w FIFO 0,7 23331536 pipe >> java 2171 root 33r FIFO 0,7 23268961 pipe >> java 2171 root 34w FIFO 0,7 23268961 pipe >> java 2171 root 35r 0000 0,8 0 23268962 >> eventpoll >> java 2171 root 36w FIFO 0,7 23314889 pipe >> >> ... >> >> ... >> >> ... >> >> What in DFS client (if any) could have caused this? Could it be >> something else? >> >> Is it not ideal to use an HDFS writer to directly write logs from Apache >> into HDFS? >> >> Is 'Chukwa" (hadoop log collection and analysis framework contributed by >> Yahoo) a better fit for our case? >> >> >> I would highly appreciate help on any or all of the above questions. >> >> >> Thanks and Regards >> >> -Ankur >> >> > >