Hi Folks, We have developed a simple log writer in Java that is plugged into Apache custom log and writes log entries directly to our hadoop cluster (50 machines, quad core, each with 16 GB Ram and 800 GB hard-disk, 1 machine as dedicated Namenode another machine as JobTracker & TaskTracker + DataNode).
There are around 8 Apache servers dumping logs into HDFS via our writer. Everything was working fine and we were getting around 15 - 20 MB log data per hour from each server. Recently we have been experiencing problems with 2-3 of our Apache servers where a file is opened by log-writer in HDFS for writing but it never receives any data. Looking at apache error logs shows the following errors 08/09/22 05:02:13 INFO ipc.Client: java.io.IOException: Too many open files at sun.nio.ch.IOUtil.initPipe(Native Method) at sun.nio.ch.EPollSelectorImpl.<init>(EPollSelectorImpl.java:49) at sun.nio.ch.EPollSelectorProvider.openSelector(EPollSelectorProvider.java :18) at org.apache.hadoop.net.SocketIOWithTimeout$SelectorPool.get(SocketIOWithT imeout.java:312) at org.apache.hadoop.net.SocketIOWithTimeout$SelectorPool.select(SocketIOWi thTimeout.java:227) at org.apache.hadoop.net.SocketIOWithTimeout.doIO(SocketIOWithTimeout.java: 155) at org.apache.hadoop.net.SocketInputStream.read(SocketInputStream.java:149) at org.apache.hadoop.net.SocketInputStream.read(SocketInputStream.java:122) at java.io.FilterInputStream.read(FilterInputStream.java:116) at org.apache.hadoop.ipc.Client$Connection$1.read(Client.java:203) at java.io.BufferedInputStream.fill(BufferedInputStream.java:218) at java.io.BufferedInputStream.read(BufferedInputStream.java:237) at java.io.DataInputStream.readInt(DataInputStream.java:370) at org.apache.hadoop.ipc.Client$Connection.run(Client.java:289) ... ... Followed by connection errors saying "Retrying to connect to server: hadoop-server.com:9000. Already tried 'n' times". (same as above) ... .... and is retrying constantly (log-writer set up so that it waits and retries). Doing an lsof on the log writer java process shows that it got stuck in a lot of pipe/event poll and eventually ran out of file handles. Below is the part of the lsof output lsof -p 2171 COMMAND PID USER FD TYPE DEVICE SIZE NODE NAME .... .... java 2171 root 20r FIFO 0,7 24090207 pipe java 2171 root 21w FIFO 0,7 24090207 pipe java 2171 root 22r 0000 0,8 0 24090208 eventpoll java 2171 root 23r FIFO 0,7 23323281 pipe java 2171 root 24r FIFO 0,7 23331536 pipe java 2171 root 25w FIFO 0,7 23306764 pipe java 2171 root 26r 0000 0,8 0 23306765 eventpoll java 2171 root 27r FIFO 0,7 23262160 pipe java 2171 root 28w FIFO 0,7 23262160 pipe java 2171 root 29r 0000 0,8 0 23262161 eventpoll java 2171 root 30w FIFO 0,7 23299329 pipe java 2171 root 31r 0000 0,8 0 23299330 eventpoll java 2171 root 32w FIFO 0,7 23331536 pipe java 2171 root 33r FIFO 0,7 23268961 pipe java 2171 root 34w FIFO 0,7 23268961 pipe java 2171 root 35r 0000 0,8 0 23268962 eventpoll java 2171 root 36w FIFO 0,7 23314889 pipe ... ... ... What in DFS client (if any) could have caused this? Could it be something else? Is it not ideal to use an HDFS writer to directly write logs from Apache into HDFS? Is 'Chukwa" (hadoop log collection and analysis framework contributed by Yahoo) a better fit for our case? I would highly appreciate help on any or all of the above questions. Thanks and Regards -Ankur