Hi Ankur,

1. which version of Hadoop are you using?
2. What is the max file-descriptor setting on the linux boxes on which
the apache server is running?
3. When you do a lsof, how many descriptors are listed as "being open"?
4. Are there multiple threads in the apache server that write logs?
are these threads created dynamically?
5. Does one Apache server open a HDFS and keep writing to it for its
entire lifetime? Or does it close and reopen a new log file
periodically?

thanks,
dhruba
On Fri, Sep 26, 2008 at 8:40 AM, Raghu Angadi <[EMAIL PROTECTED]> wrote:
>
> What does jstack show for this?
>
> Probably better suited for jira discussion.
> Raghu.
> Goel, Ankur wrote:
>>
>> Hi Folks,
>>
>>    We have developed a simple log writer in Java that is plugged into
>> Apache custom log and writes log entries directly to our hadoop cluster
>> (50 machines, quad core, each with 16 GB Ram and 800 GB hard-disk, 1
>> machine as dedicated Namenode another machine as JobTracker &
>> TaskTracker + DataNode).
>>
>> There are around 8 Apache servers dumping logs into HDFS via our writer.
>> Everything was working fine and we were getting around 15 - 20 MB log
>> data per hour from each server.
>>
>>
>> Recently we have been experiencing problems with 2-3 of our Apache
>> servers where a file is opened by log-writer in HDFS for writing but it
>> never receives any data.
>>
>> Looking at apache error logs shows the following errors
>>
>> 08/09/22 05:02:13 INFO ipc.Client: java.io.IOException: Too many open
>> files
>>        at sun.nio.ch.IOUtil.initPipe(Native Method)
>>        at
>> sun.nio.ch.EPollSelectorImpl.<init>(EPollSelectorImpl.java:49)
>>        at
>> sun.nio.ch.EPollSelectorProvider.openSelector(EPollSelectorProvider.java
>> :18)
>>        at
>> org.apache.hadoop.net.SocketIOWithTimeout$SelectorPool.get(SocketIOWithT
>> imeout.java:312)
>>        at
>> org.apache.hadoop.net.SocketIOWithTimeout$SelectorPool.select(SocketIOWi
>> thTimeout.java:227)
>>        at
>> org.apache.hadoop.net.SocketIOWithTimeout.doIO(SocketIOWithTimeout.java:
>> 155)
>>        at
>> org.apache.hadoop.net.SocketInputStream.read(SocketInputStream.java:149)
>>        at
>> org.apache.hadoop.net.SocketInputStream.read(SocketInputStream.java:122)
>>        at java.io.FilterInputStream.read(FilterInputStream.java:116)
>>        at
>> org.apache.hadoop.ipc.Client$Connection$1.read(Client.java:203)
>>        at
>> java.io.BufferedInputStream.fill(BufferedInputStream.java:218)
>>        at
>> java.io.BufferedInputStream.read(BufferedInputStream.java:237)
>>        at java.io.DataInputStream.readInt(DataInputStream.java:370)
>>        at org.apache.hadoop.ipc.Client$Connection.run(Client.java:289)
>>
>>        ...
>>
>>        ...
>>
>>  Followed by connection errors saying
>> "Retrying to connect to server: hadoop-server.com:9000. Already tried
>> 'n' times".
>>
>> (same as above) ...
>>
>> ....
>>
>> and is retrying constantly (log-writer set up so that it waits and
>> retries).
>>
>>
>> Doing an lsof on the log writer java process shows that it got stuck in
>> a lot of pipe/event poll and eventually ran out of file handles.
>> Below is the part of the lsof output
>>
>>
>> lsof -p 2171
>> COMMAND  PID USER   FD   TYPE             DEVICE     SIZE     NODE NAME
>> ....
>>
>> ....
>> java    2171 root   20r  FIFO                0,7          24090207 pipe
>> java    2171 root   21w  FIFO                0,7          24090207 pipe
>> java    2171 root   22r  0000                0,8        0 24090208
>> eventpoll
>> java    2171 root   23r  FIFO                0,7          23323281 pipe
>> java    2171 root   24r  FIFO                0,7          23331536 pipe
>> java    2171 root   25w  FIFO                0,7          23306764 pipe
>> java    2171 root   26r  0000                0,8        0 23306765
>> eventpoll
>> java    2171 root   27r  FIFO                0,7          23262160 pipe
>> java    2171 root   28w  FIFO                0,7          23262160 pipe
>> java    2171 root   29r  0000                0,8        0 23262161
>> eventpoll
>> java    2171 root   30w  FIFO                0,7          23299329 pipe
>> java    2171 root   31r  0000                0,8        0 23299330
>> eventpoll
>> java    2171 root   32w  FIFO                0,7          23331536 pipe
>> java    2171 root   33r  FIFO                0,7          23268961 pipe
>> java    2171 root   34w  FIFO                0,7          23268961 pipe
>> java    2171 root   35r  0000                0,8        0 23268962
>> eventpoll
>> java    2171 root   36w  FIFO                0,7          23314889 pipe
>>
>> ...
>>
>> ...
>>
>> ...
>>
>> What in DFS client (if any) could have caused this? Could it be
>> something else?
>>
>> Is it not ideal to use an HDFS writer to directly write logs from Apache
>> into HDFS?
>>
>> Is 'Chukwa" (hadoop log collection and analysis framework contributed by
>> Yahoo) a better fit for our case?
>>
>>
>> I would highly appreciate help on any or all of the above questions.
>>
>>
>> Thanks and Regards
>>
>> -Ankur
>>
>>
>
>

Reply via email to