We are seeing similar issue at Yahoo! as well. 'jmap -histo' and 'jmap
-histo:live' are turning out to be pretty helpful. stay tuned.
How many threads do you expect to be doing HDFS i/o in your case. both
the max and normal cases are helpful.
Thanks,
Raghu.
Goel, Ankur wrote:
Hi Dhruba,
Thanks for the reply.
1. We are using 0.17.2 version of Hadoop.
2. Max file descriptor settings per process at the time error occurred
was 1024. lsof -p <java-proc-id> confirms this as the process ran out of
file handles after reaching the limit. Here is the snippet...
java 2171 root 0r FIFO 0,7 23261756 pipe
java 2171 root 1w CHR 1,3 2067
/dev/null
java 2171 root 2w FIFO 0,7 23261747 pipe
..
..
java 2171 root 1006w FIFO 0,7 26486656 pipe
java 2171 root 1007r 0000 0,8 0 26486657
eventpoll
java 2171 root 1008r FIFO 0,7 26492141 pipe
java 2171 root 1009w FIFO 0,7 26492141 pipe
java 2171 root 1010r 0000 0,8 0 26492142
eventpoll
java 2171 root 1011r FIFO 0,7 26497184 pipe
java 2171 root 1012w FIFO 0,7 26497184 pipe
java 2171 root 1013r 0000 0,8 0 26497185
eventpoll
java 2171 root 1014r FIFO 0,7 26514795 pipe
java 2171 root 1015w FIFO 0,7 26514795 pipe
java 2171 root 1016r 0000 0,8 0 26514796
eventpoll
java 2171 root 1017r FIFO 0,7 26510109 pipe
java 2171 root 1018w FIFO 0,7 26510109 pipe
java 2171 root 1019r 0000 0,8 0 26510110
eventpoll
java 2171 root 1020u IPv6 27549169 TCP
server.domain.com:46551->hadoop.aol.com:9000 (ESTABLISHED)
java 2171 root 1021r FIFO 0,7 26527653 pipe
java 2171 root 1022w FIFO 0,7 26527653 pipe
java 2171 root 1023u IPv6 26527645 TCP
server.domain.com:15245->hadoop.aol.com:9000 (CLOSE_WAIT)
We tried upping the limit and restarting the servers but the problem
recurred after 1-2 days.
3. Yes, there are multiple threads in the apache server which are
created dynamically.
4. The java log writer plugged into Apache custom log closes and reopens
a new log file periodically. The logwriter has 2 threads, one that
writes data to FSDataOutputStream and another that wakes up periodically
to close the old stream and open a new one.I am trying to see if this is
the place where file handles could be leaking.
Another thing to note is that we have a signal handler implementation
that uses sun.misc package. The signal handler is installed for the java
processes to ensure that when Apache gives the java process SIGTERM or
SIGINT, we close the file handles properly.
I will do some more analysis of our code to find out if it's our code
issue or HDFS client issue. In case I find it's a HDFS client issue I'll
move this discussion on a Hadoop JIRA.
Thanks and Regards
-Ankur