Re: The Case of a Long Running Hadoop System

Billy Pearson Sat, 15 Nov 2008 10:06:53 -0800

If I understand the secondary namenode merges the edits log in to thefsimage and reduces the edit log size.Which is likely the root of your problems 8.5G seams large and likelyputting a strain on your master servers memory and io bandwidth

Why do you not have a secondary namenode?

If you do not have the memory on the master I would look in to stopping adatanode/tasktracker on a server and loading the secondary namenode on it

Let it run for a while and watch your log for the secondary namenode youshould see your edit log get smaller


I am not an expert but that would be my first action.

Billy

"Abhijit Bagri" <[EMAIL PROTECTED]> wrote inmessage news:[EMAIL PROTECTED]

Hi,

This is a long mail as I have tried to put in as much details as might
help any of the Hadoop dev/users to help us out. The gist is this:

We have a long running Hadoop system (masters not restarted for about
3 months). We have recently started seeing the DFS responding very
slowly which has resulted in failures on a system which depends on
Hadoop. Further, the DFS seems to be an unstable state (i.e if fsck is
a good representation which I believe it is). The edits file

These are the details (skip/return here later and jump to the
questions at the end of the mail for a quicker read) :

Hadoop Version: 0.15.3 on 32 bit systems.

Number of slaves: 12
Slaves heap size: 1G
Namenode heap: 2G
Jobtracker heap: 2G

The namenode and jobtrackers have not been restarted for about 3
months. We did restart slaves(all of them within a few hours) a few
times for some maintaineance in between though. We do not have a
secondary namenode in place.

There is another system X which talks to this hadoop cluster. X writes
to the Hadoop DFS and submits jobs to the Jobtracker. The number of
jobs submitted to Hadoop so far is over 650,000 ( I am using the job
id for jobs for this), each job may rad/write to multiple files and
has several dependent libraries which it loads from Distributed Cache.

Recently, we started seeing that there were several timeouts happening
while X tries to read/write to the DFS. This in turn results in DFS
becoming very slow in response. The writes are especially slow. The
trace we get in the logs are:

java.net.SocketTimeoutException: Read timed out
        at java.net.SocketInputStream.socketRead0(Native Method)
        at java.net.SocketInputStream.read(SocketInputStream.java:129)
        at java.net.SocketInputStream.read(SocketInputStream.java:182)
        at java.io.DataInputStream.readShort(DataInputStream.java:284)
        at org.apache.hadoop.dfs.DFSClient
$DFSOutputStream.endBlock(DFSClient.java:1660)
        at org.apache.hadoop.dfs.DFSClient
$DFSOutputStream.close(DFSClient.java:1733)
        at org.apache.hadoop.fs.FSDataOutputStream
$PositionCache.close(FSDataOutputStream.java:49)
        at
org.apache.hadoop.fs.FSDataOutputStream.close(FSDataOutputStream.java:
64)
        ...

Also, datanode logs show a lot of traces like these:

2008-11-14 21:21:49,429 ERROR org.apache.hadoop.dfs.DataNode:
DataXceiver: java.io.IOException: Block blk_-1310124865741110666 is
valid, and cannot be written to.
        at
org.apache.hadoop.dfs.FSDataset.writeToBlock(FSDataset.java:551)
        at org.apache.hadoop.dfs.DataNode
$BlockReceiver.<init>(DataNode.java:1257)
        at org.apache.hadoop.dfs.DataNode
$DataXceiver.writeBlock(DataNode.java:901)
        at org.apache.hadoop.dfs.DataNode
$DataXceiver.run(DataNode.java:804)
        at java.lang.Thread.run(Thread.java:595)

and these

2008-11-14 21:21:50,695 WARN org.apache.hadoop.dfs.DataNode:
java.io.IOException: Error in deleting blocks.
        at org.apache.hadoop.dfs.FSDataset.invalidate(FSDataset.java:
719)
        at org.apache.hadoop.

The first one seems to be same as HADOOP-1845. We have been seeing
this for a long time, but as HADOOP-1845 says, it hasn't been creating
any problems. We have thus ignored this for a while, but this recent
problems have raised eyebrows again if this really is a non-issue.

We also saw a spike in number of TCP connections on the system X.
Also, a lsof on the system X shows a lot of connections to datanodes
in the CLOSE_WAIT state.

While investigating the issue, the fsck output started saying that DFS
is corrupt. The output is:

Total size:    152873519485 B
Total blocks:  1963655 (avg. block size 77851 B)
Total dirs:    1390820
Total files:   1962680
Over-replicated blocks:        111 (0.0061619785 %)
Under-replicated blocks:       13 (6.6203077E-4 %)
Target replication factor:     3
Real replication factor:       3.0000503


The filesystem under path '/' is CORRUPT

After running a fsck -move, the system went back to healthy state.
However, after some time it again started showing corrupt and fsk -
move restored it to a healthy state.

We are considering restarting the masters as a possible solution. I
have earlier had issue with restarting Master with 2G heap and edits
file size of about 2.5G. So, we decided to lookup the size of the
edits file. The edits file is now 8.5G! Since, we are on a 32 bit
system, we cannot push JVM heap beyond about 2.5G. So, I am not sure
that if we bring down the master, we will be able to restart it at
all. Please note that we do not have a secondary namenode.


Questions for hadoop-users and hadoop-dev?

About Hadoop restart:

1. Does namenode try to load edits file into memory on starrtup?
2. Will we be able to restart the namenode without a format at all?
3. Apart from taking backup, is there a safe way to restart the system
and retain the DFS data?

About DFS being slow:

4. Is HADOOP-1845 a non-issue? If it is, what may cause the DFS to
become slow and lead to SocketTimeoutException on client.
5. What may cause a spike in TCP connections and several connection
fds open in CLOSE_WAIT state? Is HADOOP-3071 relevant?
6. Is restart of master a solution at all?

About the health of the DFS:

7. What are the implications of DFS going into CORRUPT state every
once in a while? Should this increase my caffeine intake? More
importantly, is the Hadoop system on verge of a collapse?
8. Would restarting all slaves within a few hours without restarting
the masters have an implication on DFS health? Note that, this was
done to prevent any downtime for system X.

Our target is to return to a state where the DFS functions normally
(like it was doing till some time back) and of course to maintain a
healthy DFS with its current data. Please provide any insights

To add to the above,

- Due to the kind of dependency of X on this Hadoop system, api, DFS
data, and internal data structure of Hadoop,  it is non trivial for us
to jump to higher versions of hadoop in very near future. We could
apply some patches which could go with 0.15.3
- Loss of DFS data will be fatal. I believe once I format, I will get
a namespace id mismatch unless I clear the hadoop data directories
from the slaves. is this correct?
- We did not start secondary namenode due to HADOOP-3069. I have a
locally patched 0.15.3 with this JIRA, Will it be help and is it safe
if we add the secondary namenode now?

We have tried to look into jira for more leads, but are not sure which
ones fit our case specifically. Pointers to JIRA issues are also
welcome. Of course, please feel free to address any part of the
problem :)

Thanks
Bagri

Re: The Case of a Long Running Hadoop System

Reply via email to