[ 
https://issues.apache.org/jira/browse/HADOOP-3810?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Raghu Angadi updated HADOOP-3810:
---------------------------------

    Attachment: simon-namenode.PNG

The attached simon graph shows a few stats between 19:00 till 00:00. The 
namenode was restarted around 23:07.  There is not a lot write traffic. The 
fields plotted are :

- block written per sec
- num of heartbeats
- num of threads blocked
- gc/min (what does this mean?)
- gctime%



> NameNode seems unstable on a cluster with little space left
> -----------------------------------------------------------
>
>                 Key: HADOOP-3810
>                 URL: https://issues.apache.org/jira/browse/HADOOP-3810
>             Project: Hadoop Core
>          Issue Type: Bug
>          Components: dfs
>    Affects Versions: 0.17.1
>            Reporter: Raghu Angadi
>            Assignee: Raghu Angadi
>         Attachments: simon-namenode.PNG
>
>
> NameNode seems not very responsive and unstable when the cluster has very 
> little space left. The clients timeout. The main problem is that it is not 
> clear to the user what is going on. Once I have more details about a NameNode 
> that was in this state, I will fill in here.
> If there is not enough space left on a cluster, it is ok for clients to 
> receive something like "DiskOutOfSpace" exception. 
> Right now it looks like NameNode tries too hard find a node with any space 
> left and ends up being slow to respond to clients. If the CPU taken by 
> chooseTarger() is the main cause, there are two possible fixes :
> # chooseTarget() iterates and takes quite a bit of CPU for allocating 
> datanodes. Usually this not much of a problem. It takes even more cpu when it 
> needs to search multiple racks for a datanode. We could probably reduce some 
> CPU for these searches. The benefit should be measurable.
> # Once NameNode can not find any datanode that has space on a rack, it could 
> mark the rack as "full" and skip searching the rack for next one minute or 
> so. This flag gets cleared after a minute or if any new node is added to the 
> rack.
> #* Of course, this might not be optimal w.r.t disk space usage.. but only for 
> a short duration. Once a cluster is mostly full, the user does expect errors.
> #* On the flip side, this fix does not require extremely CPU optimized 
> version of chooseTarget(). 
> #* I think it is reasonable for NameNode to throw DiskOutOfSpace exception, 
> even though it could have found space if it searched much more extensively.
> ---
> edit : minor changes
>  

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.

Reply via email to