[jira] [Updated] (HDFS-7480) Namenodes loops on 'block does not belong to any file' after deleting many files

Frode Halvorsen (JIRA) Fri, 05 Dec 2014 02:00:36 -0800

     [ 
https://issues.apache.org/jira/browse/HDFS-7480?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]


Frode Halvorsen updated HDFS-7480:
----------------------------------
    Description: 
A small cluster has 8 servers with 32 G RAM.
Two is namenodes (HA-configured), six is Datanodes (8x3 TB disks configured 
with RAID as one 21 TB drive).
The cluster recieves avg 400.000 small files each day. I started archiving 
(HAR) each day as separate archives. After deleting the orinigal files for one 
month, the namenodes stared acting up really bad.
When restaring those, both active and passive nodes seems to work OK for some 
time, but then starts to report a lot of blocks belonging to no files, and the 
name-node just spins those messages in a massive loop. If the passive node is 
first, it also influences the active node in susch a way that it's no longer 
possible to archive new files. If the active node also starts in this loop, it 
suddenly dies without any error-message.

The only way I'm able to get rid of the problem, is to start decommission 
nodes, watching the cluster closely to avoid downtime, and make sure every 
datanode gets a 'clean' start. After all datanodes has been decommisioned (in 
turns), and restarted with clean disks, the problem is gone. But if I then 
delete a lot of files in a short time, the problem starts again...  
The main problem (I think), is that the recieving and reporting of those blocks 
takes so many resources, that the namenodes is too busy to tell the datanodes 
to delete those blocks.. 

If the active name-node starts on the loop, it does the 'right' thing by 
telling the datanode to invalidate the block, But the amount of blocks is so 
massive, that the namenode doesn't do anything else. Just now, I have about 
1200-1400 log-entries pr second in the passive node.

  was:
A small cluster has 8 servers with 32 G RAM.
Two is namenodes (HA-configured), six is Datanodes (8x3 TB disks configured 
with RAID as one 21 TB drive).
The cluster recieves avg 400.000 small files each day. I started archiving 
(HAR) each day as separate archives. After deleting the orinigal files for one 
month, the namenodes stared acting up really bad.
When restaring those, both active and passive nodes seems to work OK for some 
time, but then starts to report a lot of blocks belonging to no files, and the 
name-node just spins those messages in a massive loop. If the passive node is 
first, it also influences the active node in susch a way that it's no longer 
possible to archive new files. If the active node also starts in this loop, it 
suddenly dies without any error-message.

The only way I'm able to get rid of the problem, is to start decommission 
nodes, watching the cluster closely to avoid downtime, and make sure every 
datanode gets a 'clean' start. After all datanodes has been decommisioned (in 
turns), and restarted with clean disks, the problem is gone. But if I then 
delete a lot of files in a short time, the problem starts again...  
The main problem (I think), is that the recieving and reporting of those blocks 
takes so many resources, that the namenodes is too busy to tell the datanodes 
to delete those blocks.. 

If the active name-node starts on the loop, it does the 'right' thing by 
telling the datanode to invalidate the block, But the amount of blocks is so 
massive, that the namenode doesn't do anything else. Just now, I have about 
1200 log-entries pr second


> Namenodes loops on 'block does not belong to any file' after deleting many 
> files
> --------------------------------------------------------------------------------
>
>                 Key: HDFS-7480
>                 URL: https://issues.apache.org/jira/browse/HDFS-7480
>             Project: Hadoop HDFS
>          Issue Type: Bug
>    Affects Versions: 2.5.0
>         Environment: CentOS - HDFS-HA (journal), zookeeper
>            Reporter: Frode Halvorsen
>
> A small cluster has 8 servers with 32 G RAM.
> Two is namenodes (HA-configured), six is Datanodes (8x3 TB disks configured 
> with RAID as one 21 TB drive).
> The cluster recieves avg 400.000 small files each day. I started archiving 
> (HAR) each day as separate archives. After deleting the orinigal files for 
> one month, the namenodes stared acting up really bad.
> When restaring those, both active and passive nodes seems to work OK for some 
> time, but then starts to report a lot of blocks belonging to no files, and 
> the name-node just spins those messages in a massive loop. If the passive 
> node is first, it also influences the active node in susch a way that it's no 
> longer possible to archive new files. If the active node also starts in this 
> loop, it suddenly dies without any error-message.
> The only way I'm able to get rid of the problem, is to start decommission 
> nodes, watching the cluster closely to avoid downtime, and make sure every 
> datanode gets a 'clean' start. After all datanodes has been decommisioned (in 
> turns), and restarted with clean disks, the problem is gone. But if I then 
> delete a lot of files in a short time, the problem starts again...  
> The main problem (I think), is that the recieving and reporting of those 
> blocks takes so many resources, that the namenodes is too busy to tell the 
> datanodes to delete those blocks.. 
> If the active name-node starts on the loop, it does the 'right' thing by 
> telling the datanode to invalidate the block, But the amount of blocks is so 
> massive, that the namenode doesn't do anything else. Just now, I have about 
> 1200-1400 log-entries pr second in the passive node.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

[jira] [Updated] (HDFS-7480) Namenodes loops on 'block does not belong to any file' after deleting many files

Reply via email to