Frode Halvorsen created HDFS-7480:
-------------------------------------
Summary: Namenodes loops on 'block does not belong to any file'
after deleting many files
Key: HDFS-7480
URL: https://issues.apache.org/jira/browse/HDFS-7480
Project: Hadoop HDFS
Issue Type: Bug
Affects Versions: 2.5.0
Environment: CentOS - HDFS-HA (journal), zookeeper
Reporter: Frode Halvorsen
A small cluster has 8 servers with 32 G RAM.
Two is namenodes (HA-configured), six is Datanodes (8x3 TB disks configured
with RAID as one 21 TB drive).
The cluster recieves avg 400.000 small files each day. I started archiving
(HAR) each day as separate archives. After deleting the orinigal files for one
month, the namenodes stared acting up really bad.
When restaring those, both active and passive nodes seems to work OK for some
time, but then starts to report a lot of blocks belonging to no files, and the
name-node just spins those messages in a massive loop. If the passive node is
first, it also influences the active node in susch a way that it's no longer
possible to archive new files. If the active node also starts in this loop, it
suddenly dies without any error-message.
The only way I'm able to get rid of the problem, is to start decommission
nodes, watching the cluster closely to avoid downtime, and make sure every
datanode gets a 'clean' start. After all datanodes has been decommisioned (in
turns), and restarted with clean disks, the problem is gone. But if I then
delete a lot of files in a short time, the problem starts again...
The main problem (I think), is that the recieving and reporting of those blocks
takes so many resources, that the namenodes is too busy to tell the datanodes
to delete those blocks..
--
This message was sent by Atlassian JIRA
(v6.3.4#6332)