[jira] [Commented] (HDFS-10301) Blocks removed by thousands due to falsely detected zombie storages

Walter Su (JIRA) Mon, 18 Apr 2016 19:18:08 -0700

    [ 
https://issues.apache.org/jira/browse/HDFS-10301?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=15246996#comment-15246996
 ]


Walter Su commented on HDFS-10301:
----------------------------------

1. IPC reader is single-thread by default. If it's multi-threaded, The order of 
putting rpc requests into {{callQueue}} is unspecified.
1. IPC {{callQueue}} is fifo.
2. IPC Handler is multi-threaded. If 2 handlers are both waiting the fsn lock, 
the entry order depends on the fairness of the lock.
bq. When constructed as fair, threads contend for entry using an 
*approximately* arrival-order policy. When the currently held lock is released 
either the longest-waiting single writer thread will be assigned the write 
lock... (quore from 
https://docs.oracle.com/javase/7/docs/api/java/util/concurrent/locks/ReentrantReadWriteLock.html)

I think if DN can't get acked from NN, it shouldn't assume the 
arrival/processing order(esp when reestablish a connection). Well, I'm still 
curious about how the interleave happened. Any thoughts?

> Blocks removed by thousands due to falsely detected zombie storages
> -------------------------------------------------------------------
>
>                 Key: HDFS-10301
>                 URL: https://issues.apache.org/jira/browse/HDFS-10301
>             Project: Hadoop HDFS
>          Issue Type: Bug
>          Components: namenode
>    Affects Versions: 2.6.1
>            Reporter: Konstantin Shvachko
>            Priority: Critical
>         Attachments: zombieStorageLogs.rtf
>
>
> When NameNode is busy a DataNode can timeout sending a block report. Then it 
> sends the block report again. Then NameNode while process these two reports 
> at the same time can interleave processing storages from different reports. 
> This screws up the blockReportId field, which makes NameNode think that some 
> storages are zombie. Replicas from zombie storages are immediately removed, 
> causing missing blocks.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

[jira] [Commented] (HDFS-10301) Blocks removed by thousands due to falsely detected zombie storages

Reply via email to