[ 
https://issues.apache.org/jira/browse/HDFS-13243?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16392309#comment-16392309
 ] 

Zephyr Guo edited comment on HDFS-13243 at 3/9/18 2:55 AM:
-----------------------------------------------------------

{{[~jojochuang], Thanks for reviewing.}}

{quote}
1.It seems to me the root of problem is that client would call fsync() with an 
incorrect length (shorter than what it is supposed to sync). If that's the case 
you should fix the client (DFSOutputStream), rather than the NameNode.
{quote}

Client call sync() with a *corrent* length, but sync request could be sent 
after close().
The root cause is that DFSOutputStream#flushOrSync() is not thread-safe. See 
following simplified code:

 
{code:java}
synchronized (this) {}
 ... // code before send request
}

// **We send request here, but it's not included in synchronized code block**
 if (getStreamer().getPersistBlocks().getAndSet(false) || updateLength) {
   try {

   dfsClient.namenode.fsync(src, fileId, dfsClient.clientName, 
lastBlockLength); 

   } catch (IOException ioe) {
   // Deal with ioe
   }
}

synchronized (this) {
  ... // code after send request
}
{code}


{quote}
2. Looking at the log, your minimal replication number is 2, rather than 1. 
That's very unusual. In my past experience a lot of weird behavior like this 
could arise when you have that kind of configuration.
{quote}
I'm not sure that data reliability would be affected  if minimal replication 
set to 1. Do you have some experience about this

{quote}
3.And why is close() in the picture? IMHO you don't even need to close(). 
Suppose you block DataNode heartbeat, and let client keep the file open and 
then call sync(), the last block's state remains in COMMITTED. Would that cause 
the same behavior?
{quote}
close() must be called after sync(see above root cause). If you don't call 
close(), the last block's state can't change to COMMITTED, right? 

{quote}
4.Looking at the patch, I would like to ask you to stay away from using 
reflection. You could refactor FSNamesystem and DFSOutputStream to return a new 
FSNamesystem/DFSOutputStream object and override them in the test code. That 
way, you don't need to introduce new configurations too. And it'll be much 
cleaner.
{quote}
I don't understand this. There is no API that set impl of FSNamesystem in 
MiniCluster. Could you give me a sample in another test case? I will rewrite 
this patch.

{quote}
5.I don't understand the following code.
{quote}
My fixed code is not final version, because I don't know that whether 
DFSOutputStream#flushOrSync() impl is ok. We sent sync RPC without lock in 
client, maybe for performance benefit? If for a benefit, we just fix 
server-side.If not,we need to fix both server-side and client.
In server-side, we could log warn for wrong length and throw exception for 
invalid state. Is this better than current version?

 

 


was (Author: gzh1992n):
{{[~jojochuang], Thanks for reviewing.}}

{quote}
1.It seems to me the root of problem is that client would call fsync() with an 
incorrect length (shorter than what it is supposed to sync). If that's the case 
you should fix the client (DFSOutputStream), rather than the NameNode.
{quote}

Client call sync() with a *corrent* length, but sync request could be sent 
after close().
The root cause is that DFSOutputStream#flushOrSync() is not thread-safe. See 
following simplified code:

 
{code:java}
synchronized (this) {}
 ... // code before send request
}

// **We send request here, but it's not included in synchronized code block**
 if (getStreamer().getPersistBlocks().getAndSet(false) || updateLength) {
 try {

 dfsClient.namenode.fsync(src, fileId, dfsClient.clientName, lastBlockLength); 

 } catch (IOException ioe) {
 // Deal with ioe
 }

synchronized (this) {
  ... // code after send request
}
{code}


{quote}
2. Looking at the log, your minimal replication number is 2, rather than 1. 
That's very unusual. In my past experience a lot of weird behavior like this 
could arise when you have that kind of configuration.
{quote}
I'm not sure that data reliability would be affected  if minimal replication 
set to 1. Do you have some experience about this

{quote}
3.And why is close() in the picture? IMHO you don't even need to close(). 
Suppose you block DataNode heartbeat, and let client keep the file open and 
then call sync(), the last block's state remains in COMMITTED. Would that cause 
the same behavior?
{quote}
close() must be called after sync(see above root cause). If you don't call 
close(), the last block's state can't change to COMMITTED, right? 

{quote}
4.Looking at the patch, I would like to ask you to stay away from using 
reflection. You could refactor FSNamesystem and DFSOutputStream to return a new 
FSNamesystem/DFSOutputStream object and override them in the test code. That 
way, you don't need to introduce new configurations too. And it'll be much 
cleaner.
{quote}
I don't understand this. There is no API that set impl of FSNamesystem in 
MiniCluster. Could you give me a sample in another test case? I will rewrite 
this patch.

{quote}
5.I don't understand the following code.
{quote}
My fixed code is not final version, because I don't know that whether 
DFSOutputStream#flushOrSync() impl is ok. We sent sync RPC without lock in 
client, maybe for performance benefit? If for a benefit, we just fix 
server-side.If not,we need to fix both server-side and client.
In server-side, we could log warn for wrong length and throw exception for 
invalid state. Is this better than current version?

 

 

> Get CorruptBlock because of calling close and sync in same time
> ---------------------------------------------------------------
>
>                 Key: HDFS-13243
>                 URL: https://issues.apache.org/jira/browse/HDFS-13243
>             Project: Hadoop HDFS
>          Issue Type: Bug
>          Components: namenode
>    Affects Versions: 2.7.2, 3.2.0
>            Reporter: Zephyr Guo
>            Assignee: Zephyr Guo
>            Priority: Critical
>             Fix For: 3.2.0
>
>         Attachments: HDFS-13243-v1.patch, HDFS-13243-v2.patch
>
>
> HDFS File might get broken because of corrupt block(s) that could be produced 
> by calling close and sync in the same time.
> When calling close was not successful, UCBlock status would change to 
> COMMITTED, and if a sync request gets popped from queue and processed, sync 
> operation would change the last block length.
> After that, DataNode would report all received block to NameNode, and will 
> check Block length of all COMMITTED Blocks. But the block length was already 
> different between recorded in NameNode memory and reported by DataNode, and 
> consequently, the last block is marked as corruptted because of inconsistent 
> length.
>  
> {panel:title=Log in my hdfs}
> 2018-03-05 04:05:39,261 INFO org.apache.hadoop.hdfs.StateChange: BLOCK* 
> allocate blk_1085498930_11758129\{UCState=UNDER_CONSTRUCTION, 
> truncateBlock=null, primaryNodeIndex=-1, 
> replicas=[ReplicaUC[[DISK]DS-32c7e479-3845-4a44-adf1-831edec7506b:NORMAL:10.0.0.219:50010|RBW],
>  
> ReplicaUC[[DISK]DS-a9a5d653-c049-463d-8e4a-d1f0dc14409c:NORMAL:10.0.0.220:50010|RBW],
>  
> ReplicaUC[[DISK]DS-f2b7c04a-b724-4c69-abbf-d2e416f70706:NORMAL:10.0.0.218:50010|RBW]]}
>  for 
> /hbase/WALs/hb-j5e517al6xib80rkb-006.hbase.rds.aliyuncs.com,16020,1519845790686/hb-j5e517al6xib80rkb-006.hbase.rds.aliyuncs.com%2C16020%2C1519845790686.default.1520193926515
> 2018-03-05 04:05:39,760 INFO org.apache.hadoop.hdfs.StateChange: BLOCK* 
> fsync: 
> /hbase/WALs/hb-j5e517al6xib80rkb-006.hbase.rds.aliyuncs.com,16020,1519845790686/hb-j5e517al6xib80rkb-006.hbase.rds.aliyuncs.com%2C16020%2C1519845790686.default.1520193926515
>  for DFSClient_NONMAPREDUCE_1077513762_1
> 2018-03-05 04:05:39,761 INFO 
> org.apache.hadoop.hdfs.server.namenode.FSNamesystem: BLOCK* 
> blk_1085498930_11758129\{UCState=COMMITTED, truncateBlock=null, 
> primaryNodeIndex=-1, 
> replicas=[ReplicaUC[[DISK]DS-32c7e479-3845-4a44-adf1-831edec7506b:NORMAL:10.0.0.219:50010|RBW],
>  
> ReplicaUC[[DISK]DS-a9a5d653-c049-463d-8e4a-d1f0dc14409c:NORMAL:10.0.0.220:50010|RBW],
>  
> ReplicaUC[[DISK]DS-f2b7c04a-b724-4c69-abbf-d2e416f70706:NORMAL:10.0.0.218:50010|RBW]]}
>  is not COMPLETE (ucState = COMMITTED, replication# = 0 < minimum = 2) in 
> file 
> /hbase/WALs/hb-j5e517al6xib80rkb-006.hbase.rds.aliyuncs.com,16020,1519845790686/hb-j5e517al6xib80rkb-006.hbase.rds.aliyuncs.com%2C16020%2C1519845790686.default.1520193926515
> 2018-03-05 04:05:39,761 INFO BlockStateChange: BLOCK* addStoredBlock: 
> blockMap updated: 10.0.0.220:50010 is added to 
> blk_1085498930_11758129\{UCState=COMMITTED, truncateBlock=null, 
> primaryNodeIndex=-1, 
> replicas=[ReplicaUC[[DISK]DS-32c7e479-3845-4a44-adf1-831edec7506b:NORMAL:10.0.0.219:50010|RBW],
>  
> ReplicaUC[[DISK]DS-a9a5d653-c049-463d-8e4a-d1f0dc14409c:NORMAL:10.0.0.220:50010|RBW],
>  
> ReplicaUC[[DISK]DS-f2b7c04a-b724-4c69-abbf-d2e416f70706:NORMAL:10.0.0.218:50010|RBW]]}
>  size 2054413
> 2018-03-05 04:05:39,761 INFO BlockStateChange: BLOCK 
> NameSystem.addToCorruptReplicasMap: blk_1085498930 added as corrupt on 
> 10.0.0.219:50010 by 
> hb-j5e517al6xib80rkb-006.hbase.rds.aliyuncs.com/10.0.0.219 because block is 
> COMMITTED and reported length 2054413 does not match length in block map 
> 141232
> 2018-03-05 04:05:39,762 INFO BlockStateChange: BLOCK 
> NameSystem.addToCorruptReplicasMap: blk_1085498930 added as corrupt on 
> 10.0.0.218:50010 by 
> hb-j5e517al6xib80rkb-004.hbase.rds.aliyuncs.com/10.0.0.218 because block is 
> COMMITTED and reported length 2054413 does not match length in block map 
> 141232
> 2018-03-05 04:05:40,162 INFO 
> org.apache.hadoop.hdfs.server.namenode.FSNamesystem: BLOCK* 
> blk_1085498930_11758129\{UCState=COMMITTED, truncateBlock=null, 
> primaryNodeIndex=-1, 
> replicas=[ReplicaUC[[DISK]DS-32c7e479-3845-4a44-adf1-831edec7506b:NORMAL:10.0.0.219:50010|RBW],
>  
> ReplicaUC[[DISK]DS-a9a5d653-c049-463d-8e4a-d1f0dc14409c:NORMAL:10.0.0.220:50010|RBW],
>  
> ReplicaUC[[DISK]DS-f2b7c04a-b724-4c69-abbf-d2e416f70706:NORMAL:10.0.0.218:50010|RBW]]}
>  is not COMPLETE (ucState = COMMITTED, replication# = 3 >= minimum = 2) in 
> file 
> /hbase/WALs/hb-j5e517al6xib80rkb-006.hbase.rds.aliyuncs.com,16020,1519845790686/hb-j5e517al6xib80rkb-006.hbase.rds.aliyuncs.com%2C16020%2C1519845790686.default.1520193926515
> {panel}



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

---------------------------------------------------------------------
To unsubscribe, e-mail: hdfs-issues-unsubscr...@hadoop.apache.org
For additional commands, e-mail: hdfs-issues-h...@hadoop.apache.org

Reply via email to