[ 
https://issues.apache.org/jira/browse/HBASE-28260?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17796845#comment-17796845
 ] 

Sean Busbey commented on HBASE-28260:
-------------------------------------

I don't think HDFS defines when a reader in the same client can expect to see 
writes into file, so we're already relying on some shady business in that part 
of replication. but yeah I would expect an increase in the latency between RS 
acks a write and the replication system 
can see it.

bq.  I think that's exactly what this change will do. During reconstruction, 
the NameNode uses one of the existing blocks as the primary. So if we set this 
flag, the local DataNode should not be an option there. 

Looking at this definition, I think that's only indirectly true because HDFS 
will "try" to not have any of the blocks be stored on the DN that's on the same 
host as the RS:

https://hadoop.apache.org/docs/r3.3.6/api/org/apache/hadoop/fs/CreateFlag.html#NO_LOCAL_WRITE

I think in the case where we have count of DNs <= WAL Replication factor we're 
still going to get a local write and then the recovery process is still going 
to choose that write for primary since it's local.

bq. Maybe you are speaking to the later phase of reconstruction when the new 
block is replicated to 2 other datanodes? Not sure we can control that. 

I'm talking about when reconstruction is choosing which which existing block is 
primary it'd be nice if we could hint to HDFS that it should avoid picking that 
block as primary. rather than having to avoid a local write all together.

> Possible data loss in WAL after RegionServer crash
> --------------------------------------------------
>
>                 Key: HBASE-28260
>                 URL: https://issues.apache.org/jira/browse/HBASE-28260
>             Project: HBase
>          Issue Type: Bug
>            Reporter: Bryan Beaudreault
>            Priority: Major
>
> We recently had a production incident:
>  # RegionServer crashes, but local DataNode lives on
>  # WAL lease recovery kicks in
>  # Namenode reconstructs the block during lease recovery (which results in a 
> new genstamp). It chooses the replica on the local DataNode as the primary.
>  # Local DataNode reconstructs the block, so NameNode registers the new 
> genstamp.
>  # Local DataNode and the underlying host dies, before the new block could be 
> replicated to other replicas.
> This leaves us with a missing block, because the new genstamp block has no 
> replicas. The old replicas still remain, but are considered corrupt due to 
> GENSTAMP_MISMATCH.
> Thankfully we were able to confirm that the length of the corrupt blocks were 
> identical to the newly constructed and lost block. Further, the file in 
> question was only 1 block. So we downloaded one of those corrupt block files 
> and hdfs {{hdfs dfs -put -f}} to force that block to replace the file in 
> hdfs. So in this case we had no actual data loss, but it could have happened 
> easily if the file was more than 1 block or the replicas weren't fully in 
> sync prior to reconstruction.
> In order to avoid this issue, we should avoid writing WAL blocks too the 
> local datanode. We can use CreateFlag.NO_WRITE_LOCAL for this. Hat tip to 
> [~weichiu] for pointing this out.
> During reading of WALs we already reorder blocks so as to avoid reading from 
> the local datanode, but avoiding writing there altogether would be better.



--
This message was sent by Atlassian Jira
(v8.20.10#820010)

Reply via email to