[jira] [Updated] (HBASE-24469) Hedged read might hang infinitely if read data from all DN failed

Javier Akira Luca de Tena (Jira) Fri, 29 May 2020 01:49:50 -0700


     [ 
https://issues.apache.org/jira/browse/HBASE-24469?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]


Javier Akira Luca de Tena updated HBASE-24469:
----------------------------------------------
    Description: 
Found out that after an ungraceful Datanode shutdown, the number of HBase 
active handlers started to grow, making RegionServer stuck and not able to 
serve any RPC.

Took the thread dump and found out multiple read handlers were in some kind of 
dead lock state and also write handlers stuck.

This also caused to not be able to flush the memstore because it was waiting 
for this lock: 
[https://github.com/apache/hbase/blob/136414dd72a80f379b80cd6f74b5b6ebd78f33ec/hbase-server/src/main/java/org/apache/hadoop/hbase/regionserver/HStore.java#L1225]

Without being able to flush it, I could not gracefully stop the RegionServer, 
since we can't move out the flushing region.

 

Found out that the real issue was in Hadoop's DFSInputStream. When no hedged 
reads succeed, the internal hedgedService.take() call hangs forever since it's 
internally using a BlockingQueue: 
[https://github.com/apache/hadoop/blob/rel/release-2.8.5/hadoop-hdfs-project/hadoop-hdfs-client/src/main/java/org/apache/hadoop/hdfs/DFSInputStream.java#L1435]

 

Reproduced in HBase 1.4.13, but I think it affects to other versions: 
 # Datanode dies
 # A read handler holding a read lock of an HStore is blocked by hedged read 
that does not succeed.
 # Other read handlers try to acquire the lock and stuck.
 # Memstore flusher tries to acquire write lock in HStore and also blocked 
because of the other read locks.
 # Others like CompactedHFilesDischarger also blocks because memstore holds the 
lock.
 # Tried to use graceful_stop.sh, but region_mover.rb fails because can't move 
out the region being flushed.
 # Forcefully killed the RegionServer because no other option (I am not sure if 
there is possibility of data loss, since HStore#updateStorefiles is not 
finished at this point).

 

This is the Hadoop side issue: https://issues.apache.org/jira/browse/HDFS-11303 
and it's fixed for 2.9.0.

This is not directly related with HBase code, but just wanted community to be 
aware that with current used Hadoop version (2.8.5), this issue could happen.

 

I would like to suggest to upgrade the used Hadoop version to 2.9.0.

  was:
Found out that after an ungraceful Datanode shutdown, the number of HBase 
active handlers started to grow, making RegionServer stuck and not able to 
serve any RPC.

Took the thread dump and found out multiple read handlers were in some kind of 
dead lock state and also write handlers stuck.

This also caused to not be able to flush the memstore because it was waiting 
for this lock: 
[https://github.com/apache/hbase/blob/136414dd72a80f379b80cd6f74b5b6ebd78f33ec/hbase-server/src/main/java/org/apache/hadoop/hbase/regionserver/HStore.java#L1225]

Without being able to flush it, I could not gracefully stop the RegionServer, 
since we can't move out the flushing region.

 

Found out that the real issue was in Hadoop's DFSInputStream. When no hedged 
reads succeed, the internal hedgedService.take() call hangs forever since it's 
internally using a BlockingQueue: 
[https://github.com/apache/hadoop/blob/rel/release-2.8.5/hadoop-hdfs-project/hadoop-hdfs-client/src/main/java/org/apache/hadoop/hdfs/DFSInputStream.java#L1435]

 

Reproduced in HBase 1.4.13, but I think it affects to other versions: 
 # Datanode dies
 # A read handler holding a read lock of an HStore is blocked by hedged read 
that does not succeed.
 # Other read handlers try to acquire the lock and stuck.
 # Memstore flusher tries to acquire write lock in HStore and also blocked 
because of the other read locks.
 # Others like CompactedHFilesDischarger also blocks because memstore holds the 
lock.
 # Tried to use graceful_stop.sh, but region_mover.rb fails because can't move 
out the region being flushed.
 # Forcefully killed the RegionServer because no other option (I am not sure if 
there is possibility of data loss, since HStore#updateStorefiles is not 
finished at this point).

 

This is the Hadoop side issue: https://issues.apache.org/jira/browse/HDFS-11303 
and it's fixed for 2.9.0.

This is not directly related with HBase code, but just wanted community to be 
aware that with current used Hadoop used version (2.8.5), this issue could 
happen.

 

I would like to suggest to upgrade the used Hadoop version to 2.9.0.


> Hedged read might hang infinitely if read data from all DN failed
> -----------------------------------------------------------------
>
>                 Key: HBASE-24469
>                 URL: https://issues.apache.org/jira/browse/HBASE-24469
>             Project: HBase
>          Issue Type: Bug
>            Reporter: Javier Akira Luca de Tena
>            Priority: Critical
>
> Found out that after an ungraceful Datanode shutdown, the number of HBase 
> active handlers started to grow, making RegionServer stuck and not able to 
> serve any RPC.
> Took the thread dump and found out multiple read handlers were in some kind 
> of dead lock state and also write handlers stuck.
> This also caused to not be able to flush the memstore because it was waiting 
> for this lock: 
> [https://github.com/apache/hbase/blob/136414dd72a80f379b80cd6f74b5b6ebd78f33ec/hbase-server/src/main/java/org/apache/hadoop/hbase/regionserver/HStore.java#L1225]
> Without being able to flush it, I could not gracefully stop the RegionServer, 
> since we can't move out the flushing region.
>  
> Found out that the real issue was in Hadoop's DFSInputStream. When no hedged 
> reads succeed, the internal hedgedService.take() call hangs forever since 
> it's internally using a BlockingQueue: 
> [https://github.com/apache/hadoop/blob/rel/release-2.8.5/hadoop-hdfs-project/hadoop-hdfs-client/src/main/java/org/apache/hadoop/hdfs/DFSInputStream.java#L1435]
>  
> Reproduced in HBase 1.4.13, but I think it affects to other versions: 
>  # Datanode dies
>  # A read handler holding a read lock of an HStore is blocked by hedged read 
> that does not succeed.
>  # Other read handlers try to acquire the lock and stuck.
>  # Memstore flusher tries to acquire write lock in HStore and also blocked 
> because of the other read locks.
>  # Others like CompactedHFilesDischarger also blocks because memstore holds 
> the lock.
>  # Tried to use graceful_stop.sh, but region_mover.rb fails because can't 
> move out the region being flushed.
>  # Forcefully killed the RegionServer because no other option (I am not sure 
> if there is possibility of data loss, since HStore#updateStorefiles is not 
> finished at this point).
>  
> This is the Hadoop side issue: 
> https://issues.apache.org/jira/browse/HDFS-11303 and it's fixed for 2.9.0.
> This is not directly related with HBase code, but just wanted community to be 
> aware that with current used Hadoop version (2.8.5), this issue could happen.
>  
> I would like to suggest to upgrade the used Hadoop version to 2.9.0.



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

[jira] [Updated] (HBASE-24469) Hedged read might hang infinitely if read data from all DN failed

Reply via email to