[ 
https://issues.apache.org/jira/browse/HBASE-28155?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17775793#comment-17775793
 ] 

Duo Zhang commented on HBASE-28155:
-----------------------------------

It seems that there is a race...

Actually we have two wal groups for the recovered replication queue, but when 
the first one fnishes, the second one has not been added to the source yet, so 
it will consider the source has been finished and cleanup everything, and then 
cause the shipper thread for the second queue to throw NPE.

{noformat}
2023-10-15T16:04:39,020 INFO  
[RS_CLAIM_REPLICATION_QUEUE-regionserver/8c5d72df62de:0-0.replicationSource,1-8c5d72df62de,45999,1697385859285/8c5d72df62de,38435,1697385859175.replicationSource.shipper8c5d72df62de%2C38435%2C1697385859175,1-8c5d72df62de,45999,1697385859285/8c5d72df62de,38435,1697385859175
 {}] regionserver.ReplicationSourceShipper(98): Running 
ReplicationSourceShipper Thread for wal group: 
8c5d72df62de%2C38435%2C1697385859175
2023-10-15T16:04:39,020 INFO  
[RS_CLAIM_REPLICATION_QUEUE-regionserver/8c5d72df62de:0-0.replicationSource,1-8c5d72df62de,45999,1697385859285/8c5d72df62de,38435,1697385859175
 {}] regionserver.ReplicationSourceWALReader(109): 
peerClusterZnode=1-8c5d72df62de,45999,1697385859285/8c5d72df62de,38435,1697385859175,
 ReplicationSourceWALReaderThread : 1 inited, 
replicationBatchSizeCapacity=102400, replicationBatchCountCapacity=25000, 
replicationBatchQueueCapacity=1
2023-10-15T16:04:39,021 DEBUG 
[RS_CLAIM_REPLICATION_QUEUE-regionserver/8c5d72df62de:0-0.replicationSource,1-8c5d72df62de,45999,1697385859285/8c5d72df62de,38435,1697385859175.replicationSource.wal-reader.8c5d72df62de%2C38435%2C1697385859175.rep,1-8c5d72df62de,45999,1697385859285/8c5d72df62de,38435,1697385859175
 {}] regionserver.WALEntryStream(249): Creating new reader 
hdfs://localhost:39027/user/jenkins/test-data/48e99b63-a68f-156c-32f2-68ddc02602e1/oldWALs/8c5d72df62de%2C38435%2C1697385859175.rep.1697385865402,
 startPosition=0, beingWritten=false
2023-10-15T16:04:39,022 DEBUG 
[RS_CLAIM_REPLICATION_QUEUE-regionserver/8c5d72df62de:0-0.replicationSource,1-8c5d72df62de,45999,1697385859285/8c5d72df62de,38435,1697385859175.replicationSource.wal-reader.8c5d72df62de%2C38435%2C1697385859175,1-8c5d72df62de,45999,1697385859285/8c5d72df62de,38435,1697385859175
 {}] regionserver.WALEntryStream(451): EOF, closing 
hdfs://localhost:39027/user/jenkins/test-data/48e99b63-a68f-156c-32f2-68ddc02602e1/oldWALs/8c5d72df62de%2C38435%2C1697385859175.1697385867465
2023-10-15T16:04:39,022 DEBUG 
[RS_CLAIM_REPLICATION_QUEUE-regionserver/8c5d72df62de:0-0.replicationSource,1-8c5d72df62de,45999,1697385859285/8c5d72df62de,38435,1697385859175.replicationSource.wal-reader.8c5d72df62de%2C38435%2C1697385859175,1-8c5d72df62de,45999,1697385859285/8c5d72df62de,38435,1697385859175
 {}] regionserver.WALEntryStream(236): No more WAL files in queue
2023-10-15T16:04:39,022 DEBUG 
[RS_CLAIM_REPLICATION_QUEUE-regionserver/8c5d72df62de:0-0.replicationSource,1-8c5d72df62de,45999,1697385859285/8c5d72df62de,38435,1697385859175.replicationSource.wal-reader.8c5d72df62de%2C38435%2C1697385859175,1-8c5d72df62de,45999,1697385859285/8c5d72df62de,38435,1697385859175
 {}] regionserver.ReplicationSourceWALReader(118): Stopping the replication 
source wal reader
2023-10-15T16:04:39,022 DEBUG 
[RS_CLAIM_REPLICATION_QUEUE-regionserver/8c5d72df62de:0-0.replicationSource,1-8c5d72df62de,45999,1697385859285/8c5d72df62de,38435,1697385859175.replicationSource.shipper8c5d72df62de%2C38435%2C1697385859175,1-8c5d72df62de,45999,1697385859285/8c5d72df62de,38435,1697385859175
 {}] regionserver.ReplicationSourceShipper(110): Shipper from source 
1-8c5d72df62de,45999,1697385859285/8c5d72df62de,38435,1697385859175 got entry 
batch from reader: WALEntryBatch [walEntries=[], lastWalPath=null, 
lastWalPosition=0, nbRowKeys=0, nbHFiles=0, heapSize=0, lastSeqIds={}, 
endOfFile=false,usedBufferSize=0]
2023-10-15T16:04:39,022 DEBUG 
[RS_CLAIM_REPLICATION_QUEUE-regionserver/8c5d72df62de:0-0.replicationSource,1-8c5d72df62de,45999,1697385859285/8c5d72df62de,38435,1697385859175.replicationSource.shipper8c5d72df62de%2C38435%2C1697385859175,1-8c5d72df62de,45999,1697385859285/8c5d72df62de,38435,1697385859175
 {}] regionserver.ReplicationSourceShipper(138): Finished recovering queue for 
group 8c5d72df62de%2C38435%2C1697385859175 of peer 
1-8c5d72df62de,45999,1697385859285/8c5d72df62de,38435,1697385859175
2023-10-15T16:04:39,022 INFO  
[RS_CLAIM_REPLICATION_QUEUE-regionserver/8c5d72df62de:0-0.replicationSource,1-8c5d72df62de,45999,1697385859285/8c5d72df62de,38435,1697385859175.replicationSource.shipper8c5d72df62de%2C38435%2C1697385859175,1-8c5d72df62de,45999,1697385859285/8c5d72df62de,38435,1697385859175
 {}] regionserver.ReplicationSourceManager(540): Done with the recovered queue 
1-8c5d72df62de,45999,1697385859285/8c5d72df62de,38435,1697385859175
2023-10-15T16:04:39,024 INFO  
[RS_CLAIM_REPLICATION_QUEUE-regionserver/8c5d72df62de:0-0.replicationSource,1-8c5d72df62de,45999,1697385859285/8c5d72df62de,38435,1697385859175.replicationSource.shipper8c5d72df62de%2C38435%2C1697385859175.rep,1-8c5d72df62de,45999,1697385859285/8c5d72df62de,38435,1697385859175
 {}] regionserver.ReplicationSourceShipper(98): Running 
ReplicationSourceShipper Thread for wal group: 
8c5d72df62de%2C38435%2C1697385859175.rep
2023-10-15T16:04:39,025 INFO  
[RS_CLAIM_REPLICATION_QUEUE-regionserver/8c5d72df62de:0-0.replicationSource,1-8c5d72df62de,45999,1697385859285/8c5d72df62de,38435,1697385859175.replicationSource.shipper8c5d72df62de%2C38435%2C1697385859175,1-8c5d72df62de,45999,1697385859285/8c5d72df62de,38435,1697385859175
 {}] regionserver.ReplicationSourceManager(553): Finished recovering queue 
1-8c5d72df62de,45999,1697385859285/8c5d72df62de,38435,1697385859175 with the 
following stats: Total replicated edits: 0, current progress: 
walGroup [8c5d72df62de%2C38435%2C1697385859175.rep]: currently replicating 
from: 
hdfs://localhost:39027/user/jenkins/test-data/48e99b63-a68f-156c-32f2-68ddc02602e1/oldWALs/8c5d72df62de%2C38435%2C1697385859175.rep.1697385865402
 at position: -1

2023-10-15T16:04:39,026 DEBUG 
[RS_CLAIM_REPLICATION_QUEUE-regionserver/8c5d72df62de:0-0.replicationSource,1-8c5d72df62de,45999,1697385859285/8c5d72df62de,38435,1697385859175.replicationSource.wal-reader.8c5d72df62de%2C38435%2C1697385859175.rep,1-8c5d72df62de,45999,1697385859285/8c5d72df62de,38435,1697385859175
 {}] regionserver.WALEntryStream(451): EOF, closing 
hdfs://localhost:39027/user/jenkins/test-data/48e99b63-a68f-156c-32f2-68ddc02602e1/oldWALs/8c5d72df62de%2C38435%2C1697385859175.rep.1697385865402
2023-10-15T16:04:39,026 DEBUG 
[RS_CLAIM_REPLICATION_QUEUE-regionserver/8c5d72df62de:0-0.replicationSource,1-8c5d72df62de,45999,1697385859285/8c5d72df62de,38435,1697385859175.replicationSource.wal-reader.8c5d72df62de%2C38435%2C1697385859175.rep,1-8c5d72df62de,45999,1697385859285/8c5d72df62de,38435,1697385859175
 {}] regionserver.ReplicationSourceWALReader(175): Read 0 WAL entries eligible 
for replication
2023-10-15T16:04:39,026 DEBUG 
[RS_CLAIM_REPLICATION_QUEUE-regionserver/8c5d72df62de:0-0.replicationSource,1-8c5d72df62de,45999,1697385859285/8c5d72df62de,38435,1697385859175.replicationSource.shipper8c5d72df62de%2C38435%2C1697385859175.rep,1-8c5d72df62de,45999,1697385859285/8c5d72df62de,38435,1697385859175
 {}] regionserver.ReplicationSourceShipper(110): Shipper from source 
1-8c5d72df62de,45999,1697385859285/8c5d72df62de,38435,1697385859175 got entry 
batch from reader: WALEntryBatch [walEntries=[], 
lastWalPath=hdfs://localhost:39027/user/jenkins/test-data/48e99b63-a68f-156c-32f2-68ddc02602e1/oldWALs/8c5d72df62de%2C38435%2C1697385859175.rep.1697385865402,
 lastWalPosition=2701, nbRowKeys=0, nbHFiles=0, heapSize=0, lastSeqIds={}, 
endOfFile=true,usedBufferSize=0]
2023-10-15T16:04:39,026 DEBUG 
[RS_CLAIM_REPLICATION_QUEUE-regionserver/8c5d72df62de:0-0.replicationSource,1-8c5d72df62de,45999,1697385859285/8c5d72df62de,38435,1697385859175.replicationSource.wal-reader.8c5d72df62de%2C38435%2C1697385859175.rep,1-8c5d72df62de,45999,1697385859285/8c5d72df62de,38435,1697385859175
 {}] regionserver.WALEntryStream(249): Creating new reader 
hdfs://localhost:39027/user/jenkins/test-data/48e99b63-a68f-156c-32f2-68ddc02602e1/oldWALs/8c5d72df62de%2C38435%2C1697385859175.rep.1697385867425,
 startPosition=0, beingWritten=false
2023-10-15T16:04:39,029 ERROR 
[RS_CLAIM_REPLICATION_QUEUE-regionserver/8c5d72df62de:0-0.replicationSource,1-8c5d72df62de,45999,1697385859285/8c5d72df62de,38435,1697385859175.replicationSource.shipper8c5d72df62de%2C38435%2C1697385859175.rep,1-8c5d72df62de,45999,1697385859285/8c5d72df62de,38435,1697385859175
 {}] regionserver.ReplicationSource(452): Unexpected exception in 
RS_CLAIM_REPLICATION_QUEUE-regionserver/8c5d72df62de:0-0.replicationSource,1-8c5d72df62de,45999,1697385859285/8c5d72df62de,38435,1697385859175.replicationSource.shipper8c5d72df62de%2C38435%2C1697385859175.rep,1-8c5d72df62de,45999,1697385859285/8c5d72df62de,38435,1697385859175
 
currentPath=hdfs://localhost:39027/user/jenkins/test-data/48e99b63-a68f-156c-32f2-68ddc02602e1/oldWALs/8c5d72df62de%2C38435%2C1697385859175.rep.1697385867425
java.lang.NullPointerException: null
        at 
org.apache.hadoop.hbase.replication.regionserver.ReplicationSourceManager.cleanOldLogs(ReplicationSourceManager.java:662)
 ~[classes/:?]
        at 
org.apache.hadoop.hbase.replication.regionserver.ReplicationSourceManager.logPositionAndCleanOldLogs(ReplicationSourceManager.java:649)
 ~[classes/:?]
        at 
org.apache.hadoop.hbase.replication.regionserver.ReplicationSourceInterface.logPositionAndCleanOldLogs(ReplicationSourceInterface.java:211)
 ~[classes/:?]
        at 
org.apache.hadoop.hbase.replication.regionserver.ReplicationSourceShipper.updateLogPosition(ReplicationSourceShipper.java:266)
 ~[classes/:?]
        at 
org.apache.hadoop.hbase.replication.regionserver.ReplicationSourceShipper.shipEdits(ReplicationSourceShipper.java:158)
 ~[classes/:?]
        at 
org.apache.hadoop.hbase.replication.regionserver.ReplicationSourceShipper.run(ReplicationSourceShipper.java:119)
 ~[classes/:?]
{noformat}

Let me think how to fix this...

> NPE in ReplicationSourceManager.cleanOldLogs when sync replication is enabled
> -----------------------------------------------------------------------------
>
>                 Key: HBASE-28155
>                 URL: https://issues.apache.org/jira/browse/HBASE-28155
>             Project: HBase
>          Issue Type: Bug
>          Components: Recovery, Replication
>            Reporter: Duo Zhang
>            Priority: Major
>         Attachments: 
> org.apache.hadoop.hbase.replication.TestSyncReplicationStandbyKillRS-output.txt
>
>
> Need to dig more but it seems to related to how we deal with 
> RecoveredReplicationSource and queue.



--
This message was sent by Atlassian Jira
(v8.20.10#820010)

Reply via email to