[
https://issues.apache.org/jira/browse/HBASE-28155?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17775793#comment-17775793
]
Duo Zhang commented on HBASE-28155:
-----------------------------------
It seems that there is a race...
Actually we have two wal groups for the recovered replication queue, but when
the first one fnishes, the second one has not been added to the source yet, so
it will consider the source has been finished and cleanup everything, and then
cause the shipper thread for the second queue to throw NPE.
{noformat}
2023-10-15T16:04:39,020 INFO
[RS_CLAIM_REPLICATION_QUEUE-regionserver/8c5d72df62de:0-0.replicationSource,1-8c5d72df62de,45999,1697385859285/8c5d72df62de,38435,1697385859175.replicationSource.shipper8c5d72df62de%2C38435%2C1697385859175,1-8c5d72df62de,45999,1697385859285/8c5d72df62de,38435,1697385859175
{}] regionserver.ReplicationSourceShipper(98): Running
ReplicationSourceShipper Thread for wal group:
8c5d72df62de%2C38435%2C1697385859175
2023-10-15T16:04:39,020 INFO
[RS_CLAIM_REPLICATION_QUEUE-regionserver/8c5d72df62de:0-0.replicationSource,1-8c5d72df62de,45999,1697385859285/8c5d72df62de,38435,1697385859175
{}] regionserver.ReplicationSourceWALReader(109):
peerClusterZnode=1-8c5d72df62de,45999,1697385859285/8c5d72df62de,38435,1697385859175,
ReplicationSourceWALReaderThread : 1 inited,
replicationBatchSizeCapacity=102400, replicationBatchCountCapacity=25000,
replicationBatchQueueCapacity=1
2023-10-15T16:04:39,021 DEBUG
[RS_CLAIM_REPLICATION_QUEUE-regionserver/8c5d72df62de:0-0.replicationSource,1-8c5d72df62de,45999,1697385859285/8c5d72df62de,38435,1697385859175.replicationSource.wal-reader.8c5d72df62de%2C38435%2C1697385859175.rep,1-8c5d72df62de,45999,1697385859285/8c5d72df62de,38435,1697385859175
{}] regionserver.WALEntryStream(249): Creating new reader
hdfs://localhost:39027/user/jenkins/test-data/48e99b63-a68f-156c-32f2-68ddc02602e1/oldWALs/8c5d72df62de%2C38435%2C1697385859175.rep.1697385865402,
startPosition=0, beingWritten=false
2023-10-15T16:04:39,022 DEBUG
[RS_CLAIM_REPLICATION_QUEUE-regionserver/8c5d72df62de:0-0.replicationSource,1-8c5d72df62de,45999,1697385859285/8c5d72df62de,38435,1697385859175.replicationSource.wal-reader.8c5d72df62de%2C38435%2C1697385859175,1-8c5d72df62de,45999,1697385859285/8c5d72df62de,38435,1697385859175
{}] regionserver.WALEntryStream(451): EOF, closing
hdfs://localhost:39027/user/jenkins/test-data/48e99b63-a68f-156c-32f2-68ddc02602e1/oldWALs/8c5d72df62de%2C38435%2C1697385859175.1697385867465
2023-10-15T16:04:39,022 DEBUG
[RS_CLAIM_REPLICATION_QUEUE-regionserver/8c5d72df62de:0-0.replicationSource,1-8c5d72df62de,45999,1697385859285/8c5d72df62de,38435,1697385859175.replicationSource.wal-reader.8c5d72df62de%2C38435%2C1697385859175,1-8c5d72df62de,45999,1697385859285/8c5d72df62de,38435,1697385859175
{}] regionserver.WALEntryStream(236): No more WAL files in queue
2023-10-15T16:04:39,022 DEBUG
[RS_CLAIM_REPLICATION_QUEUE-regionserver/8c5d72df62de:0-0.replicationSource,1-8c5d72df62de,45999,1697385859285/8c5d72df62de,38435,1697385859175.replicationSource.wal-reader.8c5d72df62de%2C38435%2C1697385859175,1-8c5d72df62de,45999,1697385859285/8c5d72df62de,38435,1697385859175
{}] regionserver.ReplicationSourceWALReader(118): Stopping the replication
source wal reader
2023-10-15T16:04:39,022 DEBUG
[RS_CLAIM_REPLICATION_QUEUE-regionserver/8c5d72df62de:0-0.replicationSource,1-8c5d72df62de,45999,1697385859285/8c5d72df62de,38435,1697385859175.replicationSource.shipper8c5d72df62de%2C38435%2C1697385859175,1-8c5d72df62de,45999,1697385859285/8c5d72df62de,38435,1697385859175
{}] regionserver.ReplicationSourceShipper(110): Shipper from source
1-8c5d72df62de,45999,1697385859285/8c5d72df62de,38435,1697385859175 got entry
batch from reader: WALEntryBatch [walEntries=[], lastWalPath=null,
lastWalPosition=0, nbRowKeys=0, nbHFiles=0, heapSize=0, lastSeqIds={},
endOfFile=false,usedBufferSize=0]
2023-10-15T16:04:39,022 DEBUG
[RS_CLAIM_REPLICATION_QUEUE-regionserver/8c5d72df62de:0-0.replicationSource,1-8c5d72df62de,45999,1697385859285/8c5d72df62de,38435,1697385859175.replicationSource.shipper8c5d72df62de%2C38435%2C1697385859175,1-8c5d72df62de,45999,1697385859285/8c5d72df62de,38435,1697385859175
{}] regionserver.ReplicationSourceShipper(138): Finished recovering queue for
group 8c5d72df62de%2C38435%2C1697385859175 of peer
1-8c5d72df62de,45999,1697385859285/8c5d72df62de,38435,1697385859175
2023-10-15T16:04:39,022 INFO
[RS_CLAIM_REPLICATION_QUEUE-regionserver/8c5d72df62de:0-0.replicationSource,1-8c5d72df62de,45999,1697385859285/8c5d72df62de,38435,1697385859175.replicationSource.shipper8c5d72df62de%2C38435%2C1697385859175,1-8c5d72df62de,45999,1697385859285/8c5d72df62de,38435,1697385859175
{}] regionserver.ReplicationSourceManager(540): Done with the recovered queue
1-8c5d72df62de,45999,1697385859285/8c5d72df62de,38435,1697385859175
2023-10-15T16:04:39,024 INFO
[RS_CLAIM_REPLICATION_QUEUE-regionserver/8c5d72df62de:0-0.replicationSource,1-8c5d72df62de,45999,1697385859285/8c5d72df62de,38435,1697385859175.replicationSource.shipper8c5d72df62de%2C38435%2C1697385859175.rep,1-8c5d72df62de,45999,1697385859285/8c5d72df62de,38435,1697385859175
{}] regionserver.ReplicationSourceShipper(98): Running
ReplicationSourceShipper Thread for wal group:
8c5d72df62de%2C38435%2C1697385859175.rep
2023-10-15T16:04:39,025 INFO
[RS_CLAIM_REPLICATION_QUEUE-regionserver/8c5d72df62de:0-0.replicationSource,1-8c5d72df62de,45999,1697385859285/8c5d72df62de,38435,1697385859175.replicationSource.shipper8c5d72df62de%2C38435%2C1697385859175,1-8c5d72df62de,45999,1697385859285/8c5d72df62de,38435,1697385859175
{}] regionserver.ReplicationSourceManager(553): Finished recovering queue
1-8c5d72df62de,45999,1697385859285/8c5d72df62de,38435,1697385859175 with the
following stats: Total replicated edits: 0, current progress:
walGroup [8c5d72df62de%2C38435%2C1697385859175.rep]: currently replicating
from:
hdfs://localhost:39027/user/jenkins/test-data/48e99b63-a68f-156c-32f2-68ddc02602e1/oldWALs/8c5d72df62de%2C38435%2C1697385859175.rep.1697385865402
at position: -1
2023-10-15T16:04:39,026 DEBUG
[RS_CLAIM_REPLICATION_QUEUE-regionserver/8c5d72df62de:0-0.replicationSource,1-8c5d72df62de,45999,1697385859285/8c5d72df62de,38435,1697385859175.replicationSource.wal-reader.8c5d72df62de%2C38435%2C1697385859175.rep,1-8c5d72df62de,45999,1697385859285/8c5d72df62de,38435,1697385859175
{}] regionserver.WALEntryStream(451): EOF, closing
hdfs://localhost:39027/user/jenkins/test-data/48e99b63-a68f-156c-32f2-68ddc02602e1/oldWALs/8c5d72df62de%2C38435%2C1697385859175.rep.1697385865402
2023-10-15T16:04:39,026 DEBUG
[RS_CLAIM_REPLICATION_QUEUE-regionserver/8c5d72df62de:0-0.replicationSource,1-8c5d72df62de,45999,1697385859285/8c5d72df62de,38435,1697385859175.replicationSource.wal-reader.8c5d72df62de%2C38435%2C1697385859175.rep,1-8c5d72df62de,45999,1697385859285/8c5d72df62de,38435,1697385859175
{}] regionserver.ReplicationSourceWALReader(175): Read 0 WAL entries eligible
for replication
2023-10-15T16:04:39,026 DEBUG
[RS_CLAIM_REPLICATION_QUEUE-regionserver/8c5d72df62de:0-0.replicationSource,1-8c5d72df62de,45999,1697385859285/8c5d72df62de,38435,1697385859175.replicationSource.shipper8c5d72df62de%2C38435%2C1697385859175.rep,1-8c5d72df62de,45999,1697385859285/8c5d72df62de,38435,1697385859175
{}] regionserver.ReplicationSourceShipper(110): Shipper from source
1-8c5d72df62de,45999,1697385859285/8c5d72df62de,38435,1697385859175 got entry
batch from reader: WALEntryBatch [walEntries=[],
lastWalPath=hdfs://localhost:39027/user/jenkins/test-data/48e99b63-a68f-156c-32f2-68ddc02602e1/oldWALs/8c5d72df62de%2C38435%2C1697385859175.rep.1697385865402,
lastWalPosition=2701, nbRowKeys=0, nbHFiles=0, heapSize=0, lastSeqIds={},
endOfFile=true,usedBufferSize=0]
2023-10-15T16:04:39,026 DEBUG
[RS_CLAIM_REPLICATION_QUEUE-regionserver/8c5d72df62de:0-0.replicationSource,1-8c5d72df62de,45999,1697385859285/8c5d72df62de,38435,1697385859175.replicationSource.wal-reader.8c5d72df62de%2C38435%2C1697385859175.rep,1-8c5d72df62de,45999,1697385859285/8c5d72df62de,38435,1697385859175
{}] regionserver.WALEntryStream(249): Creating new reader
hdfs://localhost:39027/user/jenkins/test-data/48e99b63-a68f-156c-32f2-68ddc02602e1/oldWALs/8c5d72df62de%2C38435%2C1697385859175.rep.1697385867425,
startPosition=0, beingWritten=false
2023-10-15T16:04:39,029 ERROR
[RS_CLAIM_REPLICATION_QUEUE-regionserver/8c5d72df62de:0-0.replicationSource,1-8c5d72df62de,45999,1697385859285/8c5d72df62de,38435,1697385859175.replicationSource.shipper8c5d72df62de%2C38435%2C1697385859175.rep,1-8c5d72df62de,45999,1697385859285/8c5d72df62de,38435,1697385859175
{}] regionserver.ReplicationSource(452): Unexpected exception in
RS_CLAIM_REPLICATION_QUEUE-regionserver/8c5d72df62de:0-0.replicationSource,1-8c5d72df62de,45999,1697385859285/8c5d72df62de,38435,1697385859175.replicationSource.shipper8c5d72df62de%2C38435%2C1697385859175.rep,1-8c5d72df62de,45999,1697385859285/8c5d72df62de,38435,1697385859175
currentPath=hdfs://localhost:39027/user/jenkins/test-data/48e99b63-a68f-156c-32f2-68ddc02602e1/oldWALs/8c5d72df62de%2C38435%2C1697385859175.rep.1697385867425
java.lang.NullPointerException: null
at
org.apache.hadoop.hbase.replication.regionserver.ReplicationSourceManager.cleanOldLogs(ReplicationSourceManager.java:662)
~[classes/:?]
at
org.apache.hadoop.hbase.replication.regionserver.ReplicationSourceManager.logPositionAndCleanOldLogs(ReplicationSourceManager.java:649)
~[classes/:?]
at
org.apache.hadoop.hbase.replication.regionserver.ReplicationSourceInterface.logPositionAndCleanOldLogs(ReplicationSourceInterface.java:211)
~[classes/:?]
at
org.apache.hadoop.hbase.replication.regionserver.ReplicationSourceShipper.updateLogPosition(ReplicationSourceShipper.java:266)
~[classes/:?]
at
org.apache.hadoop.hbase.replication.regionserver.ReplicationSourceShipper.shipEdits(ReplicationSourceShipper.java:158)
~[classes/:?]
at
org.apache.hadoop.hbase.replication.regionserver.ReplicationSourceShipper.run(ReplicationSourceShipper.java:119)
~[classes/:?]
{noformat}
Let me think how to fix this...
> NPE in ReplicationSourceManager.cleanOldLogs when sync replication is enabled
> -----------------------------------------------------------------------------
>
> Key: HBASE-28155
> URL: https://issues.apache.org/jira/browse/HBASE-28155
> Project: HBase
> Issue Type: Bug
> Components: Recovery, Replication
> Reporter: Duo Zhang
> Priority: Major
> Attachments:
> org.apache.hadoop.hbase.replication.TestSyncReplicationStandbyKillRS-output.txt
>
>
> Need to dig more but it seems to related to how we deal with
> RecoveredReplicationSource and queue.
--
This message was sent by Atlassian Jira
(v8.20.10#820010)