[ 
https://issues.apache.org/jira/browse/HBASE-22190?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16830373#comment-16830373
 ] 

Duo Zhang commented on HBASE-22190:
-----------------------------------

Even hit this...

{noformat}
2019-04-30 22:58:08,151 WARN  [snapshot-hfile-cleaner-cache-refresher] 
snapshot.SnapshotFileCache$RefreshCacheTask(294): Failed to refresh snapshot 
hfile cache!
org.apache.hadoop.hbase.snapshot.CorruptedSnapshotException: unable to parse 
data manifest Message missing required fields: table_schema
        at 
org.apache.hadoop.hbase.snapshot.SnapshotManifest.readDataManifest(SnapshotManifest.java:561)
        at 
org.apache.hadoop.hbase.snapshot.SnapshotManifest.load(SnapshotManifest.java:389)
        at 
org.apache.hadoop.hbase.snapshot.SnapshotManifest.open(SnapshotManifest.java:142)
        at 
org.apache.hadoop.hbase.snapshot.SnapshotReferenceUtil.visitTableStoreFiles(SnapshotReferenceUtil.java:113)
        at 
org.apache.hadoop.hbase.snapshot.SnapshotReferenceUtil.getHFileNames(SnapshotReferenceUtil.java:348)
        at 
org.apache.hadoop.hbase.snapshot.SnapshotReferenceUtil.getHFileNames(SnapshotReferenceUtil.java:331)
        at 
org.apache.hadoop.hbase.master.snapshot.SnapshotHFileCleaner$1.filesUnderSnapshot(SnapshotHFileCleaner.java:102)
        at 
org.apache.hadoop.hbase.master.snapshot.SnapshotFileCache.refreshCache(SnapshotFileCache.java:269)
        at 
org.apache.hadoop.hbase.master.snapshot.SnapshotFileCache.access$0(SnapshotFileCache.java:216)
        at 
org.apache.hadoop.hbase.master.snapshot.SnapshotFileCache$RefreshCacheTask.run(SnapshotFileCache.java:292)
        at java.util.TimerThread.mainLoop(Timer.java:555)
        at java.util.TimerThread.run(Timer.java:505)
Caused by: 
org.apache.hbase.thirdparty.com.google.protobuf.InvalidProtocolBufferException: 
Message missing required fields: table_schema
        at 
org.apache.hbase.thirdparty.com.google.protobuf.UninitializedMessageException.asInvalidProtocolBufferException(UninitializedMessageException.java:79)
        at 
org.apache.hbase.thirdparty.com.google.protobuf.AbstractParser.checkMessageInitialized(AbstractParser.java:68)
        at 
org.apache.hbase.thirdparty.com.google.protobuf.AbstractParser.parseFrom(AbstractParser.java:86)
        at 
org.apache.hbase.thirdparty.com.google.protobuf.AbstractParser.parseFrom(AbstractParser.java:91)
        at 
org.apache.hbase.thirdparty.com.google.protobuf.AbstractParser.parseFrom(AbstractParser.java:48)
        at 
org.apache.hbase.thirdparty.com.google.protobuf.GeneratedMessageV3.parseWithIOException(GeneratedMessageV3.java:335)
        at 
org.apache.hadoop.hbase.shaded.protobuf.generated.SnapshotProtos$SnapshotDataManifest.parseFrom(SnapshotProtos.java:5816)
        at 
org.apache.hadoop.hbase.snapshot.SnapshotManifest.readDataManifest(SnapshotManifest.java:557)
        ... 11 more
{noformat}

I think the problem is because the race between SnapshotFileCache refreshing 
and snapshot file generating. The directory for the snapshot may have already 
been created but the snapshot manifest may not be ready yet, so if we try to 
load the snapshot into cache then we can just see an empty file list, since the 
manifest has not been generated yet, and we may also see the above exception if 
the manifest is half done...

What's more, we will update the recorded modification time before actually 
loading anything, and when hitting an exception, like the above one, we will 
not reset the recorded modification time, this could also lead to an incorrect 
state in cache and we have no chance to update it unless there is a new 
snapshot coming...

I think this is a very critical bug, as sometimes snapshot is used retain 
critical data which may be used to do recovery, if it is not stable then...


> TestSnapshotFromMaster is flakey
> --------------------------------
>
>                 Key: HBASE-22190
>                 URL: https://issues.apache.org/jira/browse/HBASE-22190
>             Project: HBase
>          Issue Type: Task
>            Reporter: Duo Zhang
>            Priority: Blocker
>
> And it seems that it is not only a test issue, we do delete the files under 
> the archive directory, which is incorrect.
> Need to find out why, this maybe a serious bug.



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

Reply via email to