[jira] [Commented] (HBASE-25898) RS getting aborted due to NPE in Replication WALEntryStream

ramkrishna.s.vasudevan (Jira) Wed, 19 May 2021 01:37:13 -0700


    [ 
https://issues.apache.org/jira/browse/HBASE-25898?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17347414#comment-17347414
 ]


ramkrishna.s.vasudevan commented on HBASE-25898:
------------------------------------------------

Good one.

> RS getting aborted due to NPE in Replication WALEntryStream
> -----------------------------------------------------------
>
>                 Key: HBASE-25898
>                 URL: https://issues.apache.org/jira/browse/HBASE-25898
>             Project: HBase
>          Issue Type: Bug
>          Components: Replication
>            Reporter: Anoop Sam John
>            Assignee: Anoop Sam John
>            Priority: Critical
>
> Below sequence of events happened in a customer cluster
> An empty WAL file got roll req.
> The close of file failed at HDFS side but as there  file had all edits 
> synced, we continue.
> New WAL file is created and old rolled.
> This old WAL file got archived to oldWAL 
> {code}
> 2021-05-13 13:38:46.000       Riding over failed WAL close of 
> hdfs://xxx/WALs/xxx,16020,1620828102351/xxx%2C16020%2C1620828102351.1620910673678,
>  cause="Unexpected EOF while trying to read response from server", errors=1; 
> THIS FILE WAS NOT CLOSED BUT ALL EDITS SYNCED SO SHOULD BE OK
> 2021-05-13 13:38:46.000       Rolled WAL 
> /xx/WALs/xxx,16020,1620828102351/xxx%2C16020%2C1620828102351.1620910673678 
> with entries=0, filesize=90 B; new WAL 
> /xx/WALs/xxx,16020,1620828102351/xxx%2C16020%2C1620828102351.1620913126549
> 2021-05-13 13:38:46.000        Archiving 
> hdfs://xxx/WALs/xxx,16020,1620828102351/xxx%2C16020%2C1620828102351.1620910673678
>  to hdfs://xxx/oldWALs/xxxt%2C16020%2C1620828102351.1620910673678
> 2021-05-13 13:38:46.000       Log 
> hdfs://xxx/WALs/xxx,16020,1620828102351/xxx%2C16020%2C1620828102351.1620910673678
>  was moved to hdfs://xxx/oldWALs/xxx%2C16020%2C1620828102351.1620910673678
> {code}
> As there was move of file, the WALEntryStream got IOE and we will recreate 
> the stream .
> {code}
> ReplicationSourceWALReader#run
> while (isReaderRunning()) {
>       try {
>         entryStream =
>               new WALEntryStream(logQueue, conf, currentPosition, 
> source.getWALFileLengthProvider(),
>                 source.getServerWALsBelongTo(), source.getSourceMetrics(), 
> walGroupId);
>         while (isReaderRunning()) { 
>         ...
>         ...
>         } catch (IOException e) { // stream related
>         if (handleEofException(e, batch)) {
>               sleepMultiplier = 1;
>         } else {
>               LOG.warn("Failed to read stream of replication entries", e);
>               if (sleepMultiplier < maxRetriesMultiplier) {
>                 sleepMultiplier++;
>               }
>               Threads.sleep(sleepForRetries * sleepMultiplier);
>         }
> }
> {code}
> eofAutoRecovery is turned off anyways.  So it will go to outer while loop and 
> create new WALEntryStream object
> Then we do readWALEntries
> {code}
> protected WALEntryBatch readWALEntries(WALEntryStream entryStream,
>       WALEntryBatch batch) throws IOException, InterruptedException {
>     Path currentPath = entryStream.getCurrentPath();
>     if (!entryStream.hasNext()) {
> {code}
> Here the currentPath will be still null. 
> WALEntryStream#hasNext -> tryAdvanceEntry -> checkReader -> openNextLog
> {code}
> private boolean openNextLog() throws IOException {
>     PriorityBlockingQueue<Path> queue = logQueue.getQueue(walGroupId);
>     Path nextPath = queue.peek();
>     if (nextPath != null) {
>       openReader(nextPath);
>         
> private void openReader(Path path) throws IOException {
>     try {
>       // Detect if this is a new file, if so get a new reader else
>       // reset the current reader so that we see the new data
>       if (reader == null || !getCurrentPath().equals(path)) {
>         closeReader();
>         reader = WALFactory.createReader(fs, path, conf);
>         seek();
>         setCurrentPath(path);
>       } else {
>         resetReader();
>       }
>     } catch (FileNotFoundException fnfe) {
>       handleFileNotFound(path, fnfe);
>     }  catch (RemoteException re) {
>       IOException ioe = re.unwrapRemoteException(FileNotFoundException.class);
>       if (!(ioe instanceof FileNotFoundException)) {
>         throw ioe;
>       }
>       handleFileNotFound(path, (FileNotFoundException)ioe);
>     } catch (LeaseNotRecoveredException lnre) {
>       // HBASE-15019 the WAL was not closed due to some hiccup.
>       LOG.warn("Try to recover the WAL lease " + currentPath, lnre);
>       recoverLease(conf, currentPath);
>       reader = null;
>     } catch (NullPointerException npe) {
>       // Workaround for race condition in HDFS-4380
>       // which throws a NPE if we open a file before any data node has the 
> most recent block
>       // Just sleep and retry. Will require re-reading compressed WALs for 
> compressionContext.
>       LOG.warn("Got NPE opening reader, will retry.");
>       reader = null;
>     }
>   }
> {code}
> Here the call to WALFactory.createReader is not able to complete because of 
> issue from HDFS. (Ya its on the same WAL file for which the close had an 
> issue) We have a retry mechanism there in createReader () for 5 mns. But 
> eventually it throws LeaseNotRecoveredException.  ya we try handle it.
> But the problem here is in that call we pass the state variable currentPath 
> which is still null here!
> This will throw NPE 
> {code}
> java.lang.NullPointerException
>       at org.apache.hadoop.fs.FileSystem.fixRelativePart(FileSystem.java:2635)
>       at 
> org.apache.hadoop.hdfs.DistributedFileSystem.fixRelativePart(DistributedFileSystem.java:3087)
>       at 
> org.apache.hadoop.hdfs.DistributedFileSystem.recoverLease(DistributedFileSystem.java:297)
>       at 
> org.apache.hadoop.hbase.util.FSHDFSUtils.recoverLease(FSHDFSUtils.java:283)
>       at 
> org.apache.hadoop.hbase.util.FSHDFSUtils.recoverDFSFileLease(FSHDFSUtils.java:216)
>       at 
> org.apache.hadoop.hbase.util.FSHDFSUtils.recoverFileLease(FSHDFSUtils.java:163)
>       at 
> org.apache.hadoop.hbase.replication.regionserver.WALEntryStream.recoverLease(WALEntryStream.java:387)
>       at 
> org.apache.hadoop.hbase.replication.regionserver.WALEntryStream.openReader(WALEntryStream.java:370)
>       at 
> org.apache.hadoop.hbase.replication.regionserver.WALEntryStream.handleFileNotFound(WALEntryStream.java:343)
>       at 
> org.apache.hadoop.hbase.replication.regionserver.WALEntryStream.openReader(WALEntryStream.java:362)
>       at 
> org.apache.hadoop.hbase.replication.regionserver.WALEntryStream.openNextLog(WALEntryStream.java:303)
>       at 
> org.apache.hadoop.hbase.replication.regionserver.WALEntryStream.checkReader(WALEntryStream.java:294)
>       at 
> org.apache.hadoop.hbase.replication.regionserver.WALEntryStream.tryAdvanceEntry(WALEntryStream.java:175)
>       at 
> org.apache.hadoop.hbase.replication.regionserver.WALEntryStream.hasNext(WALEntryStream.java:101)
>       at 
> org.apache.hadoop.hbase.replication.regionserver.ReplicationSourceWALReader.readWALEntries(ReplicationSourceWALReader.java:192)
>       at 
> org.apache.hadoop.hbase.replication.regionserver.ReplicationSourceWALReader.run(ReplicationSourceWALReader.java:138)
> {code}
> This will kill the RS!
> When next RS takes this Replication Q source from this failed RS, even that 
> also got NPE.
> Here in the call recoverLease() we should have been passing the method param 
> 'path'..  All other calls from this method do that way but seems this was 
> miss!



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

[jira] [Commented] (HBASE-25898) RS getting aborted due to NPE in Replication WALEntryStream

Reply via email to