[
https://issues.apache.org/jira/browse/HBASE-25898?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17350778#comment-17350778
]
Hudson commented on HBASE-25898:
--------------------------------
Results for branch branch-2
[build #259 on
builds.a.o|https://ci-hadoop.apache.org/job/HBase/job/HBase%20Nightly/job/branch-2/259/]:
(x) *{color:red}-1 overall{color}*
----
details (if available):
(/) {color:green}+1 general checks{color}
-- For more information [see general
report|https://ci-hadoop.apache.org/job/HBase/job/HBase%20Nightly/job/branch-2/259/General_20Nightly_20Build_20Report/]
(x) {color:red}-1 jdk8 hadoop2 checks{color}
-- Something went wrong running this stage, please [check relevant console
output|https://ci-hadoop.apache.org/job/HBase/job/HBase%20Nightly/job/branch-2/259//console].
(x) {color:red}-1 jdk8 hadoop3 checks{color}
-- For more information [see jdk8 (hadoop3)
report|https://ci-hadoop.apache.org/job/HBase/job/HBase%20Nightly/job/branch-2/259/JDK8_20Nightly_20Build_20Report_20_28Hadoop3_29/]
(x) {color:red}-1 jdk11 hadoop3 checks{color}
-- For more information [see jdk11
report|https://ci-hadoop.apache.org/job/HBase/job/HBase%20Nightly/job/branch-2/259/JDK11_20Nightly_20Build_20Report_20_28Hadoop3_29/]
(/) {color:green}+1 source release artifact{color}
-- See build output for details.
(x) {color:red}-1 client integration test{color}
--Failed when running client tests on top of Hadoop 2. [see log for
details|https://ci-hadoop.apache.org/job/HBase/job/HBase%20Nightly/job/branch-2/259//artifact/output-integration/hadoop-2.log].
(note that this means we didn't run on Hadoop 3)
> RS getting aborted due to NPE in Replication WALEntryStream
> -----------------------------------------------------------
>
> Key: HBASE-25898
> URL: https://issues.apache.org/jira/browse/HBASE-25898
> Project: HBase
> Issue Type: Bug
> Components: Replication
> Reporter: Anoop Sam John
> Assignee: Anoop Sam John
> Priority: Critical
> Fix For: 3.0.0-alpha-1, 2.5.0, 2.3.6, 2.4.4
>
>
> Below sequence of events happened in a customer cluster
> An empty WAL file got roll req.
> The close of file failed at HDFS side but as there file had all edits
> synced, we continue.
> New WAL file is created and old rolled.
> This old WAL file got archived to oldWAL
> {code}
> 2021-05-13 13:38:46.000 Riding over failed WAL close of
> hdfs://xxx/WALs/xxx,16020,1620828102351/xxx%2C16020%2C1620828102351.1620910673678,
> cause="Unexpected EOF while trying to read response from server", errors=1;
> THIS FILE WAS NOT CLOSED BUT ALL EDITS SYNCED SO SHOULD BE OK
> 2021-05-13 13:38:46.000 Rolled WAL
> /xx/WALs/xxx,16020,1620828102351/xxx%2C16020%2C1620828102351.1620910673678
> with entries=0, filesize=90 B; new WAL
> /xx/WALs/xxx,16020,1620828102351/xxx%2C16020%2C1620828102351.1620913126549
> 2021-05-13 13:38:46.000 Archiving
> hdfs://xxx/WALs/xxx,16020,1620828102351/xxx%2C16020%2C1620828102351.1620910673678
> to hdfs://xxx/oldWALs/xxxt%2C16020%2C1620828102351.1620910673678
> 2021-05-13 13:38:46.000 Log
> hdfs://xxx/WALs/xxx,16020,1620828102351/xxx%2C16020%2C1620828102351.1620910673678
> was moved to hdfs://xxx/oldWALs/xxx%2C16020%2C1620828102351.1620910673678
> {code}
> As there was move of file, the WALEntryStream got IOE and we will recreate
> the stream .
> {code}
> ReplicationSourceWALReader#run
> while (isReaderRunning()) {
> try {
> entryStream =
> new WALEntryStream(logQueue, conf, currentPosition,
> source.getWALFileLengthProvider(),
> source.getServerWALsBelongTo(), source.getSourceMetrics(),
> walGroupId);
> while (isReaderRunning()) {
> ...
> ...
> } catch (IOException e) { // stream related
> if (handleEofException(e, batch)) {
> sleepMultiplier = 1;
> } else {
> LOG.warn("Failed to read stream of replication entries", e);
> if (sleepMultiplier < maxRetriesMultiplier) {
> sleepMultiplier++;
> }
> Threads.sleep(sleepForRetries * sleepMultiplier);
> }
> }
> {code}
> eofAutoRecovery is turned off anyways. So it will go to outer while loop and
> create new WALEntryStream object
> Then we do readWALEntries
> {code}
> protected WALEntryBatch readWALEntries(WALEntryStream entryStream,
> WALEntryBatch batch) throws IOException, InterruptedException {
> Path currentPath = entryStream.getCurrentPath();
> if (!entryStream.hasNext()) {
> {code}
> Here the currentPath will be still null.
> WALEntryStream#hasNext -> tryAdvanceEntry -> checkReader -> openNextLog
> {code}
> private boolean openNextLog() throws IOException {
> PriorityBlockingQueue<Path> queue = logQueue.getQueue(walGroupId);
> Path nextPath = queue.peek();
> if (nextPath != null) {
> openReader(nextPath);
>
> private void openReader(Path path) throws IOException {
> try {
> // Detect if this is a new file, if so get a new reader else
> // reset the current reader so that we see the new data
> if (reader == null || !getCurrentPath().equals(path)) {
> closeReader();
> reader = WALFactory.createReader(fs, path, conf);
> seek();
> setCurrentPath(path);
> } else {
> resetReader();
> }
> } catch (FileNotFoundException fnfe) {
> handleFileNotFound(path, fnfe);
> } catch (RemoteException re) {
> IOException ioe = re.unwrapRemoteException(FileNotFoundException.class);
> if (!(ioe instanceof FileNotFoundException)) {
> throw ioe;
> }
> handleFileNotFound(path, (FileNotFoundException)ioe);
> } catch (LeaseNotRecoveredException lnre) {
> // HBASE-15019 the WAL was not closed due to some hiccup.
> LOG.warn("Try to recover the WAL lease " + currentPath, lnre);
> recoverLease(conf, currentPath);
> reader = null;
> } catch (NullPointerException npe) {
> // Workaround for race condition in HDFS-4380
> // which throws a NPE if we open a file before any data node has the
> most recent block
> // Just sleep and retry. Will require re-reading compressed WALs for
> compressionContext.
> LOG.warn("Got NPE opening reader, will retry.");
> reader = null;
> }
> }
> {code}
> Here the call to WALFactory.createReader is not able to complete because of
> issue from HDFS. (Ya its on the same WAL file for which the close had an
> issue) We have a retry mechanism there in createReader () for 5 mns. But
> eventually it throws LeaseNotRecoveredException. ya we try handle it.
> But the problem here is in that call we pass the state variable currentPath
> which is still null here!
> This will throw NPE
> {code}
> java.lang.NullPointerException
> at org.apache.hadoop.fs.FileSystem.fixRelativePart(FileSystem.java:2635)
> at
> org.apache.hadoop.hdfs.DistributedFileSystem.fixRelativePart(DistributedFileSystem.java:3087)
> at
> org.apache.hadoop.hdfs.DistributedFileSystem.recoverLease(DistributedFileSystem.java:297)
> at
> org.apache.hadoop.hbase.util.FSHDFSUtils.recoverLease(FSHDFSUtils.java:283)
> at
> org.apache.hadoop.hbase.util.FSHDFSUtils.recoverDFSFileLease(FSHDFSUtils.java:216)
> at
> org.apache.hadoop.hbase.util.FSHDFSUtils.recoverFileLease(FSHDFSUtils.java:163)
> at
> org.apache.hadoop.hbase.replication.regionserver.WALEntryStream.recoverLease(WALEntryStream.java:387)
> at
> org.apache.hadoop.hbase.replication.regionserver.WALEntryStream.openReader(WALEntryStream.java:370)
> at
> org.apache.hadoop.hbase.replication.regionserver.WALEntryStream.handleFileNotFound(WALEntryStream.java:343)
> at
> org.apache.hadoop.hbase.replication.regionserver.WALEntryStream.openReader(WALEntryStream.java:362)
> at
> org.apache.hadoop.hbase.replication.regionserver.WALEntryStream.openNextLog(WALEntryStream.java:303)
> at
> org.apache.hadoop.hbase.replication.regionserver.WALEntryStream.checkReader(WALEntryStream.java:294)
> at
> org.apache.hadoop.hbase.replication.regionserver.WALEntryStream.tryAdvanceEntry(WALEntryStream.java:175)
> at
> org.apache.hadoop.hbase.replication.regionserver.WALEntryStream.hasNext(WALEntryStream.java:101)
> at
> org.apache.hadoop.hbase.replication.regionserver.ReplicationSourceWALReader.readWALEntries(ReplicationSourceWALReader.java:192)
> at
> org.apache.hadoop.hbase.replication.regionserver.ReplicationSourceWALReader.run(ReplicationSourceWALReader.java:138)
> {code}
> This will kill the RS!
> When next RS takes this Replication Q source from this failed RS, even that
> also got NPE.
> Here in the call recoverLease() we should have been passing the method param
> 'path'.. All other calls from this method do that way but seems this was
> miss!
--
This message was sent by Atlassian Jira
(v8.3.4#803005)