[ 
https://issues.apache.org/jira/browse/HDFS-8964?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14723950#comment-14723950
 ] 

Colin Patrick McCabe commented on HDFS-8964:
--------------------------------------------

Thank you, [~zhz].

JNStorage: {{import com.google.common.annotations.VisibleForTesting;}} was 
added, but not used.

bq. I also found scanLog is identical to validateLog and removed it from two 
places.

They're not quite identical.  {{validateLog}} ends up repeatedly calling 
{{FSEditLogInputStream#readOp}}, whereas scanLog ends up repeatedly calling 
{{FSEditLogInputStream#scanNextOp}}.  In the past, {{scanNextOp}} did not 
validate checksums, whereas {{readOp}} did.  However, HDFS-8965 is changing 
{{FSEditLogInputStream#scanNextOp}} to validate checksums.  So I agree that it 
now makes sense to merge these two.

FileJournalManager#getRemoteEditLogs: this isn't going to work.
{code}
  public List<RemoteEditLog> getRemoteEditLogs(long firstTxId,
      boolean inProgressOk) throws IOException {
...
          elf.validateLog(firstTxId);
{code}
The problem is that by doing this, we set {{EditLogFile#lastTxId}} to 
{{firstTxid}}, and then all subsequent edits in the in-progress log are lost.  
Instead of doing this, you should pass in the known highest txid from 
Journal.java (taking care to pass {{Long.MAX_VALUE}} if this is not known).

Same issue in {{FileJournalManager#addStreamsToCollectionFromFiles}}.

{code}
    /** 
     * Find out where the edit log ends.
     * This will update the lastTxId of the EditLogFile or
     * mark it as corrupt if it is.
     */
    public void validateLog() throws IOException {
      validateLog(Long.MAX_VALUE);
    }

    public void validateLog(long maxTxId) throws IOException {
{code} 
I would prefer not to have this kind of function overloading, since it tends to 
obscure what is going on.  In that vein, I think there are more places where we 
are calling validateLog without a maxTxid, where we should actually have a max 
txid, like {{Journal#getSegmentInfo}}.

{code}
      if (maxTxId > 0 && lastTxId > maxTxId) {
        break;
      }
{code}
Why is maxTxid == 0 special here?  If we want to have no limit, we can just 
pass in {{Long.MAX_VALUE}}.  0 is a bad value to overload to mean "there is no 
limit" because it means you have to special-case it.  I also don't think this 
function correctly handles the case where we don't expect any edits in the log 
currently.  In this case, it is quite valid for the caller to pass 0 as 
maxTxid, and we should always return an EditLogValidate with {{maxTxid == 
startTxId == 0}}.  The check should be at the start of the loop, not the end.

{code} 
     File[] files = new File(f1, "current").listFiles(new FilenameFilter() {
-        @Override
-        public boolean accept(File dir, String name) {
-          if 
(name.startsWith(NNStorage.getFinalizedEditsFileName(startErrorTxId, 
-                                  endErrorTxId))) {
-            return true;
-          }
-          return false;
+      @Override
+      public boolean accept(File dir, String name) {
+        if (name.startsWith(NNStorage.getFinalizedEditsFileName(startErrorTxId,
+            endErrorTxId))) {
+          return true;
         }
-      });
+        return false;
+      }
+    });
{code}
Let's not mess with the whitespace here.  I don't see anything wrong with the 
original whitespace, and it makes backports harder.  Similar with {{writeInt}}

The unit test looks good to me.  Can you also test that if we call 
{{EditLogFileInputStream#validateEditLOg(editLogFile, finalTxId - 1)}} we don't 
get the last edit log op, but if we call {{validateEditLOg(editLogFile, 
finalTxId)}}, we do?  It should be easy to do by keeping track of the last edit 
log transaction id in the loop where we're writing transactions (probably can 
get it out of FSNamesystem).

> Provide max TxId when validating in-progress edit log files
> -----------------------------------------------------------
>
>                 Key: HDFS-8964
>                 URL: https://issues.apache.org/jira/browse/HDFS-8964
>             Project: Hadoop HDFS
>          Issue Type: Bug
>          Components: journal-node, namenode
>    Affects Versions: 2.7.1
>            Reporter: Zhe Zhang
>            Assignee: Zhe Zhang
>         Attachments: HDFS-8964.00.patch, HDFS-8964.01.patch, 
> HDFS-8964.02.patch
>
>
> NN/JN validates in-progress edit log files in multiple scenarios, via 
> {{EditLogFile#validateLog}}. The method scans through the edit log file to 
> find the last transaction ID.
> However, an in-progress edit log file could be actively written to, which 
> creates a race condition and causes incorrect data to be read (and later we 
> attempt to interpret the data as ops).
> Currently {{validateLog}} is used in 3 places:
> # NN {{getEditsFromTxid}}
> # JN {{getEditLogManifest}}
> # NN/JN {{recoverUnfinalizedSegments}}
> In the first two scenarios we should provide a maximum TxId to validate in 
> the in-progress file. The 3rd scenario won't cause a race condition because 
> only non-current in-progress edit log files are validated.
> {{validateLog}} is actually only used with in-progress files, and could use a 
> better name and Javadoc.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

Reply via email to