ResilientOutputStreamBase does not recover from transient network outage when 
the appender uses the LayoutWrappingEncoder
-------------------------------------------------------------------------------------------------------------------------

                 Key: LBCORE-246
                 URL: http://jira.qos.ch/browse/LBCORE-246
             Project: logback-core
          Issue Type: Bug
          Components: Layout
    Affects Versions: 0.9.30
         Environment: Windows 
            Reporter: Thomas P Rees
            Assignee: Logback dev list


Our customer has a problem with network glitches, so we upgraded from 0.9.17 to 
0.9.30 to take advantage of the ResilientOutputStream. The particular appender 
that caused problems at the customer site used a custom layout class, so we 
wrapped that class in a LayoutWrappingEncoder in build.xml:

  <appender name="AUDIT" 
class="ch.qos.logback.core.rolling.RollingFileAppender">
    <File>${logging.audit.root}/audit/audit.log</File>
    <encoder class="ch.qos.logback.core.encoder.LayoutWrappingEncoder">
      <layout class="com.chiliad.log.util.PropertiesLayout" />
    </encoder>
    <rollingPolicy class="ch.qos.logback.core.rolling.TimeBasedRollingPolicy">
     <!-- stuff deleted here -->
    </rollingPolicy>
  </appender>

We tested the ResilientOutputStream by editing logback.xml to write the logfile 
to another system, then we removed the ethernet cable from that other system. A 
couple minutes later we plugged the ethernet cable back into the other system. 
Stdout of the logging system reported

    2012/02/14 16:33:47 | INFO   | jvm 1    | srvmain | 16:33:47,414 |-ERROR in 
c.q.l.c.recovery.ResilientFileOutputStream@12297237 - IO failure while writing 
to file [\\cushman\e\logs\audit\audit.log] java.io.IOException: The specified 
network name is no longer available

There was also a long stack trace from Java with more detail about the 
IOException.

However immediately after this stack trace the following message appeared

    2012/02/14 16:33:47 | INFO   | jvm 1    | srvmain | 16:33:47,408 |-INFO in 
c.q.l.c.recovery.ResilientFileOutputStream@12297237 - Recovered from IO failure 
on file [\\cushman\e\logs\audit\audit.log]

Furthermore, logging did not actually recover. The audit log on the remote 
server did not get any more messages, and stdout continued to record the 
messaages shown above, always paired with the complaint about the IO failure 
and then saying that it recovered from the IO failure.

I downloaded the source code and found the source of the problem. The 
LayoutWrappingEncoder always calls write and then flush together:

    outputStream.write(convertToBytes(txt));
    outputStream.flush();

In ResilientOutputStreamBase the write method checks for a bad state in the 
output stream, but the flush method does not:

  public void write(int b) {
    if (isPresumedInError()) {
      if (!recoveryCoordinator.isTooSoon()) {
        attemptRecovery();
      }
      return; // return regardless of the success of the recovery attempt
    }
    try {
      os.write(b);
      postSuccessfulWrite();
    } catch (IOException e) {
      postIOFailure(e);
    }
  }

  public void flush() {
    if (os != null) {
      try {
        os.flush();
        postSuccessfulWrite();
      } catch (IOException e) {
        postIOFailure(e);
      }
    }
  }

So, the write method encounters the IOException and sets the presumedClean flag 
to false. However, the flush method does not call isPresumedInError and the 
flush succeeds (I presume because there is no data to flush). The flush method 
then calls postSuccessfulWrite, which sets the presumedClean flag to true.

I fixed this problem by changing the flush method to call isPresumedInError::

  public void flush() {
    if (os != null) {
      if (isPresumedInError()) {
        if (!recoveryCoordinator.isTooSoon()) {
          attemptRecovery();
        }
        return; // return regardless of the success of the recovery attempt
      }
      try {
        os.flush();
        postSuccessfulWrite();
      } catch (IOException e) {
        postIOFailure(e);
      }
    }
  }

Now when we run the test of unplugging the ethernet cable on the system where 
the audit logs are written and then plug the cable back in the logger starts 
logging again and the logfile resumes recording messages.

Does this solution look correct to you?
thanks
Tom


-- 
This message is automatically generated by JIRA.
-
If you think it was sent incorrectly contact one of the administrators: 
http://jira.qos.ch/secure/Administrators.jspa
-
For more information on JIRA, see: http://www.atlassian.com/software/jira

        
_______________________________________________
logback-dev mailing list
[email protected]
http://mailman.qos.ch/mailman/listinfo/logback-dev

Reply via email to