Hi Edward,
Did the fixes in LOG4J2-254 fix your file rolling issue?

What are your thoughts on how to improve spooling directory source's error
handling when it detects a change in the file? Just bail and retry later? I
suppose that's a pretty reasonable approach.

Regards,
Mike


On Tue, May 14, 2013 at 4:50 PM, Edward Sargisson <[email protected]> wrote:

> Unless I'm mistaken (and concurrent code is easy to be mistaken about) this
> is a race condition in apache-log4j-extras RollingFileAppender. I live in
> hope that when log4j2 becomes GA we can move to it and then be able to use
> it to log Flume itself.
>
> Evidence:
> File: castellan-reader.20130514T2058.log.COMPLETED
> 2013-05-14 20:57:05,330  INFO ...
>
> File: castellan-reader.20130514T2058.log
> 2013-05-14 21:23:05,709 DEBUG ...
>
> Why would an event from 2123 be written into a file from 2058?
>
> My understanding of log4j shows that the RollingFileAppenders end up
> calling this:
> FileAppender:
> public  synchronized  void setFile(String fileName, boolean append, boolean
> bufferedIO, int bufferSize)
>
> Which shortly calls:
> this.qw = new QuietWriter(writer, errorHandler);
>
> However, the code to actually write to the writer is this:
> protected
>   void subAppend(LoggingEvent event) {
>     this.qw.write(this.layout.format(event));
>
> Unless I'm mistaken there's no happens-before edge between setting the qw
> and calling subappend. The code path to get to subAppend appears not to go
> through any method synchronized on FileAppender's monitor. this.qw is not
> volatile.
>
> Oh, and based on my cursory inspection of the log4j2 code this exists in
> log4j2 as well. I've just raised log4j2-254 to cover it. We'll see if I'm
> actually right...
>
> Cheers,
> Edward
>
>
>
>
> On Mon, May 13, 2013 at 8:45 AM, Edward Sargisson <[email protected]>
> wrote:
>
> > Hi Mike,
> > Based on my reading of the various logging frameworks' source code and
> the
> > Java documentation I come to the conclusion that relying on an atomic
> move
> > is not wise. (Next time I see this I might try and prove that the spooled
> > file is incomplete).
> >
> > So I suggest two things:
> > 1) A breach of that check should not cause the entire Flume instance to
> > stop passing traffic.
> > 2) A configurable wait time might work. If you're using the spooling
> > source then you've already decided to have some latency so a little more
> is
> > fine. However, there is still a risk of a race condition because there is
> > no signal that the copy is finished.
> >
> > Cheers,
> > Edward
> >
> > "Hi Edward,
> > Thanks for investigating. I'm definitely open to suggestions for
> > improvement with this. Maybe dying is a bit extreme… the goal was to
> ensure
> > that people could not possibly try to use it to tail a file, which will
> > definitely not work correctly! :)
> >
> > Mike
> >
> >
> >
> > On Fri, May 10, 2013 at 5:02 PM, Edward Sargisson <[email protected]>
> > wrote:
> >
> > > Hi Mike,
> > > I was curious so I went on a bit of a hunt through logger source code.
> > > The result is that loggers can't be relied on to atomically roll the
> > > file so a feature to allow a delay before checking the file would be
> > > of great utility.
> > >
> > > For that matter, having Flume not die completely in this scenario
> > > would also be good.
> > >
> > > apache-log4j-extras does this [1]:
> > > return source.renameTo(destination);
> > >
> > > logback does this [2]:
> > > boolean result = srcFile.renameTo(targetFile);
> > >
> > > log4j2 does this [3]:
> > >  srcStream = new FileInputStream(source);
> > >             destStream = new FileOutputStream(destination);
> > >             srcChannel = srcStream.getChannel();
> > >             destChannel = destStream.getChannel();
> > >             destChannel.transferFrom(
> > srcChannel, 0, srcChannel.size());
> > >
> > > The JavaDoc for File.renameTo says:
> > >  Many aspects of the behavior of this method are inherently
> > >  platform-dependent: The rename operation might not be able to move a
> > >  file from one filesystem to another, it might not be atomic, and it
> > >  might not succeed if a file with the destination abstract pathname
> > >  already exists.  The return value should always be checked to make
> sure
> > >  that the rename operation was successful.
> > >
> > > My conclusion is that the loggers (except possibly log4j2) can't be
> > > relied on to atomically roll the file.
> > >
> > > Cheers,
> > > Edward
> > >
> > >
> > > Links:
> > > [1]
> > >
> >
> http://svn.apache.org/viewvc/logging/log4j/companions/extras/trunk/src/main/java/org/apache/log4j/rolling/helper/FileRenameAction.java?view=markup
> > > l77
> > >
> > > [2]
> > >
> >
> https://github.com/qos-ch/logback/blob/master/logback-core/src/main/java/ch/qos/logback/core/rolling/helper/RenameUtil.java
> > > ,
> > > l63
> > > [3]
> > >
> >
> https://svn.apache.org/repos/asf/logging/log4j/log4j2/trunk/core/src/main/java/org/apache/logging/log4j/core/appender/rolling/helper/FileRenameAction.java
> > >
> > >
> > > >Hi Edward,
> > > >Spooldir assumes that the file is atomically moved into the directory
> in
> > > >question, maybe we should make it wait around until he last mod time
> is
> > >
> > > >10 seconds or something to avoid situations where a readdir   + stat
> > > >catches it during a file copy operation and thinks you're changing the
> > > >files. If you're sure that the file name is not being reused then
> maybe
> > > >it's a bug.
> > >
> > > >We require file name to be unique because we don't have a great way to
> > > read
> > > >the inode with Java.
> > >
> > > >Mike
> > >
> > >
> > > >On Wed, May 8, 2013 at 9:04 AM, Edward Sargisson <[email protected]>
> > > wrote:
> > >
> > > > Hi all,
> > > > I've been experimenting with using Flume to transport Flume's own
> logs.
> > > >
> > > > My setup is that I have the apache-log4j-extras RollingFileAppender
> > > rolling
> > > > the file every minute into a spool directory. Then I have Flume with
> > the
> > > > spooldir source reading that directory.
> > > >
> > > > Very often I see:
> > > > java.lang.IllegalStateException: File name has been re-used with
> > > different
> > > > files. Spooling assumpti
> > > > on violated for
> > > > /opt/castellan/log/spool/castellan-reader.20130430T2055.log.COMPLETED
> > > >         at
> > > >
> > > >
> > >
> >
> org.apache.flume.client.avro.SpoolingFileLineReader.retireCurrentFile(SpoolingFileLineRe
> > > > ader.java:272)
> > > >
> > > > Then it never recovers, repeatedly saying:
> > > > 2013-05-08 15:57:19,233 ERROR [pool-10-thread-1]
> > > > org.apache.flume.source.SpoolDirectorySource Uncau
> > > > ght exception in Runnable
> > > > java.io.IOException: Stream closed
> > > >         at java.io.BufferedReader.ensureOpen(BufferedReader.java:115)
> > > >         at java.io.BufferedReader.readLine(BufferedReader.java:310)
> > > >         at java.io.BufferedReader.readLine(BufferedReader.java:382)
> > > >         at
> > > >
> > > >
> > >
> >
> org.apache.flume.client.avro.SpoolingFileLineReader.readLines(SpoolingFileLineReader.java:180)
> > > >
> > > > I note that 1.4 has extensive changes to the spooldir source but
> > nothing
> > > > that really seemed to speak to this.
> > > >
> > > > Does anybody else see this? Anything I should look for? Or is this
> some
> > > > kind of race condition between log4j and spooldir?
> > > >
> > > > Cheers,
> > > > Edward
> > > >
> > >"
> >
>

Reply via email to