Hi Edward, Did the fixes in LOG4J2-254 fix your file rolling issue? What are your thoughts on how to improve spooling directory source's error handling when it detects a change in the file? Just bail and retry later? I suppose that's a pretty reasonable approach.
Regards, Mike On Tue, May 14, 2013 at 4:50 PM, Edward Sargisson <[email protected]> wrote: > Unless I'm mistaken (and concurrent code is easy to be mistaken about) this > is a race condition in apache-log4j-extras RollingFileAppender. I live in > hope that when log4j2 becomes GA we can move to it and then be able to use > it to log Flume itself. > > Evidence: > File: castellan-reader.20130514T2058.log.COMPLETED > 2013-05-14 20:57:05,330 INFO ... > > File: castellan-reader.20130514T2058.log > 2013-05-14 21:23:05,709 DEBUG ... > > Why would an event from 2123 be written into a file from 2058? > > My understanding of log4j shows that the RollingFileAppenders end up > calling this: > FileAppender: > public synchronized void setFile(String fileName, boolean append, boolean > bufferedIO, int bufferSize) > > Which shortly calls: > this.qw = new QuietWriter(writer, errorHandler); > > However, the code to actually write to the writer is this: > protected > void subAppend(LoggingEvent event) { > this.qw.write(this.layout.format(event)); > > Unless I'm mistaken there's no happens-before edge between setting the qw > and calling subappend. The code path to get to subAppend appears not to go > through any method synchronized on FileAppender's monitor. this.qw is not > volatile. > > Oh, and based on my cursory inspection of the log4j2 code this exists in > log4j2 as well. I've just raised log4j2-254 to cover it. We'll see if I'm > actually right... > > Cheers, > Edward > > > > > On Mon, May 13, 2013 at 8:45 AM, Edward Sargisson <[email protected]> > wrote: > > > Hi Mike, > > Based on my reading of the various logging frameworks' source code and > the > > Java documentation I come to the conclusion that relying on an atomic > move > > is not wise. (Next time I see this I might try and prove that the spooled > > file is incomplete). > > > > So I suggest two things: > > 1) A breach of that check should not cause the entire Flume instance to > > stop passing traffic. > > 2) A configurable wait time might work. If you're using the spooling > > source then you've already decided to have some latency so a little more > is > > fine. However, there is still a risk of a race condition because there is > > no signal that the copy is finished. > > > > Cheers, > > Edward > > > > "Hi Edward, > > Thanks for investigating. I'm definitely open to suggestions for > > improvement with this. Maybe dying is a bit extreme… the goal was to > ensure > > that people could not possibly try to use it to tail a file, which will > > definitely not work correctly! :) > > > > Mike > > > > > > > > On Fri, May 10, 2013 at 5:02 PM, Edward Sargisson <[email protected]> > > wrote: > > > > > Hi Mike, > > > I was curious so I went on a bit of a hunt through logger source code. > > > The result is that loggers can't be relied on to atomically roll the > > > file so a feature to allow a delay before checking the file would be > > > of great utility. > > > > > > For that matter, having Flume not die completely in this scenario > > > would also be good. > > > > > > apache-log4j-extras does this [1]: > > > return source.renameTo(destination); > > > > > > logback does this [2]: > > > boolean result = srcFile.renameTo(targetFile); > > > > > > log4j2 does this [3]: > > > srcStream = new FileInputStream(source); > > > destStream = new FileOutputStream(destination); > > > srcChannel = srcStream.getChannel(); > > > destChannel = destStream.getChannel(); > > > destChannel.transferFrom( > > srcChannel, 0, srcChannel.size()); > > > > > > The JavaDoc for File.renameTo says: > > > Many aspects of the behavior of this method are inherently > > > platform-dependent: The rename operation might not be able to move a > > > file from one filesystem to another, it might not be atomic, and it > > > might not succeed if a file with the destination abstract pathname > > > already exists. The return value should always be checked to make > sure > > > that the rename operation was successful. > > > > > > My conclusion is that the loggers (except possibly log4j2) can't be > > > relied on to atomically roll the file. > > > > > > Cheers, > > > Edward > > > > > > > > > Links: > > > [1] > > > > > > http://svn.apache.org/viewvc/logging/log4j/companions/extras/trunk/src/main/java/org/apache/log4j/rolling/helper/FileRenameAction.java?view=markup > > > l77 > > > > > > [2] > > > > > > https://github.com/qos-ch/logback/blob/master/logback-core/src/main/java/ch/qos/logback/core/rolling/helper/RenameUtil.java > > > , > > > l63 > > > [3] > > > > > > https://svn.apache.org/repos/asf/logging/log4j/log4j2/trunk/core/src/main/java/org/apache/logging/log4j/core/appender/rolling/helper/FileRenameAction.java > > > > > > > > > >Hi Edward, > > > >Spooldir assumes that the file is atomically moved into the directory > in > > > >question, maybe we should make it wait around until he last mod time > is > > > > > > >10 seconds or something to avoid situations where a readdir + stat > > > >catches it during a file copy operation and thinks you're changing the > > > >files. If you're sure that the file name is not being reused then > maybe > > > >it's a bug. > > > > > > >We require file name to be unique because we don't have a great way to > > > read > > > >the inode with Java. > > > > > > >Mike > > > > > > > > > >On Wed, May 8, 2013 at 9:04 AM, Edward Sargisson <[email protected]> > > > wrote: > > > > > > > Hi all, > > > > I've been experimenting with using Flume to transport Flume's own > logs. > > > > > > > > My setup is that I have the apache-log4j-extras RollingFileAppender > > > rolling > > > > the file every minute into a spool directory. Then I have Flume with > > the > > > > spooldir source reading that directory. > > > > > > > > Very often I see: > > > > java.lang.IllegalStateException: File name has been re-used with > > > different > > > > files. Spooling assumpti > > > > on violated for > > > > /opt/castellan/log/spool/castellan-reader.20130430T2055.log.COMPLETED > > > > at > > > > > > > > > > > > > > org.apache.flume.client.avro.SpoolingFileLineReader.retireCurrentFile(SpoolingFileLineRe > > > > ader.java:272) > > > > > > > > Then it never recovers, repeatedly saying: > > > > 2013-05-08 15:57:19,233 ERROR [pool-10-thread-1] > > > > org.apache.flume.source.SpoolDirectorySource Uncau > > > > ght exception in Runnable > > > > java.io.IOException: Stream closed > > > > at java.io.BufferedReader.ensureOpen(BufferedReader.java:115) > > > > at java.io.BufferedReader.readLine(BufferedReader.java:310) > > > > at java.io.BufferedReader.readLine(BufferedReader.java:382) > > > > at > > > > > > > > > > > > > > org.apache.flume.client.avro.SpoolingFileLineReader.readLines(SpoolingFileLineReader.java:180) > > > > > > > > I note that 1.4 has extensive changes to the spooldir source but > > nothing > > > > that really seemed to speak to this. > > > > > > > > Does anybody else see this? Anything I should look for? Or is this > some > > > > kind of race condition between log4j and spooldir? > > > > > > > > Cheers, > > > > Edward > > > > > > >" > > >
