Hi Mike,
I haven't tried log4j2 in my environments but my review of the log4j2
change is that it should work.

What would I change?
Phil Scala may have some thoughts.

It would be nice if we thought through the file locking. I want to be able
to put a file in the spooldir and know that Flume isn't going to get
started until I'm ready. This certainly involves thinking about what the
file-putting process is doing but it's not clear to me how to ensure this
whole part is safe.

The thing that is currently annoying is handling stack traces. All logging
systems I've seen (except recent log4j2) output the stack trace with each
frame on a new line. This means that each frame gets its own log event and
the timestamp has to be added by Flume (instead of taken from the original
event). That Flume timestamp might be delayed by up to 1 minute (because of
log rolling so its pretty crap). Logstash has a multiline filter that
somewhat solves this.

My current approach is to try and get the Log4j2 FlumeAppender and Flume
1.3.1 reliable and trustworthy.

Cheers,
Edward

"Hi Edward,
Did the fixes in LOG4J2-254 fix your file rolling issue?

What are your thoughts on how to improve spooling directory source's error
handling when it detects a change in the file? Just bail and retry later? I
suppose that's a pretty reasonable approach.

Regards,
Mike


On Tue, May 14, 2013 at 4:50 PM, Edward Sargisson <[email protected]> wrote:

> Unless I'm mistaken (and concurrent code is easy to be mistaken about)
this
> is a race condition in apache-log4j-extras RollingFileAppender. I live in
> hope that when log4j2 becomes GA we can move to it and then be able to use
> it to log Flume itself.
>
> Evidence:
> File: castellan-reader.
20130514T2058.log.COMPLETED
> 2013-05-14 20:57:05,330  INFO ...
>
> File: castellan-reader.20130514T2058.log
> 2013-05-14 21:23:05,709 DEBUG ...
>
> Why would an event from 2123 be written into a file from 2058?
>
> My understanding of log4j shows that the RollingFileAppenders end up
> calling this:
> FileAppender:
> public  synchronized  void setFile(String fileName, boolean append,
boolean
> bufferedIO, int bufferSize)
>
> Which shortly calls:
> this.qw = new QuietWriter(writer, errorHandler);
>
> However, the code to actually write to the writer is this:
> protected
>   void subAppend(LoggingEvent event) {
>     this.qw.write(this.layout.format(event));
>
> Unless I'm mistaken there's no happens-before edge between setting the qw
> and calling subappend. The code path to get to subAppend appears not to go
> through any method synchronized on FileAppender's monitor. this.qw is not
> volatile.
>
> Oh, and based on my cursory inspection of the log4j2 code this exists in
> log4j2 as well. I've just raised log4j2-254 to cover it. We'll see if I'm
> actually right...
>
> Cheers,
> Edward
>
>
>
>
> On Mon, May 13, 2013 at 8:45 AM, Edward Sargisson <[email protected]>
> wrote:
>
> > Hi Mike,
> > Based on my reading of the various logging frameworks' source code and
> the
> > Java documentation I come to the conclusion that relying on an atomic
> move
> > is not wise. (Next time I see this I might try and prove that the
spooled
> > file is incomplete).
> >
> > So I suggest two things:
> > 1) A breach of that check should not cause the entire Flume instance to
> > stop passing traffic.
> > 2) A configurable wait time might work. If you're using the spooling
> > source then you've already decided to have some latency so a little more
> is
> > fine. However, there is still a risk of a race condition because there
is
> > no signal that the copy is finished.
> >
> > Cheers,
> > Edward
> >
> > "Hi Edward,
> > Thanks for investigating. I'm definitely open to suggestions for
> > improvement with this. Maybe dying is a bit extreme… the goal was to
> ensure
> > that people could not possibly try to use it to tail a file, which will
> > definitely not work correctly! :)
> >
> > Mike
> >
> >
> >
> > On Fri, May 10, 2013 at 5:02 PM, Edward Sargisson <[email protected]>
> > wrote:
> >
> > > Hi Mike,
> > > I was curious so I went on a bit of a hunt through logger source code.
> > > The result is that loggers can't be relied on to atomically roll the
> > > file so a feature to allow a delay before checking the file would be
> > > of great utility.
> > >
> > > For that matter, having Flume not die completely in this scenario
> > > would also be good.
> > >
> > > apache-log4j-extras does this [1]:
> > > return source.renameTo(destination);
> > >
> > > logback does this [2]:
> > > boolean result = srcFile.renameTo(targetFile);
> > >
> > > log4j2 does this [3]:
> > >  srcStream = new FileInputStream(source);
> > >             destStream = new FileOutputStream(destination);
> > >             srcChannel = srcStream.getChannel();
> > >             destChannel = destStream.getChannel();
> > >             destChannel.transferFrom(
> > srcChannel, 0, srcChannel.size());
> > >
> > > The JavaDoc for File.renameTo says:
> > >  Many aspects of the behavior of this method are inherently
> > >  platform-dependent: The rename operation might not be able to move a
> > >  file from one filesystem to another, it might not be atomic, and it
> > >  might not succeed if a file with the destination abstract pathname
> > >  already exists.  The return value should always be checked to make
> sure
> > >  that the rename operation was successful.
> > >
> > > My conclusion is that the loggers (except possibly log4j2) can't be
> > > relied on to atomically roll the file.
> > >
> > > Cheers,
> > > Edward
> > >
> > >
> > > Links:
> > > [1]
> > >
> >
>
http://svn.apache.org/viewvc/logging/log4j/companions/extras/trunk/src/main/java/org/apache/log4j/rolling/helper/FileRenameAction.java?view=markup
> > > l77
> > >
> > > [2]
> > >
> >
>
https://github.com/qos-ch/logback/blob/master/logback-core/src/main/java/ch/qos/logback/core/rolling/helper/RenameUtil.java
> > > ,
> > > l63
> > > [3]
> > >
> >
>
https://svn.apache.org/repos/asf/logging/log4j/log4j2/trunk/core/src/main/java/org/apache/logging/log4j/core/appender/rolling/helper/FileRenameAction.java
> > >
> > >
> > > >Hi Edward,
> > > >Spooldir assumes that the file is atomically moved into the directory
> in
> > > >question, maybe we should make it wait around until he last mod time
> is
> > >
> > > >10 seconds or something to avoid situations where a readdir   + stat
> > > >catches it during a file copy operation and thinks you're changing
the
> > > >files. If you're sure that the file name is not being reused then
> maybe
> > > >it's a bug.
> > >
> > > >We require file name to be unique because we don't have a great way
to
> > > read
> > > >the inode with Java.
> > >
> > > >Mike
> > >
> > >
> > > >On Wed, May 8, 2013 at 9:04 AM, Edward Sargisson <[email protected]>
> > > wrote:
> > >
> > > > Hi all,
> > > > I've been experimenting with using Flume to transport Flume's own
> logs.
> > > >
> > > > My setup is that I have the apache-log4j-extras RollingFileAppender
> > > rolling
> > > > the file every minute into a spool directory. Then I have Flume with
> > the
> > > > spooldir source reading that directory.
> > > >
> > > > Very often I see:
> > > > java.lang.IllegalStateException: File name has been re-used with
> > > different
> > > > files. Spooling assumpti
> > > > on violated for
> > > >
/opt/castellan/log/spool/castellan-reader.20130430T2055.log.COMPLETED
> > > >         at
> > > >
> > > >
> > >
> >
>
org.apache.flume.client.avro.SpoolingFileLineReader.retireCurrentFile(SpoolingFileLineRe
> > > > ader.java:272)
> > > >
> > > > Then it never recovers, repeatedly saying:
> > > > 2013-05-08 15:57:19,233 ERROR [pool-10-thread-1]
> > > > org.apache.flume.source.SpoolDirectorySource Uncau
> > > > ght exception in Runnable
> > > > java.io.IOException: Stream closed
> > > >         at
java.io.BufferedReader.ensureOpen(BufferedReader.java:115)
> > > >         at java.io.BufferedReader.readLine(BufferedReader.java:310)
> > > >         at java.io.BufferedReader.readLine(BufferedReader.java:382)
> > > >         at
> > > >
> > > >
> > >
> >
>
org.apache.flume.client.avro.SpoolingFileLineReader.readLines(SpoolingFileLineReader.java:180)
> > > >
> > > > I note that 1.4 has extensive changes to the spooldir source but
> > nothing
> > > > that really seemed to speak to this.
> > > >
> > > > Does anybody else see this? Anything I should look for? Or is this
> some
> > > > kind of race condition between log4j and spooldir?
> > > >
> > > > Cheers,
> > > > Edward
> > > >
> > >"
> >
>"

Reply via email to