Hi Phil, Since this is more of a dev discussion I'll just continue the conversation here on this list.
FYI the latest Spool Directory Source has support for resuming reading files. Trunk / Flume 1.4 have some new code around this aspect. Regarding Linux, doing lsof is a pretty cool idea but not portable to all systems. Also in Linux, two processes are allowed to have the same file open for write (it's still a bad idea though). I don't know of a portable way to check whether some other process have a given file open. We could, however, check to see if the file changed, and if so just stop processing that file for a while and try again later. I just don't want people to think spooldir is good for "tailing" a file, because it's not… :) Regards, Mike On Wed, May 22, 2013 at 5:24 PM, Phil Scala <[email protected]>wrote: > Hey Ed / Mike > > Besides a comment in the users mailing list that I commented on the file > spool starting from the beginning of the file if there was a failure. The > code does have that well commented (in the retireCurrentFile) [if you don't > retire the file then you run the risk of duplicates...fine with my use :)] > > > As Ed mentioned we have been chatting about ensuring there are no > invariants muddled up during file spool processing. I see this as 2 or 3 > pieces...I think the code is pretty solid, with one area I want to look > into. > > I would like to give this more thought... > > The file the spool source has decided is the "next file"... is it in > use/has the "upload" to the spool directory completed. > > Discussions mentioned some "time" delay -> that could be > artificial and still never solve the problem. I need to do some learning > here, coming from windows the file locking was pretty exclusive. I want to > see about FileChannel locks in nio and Linux file management. This could > maybe be an area to look at. Right now there are no locks obtained for the > file being processed. > > I will come back with something a little better formulated soon... > > Thanks > > > Phil Scala > Software Developer / Architect > Global Relay > > [email protected] > > 866.484.6630 | [email protected] | globalrelay.com > > -----Original Message----- > From: [email protected] [mailto:[email protected]] On Behalf Of Edward > Sargisson > Sent: Wednesday, May 22, 2013 12:22 PM > To: Mike Percy; [email protected] > Subject: Re: spooldir source reading Flume itself and thinking the file > has changed (1.3.1) > > Hi Mike, > I haven't tried log4j2 in my environments but my review of the log4j2 > change is that it should work. > > What would I change? > Phil Scala may have some thoughts. > > It would be nice if we thought through the file locking. I want to be able > to put a file in the spooldir and know that Flume isn't going to get > started until I'm ready. This certainly involves thinking about what the > file-putting process is doing but it's not clear to me how to ensure this > whole part is safe. > > The thing that is currently annoying is handling stack traces. All logging > systems I've seen (except recent log4j2) output the stack trace with each > frame on a new line. This means that each frame gets its own log event and > the timestamp has to be added by Flume (instead of taken from the original > event). That Flume timestamp might be delayed by up to 1 minute (because of > log rolling so its pretty crap). Logstash has a multiline filter that > somewhat solves this. > > My current approach is to try and get the Log4j2 FlumeAppender and Flume > 1.3.1 reliable and trustworthy. > > Cheers, > Edward > > "Hi Edward, > Did the fixes in LOG4J2-254 fix your file rolling issue? > > What are your thoughts on how to improve spooling directory source's error > handling when it detects a change in the file? Just bail and retry later? I > suppose that's a pretty reasonable approach. > > Regards, > Mike > > > On Tue, May 14, 2013 at 4:50 PM, Edward Sargisson <[email protected]> > wrote: > > > Unless I'm mistaken (and concurrent code is easy to be mistaken about) > this > > is a race condition in apache-log4j-extras RollingFileAppender. I live > > in hope that when log4j2 becomes GA we can move to it and then be able > > to use it to log Flume itself. > > > > Evidence: > > File: castellan-reader. > 20130514T2058.log.COMPLETED > > 2013-05-14 20:57:05,330 INFO ... > > > > File: castellan-reader.20130514T2058.log > > 2013-05-14 21:23:05,709 DEBUG ... > > > > Why would an event from 2123 be written into a file from 2058? > > > > My understanding of log4j shows that the RollingFileAppenders end up > > calling this: > > FileAppender: > > public synchronized void setFile(String fileName, boolean append, > boolean > > bufferedIO, int bufferSize) > > > > Which shortly calls: > > this.qw = new QuietWriter(writer, errorHandler); > > > > However, the code to actually write to the writer is this: > > protected > > void subAppend(LoggingEvent event) { > > this.qw.write(this.layout.format(event)); > > > > Unless I'm mistaken there's no happens-before edge between setting the > > qw and calling subappend. The code path to get to subAppend appears > > not to go through any method synchronized on FileAppender's monitor. > > this.qw is not volatile. > > > > Oh, and based on my cursory inspection of the log4j2 code this exists > > in > > log4j2 as well. I've just raised log4j2-254 to cover it. We'll see if > > I'm actually right... > > > > Cheers, > > Edward > > > > > > > > > > On Mon, May 13, 2013 at 8:45 AM, Edward Sargisson <[email protected]> > > wrote: > > > > > Hi Mike, > > > Based on my reading of the various logging frameworks' source code > > > and > > the > > > Java documentation I come to the conclusion that relying on an > > > atomic > > move > > > is not wise. (Next time I see this I might try and prove that the > spooled > > > file is incomplete). > > > > > > So I suggest two things: > > > 1) A breach of that check should not cause the entire Flume instance > > > to stop passing traffic. > > > 2) A configurable wait time might work. If you're using the spooling > > > source then you've already decided to have some latency so a little > > > more > > is > > > fine. However, there is still a risk of a race condition because > > > there > is > > > no signal that the copy is finished. > > > > > > Cheers, > > > Edward > > > > > > "Hi Edward, > > > Thanks for investigating. I'm definitely open to suggestions for > > > improvement with this. Maybe dying is a bit extreme. the goal was to > > ensure > > > that people could not possibly try to use it to tail a file, which > > > will definitely not work correctly! :) > > > > > > Mike > > > > > > > > > > > > On Fri, May 10, 2013 at 5:02 PM, Edward Sargisson <[email protected]> > > > wrote: > > > > > > > Hi Mike, > > > > I was curious so I went on a bit of a hunt through logger source > code. > > > > The result is that loggers can't be relied on to atomically roll > > > > the file so a feature to allow a delay before checking the file > > > > would be of great utility. > > > > > > > > For that matter, having Flume not die completely in this scenario > > > > would also be good. > > > > > > > > apache-log4j-extras does this [1]: > > > > return source.renameTo(destination); > > > > > > > > logback does this [2]: > > > > boolean result = srcFile.renameTo(targetFile); > > > > > > > > log4j2 does this [3]: > > > > srcStream = new FileInputStream(source); > > > > destStream = new FileOutputStream(destination); > > > > srcChannel = srcStream.getChannel(); > > > > destChannel = destStream.getChannel(); > > > > destChannel.transferFrom( > > > srcChannel, 0, srcChannel.size()); > > > > > > > > The JavaDoc for File.renameTo says: > > > > Many aspects of the behavior of this method are inherently > > > > platform-dependent: The rename operation might not be able to > > > > move a file from one filesystem to another, it might not be > > > > atomic, and it might not succeed if a file with the destination > > > > abstract pathname already exists. The return value should always > > > > be checked to make > > sure > > > > that the rename operation was successful. > > > > > > > > My conclusion is that the loggers (except possibly log4j2) can't > > > > be relied on to atomically roll the file. > > > > > > > > Cheers, > > > > Edward > > > > > > > > > > > > Links: > > > > [1] > > > > > > > > > > > http://svn.apache.org/viewvc/logging/log4j/companions/extras/trunk/src/main/java/org/apache/log4j/rolling/helper/FileRenameAction.java?view=markup > > > > l77 > > > > > > > > [2] > > > > > > > > > > > https://github.com/qos-ch/logback/blob/master/logback-core/src/main/java/ch/qos/logback/core/rolling/helper/RenameUtil.java > > > > , > > > > l63 > > > > [3] > > > > > > > > > > > https://svn.apache.org/repos/asf/logging/log4j/log4j2/trunk/core/src/main/java/org/apache/logging/log4j/core/appender/rolling/helper/FileRenameAction.java > > > > > > > > > > > > >Hi Edward, > > > > >Spooldir assumes that the file is atomically moved into the > > > > >directory > > in > > > > >question, maybe we should make it wait around until he last mod > > > > >time > > is > > > > > > > > >10 seconds or something to avoid situations where a readdir + stat > > > > >catches it during a file copy operation and thinks you're > > > > >changing > the > > > > >files. If you're sure that the file name is not being reused then > > maybe > > > > >it's a bug. > > > > > > > > >We require file name to be unique because we don't have a great > > > > >way > to > > > > read > > > > >the inode with Java. > > > > > > > > >Mike > > > > > > > > > > > > >On Wed, May 8, 2013 at 9:04 AM, Edward Sargisson > > > > ><[email protected]> > > > > wrote: > > > > > > > > > Hi all, > > > > > I've been experimenting with using Flume to transport Flume's > > > > > own > > logs. > > > > > > > > > > My setup is that I have the apache-log4j-extras > > > > > RollingFileAppender > > > > rolling > > > > > the file every minute into a spool directory. Then I have Flume > > > > > with > > > the > > > > > spooldir source reading that directory. > > > > > > > > > > Very often I see: > > > > > java.lang.IllegalStateException: File name has been re-used with > > > > different > > > > > files. Spooling assumpti > > > > > on violated for > > > > > > /opt/castellan/log/spool/castellan-reader.20130430T2055.log.COMPLETED > > > > > at > > > > > > > > > > > > > > > > > > > > > org.apache.flume.client.avro.SpoolingFileLineReader.retireCurrentFile(SpoolingFileLineRe > > > > > ader.java:272) > > > > > > > > > > Then it never recovers, repeatedly saying: > > > > > 2013-05-08 15:57:19,233 ERROR [pool-10-thread-1] > > > > > org.apache.flume.source.SpoolDirectorySource Uncau ght exception > > > > > in Runnable > > > > > java.io.IOException: Stream closed > > > > > at > java.io.BufferedReader.ensureOpen(BufferedReader.java:115) > > > > > at java.io.BufferedReader.readLine(BufferedReader.java:310) > > > > > at java.io.BufferedReader.readLine(BufferedReader.java:382) > > > > > at > > > > > > > > > > > > > > > > > > > > > org.apache.flume.client.avro.SpoolingFileLineReader.readLines(SpoolingFileLineReader.java:180) > > > > > > > > > > I note that 1.4 has extensive changes to the spooldir source but > > > nothing > > > > > that really seemed to speak to this. > > > > > > > > > > Does anybody else see this? Anything I should look for? Or is > > > > > this > > some > > > > > kind of race condition between log4j and spooldir? > > > > > > > > > > Cheers, > > > > > Edward > > > > > > > > >" > > > > >" >
