Right. Flume will miss any data that was logged while it was down because
Flume simply uses tail -F with ExecSource.
*Your implementation remembers the last file (inode?) it tailed + position
in that file?*
-It remembers only the last rotate, if the file rotates more than once, you'll
lose data. I couldn't use inodes because it's neccesary Java7, Flume is
developed in Java6, so I save the date of last modification to control the last
rotate file and the offset until I read last time.
* What happens when multiple log files are rotated while Flume agent was
down? Does your implementation know how to:
1) read the last tailed file from where it stopped all the way to the end
2) read all files that were completely missed from beginning to the end
3) start tailing the "active" log file*
-I do case number 1 and half of case 2. When it ended to read the rotate file,
it starts to read the new file. So it could read XXX.log.1 and when it ends,
will continue with XXX.log. It was outside of our scope to read all the rotate
file, so it lost XXX.log.2 if it wasn't read.
if you want to could do case 3 as well changing manually the XXX.checkpoint,
Here, I save the information about file which we're watching, the last offset
you read and the last date of last rotation.
*Assuming yes, yes, and yes, can one configure:
A) if 3) should start happening right away (while 1) and 2) are happening
"in the background)
B) or whether 1), 2), and 3) should happen sequentially**
*B case with restrictions I said.
*The A) use case is very handy when the most recent data is much more
valuable than old data (e.g. performance metrics) and thus you'd rather
start sending new data first and backfill old data later (or in parallel).*
-The point it that I need to get the data sorted by date and I assume that if
we have our server down
too much time for that files could rotate more than once, we are too lazy. So,
I just tried to solve
doesn't lose data when there's short offset and the problem with tail from
linux. I really like your suggestion
number 2, I could think about it, but in the future.
*
Have you compared your approach+impl with
http://commons.apache.org/proper/commons-io/apidocs/org/apache/commons/io/input/Tailer.html?
http://grepcode.com/file/repo1.maven.org/maven2/commons-io/commons-io/2.4/org/apache/commons/io/input/Tailer.java
*They really call to tail unix command, so, it has the same problem that Source
which you can exec a tail.
About the problem with tail from UNIX, I'm not pretty sure because I didn't
check that code, but I read somewhere that it was a patch about that command
where could happen that it could lose data which wasn't apply some distribution
or something like that. I should look for more information about it.
On 10/03/14 20:50, Otis Gospodnetic wrote:
Hi,
On Mon, Mar 10, 2014 at 3:35 AM, gortiz <[email protected]> wrote:
Hi,
About the tailing, I was checking the code of the tail from Linux and
there's some chance to lost data when the file rotates.
In case of Linux's own tail?
Plus, if Flume is stopped, there's not chance to recover the data when it
isn't getting the data. I have implemented and checkpoint mechanism to
recover the most data as possible is this happen.
Right. Flume will miss any data that was logged while it was down because
Flume simply uses tail -F with ExecSource.
Your implementation remembers the last file (inode?) it tailed + position
in that file?
What happens when multiple log files are rotated while Flume agent was
down? Does your implementation know how to:
1) read the last tailed file from where it stopped all the way to the end
2) read all files that were completely missed from beginning to the end
3) start tailing the "active" log file
Assuming yes, yes, and yes, can one configure:
A) if 3) should start happening right away (while 1) and 2) are happening
"in the background)
B) or whether 1), 2), and 3) should happen sequentially
The A) use case is very handy when the most recent data is much more
valuable than old data (e.g. performance metrics) and thus you'd rather
start sending new data first and backfill old data later (or in parallel).
Have you compared your approach+impl with
http://commons.apache.org/proper/commons-io/apidocs/org/apache/commons/io/input/Tailer.html?
http://grepcode.com/file/repo1.maven.org/maven2/commons-io/commons-io/2.4/org/apache/commons/io/input/Tailer.java
Thanks,
Otis
I think that Tailing for Flume is good enough it you're not worry to lose
any data, but it I needed to improve a little bit this feature.
If you have more question, let me know.
Guillermo Ortiz.
On 07/03/14 21:47, Otis Gospodnetic wrote:
Hi Guillermo,
I don't have the need for FLUME-2321, but maybe one of the devs can have a
look.
I am curious about that new tail source you mentioned, though. Can you
tell us more about what you are working on, how it is going to work, and
how it will be better than the tailer form Apache Commons and ExecSource
with tail -F ?
Thanks,
Otis
--
Performance Monitoring * Log Analytics * Search Analytics
Solr & Elasticsearch Support * http://sematext.com/
On Sun, Mar 2, 2014 at 4:01 PM, Guillermo Ortiz <[email protected]
wrote:
Hi,
I did a new feature for Flume
(FLUME-2321<https://issues.apache.org/jira/browse/FLUME-2321>),
I'd like to know what people think about it and how it's the mechanism to
be accepted a new feature
It's first time that I collaborate with an Apache Project and I don't
really know how it works. Or maybe it's because nobody is interested on
it,
hehe.
On another hand, I'm coding a new "tail" source, and I don't want to get
the same mistakes in the future.
Thank you,
Guillermo Ortiz.
--
*Guillermo Ortiz*
/Big Data Developer/
Telf.: +34 917 680 490
Fax: +34 913 833 301
C/ Manuel Tovar, 49-53 - 28034 Madrid - Spain
_http://www.bidoop.es_
--
*Guillermo Ortiz*
/Big Data Developer/
Telf.: +34 917 680 490
Fax: +34 913 833 301
C/ Manuel Tovar, 49-53 - 28034 Madrid - Spain
_http://www.bidoop.es_