Re: What do people think about the patch FLUME-2321?

gortiz Wed, 12 Mar 2014 00:24:24 -0700

Hi,

It's an available patch with this from yesterday, FLUME-2344-0.patch.

If I upgrade the version of Java, people with Java6 couldn't use Flumeand there's a lot of people with Java6.

It could be possible to read old logs file with the their date, if thepatch is accepted, I'll think about that little improve.


Thank you.

On 11/03/14 19:01, Otis Gospodnetic wrote:

Hi,

On Tue, Mar 11, 2014 at 4:05 AM, gortiz <[email protected]> wrote:

Right.  Flume will miss any data that was logged while it was down because
Flume simply uses tail -F with ExecSource.

*Your implementation remembers the last file (inode?) it tailed + position
in that file?*
-It remembers only the last rotate, if the file rotates more than once,
you'll lose data. I couldn't use inodes because it's neccesary Java7, Flume
is developed in Java6, so I save the date of last modification to control
the last rotate file and the offset until I read last time.

Is Java 6 support really necessary?  Java 6 is EOL.  Java 8 release is
imminent, I believe.
If tracking inode vs. last mod date brings in some advantage, I'd consider
switching to that.

*  What happens when multiple log files are rotated while Flume agent was

down?  Does your implementation know how to:
1) read the last tailed file from where it stopped all the way to the end
2) read all files that were completely missed from beginning to the end
3) start tailing the "active" log file*

-I do case number 1 and half of case 2. When it ended to read the rotate
file, it starts to read the new file. So it could read XXX.log.1 and when
it ends, will continue with XXX.log. It was outside of our scope to read
all the rotate file, so it lost XXX.log.2 if it wasn't read.
if you want to could do case 3 as well changing manually the
XXX.checkpoint, Here, I save the information about file which we're
watching, the last offset you read and the last date of last rotation.

If you save the date when you last read from a file, then I think you could
just could finish reading the file you were last reading and then look at
all other files and read all files with newer last mod date, beginning to
end. No?

*Assuming yes, yes, and yes, can one configure:

A) if 3) should start happening right away (while 1) and 2) are happening
"in the background)
B) or whether 1), 2), and 3) should happen sequentially**
*B case with restrictions I said.


  *The A) use case is very handy when the most recent data is much more
valuable than old data (e.g. performance metrics) and thus you'd rather
start sending new data first and backfill old data later (or in parallel).*
-The point it that I need to get the data sorted by date and I assume that
if we have our server down
too much time for that files could rotate more than once, we are too lazy.
So, I just tried to solve
doesn't lose data when there's short offset and the problem with tail from
linux. I really like your suggestion
number 2, I could think about it, but in the future.

+1 for the simple version now and more sophisticated later.

*
Have you compared your approach+impl with
http://commons.apache.org/proper/commons-io/apidocs/org/
apache/commons/io/input/Tailer.html?
http://grepcode.com/file/repo1.maven.org/maven2/
commons-io/commons-io/2.4/org/apache/commons/io/input/Tailer.java
*They really call to tail unix command, so, it has the same problem that
Source which you can exec a tail.

Ahaaa, I did not realize this!

About the problem with tail from UNIX, I'm not pretty sure because I
didn't check that code, but I read somewhere that it was a patch about
  that command
where could happen that it could lose data which wasn't apply some
distribution or something like that. I should look for more information
about it.

I don't know anything about this....

Is this project available anywhere?
Would it make sense to add a new Flume Source that uses it?

Thanks,
Otis
--
Performance Monitoring * Log Analytics * Search Analytics
Solr & Elasticsearch Support * http://sematext.com/



On 10/03/14 20:50, Otis Gospodnetic wrote:

Hi,

On Mon, Mar 10, 2014 at 3:35 AM, gortiz <[email protected]> wrote:

  Hi,

About the tailing, I was checking the code of the tail from Linux and
there's some chance to lost data when the file rotates.

  In case of Linux's own tail?


  Plus, if Flume is stopped, there's not chance to recover the data when it

isn't getting the data. I have implemented and checkpoint mechanism to
recover the most data as possible is this happen.

  Right.  Flume will miss any data that was logged while it was down

because
Flume simply uses tail -F with ExecSource.

Your implementation remembers the last file (inode?) it tailed + position
in that file?

What happens when multiple log files are rotated while Flume agent was
down?  Does your implementation know how to:
1) read the last tailed file from where it stopped all the way to the end
2) read all files that were completely missed from beginning to the end
3) start tailing the "active" log file

Assuming yes, yes, and yes, can one configure:
A) if 3) should start happening right away (while 1) and 2) are happening
"in the background)
B) or whether 1), 2), and 3) should happen sequentially

The A) use case is very handy when the most recent data is much more
valuable than old data (e.g. performance metrics) and thus you'd rather
start sending new data first and backfill old data later (or in parallel).

Have you compared your approach+impl with
http://commons.apache.org/proper/commons-io/apidocs/org/
apache/commons/io/input/Tailer.html?

http://grepcode.com/file/repo1.maven.org/maven2/
commons-io/commons-io/2.4/org/apache/commons/io/input/Tailer.java

Thanks,
Otis



  I think that Tailing for Flume is good enough it you're not worry to lose

any data, but it I needed to improve a little bit this feature.

If you have more question, let me know.

Guillermo Ortiz.

On 07/03/14 21:47, Otis Gospodnetic wrote:

  Hi Guillermo,

I don't have the need for FLUME-2321, but maybe one of the devs can
have a
look.

I am curious about that new tail source you mentioned, though.  Can you
tell us more about what you are working on, how it is going to work, and
how it will be better than the tailer form Apache Commons and ExecSource
with tail -F ?

Thanks,
Otis
--
Performance Monitoring * Log Analytics * Search Analytics
Solr & Elasticsearch Support * http://sematext.com/


On Sun, Mar 2, 2014 at 4:01 PM, Guillermo Ortiz <[email protected]

wrote:

Hi,

I did a new feature for Flume
(FLUME-2321<https://issues.apache.org/jira/browse/FLUME-2321>),
I'd like to know what people think about it and how it's the mechanism
to
be accepted a new feature
It's first time that I collaborate with an Apache Project and I don't
really know how it works. Or maybe it's because nobody is interested on
it,
hehe.

On another hand, I'm coding a new "tail" source, and I don't want to
get
the same mistakes in the future.

Thank you,

Guillermo Ortiz.


  --

*Guillermo Ortiz*
/Big Data Developer/

Telf.: +34 917 680 490
Fax: +34 913 833 301
C/ Manuel Tovar, 49-53 - 28034 Madrid - Spain

_http://www.bidoop.es_

--
*Guillermo Ortiz*
/Big Data Developer/

Telf.: +34 917 680 490
Fax: +34 913 833 301
C/ Manuel Tovar, 49-53 - 28034 Madrid - Spain

_http://www.bidoop.es_



--
*Guillermo Ortiz*
/Big Data Developer/

Telf.: +34 917 680 490
Fax: +34 913 833 301
C/ Manuel Tovar, 49-53 - 28034 Madrid - Spain

_http://www.bidoop.es_

Re: What do people think about the patch FLUME-2321?

Reply via email to