Re: Logs used as Flume Sources and real-time analytics

alo alt Tue, 07 Feb 2012 04:47:19 -0800

Yes, I agree fully. The tailing is a useful mechanism, but since we also have 
to deliver in time and reliable the core team decides to remove that feature. 
In your case tail make sense, in a session application not (bank, travel, car 
rental, pizza service and so on). One missing token or session can harm.


For flumeNG another sink is implemented, called exec-agent. Here you can easy 
put a tail sink, but then you have to consider that all runs well. But for new 
users I would please point them to flumeNG, because flume and flumeNG has no 
compatibility, flumeNG is written completely new. I think when flumeNG release 
the next milestone the support for flume will slowly going down.

best, and thanks for the discussion,
 Alex 

--
Alexander Lorenz
http://mapredit.blogspot.com

On Feb 7, 2012, at 1:16 PM, Michal Taborsky wrote:

> Hi Alex,
> 
> truth be told, I am quite satisfied with the file tailing and I'll try to 
> explain why I like it. The main reason is, at least for us, the web 
> application itself is business critical, the event collection is not. Writing 
> to a plain file is a thing that can rarely fail and if it fails, it fails 
> quickly and in a controlled fashion. But piping to a flume agent for example? 
> How sure can I be, that the write will work all the time or fail immediately? 
> That it will not wait for some timeout or the other? Or throw some unexpected 
> error and bring down the app.
> 
> The other aspect is simple development and debugging. Any developer can read 
> a plain file and check if the data he's writing is correct, but with any 
> sophisticated method you either need more complicated testing environment or 
> redirection switches that will write to files in development and to flume in 
> testing and production, which complicates stuff.
> 
> --
> Michal Táborský
> chief systems architect
> Netretail Holding, BV
> nrholding.com
> 
> 
> 
> 
> 2012/2/7 alo alt <[email protected]>
> Hi,
> 
> sorry for pitching in, but FlumeNG will not support tailing sources, because 
> we have here a lot of problems. First, and mostly the worst problem is the 
> marker in a tailing file. If the agent crash, or the server, or the collector 
> the marker will be lost. So, if you restart you'll getting all events again. 
> Sure, you can use append, but here you get lost events.
> 
> For a easy migration from flume to flumeNG use sources which are supported in 
> NG. Syslog as example, more sources you can found here: 
> https://cwiki.apache.org/FLUME/flume-ng.html
> 
> You could use Avro for the sessions, and you could pipe direct to a local 
> flume agent. Also syslog with a buffering mode could be work. Also in flumeNG 
> now we have hBase handler and thrift.
> Another idea for collecting sessions could be 
> http://hadoop.apache.org/common/docs/r1.0.0/webhdfs.html , an REST api for 
> hdfs?
> 
> - Alex
> 
> 
> --
> Alexander Lorenz
> http://mapredit.blogspot.com
> 
> On Feb 7, 2012, at 11:14 AM, Alain RODRIGUEZ wrote:
> 
> > Thank you for your answer it helps me a lot knowing I'am doing things in a 
> > good way.
> >
> > I've got an other question: How do you manage restart the service after a 
> > crash ? I mean, you tail the log file, so if your server crashes or you 
> > stop the tail for any reason, how do you do not to tail all the logs from 
> > the start, how do you manage restarting from the exact point where you left 
> > your tail process ?
> >
> > Thanks again for your help, I really appreciate :-).
> >
> > Alain
> >
> > 2012/2/2 Michal Taborsky <[email protected]>
> > Hello Alain,
> >
> > we are using Flume for probably the same purposes. We are writing JSON 
> > encoded event data to flat file on every application server. Since each 
> > application server writes only maybe tens of events per second, the 
> > performance hit of writing to disk is negligible (and the events are 
> > written to disk only after the content is generated and sent to the user, 
> > so there is no latency for the end user). This file is tailed by Flume and 
> > delivered thru collectors to HDFS. The collectors are forking the events to 
> > RabbitMQ as well. We have a Node.js application, that picks up these events 
> > and does some real-time analytics on them. The delay between event 
> > origination and analytics is below 10 seconds, usually 1-3 seconds in total.
> >
> > Hope this helps.
> >
> > --
> > Michal Táborský
> > chief systems architect
> > Netretail Holding, BV
> > nrholding.com
> >
> >
> >
> >
> > 2012/2/2 Alain RODRIGUEZ <[email protected]>
> > Hi,
> >
> > I'm new with Flume and I'd like to use it to get a stable flow of data to 
> > my database (To be able to handle rush hours by delaying the write in 
> > database, without introducing any timeout or latency to the user).
> >
> > My questions are :
> >
> > What is the best way to create the log file that will be used as source for 
> > flume ?
> >
> > Our production environment is running apache servers and php scripts.
> > I can't just use access log because some informations are stored in 
> > session, so I need to build a custom source.
> > An other point is that writing a file seems to be primitive and not really 
> > efficient since it writes the disk instead of writing the memory for any 
> > event I store (many events every second).
> >
> > How to use this system (as Facebook does with scribe) to proceed real-time 
> > analytics ?
> >
> > I'm open to here about hdfs, hbase or whatever could help reaching my goals 
> > which are a stable flow to the database and near real-time analytics 
> > (seconds to minutes).
> >
> > Thanks for your help.
> >
> > Alain
> >
> >
> 
>

Re: Logs used as Flume Sources and real-time analytics

Reply via email to