Ah, so a source would make more sense than a decorator then. I see... We are definitely open to the idea of contributing back, and this is indeed probably something that a lot of people could use...
We are still evaluating what we will do, as we have a lot of stuff going on at once, but if we do decide to develop a custom source, then I'll do as you suggest and open a JIRA issue. Thanks for your time and info :) -- Felix On Tue, Aug 9, 2011 at 4:06 PM, Eric Sammer <[email protected]> wrote: > Felix: > > You definitely need to implement a custom source that knows how to > read the bin logs and pack each transaction into an event rather than > just tailing it. This will give you a discreet event for each > transaction that can be treated as a single unit and make downstream > processing MUCH easier. > > Things to keep in mind: > * Flume does NOT guarantee order so make sure each event has a > timestamp or transaction ID that you can order by. > * Flume does NOT guarantee that you won't get duplicates so make sure > you have a globally unique transaction ID so you can deduplicate > transactions. > > This would be interesting functionality to get back into Flume. If you > can / want to contribute it back in the form of a custom source, feel > free to open a JIRA so others can help / watch progress. > > Thanks! > > On Tue, Aug 9, 2011 at 11:42 AM, Felix Giguere Villegas > <[email protected]> wrote: > > Hi :) ! > > > > I have a use case where I want to keep a historical record of all the > > changes (insert/update/delete) happening on a MySQL DB. > > > > I am able to tail the bin logs and record them in HDFS, but they are not > > easy to parse because one operation is split on many lines. There are > some > > comments that include the timestamp, the total time it took to execute > the > > query and other stuff. A lot of this extra info is not relevant, but the > > timestamp is important for me, and I thought I might as well keep the > rest > > of the info as well since the raw data gives me the option of going back > to > > look for these other fields if I determine later on that I need them. > > > > Now, the fact that it's split over many lines makes it harder to use with > > Map/Reduce. > > > > I have thought of using a custom M/R RecordReader but I still have the > > problem that some of the lines related to one operation will be at the > end > > of one HDFS file and the rest will be at the beginning of the next HDFS > > file, since I am opening and closing those files at an arbitrary roll > time. > > > > I think the easiest way would be to do some minimal ETL at the source. I > > think I could use a custom decorator for this. Basically, that decorator > > would group together on a single line all the bin log lines that relate > to a > > single DB operation. The original lines would be separated by semi-colons > or > > some other character in the final output. > > > > I wanted to check with you guys to see if that approach made sense. If > you > > have better suggestions, then I'm all ears, of course. Also, if you think > > there is an easier way than reading the bin logs to accomplish my > original > > goal, then I'd like to hear about it as well :) > > > > Thanks :) ! > > > > -- > > Felix > > > > > > > > -- > Eric Sammer > twitter: esammer > data: www.cloudera.com >
