Felix: You definitely need to implement a custom source that knows how to read the bin logs and pack each transaction into an event rather than just tailing it. This will give you a discreet event for each transaction that can be treated as a single unit and make downstream processing MUCH easier.
Things to keep in mind: * Flume does NOT guarantee order so make sure each event has a timestamp or transaction ID that you can order by. * Flume does NOT guarantee that you won't get duplicates so make sure you have a globally unique transaction ID so you can deduplicate transactions. This would be interesting functionality to get back into Flume. If you can / want to contribute it back in the form of a custom source, feel free to open a JIRA so others can help / watch progress. Thanks! On Tue, Aug 9, 2011 at 11:42 AM, Felix Giguere Villegas <[email protected]> wrote: > Hi :) ! > > I have a use case where I want to keep a historical record of all the > changes (insert/update/delete) happening on a MySQL DB. > > I am able to tail the bin logs and record them in HDFS, but they are not > easy to parse because one operation is split on many lines. There are some > comments that include the timestamp, the total time it took to execute the > query and other stuff. A lot of this extra info is not relevant, but the > timestamp is important for me, and I thought I might as well keep the rest > of the info as well since the raw data gives me the option of going back to > look for these other fields if I determine later on that I need them. > > Now, the fact that it's split over many lines makes it harder to use with > Map/Reduce. > > I have thought of using a custom M/R RecordReader but I still have the > problem that some of the lines related to one operation will be at the end > of one HDFS file and the rest will be at the beginning of the next HDFS > file, since I am opening and closing those files at an arbitrary roll time. > > I think the easiest way would be to do some minimal ETL at the source. I > think I could use a custom decorator for this. Basically, that decorator > would group together on a single line all the bin log lines that relate to a > single DB operation. The original lines would be separated by semi-colons or > some other character in the final output. > > I wanted to check with you guys to see if that approach made sense. If you > have better suggestions, then I'm all ears, of course. Also, if you think > there is an easier way than reading the bin logs to accomplish my original > goal, then I'd like to hear about it as well :) > > Thanks :) ! > > -- > Felix > > -- Eric Sammer twitter: esammer data: www.cloudera.com
