Hi :) ! I have a use case where I want to keep a historical record of all the changes (insert/update/delete) happening on a MySQL DB.
I am able to tail the bin logs and record them in HDFS, but they are not easy to parse because one operation is split on many lines. There are some comments that include the timestamp, the total time it took to execute the query and other stuff. A lot of this extra info is not relevant, but the timestamp is important for me, and I thought I might as well keep the rest of the info as well since the raw data gives me the option of going back to look for these other fields if I determine later on that I need them. Now, the fact that it's split over many lines makes it harder to use with Map/Reduce. I have thought of using a custom M/R RecordReader but I still have the problem that some of the lines related to one operation will be at the end of one HDFS file and the rest will be at the beginning of the next HDFS file, since I am opening and closing those files at an arbitrary roll time. I think the easiest way would be to do some minimal ETL at the source. I think I could use a custom decorator for this. Basically, that decorator would group together on a single line all the bin log lines that relate to a single DB operation. The original lines would be separated by semi-colons or some other character in the final output. I wanted to check with you guys to see if that approach made sense. If you have better suggestions, then I'm all ears, of course. Also, if you think there is an easier way than reading the bin logs to accomplish my original goal, then I'd like to hear about it as well :) Thanks :) ! -- Felix
