Yes I'm planning to do that once we get something a little more concrete working from end to end.
I'll let you guys know for sure :) -- Felix On Thu, Aug 11, 2011 at 7:25 PM, Jonathan Hsieh <[email protected]> wrote: > Good to hear. I think it would be interesting if you wrote up a little > blog post about your usecase. Sounds interesting! > > Jon. > > > On Thu, Aug 11, 2011 at 2:00 PM, Felix Giguere Villegas < > [email protected]> wrote: > >> Quick update. >> >> Sometimes the simplest solutions are the best hehe... We decided to just >> use a perl script to modify the data as it's being outputted by mysqlbinlog >> >> perl -e 'while (<STDIN>) {s/\r|\n|\r\n/ /g;s/# at/\n# at/g;print;}' >> >> We write this modified binlog to a file and tail it using flume. This way, >> mysql events are grouped per line, and even if they get to their destination >> in the wrong order, it's easy to sort them correctly downstream :) >> >> Eventually, we will probably wrap it all up in a script and use the >> execStream >> Flume source. This way, we won't need to write an intermediate/temporary >> file to disk, which will allow us to have greater disk IO throughput for DFO >> or E2E reliability... >> >> -- >> Felix >> >> >> >> >> On Tue, Aug 9, 2011 at 4:34 PM, Felix Giguere Villegas < >> [email protected]> wrote: >> >>> Ah, so a source would make more sense than a decorator then. I see... >>> >>> We are definitely open to the idea of contributing back, and this is >>> indeed probably something that a lot of people could use... >>> >>> We are still evaluating what we will do, as we have a lot of stuff going >>> on at once, but if we do decide to develop a custom source, then I'll do as >>> you suggest and open a JIRA issue. >>> >>> Thanks for your time and info :) >>> >>> -- >>> Felix >>> >>> >>> >>> >>> On Tue, Aug 9, 2011 at 4:06 PM, Eric Sammer <[email protected]>wrote: >>> >>>> Felix: >>>> >>>> You definitely need to implement a custom source that knows how to >>>> read the bin logs and pack each transaction into an event rather than >>>> just tailing it. This will give you a discreet event for each >>>> transaction that can be treated as a single unit and make downstream >>>> processing MUCH easier. >>>> >>>> Things to keep in mind: >>>> * Flume does NOT guarantee order so make sure each event has a >>>> timestamp or transaction ID that you can order by. >>>> * Flume does NOT guarantee that you won't get duplicates so make sure >>>> you have a globally unique transaction ID so you can deduplicate >>>> transactions. >>>> >>>> This would be interesting functionality to get back into Flume. If you >>>> can / want to contribute it back in the form of a custom source, feel >>>> free to open a JIRA so others can help / watch progress. >>>> >>>> Thanks! >>>> >>>> On Tue, Aug 9, 2011 at 11:42 AM, Felix Giguere Villegas >>>> <[email protected]> wrote: >>>> > Hi :) ! >>>> > >>>> > I have a use case where I want to keep a historical record of all the >>>> > changes (insert/update/delete) happening on a MySQL DB. >>>> > >>>> > I am able to tail the bin logs and record them in HDFS, but they are >>>> not >>>> > easy to parse because one operation is split on many lines. There are >>>> some >>>> > comments that include the timestamp, the total time it took to execute >>>> the >>>> > query and other stuff. A lot of this extra info is not relevant, but >>>> the >>>> > timestamp is important for me, and I thought I might as well keep the >>>> rest >>>> > of the info as well since the raw data gives me the option of going >>>> back to >>>> > look for these other fields if I determine later on that I need them. >>>> > >>>> > Now, the fact that it's split over many lines makes it harder to use >>>> with >>>> > Map/Reduce. >>>> > >>>> > I have thought of using a custom M/R RecordReader but I still have the >>>> > problem that some of the lines related to one operation will be at the >>>> end >>>> > of one HDFS file and the rest will be at the beginning of the next >>>> HDFS >>>> > file, since I am opening and closing those files at an arbitrary roll >>>> time. >>>> > >>>> > I think the easiest way would be to do some minimal ETL at the source. >>>> I >>>> > think I could use a custom decorator for this. Basically, that >>>> decorator >>>> > would group together on a single line all the bin log lines that >>>> relate to a >>>> > single DB operation. The original lines would be separated by >>>> semi-colons or >>>> > some other character in the final output. >>>> > >>>> > I wanted to check with you guys to see if that approach made sense. If >>>> you >>>> > have better suggestions, then I'm all ears, of course. Also, if you >>>> think >>>> > there is an easier way than reading the bin logs to accomplish my >>>> original >>>> > goal, then I'd like to hear about it as well :) >>>> > >>>> > Thanks :) ! >>>> > >>>> > -- >>>> > Felix >>>> > >>>> > >>>> >>>> >>>> >>>> -- >>>> Eric Sammer >>>> twitter: esammer >>>> data: www.cloudera.com >>>> >>> >>> >> > > > -- > // Jonathan Hsieh (shay) > // Software Engineer, Cloudera > // [email protected] > > >
