Resending my query below... it didn't seem to post first time. Thanks,
Tom On Apr 8, 2012 11:37 AM, "Tom Ferguson" <tomfergu...@gmail.com> wrote: > Hello, > > I'm very new to Hadoop and I am trying to carry out of proof of concept > for processing some trading data. I am from a .net background, so I am > trying to prove whether it can be done primarily using C#, therefore I am > looking at the Hadoop Streaming job (from the Hadoop examples) to call in > to some C# executables. > > My problem is, I am not certain of the best way to structure my jobs to > process the data in the way I want. > > I have data stored in an RDBMS in the following format: > > ID TradeID Date Value > --------------------------------------------- > 1 1 2012-01-01 12.34 > 2 1 2012-01-02 12.56 > 3 1 2012-01-03 13.78 > 4 2 2012-01-04 18.94 > 5 2 2012-05-17 19.32 > 6 2 2012-05-18 19.63 > 7 3 2012-05-19 17.32 > What I want to do is take all the Dates & Values for a given TradeID into > a mathematical function that will spit out the same set of Dates but will > have recalculated all the Values. I hope that makes sense.. e.g. > > Date Value > --------------------------- > 2012-01-01 12.34 > 2012-01-02 12.56 > 2012-01-03 13.78 > will have the mathematical function applied and spit out > > Date Value > --------------------------- > 2012-01-01 28.74 > 2012-01-02 31.29 > 2012-01-03 29.93 > I am not exactly sure how to achieve this using Hadoop Streaming, but my > thoughts so far are... > > > 1. Us Sqoop to take the data out of the RDBMS and in to HDFS and split > by TradeID - will this guarantee that all the the data points for a given > TradeID will be processed by the same Map task?? > 2. Write a Map task as a C# executable that will stream data in in the > format (ID, TradeID, Date, Value) > 3. Gather all the data points for a given TradeID together into an > array (or other datastructure) > 4. Pass the array into the mathematical function > 5. Get the results back as another array > 6. Stream the results back out in the format (TradeID, Date, > ResultValue) > > I will have around 500,000 Trade IDs, with up to 3,000 data points each, > so I am hoping that the data/processing will be distributed appropriately > by Hadoop. > > Now, this seams a little bit long winded, but is this the best way of > doing it, based on the constraints of having to use C# for writing my > tasks? In the example above I do not have a Reduce job at all. Is that > right in my scenario? > > Thanks for any help you can give and apologies if I am asking stupid > questions here! > > Kind Regards, > > Tom >