Structuring MapReduce Jobs

Tom Ferguson Mon, 09 Apr 2012 09:45:47 -0700

Resending my query below... it didn't seem to post first time.

Thanks,


Tom
On Apr 8, 2012 11:37 AM, "Tom Ferguson" <tomfergu...@gmail.com> wrote:

> Hello,
>
> I'm very new to Hadoop and I am trying to carry out of proof of concept
> for processing some trading data. I am from a .net background, so I am
> trying to prove whether it can be done primarily using C#, therefore I am
> looking at the Hadoop Streaming job (from the Hadoop examples) to call in
> to some C# executables.
>
> My problem is, I am not certain of the best way to structure my jobs to
> process the data in the way I want.
>
> I have data stored in an RDBMS in the following format:
>
> ID TradeID  Date  Value
> ---------------------------------------------
> 1 1  2012-01-01 12.34
> 2 1  2012-01-02 12.56
> 3 1  2012-01-03 13.78
> 4 2  2012-01-04 18.94
> 5 2  2012-05-17 19.32
> 6 2  2012-05-18 19.63
> 7 3  2012-05-19 17.32
> What I want to do is take all the Dates & Values for a given TradeID into
> a mathematical function that will spit out the same set of Dates but will
> have recalculated all the Values. I hope that makes sense.. e.g.
>
> Date Value
> ---------------------------
> 2012-01-01 12.34
> 2012-01-02 12.56
> 2012-01-03 13.78
> will have the mathematical function applied and spit out
>
> Date Value
> ---------------------------
> 2012-01-01 28.74
> 2012-01-02 31.29
> 2012-01-03 29.93
> I am not exactly sure how to achieve this using Hadoop Streaming, but my
> thoughts so far are...
>
>
>    1. Us Sqoop to take the data out of the RDBMS and in to HDFS and split
>    by TradeID - will this guarantee that all the the data points for a given
>    TradeID will be processed by the same Map task??
>    2. Write a Map task as a C# executable that will stream data in in the
>    format (ID, TradeID, Date, Value)
>    3. Gather all the data points for a given TradeID together into an
>    array (or other datastructure)
>    4. Pass the array into the mathematical function
>    5. Get the results back as another array
>    6. Stream the results back out in the format (TradeID, Date,
>    ResultValue)
>
> I will have around 500,000 Trade IDs, with up to 3,000 data points each,
> so I am hoping that the data/processing will be distributed appropriately
> by Hadoop.
>
> Now, this seams a little bit long winded, but is this the best way of
> doing it, based on the constraints of having to use C# for writing my
> tasks? In the example above I do not have a Reduce job at all. Is that
> right in my scenario?
>
> Thanks for any help you can give and apologies if I am asking stupid
> questions here!
>
> Kind Regards,
>
> Tom
>

Structuring MapReduce Jobs

Reply via email to