Hey Chris,

I think it would be appropriate. Look at it this way, it takes 1 mapper 1 minute to process 24k records, so it should take about 17 mappers to process all your tasks for the largest problem in one minute.

Even if you still think your problem is too small, consider:
1) The possibility of growth in your application. You're processing becomes "future proof" - you have a pretty solid way to scale out as your task grows. Just add new machines -- you don't have to invest in a "small scale" framework then rewrite in a year. 2) The benefits of having a framework do the heavy lifting. There's a surprising amount of "roll your own" that you end up doing when you decide to break out of a single thread. By framing your problem as a map-reduce problem, you get to skip a lot of these steps and just focus on solving your problem (also: beware that it's very sexy to build your own MapReduce framework. Anything which is "very sexy" takes up more time and money than you think possible at the outset).

Brian

On Feb 3, 2009, at 8:34 AM, cdwillie76 wrote:


I have an application I would like to apply hadoop to but I'm not sure if the tasking is too small. I have a file that contains between 70,000 - 400,000 records. All the records can be processed in parallel and I can currently process them at 400 records a second single threaded (give or take). I thought I read somewhere (one of the tutorials) that the mapper tasks should run at least for a minute to offset the overhead in creating them. Is this really the case? I am pretty sure that a one to one record to mapper is overkill but I am wondering if I batching them up for the mapper is still a way to go or if I should look at some other framework to help split up the
processing.

Any insight would be appreciated.

Thanks
Chris
--
View this message in context: 
http://www.nabble.com/Is-hadoop-right-for-my-problem-tp21811122p21811122.html
Sent from the Hadoop core-user mailing list archive at Nabble.com.

Reply via email to