Hey Chris,
I think it would be appropriate. Look at it this way, it takes 1
mapper 1 minute to process 24k records, so it should take about 17
mappers to process all your tasks for the largest problem in one minute.
Even if you still think your problem is too small, consider:
1) The possibility of growth in your application. You're processing
becomes "future proof" - you have a pretty solid way to scale out as
your task grows. Just add new machines -- you don't have to invest in
a "small scale" framework then rewrite in a year.
2) The benefits of having a framework do the heavy lifting. There's a
surprising amount of "roll your own" that you end up doing when you
decide to break out of a single thread. By framing your problem as a
map-reduce problem, you get to skip a lot of these steps and just
focus on solving your problem (also: beware that it's very sexy to
build your own MapReduce framework. Anything which is "very sexy"
takes up more time and money than you think possible at the outset).
Brian
On Feb 3, 2009, at 8:34 AM, cdwillie76 wrote:
I have an application I would like to apply hadoop to but I'm not
sure if the
tasking is too small. I have a file that contains between 70,000 -
400,000
records. All the records can be processed in parallel and I can
currently
process them at 400 records a second single threaded (give or
take). I
thought I read somewhere (one of the tutorials) that the mapper
tasks should
run at least for a minute to offset the overhead in creating them.
Is this
really the case? I am pretty sure that a one to one record to
mapper is
overkill but I am wondering if I batching them up for the mapper is
still a
way to go or if I should look at some other framework to help split
up the
processing.
Any insight would be appreciated.
Thanks
Chris
--
View this message in context:
http://www.nabble.com/Is-hadoop-right-for-my-problem-tp21811122p21811122.html
Sent from the Hadoop core-user mailing list archive at Nabble.com.