Thanks for the detailed answer, this will be useful stuff to know once I'm optimizing/tuning.
I'm actually still at the stage of figuring out how to approach applying the mapreduce pattern to the task, so I'll take your suggestion of asking again on common-user. Thanks! On Sun, Mar 7, 2010 at 8:28 AM, Kay Kay <kaykay.uni...@gmail.com> wrote: > On 03/06/2010 09:29 AM, Phil McCarthy wrote: >> >> Hi, >> >> I'm new to Hadoop, and I'm trying to figure out the best way to use it >> with EC2 to make large number of calls to a web API, > > Consider using a http client library / connection that is thread-safe > potentially. >> >> and then process >> and store the results. I'm completely new to Hadoop, so I'm wondering >> what's the best high-level approach, in terms of using MapReduce to >> parallelize the process. The calls will be regular HTTP requests, and >> the URLs follow a known format, so can be generated easily. >> > > profile the mappers / reducers for memory usage ( primary) and observe the > gc graph pattern for any crazy peaks/maximum-range of memory used and the > cpu, after the same. > While the programming language might be java, it might be best if you > consider yourselves writing for a embedded environment and conserving bytes > / new() / going slow on regex. etc. > bandwidth of intermediate results , written to the context by the mappers > (to hdfs, during the intermediate stage) and transferred to the reducers is > a different thing altogether to be worth considered. > >> This seems like it'd be a pretty common type of task, so apologies if >> I've missed something obvious in the docs etc. >> > > Good luck ! As you might have figured out from the history - the list - > common-u...@hadoop.apache.org is more busier than this and irrespective of > the name of the list being common, is still very relevant to hdfs /m-r > questions. > >> Cheers, >> Phil McCarthy >> > >