Thanks Aaron, The processing libs that we use, which take time to load are all c++ based .so libs. Can i invoke it from JVM during the configure stage of the mapper and keep it running as you suggested ? Can you point me to some documentation regarding the same ?
Regards, Amit On Sat, Apr 25, 2009 at 1:42 PM, Aaron Kimball <aa...@cloudera.com> wrote: > Amit, > > This can be made to work with Hadoop. Basically, in your mapper's > "configure" stage it would do the heavy load-in process, then it would > process your individual work items as records during the actual "map" > stage. > A map task can be comprised of many records, so you'll be fine here. > > If you use Hadoop 0.19 or 0.20, you can also enable JVM reuse, where > multiple map tasks are performed serially in the same JVM instance. In this > case, the first task in the JVM would do the heavy load-in process into > static fields or other globally-accessible items; subsequent tasks could > recognize that the system state is already initialized and would not need > to > repeat it. > > The number of mapper/reducer tasks that run in parallel on a given node can > be configured with a simple setting; setting this to 6 will work just fine. > The capacity / fairshare schedulers are not what you need here -- their > main > function is to ensure that multiple jobs (separate sets of tasks) can all > make progress simultaneously by sharing cluster resources across jobs > rather > than running jobs in a FIFO fashion. > > - Aaron > > On Sat, Apr 25, 2009 at 2:36 PM, amit handa <amha...@gmail.com> wrote: > > > Hi, > > > > We are planning to use hadoop for some very expensive and long running > > processing tasks. > > The computing nodes that we plan to use are very heavy in terms of CPU > and > > memory requirement e.g one process instance takes almost 100% CPU (1 > core) > > and around 300 -400 MB of RAM. > > The first time the process loads it can take around 1-1:30 minutes but > > after > > that we can provide the data to process and it takes few seconds to > > process. > > Can I model it on hadoop ? > > Can I have my processes pre-loaded on the task processing machines and > the > > data be provided by hadoop? This will save the 1-1:30 minutes of intial > > load > > time that it would otherwise take for each task. > > I want to run a number of these processes in parallel based on the > > machines > > capacity (e.g 6 instances on a 8 cpu box) or using capacity scheduler. > > > > Please let me know if this is possible or any pointers to how it can be > > done > > ? > > > > Thanks, > > Amit > > >