You can do whatever your want (including spawning threads) in the Mapper process (which is fork/exec by the TaskTracker). But this doesn't help
I think you need to understand the fundamental difference between the 2 parallel processing models 1) Multi-threading Small scale parallelism limited to number of cores within a single machine. Multiple execution threads with a shared memory model. And a lot of synchronization primitives to coordinate access to share data 2) Map/Reduce Large scale parallelism involving large number of machines (hundreds to thousands) Data is shuffled through 2 layers of machines via a special topology (output records of the same key from layer1 will land on the same place at layer 2) The first layer is conducting a per-record transformation (map) and the second layer is conducting a consolidation (reduce) These 2 models has a very different notions of synchronization, one is using fine grain locking and the other is share nothing, you won't be consider them to be alternatives to each other. They are intended to solve very very different problems. Can they be used together ? Absolutely yes. But you need to design how you want to partition your problem ... For example, you can consider partitioning your graph into sub-graphs so each Mapper/Reducer is dealing with a bigger sub-graph rather than individual nodes. Of course you need to think about how to combine the subgraph results, and whether you need absolutely accurate answer or an approximation is good enough. I bet you are the later one and so should be more easy. Ted, Can you point me to Matrix algorithms that is tuned for sparse graph ? What I mean is from O(v^3) to O(v*e) where v = number of vertex and e = number of edges. Rgds, Ricky -----Original Message----- From: Peng, Wei [mailto:[email protected]] Sent: Wednesday, December 22, 2010 8:58 AM To: [email protected] Subject: RE: breadth-first search Can someone tell me whether we can run multiple threads in hadoop? Thanks Wei -----Original Message----- From: Peng, Wei [mailto:[email protected]] Sent: Tuesday, December 21, 2010 9:07 PM To: [email protected] Subject: RE: breadth-first search I was just trying to run 100 source nodes in multiple threads, but the mapreduce tasks still look like to run in sequential. Do I need to configure hadoop somehow for multiple threads? Assign more task slots? How? Thanks
