Okay, that sounds like what I expected. But isn't there a strong likelihood for competition for HDFS resources between a M/R task running on a TaskTracker and the RegionServer running on the same machine?
In other words, let's say a Hadoop M/R task is running on a given TaskTracker and it's actively reading data from HDFS via the DataNode (and both are on the same machine for locality reasons). At the same time, another client is running an HBase BatchUpdate that affects the data stored on that very same DataNode. Won't that create a bottleneck? Or do the HBase operations like BatchUpdate actually run as M/R tasks? Or am I over estimating the data-retrieval problem? Thanks! -Sean On Tue, Feb 3, 2009 at 4:42 PM, Jonathan Gray <[email protected]> wrote: > Sean, > > You're going to want to run your TaskTrackers local to your DataNodes and > RegionServers, again for locality reasons. That's one of the primary > advantages of MapReduce, moving computation to data. > > Otherwise, you are on track. Of course the setup depends on what you're > doing, but what you describe is on a majority of the HBase setups I'm aware > of. > > JG > > > -----Original Message----- > > From: Sean Laurent [mailto:[email protected]] > > Sent: Tuesday, February 03, 2009 2:13 PM > > To: [email protected] > > Subject: HBase and Hadoop MapReduce - Common setups? > > > > Howdy folks, > > We're evaluating HBase and we're trying to get a good solid picture of > > how > > everything fits together... specifically, we're wondering how people > > commonly setup HBase. I'm imagining you typically run the region > > servers on > > the same machines as the HDFS data nodes to gain data locality > > benefits. And > > from what I've seen on the mailing list, it's typically recommended > > (although it sounds like it's up for debate in terms of SPoF issues) to > > run > > separate machines for the HBaseMaster and NameNode servers. > > > > Is it something along the following lines? > > > > 1x HBaseMaster > > 1x HDFS NameNode > > N machines with both HRegionServer and DataNode > > > > Now what about Hadoop and task trackers? Do people typically run > > completely > > separate clusters for their M/R tasks? Do they run task trackers along > > side > > the region server and data nodes? Or add machines that run TaskTracker > > and > > DataNode servers but ~not~ HRegionServer? > > > > Any thoughts or opinions would be greatly appreciated! >
