> In general, I do not recommend running with VMs... Running two hbase > nodes on a single node in VMs vs running one hbase node on the same node > w/o VM, I don't really see where you'd get any benefit.
We use a mixed deployment model where we move HDFS and HBase into Xen's dom0, then deploy service components packaged as VM instances into domUs. We reserve 8 GB of RAM to dom0 (out of 32 GB), 4 GB of which goes to the region servers, and also one quad core CPU. The remainder of the RAM and the other quad core CPU supports the domUs. At first glance this may seem kind of crazy, but this provides the benefit of avoiding (para)virtualization overheads and other VMM quirks in a distributed cluster storage layer, while yet allowing simplified component and service deployment and other virtualization benefits on a co-located dynamic compute cluster. It also enables auto scaling and load aware repacking/migration, using Ganglia as a metrics bus feeding into a control layer for deployment reoptimization, but that's getting out of scope. Meanwhile Hadoop and HBase daemons are neither starved for CPU or RAM. Oh, dom0 runs effectively swapless (vm.swappiness=0). Of course the trade off here is the attack surface of the privileged domain is enlarged by the exporting of HDFS and HBase services from it. The services themselves are shared cluster wide so have a lot of exposure. Currently HBase has no security model beyond HDFS file permissions, which itself is minimally protective. Deployment automation can help by blocking access to HBase and HDFS services via iptables as appropriate. However, you should not run untrusted code in any domU. (Discretionary access control for HBase is on the roadmap for 0.22: http://issues.apache.org/jira/browse/HBASE-1697 . There are also several issues open for security related enhancements to Hadoop and HDFS.) - Andy ________________________________ From: Jonathan Gray <[email protected]> To: [email protected] Sent: Wed, October 21, 2009 3:35:59 PM Subject: Re: Table Upload Optimization That depends on how much memory you have for each node. I recommend setting heap to 1/2 total memory In general, I do not recommend running with VMs... Running two hbase nodes on a single node in VMs vs running one hbase node on the same node w/o VM, I don't really see where you'd get any benefit. You should install something like Ganglia to help monitor the cluster. Swap is reported through free, top, just about anything (as well as ganglia). JG Mark Vigeant wrote: > Also, I updated the configuration and things seem to be working a bit better. > > What's a good heap size to set? > > -----Original Message----- > From: [email protected] [mailto:[email protected]] On Behalf Of stack > Sent: Wednesday, October 21, 2009 12:46 PM > To: [email protected] > Subject: Re: Table Upload Optimization > > On Wed, Oct 21, 2009 at 8:53 AM, Mark Vigeant > <[email protected]>wrote: > >>> I saw this in your first posting: 10/21/09 10:22:52 INFO mapred.JobClient: >>> map 100% reduce 0%. >>> Is your job writing hbase in the map task or in reducer? Are you using >>> TableOutputFormat? >> I am using table output format and only a mapper. There is no reducer. >> Would a reducer make things more efficient? >> >> > No. Unless you need the reduce step for some reason avoid it. > > > > >>>> I'm using Hadoop 0.20.1 and HBase 0.20.0 >>>> >>>> Each node is a virtual machine with 2 CPU, 4 GB host memory and 100 GB >>>> storage. >>>> >>>> >>> You are running DN, TT, HBase, and ZK on above? One disk shared by all? >> I'm only running zookeeper on 2 of the above nodes, and then a TT DN and >> regionserver on all. >> >> > zk cluster should be an odd number. > > One disk shared by all? > > > >>> Children running at any one time on a TaskTracker. You should start with >>> one only since you have such an anemic platform. >> Ah, and I can set that in the hadoop config? >> >> > > <property> > <name>mapred.tasktracker.map.tasks.maximum</name> > <value>2</value> > <description>The maximum number of map tasks that will be run > simultaneously by a task tracker. > </description> > </property> > > > > St.Ack > > > >>> You've upped filedescriptors and xceivers, all the stuff in 'Getting >>> Started'? >> And no it appears as though I accidentally overlooked that beginning stuff. >> Yikes. Ok. >> >> I will take care of those and get back to you. >> >> > > >>> -----Original Message----- >>> From: [email protected] [mailto:[email protected]] On Behalf Of >>> Jean-Daniel Cryans >>> Sent: Wednesday, October 21, 2009 11:04 AM >>> To: [email protected] >>> Subject: Re: Table Upload Optimization >>> >>> Well the XMLStreamingInputFormat lets you map XML files which is neat >>> but it has a problem and always needs to be patched. I wondered if >>> that was missing but in your case it's not the problem. >>> >>> Did you check the logs of the master and region servers? Also I'd like to >>> know >>> >>> - Version of Hadoop and HBase >>> - Nodes's hardware >>> - How many map slots per TT >>> - HBASE_HEAPSIZE from conf/hbase-env.sh >>> - Special configuration you use >>> >>> Thx, >>> >>> J-D >>> >>> On Wed, Oct 21, 2009 at 7:57 AM, Mark Vigeant >>> <[email protected]> wrote: >>>> No. Should I? >>>> >>>> -----Original Message----- >>>> From: [email protected] [mailto:[email protected]] On Behalf Of >>> Jean-Daniel Cryans >>>> Sent: Wednesday, October 21, 2009 10:55 AM >>>> To: [email protected] >>>> Subject: Re: Table Upload Optimization >>>> >>>> Are you using the Hadoop Streaming API? >>>> >>>> J-D >>>> >>>> On Wed, Oct 21, 2009 at 7:52 AM, Mark Vigeant >>>> <[email protected]> wrote: >>>>> Hey >>>>> >>>>> So I want to upload a lot of XML data into an HTable. I have a class >>> that successfully maps up to about 500 MB of data or so (on one >>> regionserver) into a table, but if I go for much bigger than that it >> takes >>> forever and eventually just stops. I tried uploading a big XML file into >> my >>> 4 regionserver cluster (about 7 GB) and it's been a day and it's still >> going >>> at it. >>>>> What I get when I run the job on the 4 node cluster is: >>>>> 10/21/09 10:22:35 INFO mapred.LocalJobRunner: >>>>> 10/21/09 10:22:38 INFO mapred.LocalJobRunner: >>>>> (then it does that for a while until...) >>>>> 10/21/09 10:22:52 INFO mapred.TaskRunner: Task >>> attempt_local_0001_m_000117_0 is done. And is in the process of >> committing >>>>> 10/21/09 10:22:52 INFO mapred.LocalJobRunner: >>>>> 10/21/09 10:22:52 mapred.TaskRunner: Task >>> 'attempt_local_0001_m_000117_0' is done. >>>>> 10/21/09 10:22:52 INFO mapred.JobClient: map 100% reduce 0% >>>>> 10/21/09 10:22:58 INFO mapred.LocalJobRunner: >>>>> 10/21/09 10:22:59 INFO mapred.JobClient: map 99% reduce 0% >>>>> >>>>> >>>>> I'm convinced I'm not configuring hbase or hadoop correctly. Any >>> suggestions? >>>>> Mark Vigeant >>>>> RiskMetrics Group, Inc. >>>>> >
