Re: Table Upload Optimization

Andrew Purtell Sat, 24 Oct 2009 12:15:10 -0700

> In general, I do not recommend running with VMs... Running two hbase 
> nodes on a single node in VMs vs running one hbase node on the same node 
> w/o VM, I don't really see where you'd get any benefit.

We use a mixed deployment  model where we move HDFS and HBase into Xen's dom0, 
then deploy service components packaged as VM instances into domUs. We reserve 
8 GB of RAM to dom0 (out of 32 GB), 4 GB of which goes to the region servers, 
and also one quad core CPU. The remainder of the RAM and the other quad core 
CPU supports the domUs. 

At first glance this may seem kind of crazy, but this provides the benefit of 
avoiding (para)virtualization overheads and other VMM quirks in a distributed 
cluster storage layer, while yet allowing simplified component and service 
deployment and other virtualization benefits on a co-located dynamic compute 
cluster. It also enables auto scaling and load aware repacking/migration, using 
Ganglia as a metrics bus feeding into a control layer for deployment 
reoptimization, but that's getting out of scope. Meanwhile Hadoop and HBase 
daemons are neither starved for CPU or RAM. Oh, dom0 runs effectively swapless 
(vm.swappiness=0). 

Of course the trade off here is the attack surface of the privileged domain is 
enlarged by the exporting of HDFS and HBase services from it. The services 
themselves are shared cluster wide so have a lot of exposure. Currently HBase 
has no security model beyond HDFS file permissions, which itself is minimally 
protective. Deployment automation can help by blocking access to HBase and HDFS 
services via iptables as appropriate.  However, you should not run untrusted 
code in any domU. (Discretionary access control for HBase is on the roadmap for 
0.22: http://issues.apache.org/jira/browse/HBASE-1697 . There are also several 
issues open for security related enhancements to Hadoop and HDFS.)

    - Andy

________________________________
From: Jonathan Gray <[email protected]>
To: [email protected]
Sent: Wed, October 21, 2009 3:35:59 PM
Subject: Re: Table Upload Optimization

That depends on how much memory you have for each node.  I recommend 
setting heap to 1/2 total memory

In general, I do not recommend running with VMs... Running two hbase 
nodes on a single node in VMs vs running one hbase node on the same node 
w/o VM, I don't really see where you'd get any benefit.

You should install something like Ganglia to help monitor the cluster. 
Swap is reported through free, top, just about anything (as well as 
ganglia).

JG

Mark Vigeant wrote:
> Also, I updated the configuration and things seem to be working a bit better.
> 
> What's a good heap size to set?
> 
> -----Original Message-----
> From: [email protected] [mailto:[email protected]] On Behalf Of stack
> Sent: Wednesday, October 21, 2009 12:46 PM
> To: [email protected]
> Subject: Re: Table Upload Optimization
> 
> On Wed, Oct 21, 2009 at 8:53 AM, Mark Vigeant
> <[email protected]>wrote:
> 
>>> I saw this in your first posting: 10/21/09 10:22:52 INFO mapred.JobClient:
>>> map 100% reduce 0%.
>>> Is your job writing hbase in the map task or in reducer?  Are you using
>>> TableOutputFormat?
>> I am using table output format and only a mapper. There is no reducer.
>> Would a reducer make things more efficient?
>>
>>
> No.  Unless you need the reduce step for some reason avoid it.
> 
> 
> 
> 
>>>> I'm using Hadoop 0.20.1 and HBase 0.20.0
>>>>
>>>> Each node is a virtual machine with 2 CPU, 4 GB host memory and 100 GB
>>>> storage.
>>>>
>>>>
>>> You are running DN, TT, HBase, and ZK on above?  One disk shared by all?
>> I'm only running zookeeper on 2 of the above nodes, and then a TT DN and
>> regionserver on all.
>>
>>
> zk cluster should be an odd number.
> 
> One disk shared by all?
> 
> 
> 
>>> Children running at any one time on a TaskTracker.  You should start with
>>> one only since you have such an anemic platform.
>> Ah, and I can set that in the hadoop config?
>>
>>
> 
> <property>
>   <name>mapred.tasktracker.map.tasks.maximum</name>
>   <value>2</value>
>   <description>The maximum number of map tasks that will be run
>   simultaneously by a task tracker.
>   </description>
> </property>
> 
> 
> 
> St.Ack
> 
> 
> 
>>> You've upped filedescriptors and xceivers, all the stuff in 'Getting
>>> Started'?
>> And no it appears as though I accidentally overlooked that beginning stuff.
>> Yikes. Ok.
>>
>> I will take care of those and get back to you.
>>
>>
> 
> 
>>> -----Original Message-----
>>> From: [email protected] [mailto:[email protected]] On Behalf Of
>>> Jean-Daniel Cryans
>>> Sent: Wednesday, October 21, 2009 11:04 AM
>>> To: [email protected]
>>> Subject: Re: Table Upload Optimization
>>>
>>> Well the XMLStreamingInputFormat lets you map XML files which is neat
>>> but it has a problem and always needs to be patched. I wondered if
>>> that was missing but in your case it's not the problem.
>>>
>>> Did you check the logs of the master and region servers? Also I'd like to
>>> know
>>>
>>> - Version of Hadoop and HBase
>>> - Nodes's hardware
>>> - How many map slots per TT
>>> - HBASE_HEAPSIZE from conf/hbase-env.sh
>>> - Special configuration you use
>>>
>>> Thx,
>>>
>>> J-D
>>>
>>> On Wed, Oct 21, 2009 at 7:57 AM, Mark Vigeant
>>> <[email protected]> wrote:
>>>> No. Should I?
>>>>
>>>> -----Original Message-----
>>>> From: [email protected] [mailto:[email protected]] On Behalf Of
>>> Jean-Daniel Cryans
>>>> Sent: Wednesday, October 21, 2009 10:55 AM
>>>> To: [email protected]
>>>> Subject: Re: Table Upload Optimization
>>>>
>>>> Are you using the Hadoop Streaming API?
>>>>
>>>> J-D
>>>>
>>>> On Wed, Oct 21, 2009 at 7:52 AM, Mark Vigeant
>>>> <[email protected]> wrote:
>>>>> Hey
>>>>>
>>>>> So I want to upload a lot of XML data into an HTable. I have a class
>>> that successfully maps up to about 500 MB of data or so (on one
>>> regionserver) into a table, but if I go for much bigger than that it
>> takes
>>> forever and eventually just stops. I tried uploading a big XML file into
>> my
>>> 4 regionserver cluster (about 7 GB) and it's been a day and it's still
>> going
>>> at it.
>>>>> What I get when I run the job on the 4 node cluster is:
>>>>> 10/21/09 10:22:35 INFO mapred.LocalJobRunner:
>>>>> 10/21/09 10:22:38 INFO mapred.LocalJobRunner:
>>>>> (then it does that for a while until...)
>>>>> 10/21/09 10:22:52 INFO mapred.TaskRunner: Task
>>> attempt_local_0001_m_000117_0 is done. And is in the process of
>> committing
>>>>> 10/21/09 10:22:52 INFO mapred.LocalJobRunner:
>>>>> 10/21/09 10:22:52 mapred.TaskRunner: Task
>>> 'attempt_local_0001_m_000117_0' is done.
>>>>> 10/21/09 10:22:52 INFO mapred.JobClient:   map 100% reduce 0%
>>>>> 10/21/09 10:22:58 INFO mapred.LocalJobRunner:
>>>>> 10/21/09 10:22:59 INFO mapred.JobClient: map 99% reduce 0%
>>>>>
>>>>>
>>>>> I'm convinced I'm not configuring hbase or hadoop correctly. Any
>>> suggestions?
>>>>> Mark Vigeant
>>>>> RiskMetrics Group, Inc.
>>>>>
>

Re: Table Upload Optimization

Reply via email to