Re: Table Upload Optimization

Jonathan Gray Wed, 21 Oct 2009 11:26:56 -0700

You are running all of these virtual machines on a single host node?And they are all sharing 4GB of memory?

That is a major issue. First, GC pauses will start to lock things upand create time outs. Then swapping will totally kill performance ofeverything. Is that happening on your cluster?

Virtualized clusters have some odd performance characteristics, and ifyou are starving each virtual node as it is, then you will never seesolid behavior. Virtualized IO can also be problematic, if not justslow (most especially during upload scenarios).


JG

Mark Vigeant wrote:

I saw this in your first posting: 10/21/09 10:22:52 INFO mapred.JobClient:
map 100% reduce 0%.

Is your job writing hbase in the map task or in reducer?  Are you using
TableOutputFormat?


I am using table output format and only a mapper. There is no reducer. Would a 
reducer make things more efficient?

I'm using Hadoop 0.20.1 and HBase 0.20.0

Each node is a virtual machine with 2 CPU, 4 GB host memory and 100 GB
storage.

You are running DN, TT, HBase, and ZK on above?  One disk shared by all?

I'm only running zookeeper on 2 of the above nodes, and then a TT DN and regionserver on all.

Children running at any one time on a TaskTracker.  You should start with
one only since you have such an anemic platform.


Ah, and I can set that in the hadoop config?

You've upped filedescriptors and xceivers, all the stuff in 'Getting
Started'?

And no it appears as though I accidentally overlooked that beginning stuff. Yikes. Ok.

I will take care of those and get back to you.

-----Original Message-----
From: [email protected] [mailto:[email protected]] On Behalf Of
Jean-Daniel Cryans
Sent: Wednesday, October 21, 2009 11:04 AM
To: [email protected]
Subject: Re: Table Upload Optimization

Well the XMLStreamingInputFormat lets you map XML files which is neat
but it has a problem and always needs to be patched. I wondered if
that was missing but in your case it's not the problem.

Did you check the logs of the master and region servers? Also I'd like to
know

- Version of Hadoop and HBase
- Nodes's hardware
- How many map slots per TT
- HBASE_HEAPSIZE from conf/hbase-env.sh
- Special configuration you use

Thx,

J-D

On Wed, Oct 21, 2009 at 7:57 AM, Mark Vigeant
<[email protected]> wrote:

No. Should I?

-----Original Message-----
From: [email protected] [mailto:[email protected]] On Behalf Of

Jean-Daniel Cryans

Sent: Wednesday, October 21, 2009 10:55 AM
To: [email protected]
Subject: Re: Table Upload Optimization

Are you using the Hadoop Streaming API?

J-D

On Wed, Oct 21, 2009 at 7:52 AM, Mark Vigeant
<[email protected]> wrote:

Hey

So I want to upload a lot of XML data into an HTable. I have a class

that successfully maps up to about 500 MB of data or so (on one
regionserver) into a table, but if I go for much bigger than that it takes
forever and eventually just stops. I tried uploading a big XML file into my
4 regionserver cluster (about 7 GB) and it's been a day and it's still going
at it.

What I get when I run the job on the 4 node cluster is:
10/21/09 10:22:35 INFO mapred.LocalJobRunner:
10/21/09 10:22:38 INFO mapred.LocalJobRunner:
(then it does that for a while until...)
10/21/09 10:22:52 INFO mapred.TaskRunner: Task

attempt_local_0001_m_000117_0 is done. And is in the process of committing

10/21/09 10:22:52 INFO mapred.LocalJobRunner:
10/21/09 10:22:52 mapred.TaskRunner: Task

'attempt_local_0001_m_000117_0' is done.

10/21/09 10:22:52 INFO mapred.JobClient:   map 100% reduce 0%
10/21/09 10:22:58 INFO mapred.LocalJobRunner:
10/21/09 10:22:59 INFO mapred.JobClient: map 99% reduce 0%


I'm convinced I'm not configuring hbase or hadoop correctly. Any

suggestions?

Mark Vigeant
RiskMetrics Group, Inc.

Re: Table Upload Optimization

Reply via email to