Re: Duplicated entries with map job reading from HBase

Adam Phelps Mon, 08 Nov 2010 16:29:59 -0800

Ok, poked around at this a little more with a few experiments.

The most interesting one is that I ran a a couple of the jobs thatgenerate this data in HBase, one for the existing table I had seen theproblem on and one for a new table with the same configuration as theold one.

When the analysis job is run reading from HBase the counts are onlydoubled against the older table, using the new table as input producesthe correct results.

When doing this I also noticed that when using the new table only asingle mapper is created, however for the old table two mappers arecreated (I checked and the data comes from only a single region ineither case).

So something is causing each hbase entry to be passed to a mapper twiceon the older table, but only once on the newer table.

Anyone have further thoughts on this? I'm basically at the end of myideas on figuring this out.


- Adam

On 11/5/10 4:01 PM, Adam Phelps wrote:

Yeah, it wasn't the combiner. The repeated entries are actually seen by
the mapper, so before the combiner comes into play. Is there some other
info that would be useful in getting clues as to what is causing this?

- Adam

On 11/5/10 11:35 AM, Adam Phelps wrote:

No, the system actually is much larger than two nodes. But the number of
mappers used here tends to be fairly small (I suspect based on the HBase
regions being accessed but usually more than two), I'll try turning off
the combiner to see if that changes anything.

Thanks
- Adam

On 11/5/10 9:23 AM, Niels Basjes wrote:

Hi,

I don't know the answer (simply not enough information in your email)
but I'm willing to make a guess:
You are running on a system with two processing nodes?
If so then try removing the Combiner. The combiner is a performance
optimization and the whole processing should work without it.
Some times there is a design fault in the processing and the combiner
disrupts the processing.

HTH

Niels Basjes

2010/11/5 Adam Phelps <a...@opendns.com <mailto:a...@opendns.com>>

I've noticed an odd behavior with a map-reduce job I've written
which is reading data out of an HBase table. After a couple days of
poking at this I haven't been able to figure out the cause of the
problem, so I figured I'd ask on here.

(For reference I'm running with the cdh3b2 release)

The problem is that it seems that every line from the HBase table is
passed to the mappers twice, thus resulting in counts ending up as
exactly double what they should be.

I set up the job like this:

Scan scan = new Scan();
scan.addFamily(Bytes.toBytes(scanFamily));

TableMapReduceUtil.initTableMapperJob(table,
scan,
mapper,
Text.class,
LongWritable.class,
job);
job.setCombinerClass(LongSumReducer.class);

job.setReducerClass(reducer);

I've set up counters in the mapper to verify what is happening, so
that I know for certain that the mapper is being called twice with
the same bit of data. I've also confirmed (using the hbase shell)
that each entry appears only once in the table.

Is there a known bug along these lines? If not, does anyone have
any thoughts on what might be causing this or where I'd start
looking to diagnose?

Thanks
- Adam




--
Met vriendelijke groeten,

Niels Basjes

Re: Duplicated entries with map job reading from HBase

Reply via email to