Hi David.. We successfully use the "logical" schema approach and have not seen 
issues yet.. Ofcourse it all depends on the use case and saying it would work 
for you because it works for us would be naive.. However, if it does work, it 
will make your life much easier because with a logical schema other problems 
become simpler (like you can be sure that 1 map function will process an entire 
row rather than a row going to multiple mappers, or if you are using filters 
that restrict queries to only a small subset of the data, even setBatch won't 
be needed for those use cases).. I did run into issues where I did not use 
setBatch and my mappers ran out of memory but that was a simpler one to solve 
(and by the way if you are on CDH4, the HBase export utility also does not use 
setBatch and your mapper will run out of memory if you have a large row.. Its 
easy to put that line in though as a config param and this feature is available 
in future releases of HBase
 trunk)

Regards,
Dhaval
 

________________________________
 From: David Koch <[email protected]>
To: [email protected] 
Sent: Sunday, 6 January 2013 12:53 PM
Subject: Re: Controlling TableMapReduceUtil table split points
  
Hi Dhaval,

Good call on the setBatch. I had forgotten about it. Just like changing the
schema it would involve changing the map(...) to reflect the fact that only
part of the user's data is returned in each call but I would not have to
manipulate table splits.

The HBase book does suggest that it's bad practice to use the "logical"
schema of lumping all user data into a single row(*) but I'll do some
testing to see what works.

Thank you,

/David

(*) Chapter 9, section "Tall-Narrow Versus Flat-Wide Tables", 3rd ed., page
359)


On Sun, Jan 6, 2013 at 6:29 PM, Dhaval Shah <[email protected]>wrote:

> Another option to avoid the timeout/oome issues is to use scan.setBatch()
> so that the scanner would function normally for small rows but would break
> up large rows in multiple Result objects which you can now use in
> conjunction with scan.setCaching() to control how much data you get back..
>
> This approach would not need a change in your schema design and would
> ensure that only 1 mapper processes the entire row (but in multiple calls
> to the map function)
>

Reply via email to