HBase Client.

2013-03-20 Thread Pradeep Kumar Mantha
Hi,

I would like to benchmark HBase using some of our distributed
applications using custom developed benchmarking scripts/programs.
 I found the following clients are available. Could you please let me
know which of the following provides best performance.

1. Java direct interface to  HBASE.
2. HBase Shell
3. via Rest
4. HappyBase
5. Kundera

Please let me know if there is any other client which provides better
performance.

thanks
pradeep


Re: HBase Client.

2013-03-20 Thread Viral Bajaria
Most of the clients listed below are language specific, so if your
benchmarking scripts are written in JAVA, you are better off running the
java client.
HBase Shell is more for running something interactive, not sure how you
plan to benchmark that.
REST is something that you could use, but I can't comment on it's
performance since I have
HappyBase is for python.
Kundera, can't comment since I have not used it.

You can look at AsyncHBase, if you don't mind wrapping your head around it.
But it's a bigger rewrite since the API is not compatible with existing
client.

On Tue, Mar 19, 2013 at 11:25 PM, Pradeep Kumar Mantha pradeep...@gmail.com
 wrote:

 Hi,

 I would like to benchmark HBase using some of our distributed
 applications using custom developed benchmarking scripts/programs.
  I found the following clients are available. Could you please let me
 know which of the following provides best performance.

 1. Java direct interface to  HBASE.
 2. HBase Shell
 3. via Rest
 4. HappyBase
 5. Kundera

 Please let me know if there is any other client which provides better
 performance.

 thanks
 pradeep



Re: Truncate hbase table based on column family

2013-03-20 Thread Ted Yu
Can you clarify your question ?

Did you mean that you only want to drop certain column families ?

Thanks

On Wed, Mar 20, 2013 at 7:15 AM, varaprasad.bh...@polarisft.com wrote:

 Hi All,

 Can we truncate a table in hbase based on the column family.
 Please give your comments.


 Thanks  Regards,
 Varaprasada Reddy


 This e-Mail may contain proprietary and confidential information and is
 sent for the intended recipient(s) only.  If by an addressing or
 transmission error this mail has been misdirected to you, you are requested
 to delete this mail immediately. You are also hereby notified that any use,
 any form of reproduction, dissemination, copying, disclosure, modification,
 distribution and/or publication of this e-mail message, contents or its
 attachment other than by its intended recipient/s is strictly prohibited.

 Visit us at http://www.polarisFT.com



Re: Welcome our newest Committer Anoop

2013-03-20 Thread Jimmy Xiang
Congratulations!

On Wed, Mar 20, 2013 at 6:11 AM, Jonathan Hsieh j...@cloudera.com wrote:

 welcome welcome!

 On Wed, Mar 13, 2013 at 10:23 AM, Sergey Shelukhin
 ser...@hortonworks.comwrote:

  Congrats!
 
  On Tue, Mar 12, 2013 at 10:38 PM, xkwang bruce bruce.xkwa...@gmail.com
  wrote:
 
   Congratulations, Anoop!
  
  
   2013/3/13 Devaraj Das d...@hortonworks.com
  
Hey Anoop, Congratulations!
Devaraj.
   
   
On Mon, Mar 11, 2013 at 10:50 AM, Enis Söztutar enis@gmail.com
wrote:
   
 Congrats and welcome.


 On Mon, Mar 11, 2013 at 2:21 AM, Nicolas Liochon 
 nkey...@gmail.com
 wrote:

  Congrats, Anoop!
 
 
  On Mon, Mar 11, 2013 at 5:35 AM, rajeshbabu chintaguntla 
  rajeshbabu.chintagun...@huawei.com wrote:
 
   Contratulations Anoop!
  
   
   From: Anoop Sam John [anoo...@huawei.com]
   Sent: Monday, March 11, 2013 9:00 AM
   To: user@hbase.apache.org
   Subject: RE: Welcome our newest Committer Anoop
  
   Thanks to all.. Hope to work more and more for HBase!
  
   -Anoop-
  
   
   From: Andrew Purtell [apurt...@apache.org]
   Sent: Monday, March 11, 2013 7:33 AM
   To: user@hbase.apache.org
   Subject: Re: Welcome our newest Committer Anoop
  
   Congratulations Anoop. Welcome!
  
  
   On Mon, Mar 11, 2013 at 12:42 AM, ramkrishna vasudevan 
   ramkrishna.s.vasude...@gmail.com wrote:
  
Hi All
   
Pls welcome Anoop, our newest committer.  Anoop's work in
 HBase
   has
  been
great and he has helped lot of users in the mailing list.
   
He has contributed features related to Endpoints and CPs.
   
Welcome Anoop and best wishes for your future work.
   
Hope to see your continuing efforts to the community.
   
Regards
Ram
   
  
  
  
   --
   Best regards,
  
  - Andy
  
   Problems worthy of attack prove their worth by hitting back. -
  Piet
 Hein
   (via Tom White)
  
 

   
  
 



 --
 // Jonathan Hsieh (shay)
 // Software Engineer, Cloudera
 // j...@cloudera.com



Re: How to catch java.net.ConnectException and when

2013-03-20 Thread Jean-Marc Spaggiari
Hi Gaurhari,

Can you please tell us a bit more about what you want to acheive? When
do you want to catch this exception? On which operation?

JM

2013/3/20 gaurhari dass gaurharid...@gmail.com:
 I want to catch connect exception in hbase


Re: HBase Client.

2013-03-20 Thread James Taylor
Another one to add to your list:
6. Phoenix (https://github.com/forcedotcom/phoenix)

Thanks,
James

On Mar 20, 2013, at 2:50 AM, Vivek Mishra vivek.mis...@impetus.co.in wrote:

 I have used Kundera, persistence overhead on HBase API is minimal considering 
 feature set available for use within Kundera.
 
 -Vivek
 
 From: Viral Bajaria [viral.baja...@gmail.com]
 Sent: 20 March 2013 12:30
 To: user@hbase.apache.org
 Subject: Re: HBase Client.
 
 Most of the clients listed below are language specific, so if your
 benchmarking scripts are written in JAVA, you are better off running the
 java client.
 HBase Shell is more for running something interactive, not sure how you
 plan to benchmark that.
 REST is something that you could use, but I can't comment on it's
 performance since I have
 HappyBase is for python.
 Kundera, can't comment since I have not used it.
 
 You can look at AsyncHBase, if you don't mind wrapping your head around it.
 But it's a bigger rewrite since the API is not compatible with existing
 client.
 
 On Tue, Mar 19, 2013 at 11:25 PM, Pradeep Kumar Mantha pradeep...@gmail.com
 wrote:
 
 Hi,
 
I would like to benchmark HBase using some of our distributed
 applications using custom developed benchmarking scripts/programs.
 I found the following clients are available. Could you please let me
 know which of the following provides best performance.
 
 1. Java direct interface to  HBASE.
 2. HBase Shell
 3. via Rest
 4. HappyBase
 5. Kundera
 
 Please let me know if there is any other client which provides better
 performance.
 
 thanks
 pradeep
 
 
 
 
 
 
 
 
 
 NOTE: This message may contain information that is confidential, proprietary, 
 privileged or otherwise protected by law. The message is intended solely for 
 the named addressee. If received in error, please destroy and notify the 
 sender. Any use of this email is prohibited when received in error. Impetus 
 does not represent, warrant and/or guarantee, that the integrity of this 
 communication has been maintained nor that the communication is free of 
 errors, virus, interception or interference.


Does HBase RegionServer benefit from OS Page Cache

2013-03-20 Thread Pankaj Gupta
Given that HBase has it's own cache (block cache and bloom filters) and that 
all the table data is stored in HDFS, I'm wondering if HBase benefits from OS 
page cache at all. In the set up I'm using HBase Region Servers run on the same 
boxes as the HDFS data node. In such a scenario if the underlying HLog files 
lives on the same machine then having a healthy memory surplus may mean that 
the data node can serve underlying file from page cache and thus improving 
HBase performance. Is this really the case? (I guess page cache should also 
help in case where HLog file lives on a different machine but in that case 
network I/O will probably drown the speedup achieved due to not hitting the 
disk.

I'm asking because if page cache were useful then in an HBase set up not 
utilizing all the memory on the machine for the region server may not be that 
bad. The reason one would not want to use all the memory for region server 
would be long garbage collection pauses that large heap size may induce. I 
understand that work has been done to fix the long pauses caused due to memory 
fragmentation in the old generation, mostly concurrent garbage collector by 
using slab cache allocator for memstore but that feature is marked experimental 
and we're not ready to take risks yet. So if the page cache was useful in any 
way on Region Servers we could go with less memory for RegionServer process 
with the understanding that free memory on the machine is not completely going 
to waste. Thus my curiosity about utility of os page cache to performance of 
HBase.

Thanks in Advance,
Pankaj

Re: HBase Client.

2013-03-20 Thread Ian Varley
Pradeep -

One more to add to your list of clients is Phoenix:

https://github.com/forcedotcom/phoenix

It's a SQL skin, built on top of the standard Java client with various 
optimizations; it exposes HBase via a standard JDBC interface, and thus might 
let you easily plug into other tools for testing performance.

Ian

On Mar 20, 2013, at 4:49 AM, Vivek Mishra wrote:

I have used Kundera, persistence overhead on HBase API is minimal considering 
feature set available for use within Kundera.

-Vivek

From: Viral Bajaria [viral.baja...@gmail.com]
Sent: 20 March 2013 12:30
To: user@hbase.apache.orgmailto:user@hbase.apache.org
Subject: Re: HBase Client.

Most of the clients listed below are language specific, so if your
benchmarking scripts are written in JAVA, you are better off running the
java client.
HBase Shell is more for running something interactive, not sure how you
plan to benchmark that.
REST is something that you could use, but I can't comment on it's
performance since I have
HappyBase is for python.
Kundera, can't comment since I have not used it.

You can look at AsyncHBase, if you don't mind wrapping your head around it.
But it's a bigger rewrite since the API is not compatible with existing
client.

On Tue, Mar 19, 2013 at 11:25 PM, Pradeep Kumar Mantha 
pradeep...@gmail.commailto:pradeep...@gmail.com
wrote:

Hi,

   I would like to benchmark HBase using some of our distributed
applications using custom developed benchmarking scripts/programs.
I found the following clients are available. Could you please let me
know which of the following provides best performance.

1. Java direct interface to  HBASE.
2. HBase Shell
3. via Rest
4. HappyBase
5. Kundera

Please let me know if there is any other client which provides better
performance.

thanks
pradeep









NOTE: This message may contain information that is confidential, proprietary, 
privileged or otherwise protected by law. The message is intended solely for 
the named addressee. If received in error, please destroy and notify the 
sender. Any use of this email is prohibited when received in error. Impetus 
does not represent, warrant and/or guarantee, that the integrity of this 
communication has been maintained nor that the communication is free of errors, 
virus, interception or interference.



Re: Scanner timeout -- any reason not to raise?

2013-03-20 Thread Dan Crosta
I'm confused -- I only see one setting in CDH manager, what is the name of the 
other setting?

Our load is moderately frequent small writes (in batches of 1000 cells at a 
time, typically split over a few hundred rows -- these complete very fast, we 
haven't seen any timeouts there), and infrequent batches of large reads 
(scans), which is where we do see timeouts. My guess is that the timeout is 
more due to our application taking some time -- apparently more than 60s -- to 
process the results of each scan's output, rather than due to slowness in HBase 
itself, which tends to be only moderately loaded (judging by CPU, network, and 
disk) while we do the reads.

Thanks,
- Dan

On Mar 17, 2013, at 2:20 PM, Ted Yu wrote:

 The lease timeout is used by row locking too.
 That's the reason behind splitting the setting into two config parameters.
 
 How is your load composition ? Do you mostly serve reads from HBase ?
 
 Cheers
 
 On Sun, Mar 17, 2013 at 1:56 PM, Dan Crosta d...@magnetic.com wrote:
 
 Ah, thanks Ted -- I was wondering what that setting was for.
 
 We are using CDH 4.2.0, which is HBase 0.94.2 (give or take a few
 backports from 0.94.3).
 
 Is there any harm in setting the lease timeout to something larger, like 5
 or 10 minutes?
 
 Thanks,
 - Dan
 
 On Mar 17, 2013, at 1:46 PM, Ted Yu wrote:
 
 Which HBase version are you using ?
 
 In 0.94 and prior, the config param is hbase.regionserver.lease.period
 
 In 0.95, it is different. See release notes of HBASE-6170
 
 On Sun, Mar 17, 2013 at 11:46 AM, Dan Crosta d...@magnetic.com wrote:
 
 We occasionally get scanner timeout errors such as 66698ms passed since
 the last invocation, timeout is currently set to 6 when iterating a
 scanner through the Thrift API. Is there any reason not to raise the
 timeout to something larger than the default 60s? Put another way, what
 resources (and how much of them) does a scanner take up on a thrift
 server
 or region server?
 
 Also, to confirm -- I believe hbase.rpc.timeout is the setting in
 question here, but someone please correct me if I'm wrong.
 
 Thanks,
 - Dan
 
 
 
 
 



Re: Scanner timeout -- any reason not to raise?

2013-03-20 Thread Ted Yu
In 0.94, there is only one setting.
See release notes of HBASE-6170 which is in 0.95

Looks like this should help (in 0.95):

https://issues.apache.org/jira/browse/HBASE-2214
Do HBASE-1996 -- setting size to return in scan rather than count of rows
-- properly

From your description, you should be able to raise the timeout since the
writes are relatively fast.

Cheers

On Wed, Mar 20, 2013 at 9:32 AM, Dan Crosta d...@magnetic.com wrote:

 I'm confused -- I only see one setting in CDH manager, what is the name of
 the other setting?

 Our load is moderately frequent small writes (in batches of 1000 cells at
 a time, typically split over a few hundred rows -- these complete very
 fast, we haven't seen any timeouts there), and infrequent batches of large
 reads (scans), which is where we do see timeouts. My guess is that the
 timeout is more due to our application taking some time -- apparently more
 than 60s -- to process the results of each scan's output, rather than due
 to slowness in HBase itself, which tends to be only moderately loaded
 (judging by CPU, network, and disk) while we do the reads.

 Thanks,
 - Dan

 On Mar 17, 2013, at 2:20 PM, Ted Yu wrote:

  The lease timeout is used by row locking too.
  That's the reason behind splitting the setting into two config
 parameters.
 
  How is your load composition ? Do you mostly serve reads from HBase ?
 
  Cheers
 
  On Sun, Mar 17, 2013 at 1:56 PM, Dan Crosta d...@magnetic.com wrote:
 
  Ah, thanks Ted -- I was wondering what that setting was for.
 
  We are using CDH 4.2.0, which is HBase 0.94.2 (give or take a few
  backports from 0.94.3).
 
  Is there any harm in setting the lease timeout to something larger,
 like 5
  or 10 minutes?
 
  Thanks,
  - Dan
 
  On Mar 17, 2013, at 1:46 PM, Ted Yu wrote:
 
  Which HBase version are you using ?
 
  In 0.94 and prior, the config param is hbase.regionserver.lease.period
 
  In 0.95, it is different. See release notes of HBASE-6170
 
  On Sun, Mar 17, 2013 at 11:46 AM, Dan Crosta d...@magnetic.com wrote:
 
  We occasionally get scanner timeout errors such as 66698ms passed
 since
  the last invocation, timeout is currently set to 6 when
 iterating a
  scanner through the Thrift API. Is there any reason not to raise the
  timeout to something larger than the default 60s? Put another way,
 what
  resources (and how much of them) does a scanner take up on a thrift
  server
  or region server?
 
  Also, to confirm -- I believe hbase.rpc.timeout is the setting in
  question here, but someone please correct me if I'm wrong.
 
  Thanks,
  - Dan
 
 
 
 
 




Re: Scanner timeout -- any reason not to raise?

2013-03-20 Thread Bryan Beaudreault
Typically it is better to use caching and batch size to limit the number of
rows returned and thus the amount of processing required between calls to
next() during a scan, but it would be nice if HBase provided a way to
manually refresh a lease similar to Hadoop's context.progress().  In a
cluster that is used for many different applications, upping the global
lease timeout is a heavy handed solution.  Even being able to override the
timeout on a per-scan basis would be nice.

Thoughts on that, Ted?


On Wed, Mar 20, 2013 at 1:00 PM, Ted Yu yuzhih...@gmail.com wrote:

 In 0.94, there is only one setting.
 See release notes of HBASE-6170 which is in 0.95

 Looks like this should help (in 0.95):

 https://issues.apache.org/jira/browse/HBASE-2214
 Do HBASE-1996 -- setting size to return in scan rather than count of rows
 -- properly

 From your description, you should be able to raise the timeout since the
 writes are relatively fast.

 Cheers

 On Wed, Mar 20, 2013 at 9:32 AM, Dan Crosta d...@magnetic.com wrote:

  I'm confused -- I only see one setting in CDH manager, what is the name
 of
  the other setting?
 
  Our load is moderately frequent small writes (in batches of 1000 cells at
  a time, typically split over a few hundred rows -- these complete very
  fast, we haven't seen any timeouts there), and infrequent batches of
 large
  reads (scans), which is where we do see timeouts. My guess is that the
  timeout is more due to our application taking some time -- apparently
 more
  than 60s -- to process the results of each scan's output, rather than due
  to slowness in HBase itself, which tends to be only moderately loaded
  (judging by CPU, network, and disk) while we do the reads.
 
  Thanks,
  - Dan
 
  On Mar 17, 2013, at 2:20 PM, Ted Yu wrote:
 
   The lease timeout is used by row locking too.
   That's the reason behind splitting the setting into two config
  parameters.
  
   How is your load composition ? Do you mostly serve reads from HBase ?
  
   Cheers
  
   On Sun, Mar 17, 2013 at 1:56 PM, Dan Crosta d...@magnetic.com wrote:
  
   Ah, thanks Ted -- I was wondering what that setting was for.
  
   We are using CDH 4.2.0, which is HBase 0.94.2 (give or take a few
   backports from 0.94.3).
  
   Is there any harm in setting the lease timeout to something larger,
  like 5
   or 10 minutes?
  
   Thanks,
   - Dan
  
   On Mar 17, 2013, at 1:46 PM, Ted Yu wrote:
  
   Which HBase version are you using ?
  
   In 0.94 and prior, the config param is
 hbase.regionserver.lease.period
  
   In 0.95, it is different. See release notes of HBASE-6170
  
   On Sun, Mar 17, 2013 at 11:46 AM, Dan Crosta d...@magnetic.com
 wrote:
  
   We occasionally get scanner timeout errors such as 66698ms passed
  since
   the last invocation, timeout is currently set to 6 when
  iterating a
   scanner through the Thrift API. Is there any reason not to raise the
   timeout to something larger than the default 60s? Put another way,
  what
   resources (and how much of them) does a scanner take up on a thrift
   server
   or region server?
  
   Also, to confirm -- I believe hbase.rpc.timeout is the setting in
   question here, but someone please correct me if I'm wrong.
  
   Thanks,
   - Dan
  
  
  
  
  
 
 



Re: Scanner timeout -- any reason not to raise?

2013-03-20 Thread Ted Yu
bq.  if HBase provided a way to manually refresh a lease similar to
Hadoop's context.progress()

Can you outline how the above works for long scan ?

bq. Even being able to override the timeout on a per-scan basis would be
nice.

Agreed.

On Wed, Mar 20, 2013 at 10:05 AM, Bryan Beaudreault 
bbeaudrea...@hubspot.com wrote:

 Typically it is better to use caching and batch size to limit the number of
 rows returned and thus the amount of processing required between calls to
 next() during a scan, but it would be nice if HBase provided a way to
 manually refresh a lease similar to Hadoop's context.progress().  In a
 cluster that is used for many different applications, upping the global
 lease timeout is a heavy handed solution.  Even being able to override the
 timeout on a per-scan basis would be nice.

 Thoughts on that, Ted?


 On Wed, Mar 20, 2013 at 1:00 PM, Ted Yu yuzhih...@gmail.com wrote:

  In 0.94, there is only one setting.
  See release notes of HBASE-6170 which is in 0.95
 
  Looks like this should help (in 0.95):
 
  https://issues.apache.org/jira/browse/HBASE-2214
  Do HBASE-1996 -- setting size to return in scan rather than count of rows
  -- properly
 
  From your description, you should be able to raise the timeout since the
  writes are relatively fast.
 
  Cheers
 
  On Wed, Mar 20, 2013 at 9:32 AM, Dan Crosta d...@magnetic.com wrote:
 
   I'm confused -- I only see one setting in CDH manager, what is the name
  of
   the other setting?
  
   Our load is moderately frequent small writes (in batches of 1000 cells
 at
   a time, typically split over a few hundred rows -- these complete very
   fast, we haven't seen any timeouts there), and infrequent batches of
  large
   reads (scans), which is where we do see timeouts. My guess is that the
   timeout is more due to our application taking some time -- apparently
  more
   than 60s -- to process the results of each scan's output, rather than
 due
   to slowness in HBase itself, which tends to be only moderately loaded
   (judging by CPU, network, and disk) while we do the reads.
  
   Thanks,
   - Dan
  
   On Mar 17, 2013, at 2:20 PM, Ted Yu wrote:
  
The lease timeout is used by row locking too.
That's the reason behind splitting the setting into two config
   parameters.
   
How is your load composition ? Do you mostly serve reads from HBase ?
   
Cheers
   
On Sun, Mar 17, 2013 at 1:56 PM, Dan Crosta d...@magnetic.com
 wrote:
   
Ah, thanks Ted -- I was wondering what that setting was for.
   
We are using CDH 4.2.0, which is HBase 0.94.2 (give or take a few
backports from 0.94.3).
   
Is there any harm in setting the lease timeout to something larger,
   like 5
or 10 minutes?
   
Thanks,
- Dan
   
On Mar 17, 2013, at 1:46 PM, Ted Yu wrote:
   
Which HBase version are you using ?
   
In 0.94 and prior, the config param is
  hbase.regionserver.lease.period
   
In 0.95, it is different. See release notes of HBASE-6170
   
On Sun, Mar 17, 2013 at 11:46 AM, Dan Crosta d...@magnetic.com
  wrote:
   
We occasionally get scanner timeout errors such as 66698ms passed
   since
the last invocation, timeout is currently set to 6 when
   iterating a
scanner through the Thrift API. Is there any reason not to raise
 the
timeout to something larger than the default 60s? Put another way,
   what
resources (and how much of them) does a scanner take up on a
 thrift
server
or region server?
   
Also, to confirm -- I believe hbase.rpc.timeout is the setting
 in
question here, but someone please correct me if I'm wrong.
   
Thanks,
- Dan
   
   
   
   
   
  
  
 



Re: Does HBase RegionServer benefit from OS Page Cache

2013-03-20 Thread Jean-Daniel Cryans
First, MSLAB has been enabled by default since 0.92.0 as it was deemed
stable enough. So, unless you are on 0.90, you are already using it.

Also, I'm not sure why you are referencing the HLog in your first
paragraph in the context of reading from disk, because the HLogs are
rarely read (only on recovery). Maybe you meant HFile?

In any case, your email covers most arguments except for one:
checksumming. Retrieving a block from HDFS, even when using short
circuit reads to go directly to the OS instead of passing through the
DN, will take quite a bit more time than reading directly from the
block cache. This is why even if you disable block caching on a family
that the index and root blocks will still be block cached, as reading
those very hot blocks from disk would take way too long.

Regarding your main question (how does the OS buffer help?), I don't
have a good answer. It kind of depends on the amount of RAM you have
and what your workload is like. As a data point, I've been successful
running with 24GB of heap (50% dedicated to the block cache) with a
workload consisting mainly of small writes, short scans, and a typical
random read distribution for a website. I can't remember the last time
I saw a full GC and it's been running for more than a year like this.

Hope this somehow helps,

J-D

On Wed, Mar 20, 2013 at 12:34 AM, Pankaj Gupta pankaj.ro...@gmail.com wrote:
 Given that HBase has it's own cache (block cache and bloom filters) and that 
 all the table data is stored in HDFS, I'm wondering if HBase benefits from OS 
 page cache at all. In the set up I'm using HBase Region Servers run on the 
 same boxes as the HDFS data node. In such a scenario if the underlying HLog 
 files lives on the same machine then having a healthy memory surplus may mean 
 that the data node can serve underlying file from page cache and thus 
 improving HBase performance. Is this really the case? (I guess page cache 
 should also help in case where HLog file lives on a different machine but in 
 that case network I/O will probably drown the speedup achieved due to not 
 hitting the disk.

 I'm asking because if page cache were useful then in an HBase set up not 
 utilizing all the memory on the machine for the region server may not be that 
 bad. The reason one would not want to use all the memory for region server 
 would be long garbage collection pauses that large heap size may induce. I 
 understand that work has been done to fix the long pauses caused due to 
 memory fragmentation in the old generation, mostly concurrent garbage 
 collector by using slab cache allocator for memstore but that feature is 
 marked experimental and we're not ready to take risks yet. So if the page 
 cache was useful in any way on Region Servers we could go with less memory 
 for RegionServer process with the understanding that free memory on the 
 machine is not completely going to waste. Thus my curiosity about utility of 
 os page cache to performance of HBase.

 Thanks in Advance,
 Pankaj


Re: Scanner timeout -- any reason not to raise?

2013-03-20 Thread Bryan Beaudreault
I was thinking something like this:

Scan scan = new Scan(startRow, endRow);

scan.setCaching(someVal); // based on what we expect most rows to take for
processing time

 ResultScanner scanner = table.getScanner(scan);

  for (Result r : scanner) {

  // usual processing, the time for which we accounted for in our caching
and global lease timeout settings

  if (someCondition) {

// More time-intensive processing necessary on this record, which is
hard to account for in the caching

scanner.progress();

  }

 }


--

I'm not sure how we could expose this in the context of a hadoop job, since
I don't believe we have access to the underlying scanner, but that would be
great also.


On Wed, Mar 20, 2013 at 1:11 PM, Ted Yu yuzhih...@gmail.com wrote:

 bq.  if HBase provided a way to manually refresh a lease similar to
 Hadoop's context.progress()

 Can you outline how the above works for long scan ?

 bq. Even being able to override the timeout on a per-scan basis would be
 nice.

 Agreed.

 On Wed, Mar 20, 2013 at 10:05 AM, Bryan Beaudreault 
 bbeaudrea...@hubspot.com wrote:

  Typically it is better to use caching and batch size to limit the number
 of
  rows returned and thus the amount of processing required between calls to
  next() during a scan, but it would be nice if HBase provided a way to
  manually refresh a lease similar to Hadoop's context.progress().  In a
  cluster that is used for many different applications, upping the global
  lease timeout is a heavy handed solution.  Even being able to override
 the
  timeout on a per-scan basis would be nice.
 
  Thoughts on that, Ted?
 
 
  On Wed, Mar 20, 2013 at 1:00 PM, Ted Yu yuzhih...@gmail.com wrote:
 
   In 0.94, there is only one setting.
   See release notes of HBASE-6170 which is in 0.95
  
   Looks like this should help (in 0.95):
  
   https://issues.apache.org/jira/browse/HBASE-2214
   Do HBASE-1996 -- setting size to return in scan rather than count of
 rows
   -- properly
  
   From your description, you should be able to raise the timeout since
 the
   writes are relatively fast.
  
   Cheers
  
   On Wed, Mar 20, 2013 at 9:32 AM, Dan Crosta d...@magnetic.com wrote:
  
I'm confused -- I only see one setting in CDH manager, what is the
 name
   of
the other setting?
   
Our load is moderately frequent small writes (in batches of 1000
 cells
  at
a time, typically split over a few hundred rows -- these complete
 very
fast, we haven't seen any timeouts there), and infrequent batches of
   large
reads (scans), which is where we do see timeouts. My guess is that
 the
timeout is more due to our application taking some time -- apparently
   more
than 60s -- to process the results of each scan's output, rather than
  due
to slowness in HBase itself, which tends to be only moderately loaded
(judging by CPU, network, and disk) while we do the reads.
   
Thanks,
- Dan
   
On Mar 17, 2013, at 2:20 PM, Ted Yu wrote:
   
 The lease timeout is used by row locking too.
 That's the reason behind splitting the setting into two config
parameters.

 How is your load composition ? Do you mostly serve reads from
 HBase ?

 Cheers

 On Sun, Mar 17, 2013 at 1:56 PM, Dan Crosta d...@magnetic.com
  wrote:

 Ah, thanks Ted -- I was wondering what that setting was for.

 We are using CDH 4.2.0, which is HBase 0.94.2 (give or take a few
 backports from 0.94.3).

 Is there any harm in setting the lease timeout to something
 larger,
like 5
 or 10 minutes?

 Thanks,
 - Dan

 On Mar 17, 2013, at 1:46 PM, Ted Yu wrote:

 Which HBase version are you using ?

 In 0.94 and prior, the config param is
   hbase.regionserver.lease.period

 In 0.95, it is different. See release notes of HBASE-6170

 On Sun, Mar 17, 2013 at 11:46 AM, Dan Crosta d...@magnetic.com
   wrote:

 We occasionally get scanner timeout errors such as 66698ms
 passed
since
 the last invocation, timeout is currently set to 6 when
iterating a
 scanner through the Thrift API. Is there any reason not to raise
  the
 timeout to something larger than the default 60s? Put another
 way,
what
 resources (and how much of them) does a scanner take up on a
  thrift
 server
 or region server?

 Also, to confirm -- I believe hbase.rpc.timeout is the setting
  in
 question here, but someone please correct me if I'm wrong.

 Thanks,
 - Dan





   
   
  
 



Re: Scanner timeout -- any reason not to raise?

2013-03-20 Thread Ted Yu
Bryan:
Interesting idea.

You can log a JIRA with the following two suggestions.

On Wed, Mar 20, 2013 at 10:39 AM, Bryan Beaudreault 
bbeaudrea...@hubspot.com wrote:

 I was thinking something like this:

 Scan scan = new Scan(startRow, endRow);

 scan.setCaching(someVal); // based on what we expect most rows to take for
 processing time

  ResultScanner scanner = table.getScanner(scan);

   for (Result r : scanner) {

   // usual processing, the time for which we accounted for in our caching
 and global lease timeout settings

   if (someCondition) {

 // More time-intensive processing necessary on this record, which is
 hard to account for in the caching

 scanner.progress();

   }

  }


 --

 I'm not sure how we could expose this in the context of a hadoop job, since
 I don't believe we have access to the underlying scanner, but that would be
 great also.


 On Wed, Mar 20, 2013 at 1:11 PM, Ted Yu yuzhih...@gmail.com wrote:

  bq.  if HBase provided a way to manually refresh a lease similar to
  Hadoop's context.progress()
 
  Can you outline how the above works for long scan ?
 
  bq. Even being able to override the timeout on a per-scan basis would be
  nice.
 
  Agreed.
 
  On Wed, Mar 20, 2013 at 10:05 AM, Bryan Beaudreault 
  bbeaudrea...@hubspot.com wrote:
 
   Typically it is better to use caching and batch size to limit the
 number
  of
   rows returned and thus the amount of processing required between calls
 to
   next() during a scan, but it would be nice if HBase provided a way to
   manually refresh a lease similar to Hadoop's context.progress().  In a
   cluster that is used for many different applications, upping the global
   lease timeout is a heavy handed solution.  Even being able to override
  the
   timeout on a per-scan basis would be nice.
  
   Thoughts on that, Ted?
  
  
   On Wed, Mar 20, 2013 at 1:00 PM, Ted Yu yuzhih...@gmail.com wrote:
  
In 0.94, there is only one setting.
See release notes of HBASE-6170 which is in 0.95
   
Looks like this should help (in 0.95):
   
https://issues.apache.org/jira/browse/HBASE-2214
Do HBASE-1996 -- setting size to return in scan rather than count of
  rows
-- properly
   
From your description, you should be able to raise the timeout since
  the
writes are relatively fast.
   
Cheers
   
On Wed, Mar 20, 2013 at 9:32 AM, Dan Crosta d...@magnetic.com
 wrote:
   
 I'm confused -- I only see one setting in CDH manager, what is the
  name
of
 the other setting?

 Our load is moderately frequent small writes (in batches of 1000
  cells
   at
 a time, typically split over a few hundred rows -- these complete
  very
 fast, we haven't seen any timeouts there), and infrequent batches
 of
large
 reads (scans), which is where we do see timeouts. My guess is that
  the
 timeout is more due to our application taking some time --
 apparently
more
 than 60s -- to process the results of each scan's output, rather
 than
   due
 to slowness in HBase itself, which tends to be only moderately
 loaded
 (judging by CPU, network, and disk) while we do the reads.

 Thanks,
 - Dan

 On Mar 17, 2013, at 2:20 PM, Ted Yu wrote:

  The lease timeout is used by row locking too.
  That's the reason behind splitting the setting into two config
 parameters.
 
  How is your load composition ? Do you mostly serve reads from
  HBase ?
 
  Cheers
 
  On Sun, Mar 17, 2013 at 1:56 PM, Dan Crosta d...@magnetic.com
   wrote:
 
  Ah, thanks Ted -- I was wondering what that setting was for.
 
  We are using CDH 4.2.0, which is HBase 0.94.2 (give or take a
 few
  backports from 0.94.3).
 
  Is there any harm in setting the lease timeout to something
  larger,
 like 5
  or 10 minutes?
 
  Thanks,
  - Dan
 
  On Mar 17, 2013, at 1:46 PM, Ted Yu wrote:
 
  Which HBase version are you using ?
 
  In 0.94 and prior, the config param is
hbase.regionserver.lease.period
 
  In 0.95, it is different. See release notes of HBASE-6170
 
  On Sun, Mar 17, 2013 at 11:46 AM, Dan Crosta d...@magnetic.com
 
wrote:
 
  We occasionally get scanner timeout errors such as 66698ms
  passed
 since
  the last invocation, timeout is currently set to 6 when
 iterating a
  scanner through the Thrift API. Is there any reason not to
 raise
   the
  timeout to something larger than the default 60s? Put another
  way,
 what
  resources (and how much of them) does a scanner take up on a
   thrift
  server
  or region server?
 
  Also, to confirm -- I believe hbase.rpc.timeout is the
 setting
   in
  question here, but someone please correct me if I'm wrong.
 
  Thanks,
  - Dan
 
 
 
 
 


   
  
 



Re: Scanner timeout -- any reason not to raise?

2013-03-20 Thread Bryan Beaudreault
Thanks Ted, I've submitted https://issues.apache.org/jira/browse/HBASE-8157.



On Wed, Mar 20, 2013 at 1:56 PM, Ted Yu yuzhih...@gmail.com wrote:

 Bryan:
 Interesting idea.

 You can log a JIRA with the following two suggestions.

 On Wed, Mar 20, 2013 at 10:39 AM, Bryan Beaudreault 
 bbeaudrea...@hubspot.com wrote:

  I was thinking something like this:
 
  Scan scan = new Scan(startRow, endRow);
 
  scan.setCaching(someVal); // based on what we expect most rows to take
 for
  processing time
 
   ResultScanner scanner = table.getScanner(scan);
 
for (Result r : scanner) {
 
// usual processing, the time for which we accounted for in our caching
  and global lease timeout settings
 
if (someCondition) {
 
  // More time-intensive processing necessary on this record, which is
  hard to account for in the caching
 
  scanner.progress();
 
}
 
   }
 
 
  --
 
  I'm not sure how we could expose this in the context of a hadoop job,
 since
  I don't believe we have access to the underlying scanner, but that would
 be
  great also.
 
 
  On Wed, Mar 20, 2013 at 1:11 PM, Ted Yu yuzhih...@gmail.com wrote:
 
   bq.  if HBase provided a way to manually refresh a lease similar to
   Hadoop's context.progress()
  
   Can you outline how the above works for long scan ?
  
   bq. Even being able to override the timeout on a per-scan basis would
 be
   nice.
  
   Agreed.
  
   On Wed, Mar 20, 2013 at 10:05 AM, Bryan Beaudreault 
   bbeaudrea...@hubspot.com wrote:
  
Typically it is better to use caching and batch size to limit the
  number
   of
rows returned and thus the amount of processing required between
 calls
  to
next() during a scan, but it would be nice if HBase provided a way to
manually refresh a lease similar to Hadoop's context.progress().  In
 a
cluster that is used for many different applications, upping the
 global
lease timeout is a heavy handed solution.  Even being able to
 override
   the
timeout on a per-scan basis would be nice.
   
Thoughts on that, Ted?
   
   
On Wed, Mar 20, 2013 at 1:00 PM, Ted Yu yuzhih...@gmail.com wrote:
   
 In 0.94, there is only one setting.
 See release notes of HBASE-6170 which is in 0.95

 Looks like this should help (in 0.95):

 https://issues.apache.org/jira/browse/HBASE-2214
 Do HBASE-1996 -- setting size to return in scan rather than count
 of
   rows
 -- properly

 From your description, you should be able to raise the timeout
 since
   the
 writes are relatively fast.

 Cheers

 On Wed, Mar 20, 2013 at 9:32 AM, Dan Crosta d...@magnetic.com
  wrote:

  I'm confused -- I only see one setting in CDH manager, what is
 the
   name
 of
  the other setting?
 
  Our load is moderately frequent small writes (in batches of 1000
   cells
at
  a time, typically split over a few hundred rows -- these complete
   very
  fast, we haven't seen any timeouts there), and infrequent batches
  of
 large
  reads (scans), which is where we do see timeouts. My guess is
 that
   the
  timeout is more due to our application taking some time --
  apparently
 more
  than 60s -- to process the results of each scan's output, rather
  than
due
  to slowness in HBase itself, which tends to be only moderately
  loaded
  (judging by CPU, network, and disk) while we do the reads.
 
  Thanks,
  - Dan
 
  On Mar 17, 2013, at 2:20 PM, Ted Yu wrote:
 
   The lease timeout is used by row locking too.
   That's the reason behind splitting the setting into two config
  parameters.
  
   How is your load composition ? Do you mostly serve reads from
   HBase ?
  
   Cheers
  
   On Sun, Mar 17, 2013 at 1:56 PM, Dan Crosta d...@magnetic.com
wrote:
  
   Ah, thanks Ted -- I was wondering what that setting was for.
  
   We are using CDH 4.2.0, which is HBase 0.94.2 (give or take a
  few
   backports from 0.94.3).
  
   Is there any harm in setting the lease timeout to something
   larger,
  like 5
   or 10 minutes?
  
   Thanks,
   - Dan
  
   On Mar 17, 2013, at 1:46 PM, Ted Yu wrote:
  
   Which HBase version are you using ?
  
   In 0.94 and prior, the config param is
 hbase.regionserver.lease.period
  
   In 0.95, it is different. See release notes of HBASE-6170
  
   On Sun, Mar 17, 2013 at 11:46 AM, Dan Crosta 
 d...@magnetic.com
  
 wrote:
  
   We occasionally get scanner timeout errors such as 66698ms
   passed
  since
   the last invocation, timeout is currently set to 6 when
  iterating a
   scanner through the Thrift API. Is there any reason not to
  raise
the
   timeout to something larger than the default 60s? Put
 another
   way,
  what
   resources (and how much of them) does a scanner take up on a

Evenly splitting the table

2013-03-20 Thread Cole
I was wondering how I can go about evenly splitting an entire table in 
HBase during table creation[1]. I tried providing the empty byte arrays
HConstants.EMPTY_START_ROW and HConstants.EMPTY_END_ROW 
as parameters to the method I linked below, and got an error: Start 
key must be smaller than end key. Is there a way to go about splitting
the entire table without having specific start and end keys? Thanks in
advance.


[1] 
http://hbase.apache.org/apidocs/org/apache/hadoop/hbase/client/HBaseAdmin.html
#createTable(org.apache.hadoop.hbase.HTableDescriptor, byte[], byte[], int)



Re: Evenly splitting the table

2013-03-20 Thread Ted Yu
Take a look at TestAdmin#testCreateTableRPCTimeOut() where
hbaseadmin.createTable() is called.

bq. Is there a way to go about splitting the entire table without having
specific start and end keys?

I don't think so.

On Wed, Mar 20, 2013 at 3:32 PM, Cole cole.skov...@cerner.com wrote:

 I was wondering how I can go about evenly splitting an entire table in
 HBase during table creation[1]. I tried providing the empty byte arrays
 HConstants.EMPTY_START_ROW and HConstants.EMPTY_END_ROW
 as parameters to the method I linked below, and got an error: Start
 key must be smaller than end key. Is there a way to go about splitting
 the entire table without having specific start and end keys? Thanks in
 advance.


 [1]

 http://hbase.apache.org/apidocs/org/apache/hadoop/hbase/client/HBaseAdmin.html
 #createTable(org.apache.hadoop.hbase.HTableDescriptor, byte[], byte[], int)




Fwd: Questions about versions and timestamp

2013-03-20 Thread Benyi Wang
Hi,

Please forgive me if my questions have been already asked and answered many
times because I could not googled any of them.

If I do the following commands in hbase shell,

hbase(main):048:0 create test_ts_ver, data
0 row(s) in 1.0550 seconds

hbase(main):049:0 describe test_ts_ver
DESCRIPTION  ENABLED

 {NAME = 'test_ts_ver', FAMILIES = [{NAME = 'data true

 ', BLOOMFILTER = 'NONE', REPLICATION_SCOPE = '0',

  VERSIONS = '3', COMPRESSION = 'NONE', MIN_VERSIO

 NS = '0', TTL = '2147483647', BLOCKSIZE = '65536

 ', IN_MEMORY = 'false', BLOCKCACHE = 'true'}]}

1 row(s) in 0.0940 seconds

hbase(main):052:0 put test_ts_ver, row_1, data:name, benyi_w, 100
0 row(s) in 0.0040 seconds

hbase(main):053:0 put test_ts_ver, row_1, data:name, benyi_1, 110
0 row(s) in 0.0050 seconds

hbase(main):054:0 put test_ts_ver, row_1, data:name, benyi_2, 120
0 row(s) in 0.0040 seconds

hbase(main):055:0 put test_ts_ver, row_1, data:name, benyi_3, 130
0 row(s) in 0.0040 seconds

hbase(main):056:0 put test_ts_ver, row_1, data:name, benyi_4, 140
0 row(s) in 0.0040 seconds

hbase(main):057:0 get test_ts_ver, row_1, { TIMERANGE=[0,200] }
COLUMNCELL

 data:nametimestamp=140, value=benyi_4

1 row(s) in 0.0140 seconds

hbase(main):058:0 get test_ts_ver, row_1, { TIMERANGE=[0,200],
VERSIONS=5 }
COLUMNCELL

 data:nametimestamp=140, value=benyi_4

 data:nametimestamp=130, value=benyi_3

 data:nametimestamp=120, value=benyi_2

3 row(s) in 0.0050 seconds

So far so good. But if I try to get timestamp=100 or 110, I still can get
them

hbase(main):059:0 get test_ts_ver, row_1, { TIMESTAMP= 100 }
COLUMNCELL

 data:nametimestamp=100, value=benyi_w

1 row(s) in 0.0120 seconds

hbase(main):060:0 get test_ts_ver, row_1, { TIMESTAMP= 110 }
COLUMNCELL

 data:nametimestamp=110, value=benyi_1

1 row(s) in 0.0060 seconds

My questions:

1. When all those old versions will be removed?
2. Will compact or major_compact remove those old versions?
3. Is there a section/chapter talking about this behavior In HBase
Reference Guide?

Thanks.

Ben


Re: Questions about versions and timestamp

2013-03-20 Thread Ted Yu
A few pointers so that you can find the answer yourself:

http://hbase.apache.org/book.html
Take a look at 2.5.2.8. Managed Compactions and
http://hbase.apache.org/book.html#compaction

You can also use search-hadoop.com

e.g. 'Possible to delete a specific cell?'

Cheers

On Wed, Mar 20, 2013 at 3:55 PM, Benyi Wang bewang.t...@gmail.com wrote:

 Hi,

 Please forgive me if my questions have been already asked and answered many
 times because I could not googled any of them.

 If I do the following commands in hbase shell,

 hbase(main):048:0 create test_ts_ver, data
 0 row(s) in 1.0550 seconds

 hbase(main):049:0 describe test_ts_ver
 DESCRIPTION  ENABLED

  {NAME = 'test_ts_ver', FAMILIES = [{NAME = 'data true

  ', BLOOMFILTER = 'NONE', REPLICATION_SCOPE = '0',

   VERSIONS = '3', COMPRESSION = 'NONE', MIN_VERSIO

  NS = '0', TTL = '2147483647', BLOCKSIZE = '65536

  ', IN_MEMORY = 'false', BLOCKCACHE = 'true'}]}

 1 row(s) in 0.0940 seconds

 hbase(main):052:0 put test_ts_ver, row_1, data:name, benyi_w, 100
 0 row(s) in 0.0040 seconds

 hbase(main):053:0 put test_ts_ver, row_1, data:name, benyi_1, 110
 0 row(s) in 0.0050 seconds

 hbase(main):054:0 put test_ts_ver, row_1, data:name, benyi_2, 120
 0 row(s) in 0.0040 seconds

 hbase(main):055:0 put test_ts_ver, row_1, data:name, benyi_3, 130
 0 row(s) in 0.0040 seconds

 hbase(main):056:0 put test_ts_ver, row_1, data:name, benyi_4, 140
 0 row(s) in 0.0040 seconds

 hbase(main):057:0 get test_ts_ver, row_1, { TIMERANGE=[0,200] }
 COLUMNCELL

  data:nametimestamp=140, value=benyi_4

 1 row(s) in 0.0140 seconds

 hbase(main):058:0 get test_ts_ver, row_1, { TIMERANGE=[0,200],
 VERSIONS=5 }
 COLUMNCELL

  data:nametimestamp=140, value=benyi_4

  data:nametimestamp=130, value=benyi_3

  data:nametimestamp=120, value=benyi_2

 3 row(s) in 0.0050 seconds

 So far so good. But if I try to get timestamp=100 or 110, I still can get
 them

 hbase(main):059:0 get test_ts_ver, row_1, { TIMESTAMP= 100 }
 COLUMNCELL

  data:nametimestamp=100, value=benyi_w

 1 row(s) in 0.0120 seconds

 hbase(main):060:0 get test_ts_ver, row_1, { TIMESTAMP= 110 }
 COLUMNCELL

  data:nametimestamp=110, value=benyi_1

 1 row(s) in 0.0060 seconds

 My questions:

 1. When all those old versions will be removed?
 2. Will compact or major_compact remove those old versions?
 3. Is there a section/chapter talking about this behavior In HBase
 Reference Guide?

 Thanks.

 Ben



Re: Evenly splitting the table

2013-03-20 Thread Aaron Kimball
Hi Cole,

How are your keys structured? In Kiji, we default to using hashed row keys
where each key starts with two bytes of salt. This makes it a lot easier to
pre-split the table since you can make stronger guarantees about the key
distribution.

If your keys are raw text like, say, plaintext email addresses, it is
significantly more difficult to guess the right splits a priori.

cheers,
- Aaron



On Wed, Mar 20, 2013 at 3:43 PM, Ted Yu yuzhih...@gmail.com wrote:

 Take a look at TestAdmin#testCreateTableRPCTimeOut() where
 hbaseadmin.createTable() is called.

 bq. Is there a way to go about splitting the entire table without having
 specific start and end keys?

 I don't think so.

 On Wed, Mar 20, 2013 at 3:32 PM, Cole cole.skov...@cerner.com wrote:

  I was wondering how I can go about evenly splitting an entire table in
  HBase during table creation[1]. I tried providing the empty byte arrays
  HConstants.EMPTY_START_ROW and HConstants.EMPTY_END_ROW
  as parameters to the method I linked below, and got an error: Start
  key must be smaller than end key. Is there a way to go about splitting
  the entire table without having specific start and end keys? Thanks in
  advance.
 
 
  [1]
 
 
 http://hbase.apache.org/apidocs/org/apache/hadoop/hbase/client/HBaseAdmin.html
  #createTable(org.apache.hadoop.hbase.HTableDescriptor, byte[], byte[],
 int)