HBase Client.
Hi, I would like to benchmark HBase using some of our distributed applications using custom developed benchmarking scripts/programs. I found the following clients are available. Could you please let me know which of the following provides best performance. 1. Java direct interface to HBASE. 2. HBase Shell 3. via Rest 4. HappyBase 5. Kundera Please let me know if there is any other client which provides better performance. thanks pradeep
Re: HBase Client.
Most of the clients listed below are language specific, so if your benchmarking scripts are written in JAVA, you are better off running the java client. HBase Shell is more for running something interactive, not sure how you plan to benchmark that. REST is something that you could use, but I can't comment on it's performance since I have HappyBase is for python. Kundera, can't comment since I have not used it. You can look at AsyncHBase, if you don't mind wrapping your head around it. But it's a bigger rewrite since the API is not compatible with existing client. On Tue, Mar 19, 2013 at 11:25 PM, Pradeep Kumar Mantha pradeep...@gmail.com wrote: Hi, I would like to benchmark HBase using some of our distributed applications using custom developed benchmarking scripts/programs. I found the following clients are available. Could you please let me know which of the following provides best performance. 1. Java direct interface to HBASE. 2. HBase Shell 3. via Rest 4. HappyBase 5. Kundera Please let me know if there is any other client which provides better performance. thanks pradeep
Re: Truncate hbase table based on column family
Can you clarify your question ? Did you mean that you only want to drop certain column families ? Thanks On Wed, Mar 20, 2013 at 7:15 AM, varaprasad.bh...@polarisft.com wrote: Hi All, Can we truncate a table in hbase based on the column family. Please give your comments. Thanks Regards, Varaprasada Reddy This e-Mail may contain proprietary and confidential information and is sent for the intended recipient(s) only. If by an addressing or transmission error this mail has been misdirected to you, you are requested to delete this mail immediately. You are also hereby notified that any use, any form of reproduction, dissemination, copying, disclosure, modification, distribution and/or publication of this e-mail message, contents or its attachment other than by its intended recipient/s is strictly prohibited. Visit us at http://www.polarisFT.com
Re: Welcome our newest Committer Anoop
Congratulations! On Wed, Mar 20, 2013 at 6:11 AM, Jonathan Hsieh j...@cloudera.com wrote: welcome welcome! On Wed, Mar 13, 2013 at 10:23 AM, Sergey Shelukhin ser...@hortonworks.comwrote: Congrats! On Tue, Mar 12, 2013 at 10:38 PM, xkwang bruce bruce.xkwa...@gmail.com wrote: Congratulations, Anoop! 2013/3/13 Devaraj Das d...@hortonworks.com Hey Anoop, Congratulations! Devaraj. On Mon, Mar 11, 2013 at 10:50 AM, Enis Söztutar enis@gmail.com wrote: Congrats and welcome. On Mon, Mar 11, 2013 at 2:21 AM, Nicolas Liochon nkey...@gmail.com wrote: Congrats, Anoop! On Mon, Mar 11, 2013 at 5:35 AM, rajeshbabu chintaguntla rajeshbabu.chintagun...@huawei.com wrote: Contratulations Anoop! From: Anoop Sam John [anoo...@huawei.com] Sent: Monday, March 11, 2013 9:00 AM To: user@hbase.apache.org Subject: RE: Welcome our newest Committer Anoop Thanks to all.. Hope to work more and more for HBase! -Anoop- From: Andrew Purtell [apurt...@apache.org] Sent: Monday, March 11, 2013 7:33 AM To: user@hbase.apache.org Subject: Re: Welcome our newest Committer Anoop Congratulations Anoop. Welcome! On Mon, Mar 11, 2013 at 12:42 AM, ramkrishna vasudevan ramkrishna.s.vasude...@gmail.com wrote: Hi All Pls welcome Anoop, our newest committer. Anoop's work in HBase has been great and he has helped lot of users in the mailing list. He has contributed features related to Endpoints and CPs. Welcome Anoop and best wishes for your future work. Hope to see your continuing efforts to the community. Regards Ram -- Best regards, - Andy Problems worthy of attack prove their worth by hitting back. - Piet Hein (via Tom White) -- // Jonathan Hsieh (shay) // Software Engineer, Cloudera // j...@cloudera.com
Re: How to catch java.net.ConnectException and when
Hi Gaurhari, Can you please tell us a bit more about what you want to acheive? When do you want to catch this exception? On which operation? JM 2013/3/20 gaurhari dass gaurharid...@gmail.com: I want to catch connect exception in hbase
Re: HBase Client.
Another one to add to your list: 6. Phoenix (https://github.com/forcedotcom/phoenix) Thanks, James On Mar 20, 2013, at 2:50 AM, Vivek Mishra vivek.mis...@impetus.co.in wrote: I have used Kundera, persistence overhead on HBase API is minimal considering feature set available for use within Kundera. -Vivek From: Viral Bajaria [viral.baja...@gmail.com] Sent: 20 March 2013 12:30 To: user@hbase.apache.org Subject: Re: HBase Client. Most of the clients listed below are language specific, so if your benchmarking scripts are written in JAVA, you are better off running the java client. HBase Shell is more for running something interactive, not sure how you plan to benchmark that. REST is something that you could use, but I can't comment on it's performance since I have HappyBase is for python. Kundera, can't comment since I have not used it. You can look at AsyncHBase, if you don't mind wrapping your head around it. But it's a bigger rewrite since the API is not compatible with existing client. On Tue, Mar 19, 2013 at 11:25 PM, Pradeep Kumar Mantha pradeep...@gmail.com wrote: Hi, I would like to benchmark HBase using some of our distributed applications using custom developed benchmarking scripts/programs. I found the following clients are available. Could you please let me know which of the following provides best performance. 1. Java direct interface to HBASE. 2. HBase Shell 3. via Rest 4. HappyBase 5. Kundera Please let me know if there is any other client which provides better performance. thanks pradeep NOTE: This message may contain information that is confidential, proprietary, privileged or otherwise protected by law. The message is intended solely for the named addressee. If received in error, please destroy and notify the sender. Any use of this email is prohibited when received in error. Impetus does not represent, warrant and/or guarantee, that the integrity of this communication has been maintained nor that the communication is free of errors, virus, interception or interference.
Does HBase RegionServer benefit from OS Page Cache
Given that HBase has it's own cache (block cache and bloom filters) and that all the table data is stored in HDFS, I'm wondering if HBase benefits from OS page cache at all. In the set up I'm using HBase Region Servers run on the same boxes as the HDFS data node. In such a scenario if the underlying HLog files lives on the same machine then having a healthy memory surplus may mean that the data node can serve underlying file from page cache and thus improving HBase performance. Is this really the case? (I guess page cache should also help in case where HLog file lives on a different machine but in that case network I/O will probably drown the speedup achieved due to not hitting the disk. I'm asking because if page cache were useful then in an HBase set up not utilizing all the memory on the machine for the region server may not be that bad. The reason one would not want to use all the memory for region server would be long garbage collection pauses that large heap size may induce. I understand that work has been done to fix the long pauses caused due to memory fragmentation in the old generation, mostly concurrent garbage collector by using slab cache allocator for memstore but that feature is marked experimental and we're not ready to take risks yet. So if the page cache was useful in any way on Region Servers we could go with less memory for RegionServer process with the understanding that free memory on the machine is not completely going to waste. Thus my curiosity about utility of os page cache to performance of HBase. Thanks in Advance, Pankaj
Re: HBase Client.
Pradeep - One more to add to your list of clients is Phoenix: https://github.com/forcedotcom/phoenix It's a SQL skin, built on top of the standard Java client with various optimizations; it exposes HBase via a standard JDBC interface, and thus might let you easily plug into other tools for testing performance. Ian On Mar 20, 2013, at 4:49 AM, Vivek Mishra wrote: I have used Kundera, persistence overhead on HBase API is minimal considering feature set available for use within Kundera. -Vivek From: Viral Bajaria [viral.baja...@gmail.com] Sent: 20 March 2013 12:30 To: user@hbase.apache.orgmailto:user@hbase.apache.org Subject: Re: HBase Client. Most of the clients listed below are language specific, so if your benchmarking scripts are written in JAVA, you are better off running the java client. HBase Shell is more for running something interactive, not sure how you plan to benchmark that. REST is something that you could use, but I can't comment on it's performance since I have HappyBase is for python. Kundera, can't comment since I have not used it. You can look at AsyncHBase, if you don't mind wrapping your head around it. But it's a bigger rewrite since the API is not compatible with existing client. On Tue, Mar 19, 2013 at 11:25 PM, Pradeep Kumar Mantha pradeep...@gmail.commailto:pradeep...@gmail.com wrote: Hi, I would like to benchmark HBase using some of our distributed applications using custom developed benchmarking scripts/programs. I found the following clients are available. Could you please let me know which of the following provides best performance. 1. Java direct interface to HBASE. 2. HBase Shell 3. via Rest 4. HappyBase 5. Kundera Please let me know if there is any other client which provides better performance. thanks pradeep NOTE: This message may contain information that is confidential, proprietary, privileged or otherwise protected by law. The message is intended solely for the named addressee. If received in error, please destroy and notify the sender. Any use of this email is prohibited when received in error. Impetus does not represent, warrant and/or guarantee, that the integrity of this communication has been maintained nor that the communication is free of errors, virus, interception or interference.
Re: Scanner timeout -- any reason not to raise?
I'm confused -- I only see one setting in CDH manager, what is the name of the other setting? Our load is moderately frequent small writes (in batches of 1000 cells at a time, typically split over a few hundred rows -- these complete very fast, we haven't seen any timeouts there), and infrequent batches of large reads (scans), which is where we do see timeouts. My guess is that the timeout is more due to our application taking some time -- apparently more than 60s -- to process the results of each scan's output, rather than due to slowness in HBase itself, which tends to be only moderately loaded (judging by CPU, network, and disk) while we do the reads. Thanks, - Dan On Mar 17, 2013, at 2:20 PM, Ted Yu wrote: The lease timeout is used by row locking too. That's the reason behind splitting the setting into two config parameters. How is your load composition ? Do you mostly serve reads from HBase ? Cheers On Sun, Mar 17, 2013 at 1:56 PM, Dan Crosta d...@magnetic.com wrote: Ah, thanks Ted -- I was wondering what that setting was for. We are using CDH 4.2.0, which is HBase 0.94.2 (give or take a few backports from 0.94.3). Is there any harm in setting the lease timeout to something larger, like 5 or 10 minutes? Thanks, - Dan On Mar 17, 2013, at 1:46 PM, Ted Yu wrote: Which HBase version are you using ? In 0.94 and prior, the config param is hbase.regionserver.lease.period In 0.95, it is different. See release notes of HBASE-6170 On Sun, Mar 17, 2013 at 11:46 AM, Dan Crosta d...@magnetic.com wrote: We occasionally get scanner timeout errors such as 66698ms passed since the last invocation, timeout is currently set to 6 when iterating a scanner through the Thrift API. Is there any reason not to raise the timeout to something larger than the default 60s? Put another way, what resources (and how much of them) does a scanner take up on a thrift server or region server? Also, to confirm -- I believe hbase.rpc.timeout is the setting in question here, but someone please correct me if I'm wrong. Thanks, - Dan
Re: Scanner timeout -- any reason not to raise?
In 0.94, there is only one setting. See release notes of HBASE-6170 which is in 0.95 Looks like this should help (in 0.95): https://issues.apache.org/jira/browse/HBASE-2214 Do HBASE-1996 -- setting size to return in scan rather than count of rows -- properly From your description, you should be able to raise the timeout since the writes are relatively fast. Cheers On Wed, Mar 20, 2013 at 9:32 AM, Dan Crosta d...@magnetic.com wrote: I'm confused -- I only see one setting in CDH manager, what is the name of the other setting? Our load is moderately frequent small writes (in batches of 1000 cells at a time, typically split over a few hundred rows -- these complete very fast, we haven't seen any timeouts there), and infrequent batches of large reads (scans), which is where we do see timeouts. My guess is that the timeout is more due to our application taking some time -- apparently more than 60s -- to process the results of each scan's output, rather than due to slowness in HBase itself, which tends to be only moderately loaded (judging by CPU, network, and disk) while we do the reads. Thanks, - Dan On Mar 17, 2013, at 2:20 PM, Ted Yu wrote: The lease timeout is used by row locking too. That's the reason behind splitting the setting into two config parameters. How is your load composition ? Do you mostly serve reads from HBase ? Cheers On Sun, Mar 17, 2013 at 1:56 PM, Dan Crosta d...@magnetic.com wrote: Ah, thanks Ted -- I was wondering what that setting was for. We are using CDH 4.2.0, which is HBase 0.94.2 (give or take a few backports from 0.94.3). Is there any harm in setting the lease timeout to something larger, like 5 or 10 minutes? Thanks, - Dan On Mar 17, 2013, at 1:46 PM, Ted Yu wrote: Which HBase version are you using ? In 0.94 and prior, the config param is hbase.regionserver.lease.period In 0.95, it is different. See release notes of HBASE-6170 On Sun, Mar 17, 2013 at 11:46 AM, Dan Crosta d...@magnetic.com wrote: We occasionally get scanner timeout errors such as 66698ms passed since the last invocation, timeout is currently set to 6 when iterating a scanner through the Thrift API. Is there any reason not to raise the timeout to something larger than the default 60s? Put another way, what resources (and how much of them) does a scanner take up on a thrift server or region server? Also, to confirm -- I believe hbase.rpc.timeout is the setting in question here, but someone please correct me if I'm wrong. Thanks, - Dan
Re: Scanner timeout -- any reason not to raise?
Typically it is better to use caching and batch size to limit the number of rows returned and thus the amount of processing required between calls to next() during a scan, but it would be nice if HBase provided a way to manually refresh a lease similar to Hadoop's context.progress(). In a cluster that is used for many different applications, upping the global lease timeout is a heavy handed solution. Even being able to override the timeout on a per-scan basis would be nice. Thoughts on that, Ted? On Wed, Mar 20, 2013 at 1:00 PM, Ted Yu yuzhih...@gmail.com wrote: In 0.94, there is only one setting. See release notes of HBASE-6170 which is in 0.95 Looks like this should help (in 0.95): https://issues.apache.org/jira/browse/HBASE-2214 Do HBASE-1996 -- setting size to return in scan rather than count of rows -- properly From your description, you should be able to raise the timeout since the writes are relatively fast. Cheers On Wed, Mar 20, 2013 at 9:32 AM, Dan Crosta d...@magnetic.com wrote: I'm confused -- I only see one setting in CDH manager, what is the name of the other setting? Our load is moderately frequent small writes (in batches of 1000 cells at a time, typically split over a few hundred rows -- these complete very fast, we haven't seen any timeouts there), and infrequent batches of large reads (scans), which is where we do see timeouts. My guess is that the timeout is more due to our application taking some time -- apparently more than 60s -- to process the results of each scan's output, rather than due to slowness in HBase itself, which tends to be only moderately loaded (judging by CPU, network, and disk) while we do the reads. Thanks, - Dan On Mar 17, 2013, at 2:20 PM, Ted Yu wrote: The lease timeout is used by row locking too. That's the reason behind splitting the setting into two config parameters. How is your load composition ? Do you mostly serve reads from HBase ? Cheers On Sun, Mar 17, 2013 at 1:56 PM, Dan Crosta d...@magnetic.com wrote: Ah, thanks Ted -- I was wondering what that setting was for. We are using CDH 4.2.0, which is HBase 0.94.2 (give or take a few backports from 0.94.3). Is there any harm in setting the lease timeout to something larger, like 5 or 10 minutes? Thanks, - Dan On Mar 17, 2013, at 1:46 PM, Ted Yu wrote: Which HBase version are you using ? In 0.94 and prior, the config param is hbase.regionserver.lease.period In 0.95, it is different. See release notes of HBASE-6170 On Sun, Mar 17, 2013 at 11:46 AM, Dan Crosta d...@magnetic.com wrote: We occasionally get scanner timeout errors such as 66698ms passed since the last invocation, timeout is currently set to 6 when iterating a scanner through the Thrift API. Is there any reason not to raise the timeout to something larger than the default 60s? Put another way, what resources (and how much of them) does a scanner take up on a thrift server or region server? Also, to confirm -- I believe hbase.rpc.timeout is the setting in question here, but someone please correct me if I'm wrong. Thanks, - Dan
Re: Scanner timeout -- any reason not to raise?
bq. if HBase provided a way to manually refresh a lease similar to Hadoop's context.progress() Can you outline how the above works for long scan ? bq. Even being able to override the timeout on a per-scan basis would be nice. Agreed. On Wed, Mar 20, 2013 at 10:05 AM, Bryan Beaudreault bbeaudrea...@hubspot.com wrote: Typically it is better to use caching and batch size to limit the number of rows returned and thus the amount of processing required between calls to next() during a scan, but it would be nice if HBase provided a way to manually refresh a lease similar to Hadoop's context.progress(). In a cluster that is used for many different applications, upping the global lease timeout is a heavy handed solution. Even being able to override the timeout on a per-scan basis would be nice. Thoughts on that, Ted? On Wed, Mar 20, 2013 at 1:00 PM, Ted Yu yuzhih...@gmail.com wrote: In 0.94, there is only one setting. See release notes of HBASE-6170 which is in 0.95 Looks like this should help (in 0.95): https://issues.apache.org/jira/browse/HBASE-2214 Do HBASE-1996 -- setting size to return in scan rather than count of rows -- properly From your description, you should be able to raise the timeout since the writes are relatively fast. Cheers On Wed, Mar 20, 2013 at 9:32 AM, Dan Crosta d...@magnetic.com wrote: I'm confused -- I only see one setting in CDH manager, what is the name of the other setting? Our load is moderately frequent small writes (in batches of 1000 cells at a time, typically split over a few hundred rows -- these complete very fast, we haven't seen any timeouts there), and infrequent batches of large reads (scans), which is where we do see timeouts. My guess is that the timeout is more due to our application taking some time -- apparently more than 60s -- to process the results of each scan's output, rather than due to slowness in HBase itself, which tends to be only moderately loaded (judging by CPU, network, and disk) while we do the reads. Thanks, - Dan On Mar 17, 2013, at 2:20 PM, Ted Yu wrote: The lease timeout is used by row locking too. That's the reason behind splitting the setting into two config parameters. How is your load composition ? Do you mostly serve reads from HBase ? Cheers On Sun, Mar 17, 2013 at 1:56 PM, Dan Crosta d...@magnetic.com wrote: Ah, thanks Ted -- I was wondering what that setting was for. We are using CDH 4.2.0, which is HBase 0.94.2 (give or take a few backports from 0.94.3). Is there any harm in setting the lease timeout to something larger, like 5 or 10 minutes? Thanks, - Dan On Mar 17, 2013, at 1:46 PM, Ted Yu wrote: Which HBase version are you using ? In 0.94 and prior, the config param is hbase.regionserver.lease.period In 0.95, it is different. See release notes of HBASE-6170 On Sun, Mar 17, 2013 at 11:46 AM, Dan Crosta d...@magnetic.com wrote: We occasionally get scanner timeout errors such as 66698ms passed since the last invocation, timeout is currently set to 6 when iterating a scanner through the Thrift API. Is there any reason not to raise the timeout to something larger than the default 60s? Put another way, what resources (and how much of them) does a scanner take up on a thrift server or region server? Also, to confirm -- I believe hbase.rpc.timeout is the setting in question here, but someone please correct me if I'm wrong. Thanks, - Dan
Re: Does HBase RegionServer benefit from OS Page Cache
First, MSLAB has been enabled by default since 0.92.0 as it was deemed stable enough. So, unless you are on 0.90, you are already using it. Also, I'm not sure why you are referencing the HLog in your first paragraph in the context of reading from disk, because the HLogs are rarely read (only on recovery). Maybe you meant HFile? In any case, your email covers most arguments except for one: checksumming. Retrieving a block from HDFS, even when using short circuit reads to go directly to the OS instead of passing through the DN, will take quite a bit more time than reading directly from the block cache. This is why even if you disable block caching on a family that the index and root blocks will still be block cached, as reading those very hot blocks from disk would take way too long. Regarding your main question (how does the OS buffer help?), I don't have a good answer. It kind of depends on the amount of RAM you have and what your workload is like. As a data point, I've been successful running with 24GB of heap (50% dedicated to the block cache) with a workload consisting mainly of small writes, short scans, and a typical random read distribution for a website. I can't remember the last time I saw a full GC and it's been running for more than a year like this. Hope this somehow helps, J-D On Wed, Mar 20, 2013 at 12:34 AM, Pankaj Gupta pankaj.ro...@gmail.com wrote: Given that HBase has it's own cache (block cache and bloom filters) and that all the table data is stored in HDFS, I'm wondering if HBase benefits from OS page cache at all. In the set up I'm using HBase Region Servers run on the same boxes as the HDFS data node. In such a scenario if the underlying HLog files lives on the same machine then having a healthy memory surplus may mean that the data node can serve underlying file from page cache and thus improving HBase performance. Is this really the case? (I guess page cache should also help in case where HLog file lives on a different machine but in that case network I/O will probably drown the speedup achieved due to not hitting the disk. I'm asking because if page cache were useful then in an HBase set up not utilizing all the memory on the machine for the region server may not be that bad. The reason one would not want to use all the memory for region server would be long garbage collection pauses that large heap size may induce. I understand that work has been done to fix the long pauses caused due to memory fragmentation in the old generation, mostly concurrent garbage collector by using slab cache allocator for memstore but that feature is marked experimental and we're not ready to take risks yet. So if the page cache was useful in any way on Region Servers we could go with less memory for RegionServer process with the understanding that free memory on the machine is not completely going to waste. Thus my curiosity about utility of os page cache to performance of HBase. Thanks in Advance, Pankaj
Re: Scanner timeout -- any reason not to raise?
I was thinking something like this: Scan scan = new Scan(startRow, endRow); scan.setCaching(someVal); // based on what we expect most rows to take for processing time ResultScanner scanner = table.getScanner(scan); for (Result r : scanner) { // usual processing, the time for which we accounted for in our caching and global lease timeout settings if (someCondition) { // More time-intensive processing necessary on this record, which is hard to account for in the caching scanner.progress(); } } -- I'm not sure how we could expose this in the context of a hadoop job, since I don't believe we have access to the underlying scanner, but that would be great also. On Wed, Mar 20, 2013 at 1:11 PM, Ted Yu yuzhih...@gmail.com wrote: bq. if HBase provided a way to manually refresh a lease similar to Hadoop's context.progress() Can you outline how the above works for long scan ? bq. Even being able to override the timeout on a per-scan basis would be nice. Agreed. On Wed, Mar 20, 2013 at 10:05 AM, Bryan Beaudreault bbeaudrea...@hubspot.com wrote: Typically it is better to use caching and batch size to limit the number of rows returned and thus the amount of processing required between calls to next() during a scan, but it would be nice if HBase provided a way to manually refresh a lease similar to Hadoop's context.progress(). In a cluster that is used for many different applications, upping the global lease timeout is a heavy handed solution. Even being able to override the timeout on a per-scan basis would be nice. Thoughts on that, Ted? On Wed, Mar 20, 2013 at 1:00 PM, Ted Yu yuzhih...@gmail.com wrote: In 0.94, there is only one setting. See release notes of HBASE-6170 which is in 0.95 Looks like this should help (in 0.95): https://issues.apache.org/jira/browse/HBASE-2214 Do HBASE-1996 -- setting size to return in scan rather than count of rows -- properly From your description, you should be able to raise the timeout since the writes are relatively fast. Cheers On Wed, Mar 20, 2013 at 9:32 AM, Dan Crosta d...@magnetic.com wrote: I'm confused -- I only see one setting in CDH manager, what is the name of the other setting? Our load is moderately frequent small writes (in batches of 1000 cells at a time, typically split over a few hundred rows -- these complete very fast, we haven't seen any timeouts there), and infrequent batches of large reads (scans), which is where we do see timeouts. My guess is that the timeout is more due to our application taking some time -- apparently more than 60s -- to process the results of each scan's output, rather than due to slowness in HBase itself, which tends to be only moderately loaded (judging by CPU, network, and disk) while we do the reads. Thanks, - Dan On Mar 17, 2013, at 2:20 PM, Ted Yu wrote: The lease timeout is used by row locking too. That's the reason behind splitting the setting into two config parameters. How is your load composition ? Do you mostly serve reads from HBase ? Cheers On Sun, Mar 17, 2013 at 1:56 PM, Dan Crosta d...@magnetic.com wrote: Ah, thanks Ted -- I was wondering what that setting was for. We are using CDH 4.2.0, which is HBase 0.94.2 (give or take a few backports from 0.94.3). Is there any harm in setting the lease timeout to something larger, like 5 or 10 minutes? Thanks, - Dan On Mar 17, 2013, at 1:46 PM, Ted Yu wrote: Which HBase version are you using ? In 0.94 and prior, the config param is hbase.regionserver.lease.period In 0.95, it is different. See release notes of HBASE-6170 On Sun, Mar 17, 2013 at 11:46 AM, Dan Crosta d...@magnetic.com wrote: We occasionally get scanner timeout errors such as 66698ms passed since the last invocation, timeout is currently set to 6 when iterating a scanner through the Thrift API. Is there any reason not to raise the timeout to something larger than the default 60s? Put another way, what resources (and how much of them) does a scanner take up on a thrift server or region server? Also, to confirm -- I believe hbase.rpc.timeout is the setting in question here, but someone please correct me if I'm wrong. Thanks, - Dan
Re: Scanner timeout -- any reason not to raise?
Bryan: Interesting idea. You can log a JIRA with the following two suggestions. On Wed, Mar 20, 2013 at 10:39 AM, Bryan Beaudreault bbeaudrea...@hubspot.com wrote: I was thinking something like this: Scan scan = new Scan(startRow, endRow); scan.setCaching(someVal); // based on what we expect most rows to take for processing time ResultScanner scanner = table.getScanner(scan); for (Result r : scanner) { // usual processing, the time for which we accounted for in our caching and global lease timeout settings if (someCondition) { // More time-intensive processing necessary on this record, which is hard to account for in the caching scanner.progress(); } } -- I'm not sure how we could expose this in the context of a hadoop job, since I don't believe we have access to the underlying scanner, but that would be great also. On Wed, Mar 20, 2013 at 1:11 PM, Ted Yu yuzhih...@gmail.com wrote: bq. if HBase provided a way to manually refresh a lease similar to Hadoop's context.progress() Can you outline how the above works for long scan ? bq. Even being able to override the timeout on a per-scan basis would be nice. Agreed. On Wed, Mar 20, 2013 at 10:05 AM, Bryan Beaudreault bbeaudrea...@hubspot.com wrote: Typically it is better to use caching and batch size to limit the number of rows returned and thus the amount of processing required between calls to next() during a scan, but it would be nice if HBase provided a way to manually refresh a lease similar to Hadoop's context.progress(). In a cluster that is used for many different applications, upping the global lease timeout is a heavy handed solution. Even being able to override the timeout on a per-scan basis would be nice. Thoughts on that, Ted? On Wed, Mar 20, 2013 at 1:00 PM, Ted Yu yuzhih...@gmail.com wrote: In 0.94, there is only one setting. See release notes of HBASE-6170 which is in 0.95 Looks like this should help (in 0.95): https://issues.apache.org/jira/browse/HBASE-2214 Do HBASE-1996 -- setting size to return in scan rather than count of rows -- properly From your description, you should be able to raise the timeout since the writes are relatively fast. Cheers On Wed, Mar 20, 2013 at 9:32 AM, Dan Crosta d...@magnetic.com wrote: I'm confused -- I only see one setting in CDH manager, what is the name of the other setting? Our load is moderately frequent small writes (in batches of 1000 cells at a time, typically split over a few hundred rows -- these complete very fast, we haven't seen any timeouts there), and infrequent batches of large reads (scans), which is where we do see timeouts. My guess is that the timeout is more due to our application taking some time -- apparently more than 60s -- to process the results of each scan's output, rather than due to slowness in HBase itself, which tends to be only moderately loaded (judging by CPU, network, and disk) while we do the reads. Thanks, - Dan On Mar 17, 2013, at 2:20 PM, Ted Yu wrote: The lease timeout is used by row locking too. That's the reason behind splitting the setting into two config parameters. How is your load composition ? Do you mostly serve reads from HBase ? Cheers On Sun, Mar 17, 2013 at 1:56 PM, Dan Crosta d...@magnetic.com wrote: Ah, thanks Ted -- I was wondering what that setting was for. We are using CDH 4.2.0, which is HBase 0.94.2 (give or take a few backports from 0.94.3). Is there any harm in setting the lease timeout to something larger, like 5 or 10 minutes? Thanks, - Dan On Mar 17, 2013, at 1:46 PM, Ted Yu wrote: Which HBase version are you using ? In 0.94 and prior, the config param is hbase.regionserver.lease.period In 0.95, it is different. See release notes of HBASE-6170 On Sun, Mar 17, 2013 at 11:46 AM, Dan Crosta d...@magnetic.com wrote: We occasionally get scanner timeout errors such as 66698ms passed since the last invocation, timeout is currently set to 6 when iterating a scanner through the Thrift API. Is there any reason not to raise the timeout to something larger than the default 60s? Put another way, what resources (and how much of them) does a scanner take up on a thrift server or region server? Also, to confirm -- I believe hbase.rpc.timeout is the setting in question here, but someone please correct me if I'm wrong. Thanks, - Dan
Re: Scanner timeout -- any reason not to raise?
Thanks Ted, I've submitted https://issues.apache.org/jira/browse/HBASE-8157. On Wed, Mar 20, 2013 at 1:56 PM, Ted Yu yuzhih...@gmail.com wrote: Bryan: Interesting idea. You can log a JIRA with the following two suggestions. On Wed, Mar 20, 2013 at 10:39 AM, Bryan Beaudreault bbeaudrea...@hubspot.com wrote: I was thinking something like this: Scan scan = new Scan(startRow, endRow); scan.setCaching(someVal); // based on what we expect most rows to take for processing time ResultScanner scanner = table.getScanner(scan); for (Result r : scanner) { // usual processing, the time for which we accounted for in our caching and global lease timeout settings if (someCondition) { // More time-intensive processing necessary on this record, which is hard to account for in the caching scanner.progress(); } } -- I'm not sure how we could expose this in the context of a hadoop job, since I don't believe we have access to the underlying scanner, but that would be great also. On Wed, Mar 20, 2013 at 1:11 PM, Ted Yu yuzhih...@gmail.com wrote: bq. if HBase provided a way to manually refresh a lease similar to Hadoop's context.progress() Can you outline how the above works for long scan ? bq. Even being able to override the timeout on a per-scan basis would be nice. Agreed. On Wed, Mar 20, 2013 at 10:05 AM, Bryan Beaudreault bbeaudrea...@hubspot.com wrote: Typically it is better to use caching and batch size to limit the number of rows returned and thus the amount of processing required between calls to next() during a scan, but it would be nice if HBase provided a way to manually refresh a lease similar to Hadoop's context.progress(). In a cluster that is used for many different applications, upping the global lease timeout is a heavy handed solution. Even being able to override the timeout on a per-scan basis would be nice. Thoughts on that, Ted? On Wed, Mar 20, 2013 at 1:00 PM, Ted Yu yuzhih...@gmail.com wrote: In 0.94, there is only one setting. See release notes of HBASE-6170 which is in 0.95 Looks like this should help (in 0.95): https://issues.apache.org/jira/browse/HBASE-2214 Do HBASE-1996 -- setting size to return in scan rather than count of rows -- properly From your description, you should be able to raise the timeout since the writes are relatively fast. Cheers On Wed, Mar 20, 2013 at 9:32 AM, Dan Crosta d...@magnetic.com wrote: I'm confused -- I only see one setting in CDH manager, what is the name of the other setting? Our load is moderately frequent small writes (in batches of 1000 cells at a time, typically split over a few hundred rows -- these complete very fast, we haven't seen any timeouts there), and infrequent batches of large reads (scans), which is where we do see timeouts. My guess is that the timeout is more due to our application taking some time -- apparently more than 60s -- to process the results of each scan's output, rather than due to slowness in HBase itself, which tends to be only moderately loaded (judging by CPU, network, and disk) while we do the reads. Thanks, - Dan On Mar 17, 2013, at 2:20 PM, Ted Yu wrote: The lease timeout is used by row locking too. That's the reason behind splitting the setting into two config parameters. How is your load composition ? Do you mostly serve reads from HBase ? Cheers On Sun, Mar 17, 2013 at 1:56 PM, Dan Crosta d...@magnetic.com wrote: Ah, thanks Ted -- I was wondering what that setting was for. We are using CDH 4.2.0, which is HBase 0.94.2 (give or take a few backports from 0.94.3). Is there any harm in setting the lease timeout to something larger, like 5 or 10 minutes? Thanks, - Dan On Mar 17, 2013, at 1:46 PM, Ted Yu wrote: Which HBase version are you using ? In 0.94 and prior, the config param is hbase.regionserver.lease.period In 0.95, it is different. See release notes of HBASE-6170 On Sun, Mar 17, 2013 at 11:46 AM, Dan Crosta d...@magnetic.com wrote: We occasionally get scanner timeout errors such as 66698ms passed since the last invocation, timeout is currently set to 6 when iterating a scanner through the Thrift API. Is there any reason not to raise the timeout to something larger than the default 60s? Put another way, what resources (and how much of them) does a scanner take up on a
Evenly splitting the table
I was wondering how I can go about evenly splitting an entire table in HBase during table creation[1]. I tried providing the empty byte arrays HConstants.EMPTY_START_ROW and HConstants.EMPTY_END_ROW as parameters to the method I linked below, and got an error: Start key must be smaller than end key. Is there a way to go about splitting the entire table without having specific start and end keys? Thanks in advance. [1] http://hbase.apache.org/apidocs/org/apache/hadoop/hbase/client/HBaseAdmin.html #createTable(org.apache.hadoop.hbase.HTableDescriptor, byte[], byte[], int)
Re: Evenly splitting the table
Take a look at TestAdmin#testCreateTableRPCTimeOut() where hbaseadmin.createTable() is called. bq. Is there a way to go about splitting the entire table without having specific start and end keys? I don't think so. On Wed, Mar 20, 2013 at 3:32 PM, Cole cole.skov...@cerner.com wrote: I was wondering how I can go about evenly splitting an entire table in HBase during table creation[1]. I tried providing the empty byte arrays HConstants.EMPTY_START_ROW and HConstants.EMPTY_END_ROW as parameters to the method I linked below, and got an error: Start key must be smaller than end key. Is there a way to go about splitting the entire table without having specific start and end keys? Thanks in advance. [1] http://hbase.apache.org/apidocs/org/apache/hadoop/hbase/client/HBaseAdmin.html #createTable(org.apache.hadoop.hbase.HTableDescriptor, byte[], byte[], int)
Fwd: Questions about versions and timestamp
Hi, Please forgive me if my questions have been already asked and answered many times because I could not googled any of them. If I do the following commands in hbase shell, hbase(main):048:0 create test_ts_ver, data 0 row(s) in 1.0550 seconds hbase(main):049:0 describe test_ts_ver DESCRIPTION ENABLED {NAME = 'test_ts_ver', FAMILIES = [{NAME = 'data true ', BLOOMFILTER = 'NONE', REPLICATION_SCOPE = '0', VERSIONS = '3', COMPRESSION = 'NONE', MIN_VERSIO NS = '0', TTL = '2147483647', BLOCKSIZE = '65536 ', IN_MEMORY = 'false', BLOCKCACHE = 'true'}]} 1 row(s) in 0.0940 seconds hbase(main):052:0 put test_ts_ver, row_1, data:name, benyi_w, 100 0 row(s) in 0.0040 seconds hbase(main):053:0 put test_ts_ver, row_1, data:name, benyi_1, 110 0 row(s) in 0.0050 seconds hbase(main):054:0 put test_ts_ver, row_1, data:name, benyi_2, 120 0 row(s) in 0.0040 seconds hbase(main):055:0 put test_ts_ver, row_1, data:name, benyi_3, 130 0 row(s) in 0.0040 seconds hbase(main):056:0 put test_ts_ver, row_1, data:name, benyi_4, 140 0 row(s) in 0.0040 seconds hbase(main):057:0 get test_ts_ver, row_1, { TIMERANGE=[0,200] } COLUMNCELL data:nametimestamp=140, value=benyi_4 1 row(s) in 0.0140 seconds hbase(main):058:0 get test_ts_ver, row_1, { TIMERANGE=[0,200], VERSIONS=5 } COLUMNCELL data:nametimestamp=140, value=benyi_4 data:nametimestamp=130, value=benyi_3 data:nametimestamp=120, value=benyi_2 3 row(s) in 0.0050 seconds So far so good. But if I try to get timestamp=100 or 110, I still can get them hbase(main):059:0 get test_ts_ver, row_1, { TIMESTAMP= 100 } COLUMNCELL data:nametimestamp=100, value=benyi_w 1 row(s) in 0.0120 seconds hbase(main):060:0 get test_ts_ver, row_1, { TIMESTAMP= 110 } COLUMNCELL data:nametimestamp=110, value=benyi_1 1 row(s) in 0.0060 seconds My questions: 1. When all those old versions will be removed? 2. Will compact or major_compact remove those old versions? 3. Is there a section/chapter talking about this behavior In HBase Reference Guide? Thanks. Ben
Re: Questions about versions and timestamp
A few pointers so that you can find the answer yourself: http://hbase.apache.org/book.html Take a look at 2.5.2.8. Managed Compactions and http://hbase.apache.org/book.html#compaction You can also use search-hadoop.com e.g. 'Possible to delete a specific cell?' Cheers On Wed, Mar 20, 2013 at 3:55 PM, Benyi Wang bewang.t...@gmail.com wrote: Hi, Please forgive me if my questions have been already asked and answered many times because I could not googled any of them. If I do the following commands in hbase shell, hbase(main):048:0 create test_ts_ver, data 0 row(s) in 1.0550 seconds hbase(main):049:0 describe test_ts_ver DESCRIPTION ENABLED {NAME = 'test_ts_ver', FAMILIES = [{NAME = 'data true ', BLOOMFILTER = 'NONE', REPLICATION_SCOPE = '0', VERSIONS = '3', COMPRESSION = 'NONE', MIN_VERSIO NS = '0', TTL = '2147483647', BLOCKSIZE = '65536 ', IN_MEMORY = 'false', BLOCKCACHE = 'true'}]} 1 row(s) in 0.0940 seconds hbase(main):052:0 put test_ts_ver, row_1, data:name, benyi_w, 100 0 row(s) in 0.0040 seconds hbase(main):053:0 put test_ts_ver, row_1, data:name, benyi_1, 110 0 row(s) in 0.0050 seconds hbase(main):054:0 put test_ts_ver, row_1, data:name, benyi_2, 120 0 row(s) in 0.0040 seconds hbase(main):055:0 put test_ts_ver, row_1, data:name, benyi_3, 130 0 row(s) in 0.0040 seconds hbase(main):056:0 put test_ts_ver, row_1, data:name, benyi_4, 140 0 row(s) in 0.0040 seconds hbase(main):057:0 get test_ts_ver, row_1, { TIMERANGE=[0,200] } COLUMNCELL data:nametimestamp=140, value=benyi_4 1 row(s) in 0.0140 seconds hbase(main):058:0 get test_ts_ver, row_1, { TIMERANGE=[0,200], VERSIONS=5 } COLUMNCELL data:nametimestamp=140, value=benyi_4 data:nametimestamp=130, value=benyi_3 data:nametimestamp=120, value=benyi_2 3 row(s) in 0.0050 seconds So far so good. But if I try to get timestamp=100 or 110, I still can get them hbase(main):059:0 get test_ts_ver, row_1, { TIMESTAMP= 100 } COLUMNCELL data:nametimestamp=100, value=benyi_w 1 row(s) in 0.0120 seconds hbase(main):060:0 get test_ts_ver, row_1, { TIMESTAMP= 110 } COLUMNCELL data:nametimestamp=110, value=benyi_1 1 row(s) in 0.0060 seconds My questions: 1. When all those old versions will be removed? 2. Will compact or major_compact remove those old versions? 3. Is there a section/chapter talking about this behavior In HBase Reference Guide? Thanks. Ben
Re: Evenly splitting the table
Hi Cole, How are your keys structured? In Kiji, we default to using hashed row keys where each key starts with two bytes of salt. This makes it a lot easier to pre-split the table since you can make stronger guarantees about the key distribution. If your keys are raw text like, say, plaintext email addresses, it is significantly more difficult to guess the right splits a priori. cheers, - Aaron On Wed, Mar 20, 2013 at 3:43 PM, Ted Yu yuzhih...@gmail.com wrote: Take a look at TestAdmin#testCreateTableRPCTimeOut() where hbaseadmin.createTable() is called. bq. Is there a way to go about splitting the entire table without having specific start and end keys? I don't think so. On Wed, Mar 20, 2013 at 3:32 PM, Cole cole.skov...@cerner.com wrote: I was wondering how I can go about evenly splitting an entire table in HBase during table creation[1]. I tried providing the empty byte arrays HConstants.EMPTY_START_ROW and HConstants.EMPTY_END_ROW as parameters to the method I linked below, and got an error: Start key must be smaller than end key. Is there a way to go about splitting the entire table without having specific start and end keys? Thanks in advance. [1] http://hbase.apache.org/apidocs/org/apache/hadoop/hbase/client/HBaseAdmin.html #createTable(org.apache.hadoop.hbase.HTableDescriptor, byte[], byte[], int)