Re: Change data capture tool for hbase
Hi Asaf, This CDC pattern will be used for directing changes to another system, Assume I have a table hbase_alarms in hbase with columns Severity,Source,Time and tracking changes with this CDC tool. Some external system is putting alarms with their severity and source to hbase_alarms table . Now I have a source system and I need to take some action tracking changes. For example one example may be inserting some critical alarms to another table in rdms database as well. So using such kind of CDC tool, I can write rules like that if severity=critical and source=router insert record to psql_alarms . This is just an example, as I wrote I am planning implement this tool as flume source so I can take any action on any system using flume sinks. ( calling a webservice, doing an http request, writing to file etc... ) In RDMS world CDC pattern works like an triggering mechanism but it is much more efficient than triggers (cdc tools extracts change information from logs asynchronously therefore they do lengthen transaction ). regards.. On 4 June 2013 06:57, Asaf Mesika asaf.mes...@gmail.com wrote: What's wrong with HBase native Master Slave replicate, or am I missing something here? On Mon, Jun 3, 2013 at 12:16 PM, yavuz gokirmak ygokir...@gmail.com wrote: Hi all, Currently we are working on a hbase change data capture (CDC) tool. I want to share our ideas and continue development according to your feedback. As you know CDC tools are used for tracking the data changes and take actions according to these changes[1]. For example in relational databases, CDC tools are mainly used for replication. You can replicate your source system continuously to another location or db using CDC tool.So whenever an insert/update/delete is done on the source system, you can reflect the same operation to the replicated environment. As I've said, we are working on a CDC tool that can track changes on a hbase table and reflect those changes to any other system in real-time. What we are trying to implement the tool in a way that he will behave as a slave cluster. So if we enable master-master replication in the source system, we expect to get all changes and act accordingly. Once the proof of concept cdc tool is implemented ( we need one week ) we will convert it to a flume source. So using it as a flume source we can direct data changes to any destination (sink) This is just a summary. Please write your feedback and comments. Do you know any tool similar to this proposal? regards. 1- http://en.wikipedia.org/wiki/Change_data_capture
RPC Replication Compression
Hi, Just wanted to make sure if I read in the internet correctly: 0.96 will support HBase RPC compression thus Replication between master and slave will enjoy it as well (important since bandwidth between geographically distant data centers is scarce and more expensive)
Re: RPC Replication Compression
0.96 will support HBase RPC compression Yes Replication between master and slave will enjoy it as well (important since bandwidth between geographically distant data centers is scarce and more expensive) But I can not see it is being utilized in replication. May be we can do improvements in this area. I can see possibilities. -Anoop- On Tue, Jun 4, 2013 at 1:51 PM, Asaf Mesika asaf.mes...@gmail.com wrote: Hi, Just wanted to make sure if I read in the internet correctly: 0.96 will support HBase RPC compression thus Replication between master and slave will enjoy it as well (important since bandwidth between geographically distant data centers is scarce and more expensive)
Re: what's the typical scan latency?
What's your blockCacheHitCachingRatio ? It would tell you about the ratio of scans requested from cache (default) to the scans actually served from the block cache. You can get that from the RS web ui. What you are seeing can almost map to anything, for example: is scanner caching (client side) enabled ? if so, how many rows are cached (how many rows returned by the scanner.next RPC call) ? what's your HFile block size, block cache % of total RS heap, max number of RPCs per RS for client connections, tcpnodelay, your network topology and jitter, number of NICs. Are you using HTableInterface connection pool ? HBase client is synchronous, so how do achieve concurrency ? What about your percentiles ? is 5ms the mean ? median ? is 20ms only in the 99% percentile, etc. etc. etc ... I am far from considering my self an expert on the general topic of HBase, so take my tips with a pinch of salt - these are just factors I've considered when trying to optimize my read latency. Hope that helps. On Tue, Jun 4, 2013 at 4:02 AM, Liu, Raymond raymond@intel.com wrote: Thanks Amit In my envionment, I run a dozens of client to read about 5-20K data per scan concurrently, And the average read latency for cached data is around 5-20ms. So it seems there must be something wrong with my cluster env or application. Or did you run that with multiple client? Depends on so much environment related variables and on data as well. But to give you a number after all: One of our clusters is on EC2, 6 RS, on m1.xlarge machines (network performance 'high' according to aws), with 90% of the time we do reads; our avg data size is 2K, block cache at 20K, 100 rows per scan avg, bloom filters 'on' at the 'ROW' level, 40% of heap dedicated to block cache (note that it contains several other bits and pieces) and I would say our average latency for cached data (~97% blockCacheHitCachingRatio) is 3-4ms. File system access is much much painful, especially on ec2 m1.xlarge where you really can't tell what's going on, as far as I can tell. To tell you the truth as I see it, this is an abuse (for our use case) of the HBase store and for cache like behavior I would recommend going to something like Redis. On Mon, Jun 3, 2013 at 12:13 PM, ramkrishna vasudevan ramkrishna.s.vasude...@gmail.com wrote: What is that you are observing now? Regards Ram On Mon, Jun 3, 2013 at 2:00 PM, Liu, Raymond raymond@intel.com wrote: Hi If all the data is already in RS blockcache. Then what's the typical scan latency for scan a few rows from a say several GB table ( with dozens of regions ) on a small cluster with say 4 RS ? A few ms? Tens of ms? Or more? Best Regards, Raymond Liu
Re: RPC Replication Compression
If RPC has compression abilities, how come Replication, which also works in RPC does not get it automatically? On Tue, Jun 4, 2013 at 12:34 PM, Anoop John anoop.hb...@gmail.com wrote: 0.96 will support HBase RPC compression Yes Replication between master and slave will enjoy it as well (important since bandwidth between geographically distant data centers is scarce and more expensive) But I can not see it is being utilized in replication. May be we can do improvements in this area. I can see possibilities. -Anoop- On Tue, Jun 4, 2013 at 1:51 PM, Asaf Mesika asaf.mes...@gmail.com wrote: Hi, Just wanted to make sure if I read in the internet correctly: 0.96 will support HBase RPC compression thus Replication between master and slave will enjoy it as well (important since bandwidth between geographically distant data centers is scarce and more expensive)
Using thrift2 interface but getting : 400 Bad Request
Hello, I am using thrift thrift2 interfaces (thrift for DDL thrift2 for the rest), my requests work with thrift but with thrift2 I got a error 400. Here is my code (coffeescript) : colValue = new types2.TColumnValue family: 'cf', qualifier:'col', value:'yoo' put = new types2.TPut(row:'row1', columnValues: [ colValue ]) client2.put 'test', put, (err, res) - console.log 'put', err, res Here is what is sent by the put method : { row: 'row1', columnValues: [ { family: 'cf', qualifier: 'col', value: 'yoo', timestamp: null } ], timestamp: null, writeToWal: true } And here is the reply from thrift2 deamon : receive HTTP/1.1 400 Bad Request Connection: close Server: Jetty(6.1.26) There are no logs into thrift2.log when I do my request. Anyone have any clue ? Simon
Re: Using thrift2 interface but getting : 400 Bad Request
Can you check region server log around that time ? Thanks On Jun 4, 2013, at 8:37 AM, Simon Majou si...@majou.org wrote: Hello, I am using thrift thrift2 interfaces (thrift for DDL thrift2 for the rest), my requests work with thrift but with thrift2 I got a error 400. Here is my code (coffeescript) : colValue = new types2.TColumnValue family: 'cf', qualifier:'col', value:'yoo' put = new types2.TPut(row:'row1', columnValues: [ colValue ]) client2.put 'test', put, (err, res) - console.log 'put', err, res Here is what is sent by the put method : { row: 'row1', columnValues: [ { family: 'cf', qualifier: 'col', value: 'yoo', timestamp: null } ], timestamp: null, writeToWal: true } And here is the reply from thrift2 deamon : receive HTTP/1.1 400 Bad Request Connection: close Server: Jetty(6.1.26) There are no logs into thrift2.log when I do my request. Anyone have any clue ? Simon
Re: Using thrift2 interface but getting : 400 Bad Request
No logs there either (in fact no logs are written in any log file when I execute the request) Simon On Tue, Jun 4, 2013 at 5:42 PM, Ted Yu yuzhih...@gmail.com wrote: Can you check region server log around that time ? Thanks On Jun 4, 2013, at 8:37 AM, Simon Majou si...@majou.org wrote: Hello, I am using thrift thrift2 interfaces (thrift for DDL thrift2 for the rest), my requests work with thrift but with thrift2 I got a error 400. Here is my code (coffeescript) : colValue = new types2.TColumnValue family: 'cf', qualifier:'col', value:'yoo' put = new types2.TPut(row:'row1', columnValues: [ colValue ]) client2.put 'test', put, (err, res) - console.log 'put', err, res Here is what is sent by the put method : { row: 'row1', columnValues: [ { family: 'cf', qualifier: 'col', value: 'yoo', timestamp: null } ], timestamp: null, writeToWal: true } And here is the reply from thrift2 deamon : receive HTTP/1.1 400 Bad Request Connection: close Server: Jetty(6.1.26) There are no logs into thrift2.log when I do my request. Anyone have any clue ? Simon
Regarding Indexing columns in HBASE
Hi, In a HBASE table, there are 200 columns and the read pattern for diffferent systems invols 70 columns... In the above case, we cannot have 70 columns in the rowkey which will not be a good design... Can you please suggest how to handle this problem? Also can we do indexing in HBASE apart from rowkey? (something called secondary index) regards, Rams
Re: RPC Replication Compression
Replication doesn't need to know about compression at the RPC level so it won't refer to it and as far as I can tell you need to set compression only on the master cluster and the slave will figure it out. Looking at the code tho, I'm not sure it works the same way it used to work before everything went protobuf. I would give 2 internets to whoever tests 0.95.1 with RPC compression turned on and compares results with non-compressed RPC. See http://hbase.apache.org/book.html#rpc.configs J-D On Tue, Jun 4, 2013 at 5:22 AM, Asaf Mesika asaf.mes...@gmail.com wrote: If RPC has compression abilities, how come Replication, which also works in RPC does not get it automatically? On Tue, Jun 4, 2013 at 12:34 PM, Anoop John anoop.hb...@gmail.com wrote: 0.96 will support HBase RPC compression Yes Replication between master and slave will enjoy it as well (important since bandwidth between geographically distant data centers is scarce and more expensive) But I can not see it is being utilized in replication. May be we can do improvements in this area. I can see possibilities. -Anoop- On Tue, Jun 4, 2013 at 1:51 PM, Asaf Mesika asaf.mes...@gmail.com wrote: Hi, Just wanted to make sure if I read in the internet correctly: 0.96 will support HBase RPC compression thus Replication between master and slave will enjoy it as well (important since bandwidth between geographically distant data centers is scarce and more expensive)
Re: Regarding Indexing columns in HBASE
Just a quick thought, why don't you create different tables and duplicate data i.e. go for demoralization and data redundancy. Is your all read access patterns that would require 70 columns are incorporated into one application/client? Or it will be bunch of different clients/applications? If that is not the case then I think why not take advantage of more storage. Regards, Shahab On Tue, Jun 4, 2013 at 12:43 PM, Ramasubramanian Narayanan ramasubramanian.naraya...@gmail.com wrote: Hi, In a HBASE table, there are 200 columns and the read pattern for diffferent systems invols 70 columns... In the above case, we cannot have 70 columns in the rowkey which will not be a good design... Can you please suggest how to handle this problem? Also can we do indexing in HBASE apart from rowkey? (something called secondary index) regards, Rams
Re: Poor HBase map-reduce scan performance
Thanks Enis, I'll see if I can backport this patch - it is exactly what I was going to try. This should solve my scan performance problems if I can get it to work. On May 29, 2013, at 1:29 PM, Enis Söztutar e...@hortonworks.com wrote: Hi, Regarding running raw scans on top of Hfiles, you can try a version of the patch attached at https://issues.apache.org/jira/browse/HBASE-8369, which enables exactly this. However, the patch is for trunk. In that, we open one region from snapshot files in each record reader, and run a scan through using an internal region scanner. Since this bypasses the client + rpc + server daemon layers, it should be able to give optimum scan performance. There is also a tool called HFilePerformanceBenchmark that intends to measure raw performance for HFiles. I've had to do a lot of changes to make is workable, but it might be worth to take a look to see whether there is any perf difference between scanning a sequence file from hdfs vs scanning an hfile. Enis On Fri, May 24, 2013 at 10:50 PM, lars hofhansl la...@apache.org wrote: Sorry. Haven't gotten to this, yet. Scanning in HBase being about 3x slower than straight HDFS is in the right ballpark, though. It has to a bit more work. Generally, HBase is great at honing in to a subset (some 10-100m rows) of the data. Raw scan performance is not (yet) a strength of HBase. So with HDFS you get to 75% of the theoretical maximum read throughput; hence with HBase you to 25% of the theoretical cluster wide maximum disk throughput? -- Lars - Original Message - From: Bryan Keller brya...@gmail.com To: user@hbase.apache.org Cc: Sent: Friday, May 10, 2013 8:46 AM Subject: Re: Poor HBase map-reduce scan performance FYI, I ran tests with compression on and off. With a plain HDFS sequence file and compression off, I am getting very good I/O numbers, roughly 75% of theoretical max for reads. With snappy compression on with a sequence file, I/O speed is about 3x slower. However the file size is 3x smaller so it takes about the same time to scan. With HBase, the results are equivalent (just much slower than a sequence file). Scanning a compressed table is about 3x slower I/O than an uncompressed table, but the table is 3x smaller, so the time to scan is about the same. Scanning an HBase table takes about 3x as long as scanning the sequence file export of the table, either compressed or uncompressed. The sequence file export file size ends up being just barely larger than the table, either compressed or uncompressed So in sum, compression slows down I/O 3x, but the file is 3x smaller so the time to scan is about the same. Adding in HBase slows things down another 3x. So I'm seeing 9x faster I/O scanning an uncompressed sequence file vs scanning a compressed table. On May 8, 2013, at 10:15 AM, Bryan Keller brya...@gmail.com wrote: Thanks for the offer Lars! I haven't made much progress speeding things up. I finally put together a test program that populates a table that is similar to my production dataset. I have a readme that should describe things, hopefully enough to make it useable. There is code to populate a test table, code to scan the table, and code to scan sequence files from an export (to compare HBase w/ raw HDFS). I use a gradle build script. You can find the code here: https://dl.dropboxusercontent.com/u/6880177/hbasetest.zip On May 4, 2013, at 6:33 PM, lars hofhansl la...@apache.org wrote: The blockbuffers are not reused, but that by itself should not be a problem as they are all the same size (at least I have never identified that as one in my profiling sessions). My offer still stands to do some profiling myself if there is an easy way to generate data of similar shape. -- Lars From: Bryan Keller brya...@gmail.com To: user@hbase.apache.org Sent: Friday, May 3, 2013 3:44 AM Subject: Re: Poor HBase map-reduce scan performance Actually I'm not too confident in my results re block size, they may have been related to major compaction. I'm going to rerun before drawing any conclusions. On May 3, 2013, at 12:17 AM, Bryan Keller brya...@gmail.com wrote: I finally made some progress. I tried a very large HBase block size (16mb), and it significantly improved scan performance. I went from 45-50 min to 24 min. Not great but much better. Before I had it set to 128k. Scanning an equivalent sequence file takes 10 min. My random read performance will probably suffer with such a large block size (theoretically), so I probably can't keep it this big. I care about random read performance too. I've read having a block size this big is not recommended, is that correct? I haven't dug too deeply into the code, are the block buffers reused or is each new block read a new allocation? Perhaps a buffer pool could help here if there isn't one already. When doing a
Re: Regarding Indexing columns in HBASE
Hi, The read pattern differs from each application.. Is the below approach fine? Create one HBASE table with a unique rowkey and put all 200 columns into it... create mutiple small HBASE tables where it has the read access pattern columns and the rowkey it is mapped to the master table... e.g. *Master Table :* MasterRowkey Field1 .. .. Field 200 *Link Table1:* Link1Rowkey Field1 Field13 Field16 Field67 MasterRowkey (value) * * *Link Table2:* Link2Rowkey Field5 Field23 Field56 Field167 MasterRowkey (value) regards, Rams On Tue, Jun 4, 2013 at 12:51 PM, Shahab Yunus shahab.yu...@gmail.comwrote: Just a quick thought, why don't you create different tables and duplicate data i.e. go for demoralization and data redundancy. Is your all read access patterns that would require 70 columns are incorporated into one application/client? Or it will be bunch of different clients/applications? If that is not the case then I think why not take advantage of more storage. Regards, Shahab On Tue, Jun 4, 2013 at 12:43 PM, Ramasubramanian Narayanan ramasubramanian.naraya...@gmail.com wrote: Hi, In a HBASE table, there are 200 columns and the read pattern for diffferent systems invols 70 columns... In the above case, we cannot have 70 columns in the rowkey which will not be a good design... Can you please suggest how to handle this problem? Also can we do indexing in HBASE apart from rowkey? (something called secondary index) regards, Rams
Re: Regarding Indexing columns in HBASE
Quick and dirty... Create an inverted table for each index Then you can take the intersection of the result set(s) to get your list of rows for further filtering. There is obviously more to this, but its the core idea... Sent from a remote device. Please excuse any typos... Mike Segel On Jun 4, 2013, at 11:51 AM, Shahab Yunus shahab.yu...@gmail.com wrote: Just a quick thought, why don't you create different tables and duplicate data i.e. go for demoralization and data redundancy. Is your all read access patterns that would require 70 columns are incorporated into one application/client? Or it will be bunch of different clients/applications? If that is not the case then I think why not take advantage of more storage. Regards, Shahab On Tue, Jun 4, 2013 at 12:43 PM, Ramasubramanian Narayanan ramasubramanian.naraya...@gmail.com wrote: Hi, In a HBASE table, there are 200 columns and the read pattern for diffferent systems invols 70 columns... In the above case, we cannot have 70 columns in the rowkey which will not be a good design... Can you please suggest how to handle this problem? Also can we do indexing in HBASE apart from rowkey? (something called secondary index) regards, Rams
Re: Regarding Indexing columns in HBASE
Hi Michel, If you don't mind can you please help explain in detail ... Also can you pls let me know whether we have secondary index in HBASE? regards, Rams On Tue, Jun 4, 2013 at 1:13 PM, Michel Segel michael_se...@hotmail.comwrote: Quick and dirty... Create an inverted table for each index Then you can take the intersection of the result set(s) to get your list of rows for further filtering. There is obviously more to this, but its the core idea... Sent from a remote device. Please excuse any typos... Mike Segel On Jun 4, 2013, at 11:51 AM, Shahab Yunus shahab.yu...@gmail.com wrote: Just a quick thought, why don't you create different tables and duplicate data i.e. go for demoralization and data redundancy. Is your all read access patterns that would require 70 columns are incorporated into one application/client? Or it will be bunch of different clients/applications? If that is not the case then I think why not take advantage of more storage. Regards, Shahab On Tue, Jun 4, 2013 at 12:43 PM, Ramasubramanian Narayanan ramasubramanian.naraya...@gmail.com wrote: Hi, In a HBASE table, there are 200 columns and the read pattern for diffferent systems invols 70 columns... In the above case, we cannot have 70 columns in the rowkey which will not be a good design... Can you please suggest how to handle this problem? Also can we do indexing in HBASE apart from rowkey? (something called secondary index) regards, Rams
Re: Regarding Indexing columns in HBASE
Rams - you might enjoy this blog post from HBase committer Jesse Yates (from last summer): http://jyates.github.io/2012/07/09/consistent-enough-secondary-indexes.html Secondary Indexing doesn't exist in HBase core today, but there are various proposals and early implementations of it in flight. In the mean time, as Mike and others have said, if you don't need them to be immediately consistent in a real-time write scenario, you can simply write the same data into multiple tables in different sort orders. (This is hard in a real-time write scenario because, without cross-table transactions, you'd have to handle all the cases where the record was written but the index wasn't, or vice versa.) Ian On Jun 4, 2013, at 12:22 PM, Ramasubramanian Narayanan wrote: Hi Michel, If you don't mind can you please help explain in detail ... Also can you pls let me know whether we have secondary index in HBASE? regards, Rams On Tue, Jun 4, 2013 at 1:13 PM, Michel Segel michael_se...@hotmail.commailto:michael_se...@hotmail.comwrote: Quick and dirty... Create an inverted table for each index Then you can take the intersection of the result set(s) to get your list of rows for further filtering. There is obviously more to this, but its the core idea... Sent from a remote device. Please excuse any typos... Mike Segel On Jun 4, 2013, at 11:51 AM, Shahab Yunus shahab.yu...@gmail.commailto:shahab.yu...@gmail.com wrote: Just a quick thought, why don't you create different tables and duplicate data i.e. go for demoralization and data redundancy. Is your all read access patterns that would require 70 columns are incorporated into one application/client? Or it will be bunch of different clients/applications? If that is not the case then I think why not take advantage of more storage. Regards, Shahab On Tue, Jun 4, 2013 at 12:43 PM, Ramasubramanian Narayanan ramasubramanian.naraya...@gmail.commailto:ramasubramanian.naraya...@gmail.com wrote: Hi, In a HBASE table, there are 200 columns and the read pattern for diffferent systems invols 70 columns... In the above case, we cannot have 70 columns in the rowkey which will not be a good design... Can you please suggest how to handle this problem? Also can we do indexing in HBASE apart from rowkey? (something called secondary index) regards, Rams
Re: Regarding Indexing columns in HBASE
Ok... A little bit more detail... First, its possible to store your data in multiple tables each with a different key. Not a good idea for some very obvious reasons You could however create a secondary table which is an inverted table where the rowkey of the index is the value in the base table and the column name is the rowkey in the base table and the value is the base table. This will work well, as long as you're not indexing a column that has a small finite set of values like a binary index. (Male/Female as an example...) (It will create a very wide row...) But in a general case it should work ok. Note too that you can also still create a compound key for the index. As an example... you could create an index on manufacture, model, year, color where the value is the VIN which would be the rowkey for the base table. Then if you want to find all of the 2005 Volvo S80's on the road, you can do a partial scan of the index setting up start and stop rows. Then filter the result set based on the state listed on the vehicle's registration. The idea is that you would fetch the rows from the index query's result set and that would be your list that you would use for your next query. Again, there is more to this... like if you have multiple indexes on the data, you'd take the intersection of the result set(s) and then apply the filters that are not indexed. The initial key lookups should normally be a simple fetch of a single row, yielding you a list of rows in the base table. PLEASE NOTE THE FOLLOWING: 1) This is a general use case example. 2) YMMV based on the use case 3) YMMV based on the data contained in your underlying table 4) This is one simple way that can work with or without coprocessors 5) There is more to the solution, I'm painting a very high level solution. And of course I'm waiting for someone to mention that you look at Phoenix which can implement this or a variation on this to do indexing. And of course you have other indexing options. HTH... -Mike On Jun 4, 2013, at 12:30 PM, Ian Varley ivar...@salesforce.com wrote: Rams - you might enjoy this blog post from HBase committer Jesse Yates (from last summer): http://jyates.github.io/2012/07/09/consistent-enough-secondary-indexes.html Secondary Indexing doesn't exist in HBase core today, but there are various proposals and early implementations of it in flight. In the mean time, as Mike and others have said, if you don't need them to be immediately consistent in a real-time write scenario, you can simply write the same data into multiple tables in different sort orders. (This is hard in a real-time write scenario because, without cross-table transactions, you'd have to handle all the cases where the record was written but the index wasn't, or vice versa.) Ian On Jun 4, 2013, at 12:22 PM, Ramasubramanian Narayanan wrote: Hi Michel, If you don't mind can you please help explain in detail ... Also can you pls let me know whether we have secondary index in HBASE? regards, Rams On Tue, Jun 4, 2013 at 1:13 PM, Michel Segel michael_se...@hotmail.commailto:michael_se...@hotmail.comwrote: Quick and dirty... Create an inverted table for each index Then you can take the intersection of the result set(s) to get your list of rows for further filtering. There is obviously more to this, but its the core idea... Sent from a remote device. Please excuse any typos... Mike Segel On Jun 4, 2013, at 11:51 AM, Shahab Yunus shahab.yu...@gmail.commailto:shahab.yu...@gmail.com wrote: Just a quick thought, why don't you create different tables and duplicate data i.e. go for demoralization and data redundancy. Is your all read access patterns that would require 70 columns are incorporated into one application/client? Or it will be bunch of different clients/applications? If that is not the case then I think why not take advantage of more storage. Regards, Shahab On Tue, Jun 4, 2013 at 12:43 PM, Ramasubramanian Narayanan ramasubramanian.naraya...@gmail.commailto:ramasubramanian.naraya...@gmail.com wrote: Hi, In a HBASE table, there are 200 columns and the read pattern for diffferent systems invols 70 columns... In the above case, we cannot have 70 columns in the rowkey which will not be a good design... Can you please suggest how to handle this problem? Also can we do indexing in HBASE apart from rowkey? (something called secondary index) regards, Rams
Scan + Gets are disk bound
Hi, We are relatively new to Hbase, and we are hitting a roadblock on our scan performance. I searched through the email archives and applied a bunch of the recommendations there, but they did not improve much. So, I am hoping I am missing something which you could guide me towards. Thanks in advance. We are currently writing data and reading in an almost continuous mode (stream of data written into an HBase table and then we run a time-based MR on top of this Table). We currently were backed up and about 1.5 TB of data was loaded into the table and we began performing time-based scan MRs in 10 minute time intervals(startTime and endTime interval is 10 minutes). Most of the 10 minute interval had about 100 GB of data to process. Our workflow was to primarily eliminate duplicates from this table. We have maxVersions = 5 for the table. We use TableInputFormat to perform the time-based scan to ensure data locality. In the mapper, we check if there exists a previous version of the row in a time period earlier to the timestamp of the input row. If not, we emit that row. We looked at https://issues.apache.org/jira/browse/HBASE-4683 and hence turned off block cache for this table with the expectation that the block index and bloom filter will be cached in the block cache. We expect duplicates to be rare and hence hope for most of these checks to be fulfilled by the bloom filter. Unfortunately, we notice very slow performance on account of being disk bound. Looking at jstack, we notice that most of the time, we appear to be hitting disk for the block index. We performed a major compaction and retried and performance improved some, but not by much. We are processing data at about 2 MB per second. We are using CDH 4.2.1 HBase 0.94.2 and HDFS 2.0.0 running with 8 datanodes/regionservers(each with 32 cores, 4x1TB disks and 60 GB RAM). HBase is running with 30 GB Heap size, memstore values being capped at 3 GB and flush thresholds being 0.15 and 0.2. Blockcache is at 0.5 of total heap size(15 GB). We are using SNAPPY for our tables. A couple of questions: * Is the performance of the time-based scan bad after a major compaction? * What can we do to help alleviate being disk bound? The typical answer of adding more RAM does not seem to have helped, or we are missing some other config Below are some of the metrics from a Regionserver webUI: requestsPerSecond=5895, numberOfOnlineRegions=60, numberOfStores=60, numberOfStorefiles=209, storefileIndexSizeMB=6, rootIndexSizeKB=7131, totalStaticIndexSizeKB=415995, totalStaticBloomSizeKB=2514675, memstoreSizeMB=0, mbInMemoryWithoutWAL=0, numberOfPutsWithoutWAL=0, readRequestsCount=30589690, writeRequestsCount=0, compactionQueueSize=0, flushQueueSize=0, usedHeapMB=2688, maxHeapMB=30672, blockCacheSizeMB=1604.86, blockCacheFreeMB=13731.24, blockCacheCount=11817, blockCacheHitCount=2759, blockCacheMissCount=25373411, blockCacheEvictedCount=7112, blockCacheHitRatio=52%, blockCacheHitCachingRatio=72%, hdfsBlocksLocalityIndex=91, slowHLogAppendCount=0, fsReadLatencyHistogramMean=15409428.56, fsReadLatencyHistogramCount=1559927, fsReadLatencyHistogramMedian=230609.5, fsReadLatencyHistogram75th=280094.75, fsReadLatencyHistogram95th=9574280.4, fsReadLatencyHistogram99th=100981301.2, fsReadLatencyHistogram999th=511591146.03, fsPreadLatencyHistogramMean=3895616.6, fsPreadLatencyHistogramCount=42, fsPreadLatencyHistogramMedian=954552, fsPreadLatencyHistogram75th=8723662.5, fsPreadLatencyHistogram95th=11159637.65, fsPreadLatencyHistogram99th=37763281.57, fsPreadLatencyHistogram999th=273192813.91, fsWriteLatencyHistogramMean=6124343.91, fsWriteLatencyHistogramCount=114, fsWriteLatencyHistogramMedian=374379, fsWriteLatencyHistogram75th=431395.75, fsWriteLatencyHistogram95th=576853.8, fsWriteLatencyHistogram99th=1034159.75, fsWriteLatencyHistogram999th=5687910.29 key size: 20 bytes Table description: {NAME = 'foo', FAMILIES = [{NAME = 'f', DATA_BLOCK_ENCODING = 'NONE', BLOOMFI true LTER = 'ROW', REPLICATION_SCOPE = '0', COMPRESSION = 'SNAPPY', VERSIONS = '5', TTL = ' 2592000', MIN_VERSIONS = '0', KEEP_DELETED_CELLS = 'false', BLOCKSIZE = '65536', ENCODE_ ON_DISK = 'true', IN_MEMORY = 'false', BLOCKCACHE = 'false'}]}
Re: Explosion in datasize using HBase as a MR sink
Finally fixed this, my code was at fault. Protobufs require a builder object which was a (non static) protected object in an abstract class all parsers extend. The mapper calls a parser factory depending on the input record. Because we designed the parser instances as singletons, the builder object in the abstract class got reused and all data got appended to the same builder. Doh! This only shows up in a job, not in single tests. Ah well, I've learned a lot :) @Asaf we will be moving to LoadIncrementalHFiles asap. I had the code ready, but obviously it showed the same size problems before the fix. Thnx for the thoughts! On May 31, 2013, at 22:02, Asaf Mesika asaf.mes...@gmail.com wrote: On your data set size, I would go on HFile OutputFormat and then bulk load in into HBase. Why go through the Put flow anyway (memstore, flush, WAL), especially if you have the input ready at your disposal for re-try if something fails? Sounds faster to me anyway. On May 30, 2013, at 10:52 PM, Rob Verkuylen r...@verkuylen.net wrote: On May 30, 2013, at 4:51, Stack st...@duboce.net wrote: Triggering a major compaction does not alter the overall 217.5GB size? A major compaction reduces the size from the original 219GB to the 217,5GB, so barely a reduction. 80% of the region sizes are 1,4GB before and after. I haven't merged the smaller regions, but that still would not bring the size down to the 2,5-5 or so GB I would expect given T2's size. You have speculative execution turned on in your MR job so its possible you write many versions? I've turned off speculative execution (through conf.set) just for the mappers, since we're not using reducers, should we? I will triple check the actual job settings in the job tracker, since I need to make the settings on a job level. Does your MR job fail many tasks (and though it fails, until it fails, it will have written some subset of the task hence bloating your versions?). We've had problems with failing mappers, because of zookeeper timeouts on large inserts, we increased zookeeper timeout and blockingstorefiles to accommodate. Now we don't get failures. This job writes to a cleanly made table, versions set to 1, so there shouldn't be extra versions I assume(?). You are putting everything into protobufs? Could that be bloating your data? Can you take a smaller subset and dump to the log a string version of the pb. Use TextFormat https://developers.google.com/protocol-buffers/docs/reference/java/com/google/protobuf/TextFormat#shortDebugString(com.google.protobuf.MessageOrBuilder) The protobufs reduce the size to roughly 40% of the original XML data in T1. The MR parser is a port of the python parse code we use going from T1 to T2. I've done manual comparisons on 20-30 records from T2.1 and T2 and they are identical, with only minute differences, because of slightly different parsing. I've done these in hbase shell, I will try log dumping them too. It can be informative looking at hfile content. It could give you a clue as to the bloat. See http://hbase.apache.org/book.html#hfile_tool I will give this a go and report back. Any other debugging suggestions are more then welcome :) Thnx, Rob
Replication is on columnfamily level or table level?
hi, folks, hbase 0.94.3 By reading several documents, I always have the impression that * Replication* works at the table-*column*-*family level*. However, when I am setting up a table with two columnfamilies and replicate them to two different slavers, the whole table replicated. Is this a bug? Thanks Here is the simple steps to receate. *Environment: * Replication Master: hdtest014 Replication Slave 1: hdtest017 Replication Slave 2: hdtest009 *Create Table*: on Master, and the two slaves: create 't2_dn','cf1','cf2' *setup replication on Master*(hdtest014), so that Master list_peers PEER_ID CLUSTER_KEY STATE 1 hdtest017.svl.ibm.com:2181:/hbase ENABLED 2 hdtest009.svl.ibm.com:2181:/hbase ENABLED Master describe 't2_dn' DESCRIPTION ENABLED {NAME = 't2_dn', FAMILIES = [{*NAME = 'cf1', REPLICATION_SCOPE = '1'*, KEEP_DELETED_CELLS = 'fals true e', COMPRESSION = 'NONE', ENCODE_ON_DISK = 'true', BLOCKCACHE = 'true', MIN_VERSIONS = '0', DATA _BLOCK_ENCODING = 'NONE', IN_MEMORY = 'false', BLOOMFILTER = 'NONE', TTL = '2147483647', VERSION S = '3', BLOCKSIZE = '65536'}, {*NAME = 'cf2', REPLICATION_SCOPE = '2'*, KEEP_DELETED_CELLS = 'fa lse', COMPRESSION = 'NONE', ENCODE_ON_DISK = 'true', BLOCKCACHE = 'true', MIN_VERSIONS = '0', DA TA_BLOCK_ENCODING = 'NONE', IN_MEMORY = 'false', BLOOMFILTER = 'NONE', TTL = '2147483647', VERSI ONS = '3', BLOCKSIZE = '65536'}]} 1 row(s) in 0.0250 seconds *Put rows into t2_dn on Master* put 't2_dn','row1','cf1:q1','val1cf1fromMaster' put 't2_dn','row1','cf2:q1','val1cf2fromMaster' put 't2_dn','row2','cf1:q1','val2cf1fromMaster' put 't2_dn','row3','cf2:q1','val3cf2fromMaster' *Expecting cf1 replicated to slave1, and cf2 replicatedto slave2. Where all the three clusters got: * scan 't2_dn' ROW COLUMN+CELL row1 column=cf1:q1, timestamp=1370382328358, value=val1cf1fromMaster row1 column=cf2:q1, timestamp=1370382334303, value=val1cf2fromMaster row2 column=cf1:q1, timestamp=1370382351716, value=val2cf1fromMaster row3 column=cf2:q1, timestamp=1370382367724, value=val3cf2fromMaster 3 row(s) in 0.0160 seconds Many thanks Demai
Re: Explosion in datasize using HBase as a MR sink
On Tue, Jun 4, 2013 at 9:58 PM, Rob Verkuylen r...@verkuylen.net wrote: Finally fixed this, my code was at fault. Protobufs require a builder object which was a (non static) protected object in an abstract class all parsers extend. The mapper calls a parser factory depending on the input record. Because we designed the parser instances as singletons, the builder object in the abstract class got reused and all data got appended to the same builder. Doh! This only shows up in a job, not in single tests. Ah well, I've learned a lot :) Thanks for updating the list Rob. Yours is a classic except it is first time I've heard of someone protobufing it.. Usually it is a reuse of an Hadoop Writable instance accumulating St.Ack
Re: RPC Replication Compression
On Tue, Jun 4, 2013 at 6:48 PM, Jean-Daniel Cryans jdcry...@apache.orgwrote: Replication doesn't need to know about compression at the RPC level so it won't refer to it and as far as I can tell you need to set compression only on the master cluster and the slave will figure it out. Looking at the code tho, I'm not sure it works the same way it used to work before everything went protobuf. I would give 2 internets to whoever tests 0.95.1 with RPC compression turned on and compares results with non-compressed RPC. See http://hbase.apache.org/book.html#rpc.configs What are you looking for JD? Faster replication or just less network used? Looks like we have not had the ability to do compressed rpc before (We almost did, the original rpc compression attempt almost got committed to trunk -- see HBASE-5355 and the referenced follow-on issue -- but was put aside after the pb stuff went in). St.Ack
Re: Poor HBase map-reduce scan performance
Haven't had a chance to write a JIRA yet, but I thought I'd pop in here with an update in the meantime. I tried a number of different approaches to eliminate latency and bubbles in the scan pipeline, and eventually arrived at adding a streaming scan API to the region server, along with refactoring the scan interface into an event-drive message receiver interface. In so doing, I was able to take scan speed on my cluster from 59,537 records/sec with the classic scanner to 222,703 records per second with my new scan API. Needless to say, I'm pleased ;) More details forthcoming when I get a chance. Thanks, Sandy On 5/23/13 3:47 PM, Ted Yu yuzhih...@gmail.com wrote: Thanks for the update, Sandy. If you can open a JIRA and attach your producer / consumer scanner there, that would be great. On Thu, May 23, 2013 at 3:42 PM, Sandy Pratt prat...@adobe.com wrote: I wrote myself a Scanner wrapper that uses a producer/consumer queue to keep the client fed with a full buffer as much as possible. When scanning my table with scanner caching at 100 records, I see about a 24% uplift in performance (~35k records/sec with the ClientScanner and ~44k records/sec with my P/C scanner). However, when I set scanner caching to 5000, it's more of a wash compared to the standard ClientScanner: ~53k records/sec with the ClientScanner and ~60k records/sec with the P/C scanner. I'm not sure what to make of those results. I think next I'll shut down HBase and read the HFiles directly, to see if there's a drop off in performance between reading them directly vs. via the RegionServer. I still think that to really solve this there needs to be sliding window of records in flight between disk and RS, and between RS and client. I'm thinking there's probably a single batch of records in flight between RS and client at the moment. Sandy On 5/23/13 8:45 AM, Bryan Keller brya...@gmail.com wrote: I am considering scanning a snapshot instead of the table. I believe this is what the ExportSnapshot class does. If I could use the scanning code from ExportSnapshot then I will be able to scan the HDFS files directly and bypass the regionservers. This could potentially give me a huge boost in performance for full table scans. However, it doesn't really address the poor scan performance against a table.
Questions about HBase
Hi, I have a few small questions regarding HBase. I've searched the forum but couldn't find clear answers hence asking them here: 1. Does Minor compaction remove HFiles in which all entries are out of TTL or does only Major compaction do that? I found this jira: https://issues.apache.org/jira/browse/HBASE-5199 but I dont' know if the compaction being talked about there is minor or major. 2. Is there a way of configuring major compaction to compact only files older than a certain time or to compress all the files except the latest few? We basically want to use the time based filtering optimization in HBase to get the latest additions to the table and since major compaction bunches everything into one file, it would defeat the optimization. 3. Is there a way to warm up the bloom filter and block index cache for a table? This is for a case where I always want the bloom filters and index to be all in memory, but not the data blocks themselves. 4. This one is related to what I read in the HBase definitive guide bloom filter section Given a random row key you are looking for, it is very likely that this key will fall in between two block start keys. The only way for HBase to figure out if the key actually exists is by loading the block and scanning it to find the key. The above excerpt seems to imply to me that the search for key inside a block is linear and I feel I must be reading it wrong. I would expect the scan to be a binary search. Thanks in Advance, Pankaj -- *P* | (415) 677-9222 ext. 205 *F *| (415) 677-0895 | pan...@brightroll.com Pankaj Gupta | Software Engineer *BrightRoll, Inc. *| Smart Video Advertising | www.brightroll.com United States | Canada | United Kingdom | Germany We're hiringhttp://newton.newtonsoftware.com/career/CareerHome.action?clientId=8a42a12b3580e2060135837631485aa7 !
Re: Questions about HBase
Does Minor compaction remove HFiles in which all entries are out of TTL or does only Major compaction do that Yes it applies for Minor compactions. Is there a way of configuring major compaction to compact only files older than a certain time or to compress all the files except the latest few? In the latest trunk version the compaction algo itself can be plugged. There are some coprocessor hooks that gives control on the scanner that gets created for compaction with which we can control the KVs being selected. But i am not very sure if we can control the files getting selected for compaction in the older verisons. The above excerpt seems to imply to me that the search for key inside a block is linear and I feel I must be reading it wrong. I would expect the scan to be a binary search. Once the data block is identified for a key, we seek to the beginning of the block and then do a linear search until we reach the exact key that we are looking out for. Because internally the data (KVs) are stored as byte buffers per block and it follows this pattern keylengthvaluelengthkeybytearrayvaluebytearray Is there a way to warm up the bloom filter and block index cache for a table? You always want the bloom and block index to be in cache? On Wed, Jun 5, 2013 at 7:45 AM, Pankaj Gupta pan...@brightroll.com wrote: Hi, I have a few small questions regarding HBase. I've searched the forum but couldn't find clear answers hence asking them here: 1. Does Minor compaction remove HFiles in which all entries are out of TTL or does only Major compaction do that? I found this jira: https://issues.apache.org/jira/browse/HBASE-5199 but I dont' know if the compaction being talked about there is minor or major. 2. Is there a way of configuring major compaction to compact only files older than a certain time or to compress all the files except the latest few? We basically want to use the time based filtering optimization in HBase to get the latest additions to the table and since major compaction bunches everything into one file, it would defeat the optimization. 3. Is there a way to warm up the bloom filter and block index cache for a table? This is for a case where I always want the bloom filters and index to be all in memory, but not the data blocks themselves. 4. This one is related to what I read in the HBase definitive guide bloom filter section Given a random row key you are looking for, it is very likely that this key will fall in between two block start keys. The only way for HBase to figure out if the key actually exists is by loading the block and scanning it to find the key. The above excerpt seems to imply to me that the search for key inside a block is linear and I feel I must be reading it wrong. I would expect the scan to be a binary search. Thanks in Advance, Pankaj -- *P* | (415) 677-9222 ext. 205 *F *| (415) 677-0895 | pan...@brightroll.com Pankaj Gupta | Software Engineer *BrightRoll, Inc. *| Smart Video Advertising | www.brightroll.com United States | Canada | United Kingdom | Germany We're hiring http://newton.newtonsoftware.com/career/CareerHome.action?clientId=8a42a12b3580e2060135837631485aa7 !
Re: Questions about HBase
bq. I found this jira: https://issues.apache.org/jira/browse/HBASE-5199 but I dont' know if the compaction being talked about there is minor or major. The optimization above applies to minor compaction selection. Cheers On Tue, Jun 4, 2013 at 7:15 PM, Pankaj Gupta pan...@brightroll.com wrote: Hi, I have a few small questions regarding HBase. I've searched the forum but couldn't find clear answers hence asking them here: 1. Does Minor compaction remove HFiles in which all entries are out of TTL or does only Major compaction do that? I found this jira: https://issues.apache.org/jira/browse/HBASE-5199 but I dont' know if the compaction being talked about there is minor or major. 2. Is there a way of configuring major compaction to compact only files older than a certain time or to compress all the files except the latest few? We basically want to use the time based filtering optimization in HBase to get the latest additions to the table and since major compaction bunches everything into one file, it would defeat the optimization. 3. Is there a way to warm up the bloom filter and block index cache for a table? This is for a case where I always want the bloom filters and index to be all in memory, but not the data blocks themselves. 4. This one is related to what I read in the HBase definitive guide bloom filter section Given a random row key you are looking for, it is very likely that this key will fall in between two block start keys. The only way for HBase to figure out if the key actually exists is by loading the block and scanning it to find the key. The above excerpt seems to imply to me that the search for key inside a block is linear and I feel I must be reading it wrong. I would expect the scan to be a binary search. Thanks in Advance, Pankaj -- *P* | (415) 677-9222 ext. 205 *F *| (415) 677-0895 | pan...@brightroll.com Pankaj Gupta | Software Engineer *BrightRoll, Inc. *| Smart Video Advertising | www.brightroll.com United States | Canada | United Kingdom | Germany We're hiring http://newton.newtonsoftware.com/career/CareerHome.action?clientId=8a42a12b3580e2060135837631485aa7 !
Re: Questions about HBase
bq. But i am not very sure if we can control the files getting selected for compaction in the older verisons. Same mechanism is available in 0.94 Take a look at src/main/java/org/apache/hadoop/hbase/coprocessor/BaseRegionObserver.java where you would find the following methods (and more): public void preCompactSelection(final ObserverContextRegionCoprocessorEnvironment c, final Store store, final ListStoreFile candidates, final CompactionRequest request) public InternalScanner preCompact(ObserverContextRegionCoprocessorEnvironment e, final Store store, final InternalScanner scanner) throws IOException { Cheers On Tue, Jun 4, 2013 at 8:14 PM, ramkrishna vasudevan ramkrishna.s.vasude...@gmail.com wrote: Does Minor compaction remove HFiles in which all entries are out of TTL or does only Major compaction do that Yes it applies for Minor compactions. Is there a way of configuring major compaction to compact only files older than a certain time or to compress all the files except the latest few? In the latest trunk version the compaction algo itself can be plugged. There are some coprocessor hooks that gives control on the scanner that gets created for compaction with which we can control the KVs being selected. But i am not very sure if we can control the files getting selected for compaction in the older verisons. The above excerpt seems to imply to me that the search for key inside a block is linear and I feel I must be reading it wrong. I would expect the scan to be a binary search. Once the data block is identified for a key, we seek to the beginning of the block and then do a linear search until we reach the exact key that we are looking out for. Because internally the data (KVs) are stored as byte buffers per block and it follows this pattern keylengthvaluelengthkeybytearrayvaluebytearray Is there a way to warm up the bloom filter and block index cache for a table? You always want the bloom and block index to be in cache? On Wed, Jun 5, 2013 at 7:45 AM, Pankaj Gupta pan...@brightroll.com wrote: Hi, I have a few small questions regarding HBase. I've searched the forum but couldn't find clear answers hence asking them here: 1. Does Minor compaction remove HFiles in which all entries are out of TTL or does only Major compaction do that? I found this jira: https://issues.apache.org/jira/browse/HBASE-5199 but I dont' know if the compaction being talked about there is minor or major. 2. Is there a way of configuring major compaction to compact only files older than a certain time or to compress all the files except the latest few? We basically want to use the time based filtering optimization in HBase to get the latest additions to the table and since major compaction bunches everything into one file, it would defeat the optimization. 3. Is there a way to warm up the bloom filter and block index cache for a table? This is for a case where I always want the bloom filters and index to be all in memory, but not the data blocks themselves. 4. This one is related to what I read in the HBase definitive guide bloom filter section Given a random row key you are looking for, it is very likely that this key will fall in between two block start keys. The only way for HBase to figure out if the key actually exists is by loading the block and scanning it to find the key. The above excerpt seems to imply to me that the search for key inside a block is linear and I feel I must be reading it wrong. I would expect the scan to be a binary search. Thanks in Advance, Pankaj -- *P* | (415) 677-9222 ext. 205 *F *| (415) 677-0895 | pan...@brightroll.com Pankaj Gupta | Software Engineer *BrightRoll, Inc. *| Smart Video Advertising | www.brightroll.com United States | Canada | United Kingdom | Germany We're hiring http://newton.newtonsoftware.com/career/CareerHome.action?clientId=8a42a12b3580e2060135837631485aa7 !
Re: Scan + Gets are disk bound
On Tue, Jun 4, 2013 at 11:48 AM, Rahul Ravindran rahu...@yahoo.com wrote: Hi, We are relatively new to Hbase, and we are hitting a roadblock on our scan performance. I searched through the email archives and applied a bunch of the recommendations there, but they did not improve much. So, I am hoping I am missing something which you could guide me towards. Thanks in advance. We are currently writing data and reading in an almost continuous mode (stream of data written into an HBase table and then we run a time-based MR on top of this Table). We currently were backed up and about 1.5 TB of data was loaded into the table and we began performing time-based scan MRs in 10 minute time intervals(startTime and endTime interval is 10 minutes). Most of the 10 minute interval had about 100 GB of data to process. Our workflow was to primarily eliminate duplicates from this table. We have maxVersions = 5 for the table. We use TableInputFormat to perform the time-based scan to ensure data locality. In the mapper, we check if there exists a previous version of the row in a time period earlier to the timestamp of the input row. If not, we emit that row. We looked at https://issues.apache.org/jira/browse/HBASE-4683 and hence turned off block cache for this table with the expectation that the block index and bloom filter will be cached in the block cache. We expect duplicates to be rare and hence hope for most of these checks to be fulfilled by the bloom filter. Unfortunately, we notice very slow performance on account of being disk bound. Looking at jstack, we notice that most of the time, we appear to be hitting disk for the block index. We performed a major compaction and retried and performance improved some, but not by much. We are processing data at about 2 MB per second. We are using CDH 4.2.1 HBase 0.94.2 and HDFS 2.0.0 running with 8 datanodes/regionservers(each with 32 cores, 4x1TB disks and 60 GB RAM). Anil: You dont have the right balance between disk,cpu and ram. You have too much of CPU, RAM but very less NUMBER of disks. Usually, its better to have a Disk/Cpu_core ratio near 0.6-0.8. Your's is around 0.13. This seems to be the biggest reason of your problem. HBase is running with 30 GB Heap size, memstore values being capped at 3 GB and flush thresholds being 0.15 and 0.2. Blockcache is at 0.5 of total heap size(15 GB). We are using SNAPPY for our tables. A couple of questions: * Is the performance of the time-based scan bad after a major compaction? Anil: In general, TimeBased(i am assuming you have built your rowkey on timestamp) scans are not good for HBase because of region hot-spotting. Have you tried setting the ScannerCaching to a higher number? * What can we do to help alleviate being disk bound? The typical answer of adding more RAM does not seem to have helped, or we are missing some other config Anil: Try adding more disks to your machines. Below are some of the metrics from a Regionserver webUI: requestsPerSecond=5895, numberOfOnlineRegions=60, numberOfStores=60, numberOfStorefiles=209, storefileIndexSizeMB=6, rootIndexSizeKB=7131, totalStaticIndexSizeKB=415995, totalStaticBloomSizeKB=2514675, memstoreSizeMB=0, mbInMemoryWithoutWAL=0, numberOfPutsWithoutWAL=0, readRequestsCount=30589690, writeRequestsCount=0, compactionQueueSize=0, flushQueueSize=0, usedHeapMB=2688, maxHeapMB=30672, blockCacheSizeMB=1604.86, blockCacheFreeMB=13731.24, blockCacheCount=11817, blockCacheHitCount=2759, blockCacheMissCount=25373411, blockCacheEvictedCount=7112, blockCacheHitRatio=52%, blockCacheHitCachingRatio=72%, hdfsBlocksLocalityIndex=91, slowHLogAppendCount=0, fsReadLatencyHistogramMean=15409428.56, fsReadLatencyHistogramCount=1559927, fsReadLatencyHistogramMedian=230609.5, fsReadLatencyHistogram75th=280094.75, fsReadLatencyHistogram95th=9574280.4, fsReadLatencyHistogram99th=100981301.2, fsReadLatencyHistogram999th=511591146.03, fsPreadLatencyHistogramMean=3895616.6, fsPreadLatencyHistogramCount=42, fsPreadLatencyHistogramMedian=954552, fsPreadLatencyHistogram75th=8723662.5, fsPreadLatencyHistogram95th=11159637.65, fsPreadLatencyHistogram99th=37763281.57, fsPreadLatencyHistogram999th=273192813.91, fsWriteLatencyHistogramMean=6124343.91, fsWriteLatencyHistogramCount=114, fsWriteLatencyHistogramMedian=374379, fsWriteLatencyHistogram75th=431395.75, fsWriteLatencyHistogram95th=576853.8, fsWriteLatencyHistogram99th=1034159.75, fsWriteLatencyHistogram999th=5687910.29 key size: 20 bytes Table description: {NAME = 'foo', FAMILIES = [{NAME = 'f', DATA_BLOCK_ENCODING = 'NONE', BLOOMFI true LTER = 'ROW', REPLICATION_SCOPE = '0', COMPRESSION = 'SNAPPY', VERSIONS = '5', TTL = ' 2592000', MIN_VERSIONS = '0', KEEP_DELETED_CELLS = 'false', BLOCKSIZE = '65536', ENCODE_ ON_DISK = 'true', IN_MEMORY = 'false', BLOCKCACHE = 'false'}]} -- Thanks Regards, Anil Gupta
Re: Replication is on columnfamily level or table level?
Yes the replication can be specified at the CF level.. You have used HCD#setScope() right? S = '3', BLOCKSIZE = '65536'}, {*NAME = 'cf2', REPLICATION_SCOPE = '2'*, You set scope as 2?? You have to set one CF to be replicated to one cluster and another to to another cluster. I dont think it is supported even now. You can see in the HCD code that there are 2 constants for scope 0 and 1 where 1 means replicate and 0 means not to be replicated. -Anoop- On Wed, Jun 5, 2013 at 3:31 AM, N Dm nid...@gmail.com wrote: hi, folks, hbase 0.94.3 By reading several documents, I always have the impression that * Replication* works at the table-*column*-*family level*. However, when I am setting up a table with two columnfamilies and replicate them to two different slavers, the whole table replicated. Is this a bug? Thanks Here is the simple steps to receate. *Environment: * Replication Master: hdtest014 Replication Slave 1: hdtest017 Replication Slave 2: hdtest009 *Create Table*: on Master, and the two slaves: create 't2_dn','cf1','cf2' *setup replication on Master*(hdtest014), so that Master list_peers PEER_ID CLUSTER_KEY STATE 1 hdtest017.svl.ibm.com:2181:/hbase ENABLED 2 hdtest009.svl.ibm.com:2181:/hbase ENABLED Master describe 't2_dn' DESCRIPTION ENABLED {NAME = 't2_dn', FAMILIES = [{*NAME = 'cf1', REPLICATION_SCOPE = '1'*, KEEP_DELETED_CELLS = 'fals true e', COMPRESSION = 'NONE', ENCODE_ON_DISK = 'true', BLOCKCACHE = 'true', MIN_VERSIONS = '0', DATA _BLOCK_ENCODING = 'NONE', IN_MEMORY = 'false', BLOOMFILTER = 'NONE', TTL = '2147483647', VERSION S = '3', BLOCKSIZE = '65536'}, {*NAME = 'cf2', REPLICATION_SCOPE = '2'*, KEEP_DELETED_CELLS = 'fa lse', COMPRESSION = 'NONE', ENCODE_ON_DISK = 'true', BLOCKCACHE = 'true', MIN_VERSIONS = '0', DA TA_BLOCK_ENCODING = 'NONE', IN_MEMORY = 'false', BLOOMFILTER = 'NONE', TTL = '2147483647', VERSI ONS = '3', BLOCKSIZE = '65536'}]} 1 row(s) in 0.0250 seconds *Put rows into t2_dn on Master* put 't2_dn','row1','cf1:q1','val1cf1fromMaster' put 't2_dn','row1','cf2:q1','val1cf2fromMaster' put 't2_dn','row2','cf1:q1','val2cf1fromMaster' put 't2_dn','row3','cf2:q1','val3cf2fromMaster' *Expecting cf1 replicated to slave1, and cf2 replicatedto slave2. Where all the three clusters got: * scan 't2_dn' ROW COLUMN+CELL row1 column=cf1:q1, timestamp=1370382328358, value=val1cf1fromMaster row1 column=cf2:q1, timestamp=1370382334303, value=val1cf2fromMaster row2 column=cf1:q1, timestamp=1370382351716, value=val2cf1fromMaster row3 column=cf2:q1, timestamp=1370382367724, value=val3cf2fromMaster 3 row(s) in 0.0160 seconds Many thanks Demai
Re: Scan + Gets are disk bound
Our row-keys do not contain time. By time-based scans, I mean, an MR over the Hbase table where the scan object has no startRow or endRow but has a startTime and endTime. Our row key format is MD5 of UUID+UUID, so, we expect good distribution. We have pre-split initially to prevent any initial hotspotting. ~Rahul. From: anil gupta anilgupt...@gmail.com To: user@hbase.apache.org; Rahul Ravindran rahu...@yahoo.com Sent: Tuesday, June 4, 2013 9:31 PM Subject: Re: Scan + Gets are disk bound On Tue, Jun 4, 2013 at 11:48 AM, Rahul Ravindran rahu...@yahoo.com wrote: Hi, We are relatively new to Hbase, and we are hitting a roadblock on our scan performance. I searched through the email archives and applied a bunch of the recommendations there, but they did not improve much. So, I am hoping I am missing something which you could guide me towards. Thanks in advance. We are currently writing data and reading in an almost continuous mode (stream of data written into an HBase table and then we run a time-based MR on top of this Table). We currently were backed up and about 1.5 TB of data was loaded into the table and we began performing time-based scan MRs in 10 minute time intervals(startTime and endTime interval is 10 minutes). Most of the 10 minute interval had about 100 GB of data to process. Our workflow was to primarily eliminate duplicates from this table. We have maxVersions = 5 for the table. We use TableInputFormat to perform the time-based scan to ensure data locality. In the mapper, we check if there exists a previous version of the row in a time period earlier to the timestamp of the input row. If not, we emit that row. We looked at https://issues.apache.org/jira/browse/HBASE-4683 and hence turned off block cache for this table with the expectation that the block index and bloom filter will be cached in the block cache. We expect duplicates to be rare and hence hope for most of these checks to be fulfilled by the bloom filter. Unfortunately, we notice very slow performance on account of being disk bound. Looking at jstack, we notice that most of the time, we appear to be hitting disk for the block index. We performed a major compaction and retried and performance improved some, but not by much. We are processing data at about 2 MB per second. We are using CDH 4.2.1 HBase 0.94.2 and HDFS 2.0.0 running with 8 datanodes/regionservers(each with 32 cores, 4x1TB disks and 60 GB RAM). Anil: You dont have the right balance between disk,cpu and ram. You have too much of CPU, RAM but very less NUMBER of disks. Usually, its better to have a Disk/Cpu_core ratio near 0.6-0.8. Your's is around 0.13. This seems to be the biggest reason of your problem. HBase is running with 30 GB Heap size, memstore values being capped at 3 GB and flush thresholds being 0.15 and 0.2. Blockcache is at 0.5 of total heap size(15 GB). We are using SNAPPY for our tables. A couple of questions: * Is the performance of the time-based scan bad after a major compaction? Anil: In general, TimeBased(i am assuming you have built your rowkey on timestamp) scans are not good for HBase because of region hot-spotting. Have you tried setting the ScannerCaching to a higher number? * What can we do to help alleviate being disk bound? The typical answer of adding more RAM does not seem to have helped, or we are missing some other config Anil: Try adding more disks to your machines. Below are some of the metrics from a Regionserver webUI: requestsPerSecond=5895, numberOfOnlineRegions=60, numberOfStores=60, numberOfStorefiles=209, storefileIndexSizeMB=6, rootIndexSizeKB=7131, totalStaticIndexSizeKB=415995, totalStaticBloomSizeKB=2514675, memstoreSizeMB=0, mbInMemoryWithoutWAL=0, numberOfPutsWithoutWAL=0, readRequestsCount=30589690, writeRequestsCount=0, compactionQueueSize=0, flushQueueSize=0, usedHeapMB=2688, maxHeapMB=30672, blockCacheSizeMB=1604.86, blockCacheFreeMB=13731.24, blockCacheCount=11817, blockCacheHitCount=2759, blockCacheMissCount=25373411, blockCacheEvictedCount=7112, blockCacheHitRatio=52%, blockCacheHitCachingRatio=72%, hdfsBlocksLocalityIndex=91, slowHLogAppendCount=0, fsReadLatencyHistogramMean=15409428.56, fsReadLatencyHistogramCount=1559927, fsReadLatencyHistogramMedian=230609.5, fsReadLatencyHistogram75th=280094.75, fsReadLatencyHistogram95th=9574280.4, fsReadLatencyHistogram99th=100981301.2, fsReadLatencyHistogram999th=511591146.03, fsPreadLatencyHistogramMean=3895616.6, fsPreadLatencyHistogramCount=42, fsPreadLatencyHistogramMedian=954552, fsPreadLatencyHistogram75th=8723662.5, fsPreadLatencyHistogram95th=11159637.65, fsPreadLatencyHistogram99th=37763281.57, fsPreadLatencyHistogram999th=273192813.91, fsWriteLatencyHistogramMean=6124343.91, fsWriteLatencyHistogramCount=114, fsWriteLatencyHistogramMedian=374379, fsWriteLatencyHistogram75th=431395.75,
Re: Questions about HBase
Thanks for the replies. I'll take a look at src/main/java/org/apache/ hadoop/hbase/coprocessor/BaseRegionObserver.java. @ramkrishna: I do want to have bloom filter and block index all the time. For good read performance they're critical in my workflow. The worry is that when HBase is restarted it will take a long time for them to get populated again and performance will suffer. If there was a way of loading them quickly and warm up the table then we'll be able to restart HBase without causing slow down in processing. On Tue, Jun 4, 2013 at 9:29 PM, Ted Yu yuzhih...@gmail.com wrote: bq. But i am not very sure if we can control the files getting selected for compaction in the older verisons. Same mechanism is available in 0.94 Take a look at src/main/java/org/apache/hadoop/hbase/coprocessor/BaseRegionObserver.java where you would find the following methods (and more): public void preCompactSelection(final ObserverContextRegionCoprocessorEnvironment c, final Store store, final ListStoreFile candidates, final CompactionRequest request) public InternalScanner preCompact(ObserverContextRegionCoprocessorEnvironment e, final Store store, final InternalScanner scanner) throws IOException { Cheers On Tue, Jun 4, 2013 at 8:14 PM, ramkrishna vasudevan ramkrishna.s.vasude...@gmail.com wrote: Does Minor compaction remove HFiles in which all entries are out of TTL or does only Major compaction do that Yes it applies for Minor compactions. Is there a way of configuring major compaction to compact only files older than a certain time or to compress all the files except the latest few? In the latest trunk version the compaction algo itself can be plugged. There are some coprocessor hooks that gives control on the scanner that gets created for compaction with which we can control the KVs being selected. But i am not very sure if we can control the files getting selected for compaction in the older verisons. The above excerpt seems to imply to me that the search for key inside a block is linear and I feel I must be reading it wrong. I would expect the scan to be a binary search. Once the data block is identified for a key, we seek to the beginning of the block and then do a linear search until we reach the exact key that we are looking out for. Because internally the data (KVs) are stored as byte buffers per block and it follows this pattern keylengthvaluelengthkeybytearrayvaluebytearray Is there a way to warm up the bloom filter and block index cache for a table? You always want the bloom and block index to be in cache? On Wed, Jun 5, 2013 at 7:45 AM, Pankaj Gupta pan...@brightroll.com wrote: Hi, I have a few small questions regarding HBase. I've searched the forum but couldn't find clear answers hence asking them here: 1. Does Minor compaction remove HFiles in which all entries are out of TTL or does only Major compaction do that? I found this jira: https://issues.apache.org/jira/browse/HBASE-5199 but I dont' know if the compaction being talked about there is minor or major. 2. Is there a way of configuring major compaction to compact only files older than a certain time or to compress all the files except the latest few? We basically want to use the time based filtering optimization in HBase to get the latest additions to the table and since major compaction bunches everything into one file, it would defeat the optimization. 3. Is there a way to warm up the bloom filter and block index cache for a table? This is for a case where I always want the bloom filters and index to be all in memory, but not the data blocks themselves. 4. This one is related to what I read in the HBase definitive guide bloom filter section Given a random row key you are looking for, it is very likely that this key will fall in between two block start keys. The only way for HBase to figure out if the key actually exists is by loading the block and scanning it to find the key. The above excerpt seems to imply to me that the search for key inside a block is linear and I feel I must be reading it wrong. I would expect the scan to be a binary search. Thanks in Advance, Pankaj -- *P* | (415) 677-9222 ext. 205 *F *| (415) 677-0895 | pan...@brightroll.com Pankaj Gupta | Software Engineer *BrightRoll, Inc. *| Smart Video Advertising | www.brightroll.com United States | Canada | United Kingdom | Germany We're hiring http://newton.newtonsoftware.com/career/CareerHome.action?clientId=8a42a12b3580e2060135837631485aa7 ! -- *P* | (415) 677-9222 ext. 205 *F *| (415) 677-0895 | pan...@brightroll.com Pankaj Gupta | Software Engineer *BrightRoll, Inc. *| Smart Video Advertising |
Re: Questions about HBase
4. This one is related to what I read in the HBase definitive guide bloom filter section Given a random row key you are looking for, it is very likely that this key will fall in between two block start keys. The only way for HBase to figure out if the key actually exists is by loading the block and scanning it to find the key. The above excerpt seems to imply to me that the search for key inside a block is linear and I feel I must be reading it wrong. I would expect the scan to be a binary search. Yes as Ram said, using the RK the HFile data block where this key *might* be present can be found out and the same is loaded and then we seek to exact RK. This is a linear read. You can take a look at Prefix Tree encoder which is available in 95. This one tries to avoid this linear read within a block. On Wed, Jun 5, 2013 at 9:59 AM, Ted Yu yuzhih...@gmail.com wrote: bq. But i am not very sure if we can control the files getting selected for compaction in the older verisons. Same mechanism is available in 0.94 Take a look at src/main/java/org/apache/hadoop/hbase/coprocessor/BaseRegionObserver.java where you would find the following methods (and more): public void preCompactSelection(final ObserverContextRegionCoprocessorEnvironment c, final Store store, final ListStoreFile candidates, final CompactionRequest request) public InternalScanner preCompact(ObserverContextRegionCoprocessorEnvironment e, final Store store, final InternalScanner scanner) throws IOException { Cheers On Tue, Jun 4, 2013 at 8:14 PM, ramkrishna vasudevan ramkrishna.s.vasude...@gmail.com wrote: Does Minor compaction remove HFiles in which all entries are out of TTL or does only Major compaction do that Yes it applies for Minor compactions. Is there a way of configuring major compaction to compact only files older than a certain time or to compress all the files except the latest few? In the latest trunk version the compaction algo itself can be plugged. There are some coprocessor hooks that gives control on the scanner that gets created for compaction with which we can control the KVs being selected. But i am not very sure if we can control the files getting selected for compaction in the older verisons. The above excerpt seems to imply to me that the search for key inside a block is linear and I feel I must be reading it wrong. I would expect the scan to be a binary search. Once the data block is identified for a key, we seek to the beginning of the block and then do a linear search until we reach the exact key that we are looking out for. Because internally the data (KVs) are stored as byte buffers per block and it follows this pattern keylengthvaluelengthkeybytearrayvaluebytearray Is there a way to warm up the bloom filter and block index cache for a table? You always want the bloom and block index to be in cache? On Wed, Jun 5, 2013 at 7:45 AM, Pankaj Gupta pan...@brightroll.com wrote: Hi, I have a few small questions regarding HBase. I've searched the forum but couldn't find clear answers hence asking them here: 1. Does Minor compaction remove HFiles in which all entries are out of TTL or does only Major compaction do that? I found this jira: https://issues.apache.org/jira/browse/HBASE-5199 but I dont' know if the compaction being talked about there is minor or major. 2. Is there a way of configuring major compaction to compact only files older than a certain time or to compress all the files except the latest few? We basically want to use the time based filtering optimization in HBase to get the latest additions to the table and since major compaction bunches everything into one file, it would defeat the optimization. 3. Is there a way to warm up the bloom filter and block index cache for a table? This is for a case where I always want the bloom filters and index to be all in memory, but not the data blocks themselves. 4. This one is related to what I read in the HBase definitive guide bloom filter section Given a random row key you are looking for, it is very likely that this key will fall in between two block start keys. The only way for HBase to figure out if the key actually exists is by loading the block and scanning it to find the key. The above excerpt seems to imply to me that the search for key inside a block is linear and I feel I must be reading it wrong. I would expect the scan to be a binary search. Thanks in Advance, Pankaj -- *P* | (415) 677-9222 ext. 205 *F *| (415) 677-0895 | pan...@brightroll.com Pankaj Gupta | Software Engineer *BrightRoll, Inc. *| Smart Video Advertising | www.brightroll.com United States | Canada | United Kingdom | Germany
Re: Questions about HBase
If you will read HFile v2 document on HBase site you will understand completely how the search for a record works and why there is linear search in the block but binary search to get to the right block. Also bear in mind the amount of keys in a blocks is not big since a block in HFile by default is 65k, thus from a 10GB HFile you are only fully scanning 65k out of it. On Wednesday, June 5, 2013, Pankaj Gupta wrote: Thanks for the replies. I'll take a look at src/main/java/org/apache/ hadoop/hbase/coprocessor/BaseRegionObserver.java. @ramkrishna: I do want to have bloom filter and block index all the time. For good read performance they're critical in my workflow. The worry is that when HBase is restarted it will take a long time for them to get populated again and performance will suffer. If there was a way of loading them quickly and warm up the table then we'll be able to restart HBase without causing slow down in processing. On Tue, Jun 4, 2013 at 9:29 PM, Ted Yu yuzhih...@gmail.com wrote: bq. But i am not very sure if we can control the files getting selected for compaction in the older verisons. Same mechanism is available in 0.94 Take a look at src/main/java/org/apache/hadoop/hbase/coprocessor/BaseRegionObserver.java where you would find the following methods (and more): public void preCompactSelection(final ObserverContextRegionCoprocessorEnvironment c, final Store store, final ListStoreFile candidates, final CompactionRequest request) public InternalScanner preCompact(ObserverContextRegionCoprocessorEnvironment e, final Store store, final InternalScanner scanner) throws IOException { Cheers On Tue, Jun 4, 2013 at 8:14 PM, ramkrishna vasudevan ramkrishna.s.vasude...@gmail.com wrote: Does Minor compaction remove HFiles in which all entries are out of TTL or does only Major compaction do that Yes it applies for Minor compactions. Is there a way of configuring major compaction to compact only files older than a certain time or to compress all the files except the latest few? In the latest trunk version the compaction algo itself can be plugged. There are some coprocessor hooks that gives control on the scanner that gets created for compaction with which we can control the KVs being selected. But i am not very sure if we can control the files getting selected for compaction in the older verisons. The above excerpt seems to imply to me that the search for key inside a block is linear and I feel I must be reading it wrong. I would expect the scan to be a binary search. Once the data block is identified for a key, we seek to the beginning of the block and then do a linear search until we reach the exact key that we are looking out for. Because internally the data (KVs) are stored as byte buffers per block and it follows this pattern keylengthvaluelengthkeybytearrayvaluebytearray Is there a way to warm up the bloom filter and block index cache for a table? You always want the bloom and block index to be in cache? On Wed, Jun 5, 2013 at 7:45 AM, Pankaj Gupta pan...@brightroll.com wrote: Hi, I have a few small questions regarding HBase. I've searched the forum but couldn't find clear answers hence asking them here: 1. Does Minor compaction remove HFiles in which all entries are out of TTL or does only Major compaction do that? I found this jira: https://issues.apache.org/jira/browse/HBASE-5199 but I dont' know if the compaction being talked about there is minor or major. 2. Is there a way of configuring major compaction to compact only files older than a certain time or to compress all the files except the latest few? We basically want to use the time based filtering optimization in HBase to get the latest additions to the table and since major compaction bunches everything into one file, it would defeat the optimization. 3. Is there a way to warm up the bloom filter and block index cache for a table? This is for a case where I always want the bloom filters and index to be all in memory, but not the
Re: Questions about HBase
for the question whether you will be able to do a warm up for the bloom and block cache i don't think it is possible now. Regards Ram On Wed, Jun 5, 2013 at 10:57 AM, Asaf Mesika asaf.mes...@gmail.com wrote: If you will read HFile v2 document on HBase site you will understand completely how the search for a record works and why there is linear search in the block but binary search to get to the right block. Also bear in mind the amount of keys in a blocks is not big since a block in HFile by default is 65k, thus from a 10GB HFile you are only fully scanning 65k out of it. On Wednesday, June 5, 2013, Pankaj Gupta wrote: Thanks for the replies. I'll take a look at src/main/java/org/apache/ hadoop/hbase/coprocessor/BaseRegionObserver.java. @ramkrishna: I do want to have bloom filter and block index all the time. For good read performance they're critical in my workflow. The worry is that when HBase is restarted it will take a long time for them to get populated again and performance will suffer. If there was a way of loading them quickly and warm up the table then we'll be able to restart HBase without causing slow down in processing. On Tue, Jun 4, 2013 at 9:29 PM, Ted Yu yuzhih...@gmail.com wrote: bq. But i am not very sure if we can control the files getting selected for compaction in the older verisons. Same mechanism is available in 0.94 Take a look at src/main/java/org/apache/hadoop/hbase/coprocessor/BaseRegionObserver.java where you would find the following methods (and more): public void preCompactSelection(final ObserverContextRegionCoprocessorEnvironment c, final Store store, final ListStoreFile candidates, final CompactionRequest request) public InternalScanner preCompact(ObserverContextRegionCoprocessorEnvironment e, final Store store, final InternalScanner scanner) throws IOException { Cheers On Tue, Jun 4, 2013 at 8:14 PM, ramkrishna vasudevan ramkrishna.s.vasude...@gmail.com wrote: Does Minor compaction remove HFiles in which all entries are out of TTL or does only Major compaction do that Yes it applies for Minor compactions. Is there a way of configuring major compaction to compact only files older than a certain time or to compress all the files except the latest few? In the latest trunk version the compaction algo itself can be plugged. There are some coprocessor hooks that gives control on the scanner that gets created for compaction with which we can control the KVs being selected. But i am not very sure if we can control the files getting selected for compaction in the older verisons. The above excerpt seems to imply to me that the search for key inside a block is linear and I feel I must be reading it wrong. I would expect the scan to be a binary search. Once the data block is identified for a key, we seek to the beginning of the block and then do a linear search until we reach the exact key that we are looking out for. Because internally the data (KVs) are stored as byte buffers per block and it follows this pattern keylengthvaluelengthkeybytearrayvaluebytearray Is there a way to warm up the bloom filter and block index cache for a table? You always want the bloom and block index to be in cache? On Wed, Jun 5, 2013 at 7:45 AM, Pankaj Gupta pan...@brightroll.com wrote: Hi, I have a few small questions regarding HBase. I've searched the forum but couldn't find clear answers hence asking them here: 1. Does Minor compaction remove HFiles in which all entries are out of TTL or does only Major compaction do that? I found this jira: https://issues.apache.org/jira/browse/HBASE-5199 but I dont' know if the compaction being talked about there is minor or major. 2. Is there a way of configuring major compaction to compact only files older than a certain time or to compress all the files except the latest few? We basically want to use the time based filtering optimization in HBase to get the latest additions to the table and since major compaction bunches everything into one file, it would defeat the optimization. 3. Is there a way to warm up the bloom filter and block index cache for a table? This is for a case where I always want the bloom filters and index to be all in memory, but not the
Re: Scan + Gets are disk bound
When you set time range on Scan, some files can get skipped based on the max min ts values in that file. Said this, when u do major compact and do scan based on time range, dont think u will get some advantage. -Anoop- On Wed, Jun 5, 2013 at 10:11 AM, Rahul Ravindran rahu...@yahoo.com wrote: Our row-keys do not contain time. By time-based scans, I mean, an MR over the Hbase table where the scan object has no startRow or endRow but has a startTime and endTime. Our row key format is MD5 of UUID+UUID, so, we expect good distribution. We have pre-split initially to prevent any initial hotspotting. ~Rahul. From: anil gupta anilgupt...@gmail.com To: user@hbase.apache.org; Rahul Ravindran rahu...@yahoo.com Sent: Tuesday, June 4, 2013 9:31 PM Subject: Re: Scan + Gets are disk bound On Tue, Jun 4, 2013 at 11:48 AM, Rahul Ravindran rahu...@yahoo.com wrote: Hi, We are relatively new to Hbase, and we are hitting a roadblock on our scan performance. I searched through the email archives and applied a bunch of the recommendations there, but they did not improve much. So, I am hoping I am missing something which you could guide me towards. Thanks in advance. We are currently writing data and reading in an almost continuous mode (stream of data written into an HBase table and then we run a time-based MR on top of this Table). We currently were backed up and about 1.5 TB of data was loaded into the table and we began performing time-based scan MRs in 10 minute time intervals(startTime and endTime interval is 10 minutes). Most of the 10 minute interval had about 100 GB of data to process. Our workflow was to primarily eliminate duplicates from this table. We have maxVersions = 5 for the table. We use TableInputFormat to perform the time-based scan to ensure data locality. In the mapper, we check if there exists a previous version of the row in a time period earlier to the timestamp of the input row. If not, we emit that row. We looked at https://issues.apache.org/jira/browse/HBASE-4683 and hence turned off block cache for this table with the expectation that the block index and bloom filter will be cached in the block cache. We expect duplicates to be rare and hence hope for most of these checks to be fulfilled by the bloom filter. Unfortunately, we notice very slow performance on account of being disk bound. Looking at jstack, we notice that most of the time, we appear to be hitting disk for the block index. We performed a major compaction and retried and performance improved some, but not by much. We are processing data at about 2 MB per second. We are using CDH 4.2.1 HBase 0.94.2 and HDFS 2.0.0 running with 8 datanodes/regionservers(each with 32 cores, 4x1TB disks and 60 GB RAM). Anil: You dont have the right balance between disk,cpu and ram. You have too much of CPU, RAM but very less NUMBER of disks. Usually, its better to have a Disk/Cpu_core ratio near 0.6-0.8. Your's is around 0.13. This seems to be the biggest reason of your problem. HBase is running with 30 GB Heap size, memstore values being capped at 3 GB and flush thresholds being 0.15 and 0.2. Blockcache is at 0.5 of total heap size(15 GB). We are using SNAPPY for our tables. A couple of questions: * Is the performance of the time-based scan bad after a major compaction? Anil: In general, TimeBased(i am assuming you have built your rowkey on timestamp) scans are not good for HBase because of region hot-spotting. Have you tried setting the ScannerCaching to a higher number? * What can we do to help alleviate being disk bound? The typical answer of adding more RAM does not seem to have helped, or we are missing some other config Anil: Try adding more disks to your machines. Below are some of the metrics from a Regionserver webUI: requestsPerSecond=5895, numberOfOnlineRegions=60, numberOfStores=60, numberOfStorefiles=209, storefileIndexSizeMB=6, rootIndexSizeKB=7131, totalStaticIndexSizeKB=415995, totalStaticBloomSizeKB=2514675, memstoreSizeMB=0, mbInMemoryWithoutWAL=0, numberOfPutsWithoutWAL=0, readRequestsCount=30589690, writeRequestsCount=0, compactionQueueSize=0, flushQueueSize=0, usedHeapMB=2688, maxHeapMB=30672, blockCacheSizeMB=1604.86, blockCacheFreeMB=13731.24, blockCacheCount=11817, blockCacheHitCount=2759, blockCacheMissCount=25373411, blockCacheEvictedCount=7112, blockCacheHitRatio=52%, blockCacheHitCachingRatio=72%, hdfsBlocksLocalityIndex=91, slowHLogAppendCount=0, fsReadLatencyHistogramMean=15409428.56, fsReadLatencyHistogramCount=1559927, fsReadLatencyHistogramMedian=230609.5, fsReadLatencyHistogram75th=280094.75, fsReadLatencyHistogram95th=9574280.4, fsReadLatencyHistogram99th=100981301.2, fsReadLatencyHistogram999th=511591146.03, fsPreadLatencyHistogramMean=3895616.6, fsPreadLatencyHistogramCount=42, fsPreadLatencyHistogramMedian=954552,
Re: Scan + Gets are disk bound
On Tuesday, June 4, 2013, Rahul Ravindran wrote: Hi, We are relatively new to Hbase, and we are hitting a roadblock on our scan performance. I searched through the email archives and applied a bunch of the recommendations there, but they did not improve much. So, I am hoping I am missing something which you could guide me towards. Thanks in advance. We are currently writing data and reading in an almost continuous mode (stream of data written into an HBase table and then we run a time-based MR on top of this Table). We currently were backed up and about 1.5 TB of data was loaded into the table and we began performing time-based scan MRs in 10 minute time intervals(startTime and endTime interval is 10 minutes). Most of the 10 minute interval had about 100 GB of data to process. Our workflow was to primarily eliminate duplicates from this table. We have maxVersions = 5 for the table. We use TableInputFormat to perform the time-based scan to ensure data locality. In the mapper, we check if there exists a previous version of the row in a time period earlier to the timestamp of the input row. If not, we emit that row. If I understand correctly, for a rowkey R, column family F, column qualifier C, if you have two values with time stamp 13:00 and 13:02, you want to remove the value associated with 13:02. The best way to do this is to write a simple RegionObserver Coprocessor, which hooks to the compaction process (preCompact for instance). In there simply, for any given R, F, C only emit the earliest timestamp value (the last, since timestamp is ordered descending), and that's it. It's a very effective way, since you are riding on top of an existing process which reads the values either way, so you are not paying the price of reading it again your MR job. Also, in between major compactions, you can also implement the preScan hook in the region observer, so you'll pick up only the earliest timestamp value, thus achieving the same result for your client, although you haven't removed those values yet. I've implemented this for counters delayed aggregations, and it works great in production. We looked at https://issues.apache.org/jira/browse/HBASE-4683 and hence turned off block cache for this table with the expectation that the block index and bloom filter will be cached in the block cache. We expect duplicates to be rare and hence hope for most of these checks to be fulfilled by the bloom filter. Unfortunately, we notice very slow performance on account of being disk bound. Looking at jstack, we notice that most of the time, we appear to be hitting disk for the block index. We performed a major compaction and retried and performance improved some, but not by much. We are processing data at about 2 MB per second. We are using CDH 4.2.1 HBase 0.94.2 and HDFS 2.0.0 running with 8 datanodes/regionservers(each with 32 cores, 4x1TB disks and 60 GB RAM). HBase is running with 30 GB Heap size, memstore values being capped at 3 GB and flush thresholds being 0.15 and 0.2. Blockcache is at 0.5 of total heap size(15 GB). We are using SNAPPY for our tables. A couple of questions: * Is the performance of the time-based scan bad after a major compaction? * What can we do to help alleviate being disk bound? The typical answer of adding more RAM does not seem to have helped, or we are missing some other config Below are some of the metrics from a Regionserver webUI: requestsPerSecond=5895, numberOfOnlineRegions=60, numberOfStores=60, numberOfStorefiles=209, storefileIndexSizeMB=6, rootIndexSizeKB=7131, totalStaticIndexSizeKB=415995, totalStaticBloomSizeKB=2514675, memstoreSizeMB=0, mbInMemoryWithoutWAL=0, numberOfPutsWithoutWAL=0, readRequestsCount=30589690, writeRequestsCount=0, compactionQueueSize=0, flushQueueSize=0, usedHeapMB=2688, maxHeapMB=30672, blockCacheSizeMB=1604.86, blockCacheFreeMB=13731.24, blockCacheCount=11817, blockCacheHitCount=2759, blockCacheMissCount=25373411, blockCacheEvictedCount=7112, blockCacheHitRatio=52%, blockCacheHitCachingRatio=72%, hdfsBlocksLocalityIndex=91, slowHLogAppendCount=0, fsReadLatencyHistogramMean=15409428.56, fsReadLatencyHistogramCount=1559927, fsReadLatencyHistogramMedian=230609.5, fsReadLatencyHistogram75th=280094.75, fsReadLatencyHistogram95th=9574280.4, fsReadLatencyHistogram99th=100981301.2, fsReadLatencyHistogram999th=511591146.03, fsPreadLatencyHistogramMean=3895616.6, fsPreadLatencyHistogramCount=42, fsPreadLatencyHistogramMedian=954552, fsPreadLatencyHistogram75th=8723662.5, fsPreadLatencyHistogram95th=11159637.65, fsPreadLatencyHistogram99th=37763281.57, fsPreadLatencyHistogram999th=273192813.91, fsWriteLatencyHistogramMean=6124343.91, fsWriteLatencyHistogramCount=114, fsWriteLatencyHistogramMedian=374379, fsWriteLatencyHistogram75th=431395.75, fsWriteLatencyHistogram95th=576853.8, fsWriteLatencyHistogram99th=1034159.75, fsWriteLatencyHistogram999th=5687910.29
Re: Questions about HBase
When you do the first read of this region, wouldn't this load all bloom filters? On Wed, Jun 5, 2013 at 8:43 AM, ramkrishna vasudevan ramkrishna.s.vasude...@gmail.com wrote: for the question whether you will be able to do a warm up for the bloom and block cache i don't think it is possible now. Regards Ram On Wed, Jun 5, 2013 at 10:57 AM, Asaf Mesika asaf.mes...@gmail.com wrote: If you will read HFile v2 document on HBase site you will understand completely how the search for a record works and why there is linear search in the block but binary search to get to the right block. Also bear in mind the amount of keys in a blocks is not big since a block in HFile by default is 65k, thus from a 10GB HFile you are only fully scanning 65k out of it. On Wednesday, June 5, 2013, Pankaj Gupta wrote: Thanks for the replies. I'll take a look at src/main/java/org/apache/ hadoop/hbase/coprocessor/BaseRegionObserver.java. @ramkrishna: I do want to have bloom filter and block index all the time. For good read performance they're critical in my workflow. The worry is that when HBase is restarted it will take a long time for them to get populated again and performance will suffer. If there was a way of loading them quickly and warm up the table then we'll be able to restart HBase without causing slow down in processing. On Tue, Jun 4, 2013 at 9:29 PM, Ted Yu yuzhih...@gmail.com wrote: bq. But i am not very sure if we can control the files getting selected for compaction in the older verisons. Same mechanism is available in 0.94 Take a look at src/main/java/org/apache/hadoop/hbase/coprocessor/BaseRegionObserver.java where you would find the following methods (and more): public void preCompactSelection(final ObserverContextRegionCoprocessorEnvironment c, final Store store, final ListStoreFile candidates, final CompactionRequest request) public InternalScanner preCompact(ObserverContextRegionCoprocessorEnvironment e, final Store store, final InternalScanner scanner) throws IOException { Cheers On Tue, Jun 4, 2013 at 8:14 PM, ramkrishna vasudevan ramkrishna.s.vasude...@gmail.com wrote: Does Minor compaction remove HFiles in which all entries are out of TTL or does only Major compaction do that Yes it applies for Minor compactions. Is there a way of configuring major compaction to compact only files older than a certain time or to compress all the files except the latest few? In the latest trunk version the compaction algo itself can be plugged. There are some coprocessor hooks that gives control on the scanner that gets created for compaction with which we can control the KVs being selected. But i am not very sure if we can control the files getting selected for compaction in the older verisons. The above excerpt seems to imply to me that the search for key inside a block is linear and I feel I must be reading it wrong. I would expect the scan to be a binary search. Once the data block is identified for a key, we seek to the beginning of the block and then do a linear search until we reach the exact key that we are looking out for. Because internally the data (KVs) are stored as byte buffers per block and it follows this pattern keylengthvaluelengthkeybytearrayvaluebytearray Is there a way to warm up the bloom filter and block index cache for a table? You always want the bloom and block index to be in cache? On Wed, Jun 5, 2013 at 7:45 AM, Pankaj Gupta pan...@brightroll.com wrote: Hi, I have a few small questions regarding HBase. I've searched the forum but couldn't find clear answers hence asking them here: 1. Does Minor compaction remove HFiles in which all entries are out of TTL or does only Major compaction do that? I found this jira: https://issues.apache.org/jira/browse/HBASE-5199 but I dont' know if the compaction being talked about there is minor or major. 2. Is there a way of configuring major compaction to compact only files older than a certain time or to compress all the files except the latest few? We basically want to use the time based filtering optimization in HBase to get the latest additions to the table and since major compaction bunches everything into one file, it would defeat the optimization. 3. Is there a way to warm up the bloom filter and block index cache for a table? This is for a case where I always want the bloom filters and index to be all in memory, but not the
Re: Scan + Gets are disk bound
Thanks for that confirmation. This is what we hypothesized as well. So, if we are dependent on timerange scans, we need to completely avoid major compaction and depend only on minor compactions? Is there any downside? We do have a TTL set on all the rows in the table. ~Rahul. From: Anoop John anoop.hb...@gmail.com To: user@hbase.apache.org; Rahul Ravindran rahu...@yahoo.com Cc: anil gupta anilgupt...@gmail.com Sent: Tuesday, June 4, 2013 10:44 PM Subject: Re: Scan + Gets are disk bound When you set time range on Scan, some files can get skipped based on the max min ts values in that file. Said this, when u do major compact and do scan based on time range, dont think u will get some advantage. -Anoop- On Wed, Jun 5, 2013 at 10:11 AM, Rahul Ravindran rahu...@yahoo.com wrote: Our row-keys do not contain time. By time-based scans, I mean, an MR over the Hbase table where the scan object has no startRow or endRow but has a startTime and endTime. Our row key format is MD5 of UUID+UUID, so, we expect good distribution. We have pre-split initially to prevent any initial hotspotting. ~Rahul. From: anil gupta anilgupt...@gmail.com To: user@hbase.apache.org; Rahul Ravindran rahu...@yahoo.com Sent: Tuesday, June 4, 2013 9:31 PM Subject: Re: Scan + Gets are disk bound On Tue, Jun 4, 2013 at 11:48 AM, Rahul Ravindran rahu...@yahoo.com wrote: Hi, We are relatively new to Hbase, and we are hitting a roadblock on our scan performance. I searched through the email archives and applied a bunch of the recommendations there, but they did not improve much. So, I am hoping I am missing something which you could guide me towards. Thanks in advance. We are currently writing data and reading in an almost continuous mode (stream of data written into an HBase table and then we run a time-based MR on top of this Table). We currently were backed up and about 1.5 TB of data was loaded into the table and we began performing time-based scan MRs in 10 minute time intervals(startTime and endTime interval is 10 minutes). Most of the 10 minute interval had about 100 GB of data to process. Our workflow was to primarily eliminate duplicates from this table. We have maxVersions = 5 for the table. We use TableInputFormat to perform the time-based scan to ensure data locality. In the mapper, we check if there exists a previous version of the row in a time period earlier to the timestamp of the input row. If not, we emit that row. We looked at https://issues.apache.org/jira/browse/HBASE-4683 and hence turned off block cache for this table with the expectation that the block index and bloom filter will be cached in the block cache. We expect duplicates to be rare and hence hope for most of these checks to be fulfilled by the bloom filter. Unfortunately, we notice very slow performance on account of being disk bound. Looking at jstack, we notice that most of the time, we appear to be hitting disk for the block index. We performed a major compaction and retried and performance improved some, but not by much. We are processing data at about 2 MB per second. We are using CDH 4.2.1 HBase 0.94.2 and HDFS 2.0.0 running with 8 datanodes/regionservers(each with 32 cores, 4x1TB disks and 60 GB RAM). Anil: You dont have the right balance between disk,cpu and ram. You have too much of CPU, RAM but very less NUMBER of disks. Usually, its better to have a Disk/Cpu_core ratio near 0.6-0.8. Your's is around 0.13. This seems to be the biggest reason of your problem. HBase is running with 30 GB Heap size, memstore values being capped at 3 GB and flush thresholds being 0.15 and 0.2. Blockcache is at 0.5 of total heap size(15 GB). We are using SNAPPY for our tables. A couple of questions: * Is the performance of the time-based scan bad after a major compaction? Anil: In general, TimeBased(i am assuming you have built your rowkey on timestamp) scans are not good for HBase because of region hot-spotting. Have you tried setting the ScannerCaching to a higher number? * What can we do to help alleviate being disk bound? The typical answer of adding more RAM does not seem to have helped, or we are missing some other config Anil: Try adding more disks to your machines. Below are some of the metrics from a Regionserver webUI: requestsPerSecond=5895, numberOfOnlineRegions=60, numberOfStores=60, numberOfStorefiles=209, storefileIndexSizeMB=6, rootIndexSizeKB=7131, totalStaticIndexSizeKB=415995, totalStaticBloomSizeKB=2514675, memstoreSizeMB=0, mbInMemoryWithoutWAL=0, numberOfPutsWithoutWAL=0, readRequestsCount=30589690, writeRequestsCount=0, compactionQueueSize=0, flushQueueSize=0, usedHeapMB=2688, maxHeapMB=30672, blockCacheSizeMB=1604.86, blockCacheFreeMB=13731.24, blockCacheCount=11817, blockCacheHitCount=2759, blockCacheMissCount=25373411,