Re: Hbase random read performance
Moving to HBase user mailing list. Can you upgrade to newer release such as 0.94.8 ? Cheers On Jul 8, 2013, at 4:36 AM, Boris Emelyanov emelya...@post.km.ru wrote: I'm trying to configure hbase for fully random read performance, my cluster parameters are: 9 servers as slaves, each has two 1TB HDD as hadoop volumes; data: 800 millions 6-10 KB objects in hbase; HBase Version - 0.90.6-cdh3u5 HBASE_HEAPSIZE=12288; hfile.block.cache.size = 0.4 hfile.min.blocksize.size = 16384 hbase.regionserver.handler.count = 100 I used several recommended tuning solutions, such as * tunrning on bloom filters; * decreasing hbase block size to 16384. However, read performance is still poor, about 800-1000rps. What would you recommend to solve the problem? -- Best regards, Boris.
Re: optimizing block cache requests + eviction
For suggestion #3 below, take a look at: HBASE-7509 Enable RS to query a secondary datanode in parallel, if the primary takes too long Cheers On Mon, Jul 8, 2013 at 3:04 AM, Viral Bajaria viral.baja...@gmail.comwrote: Hi, TL;DR; Trying to make a case for making the block eviction strategy smart and to not evict remote blocks more frequently and make the requests more smarter. The question here comes after I debugged the issue that I was having with random region servers hitting high load averages. I initially thought the problem was hardware related i.e. bad disk or network since the wait I/O was too high but it was a combination of things. I figured with SCR (short circuit read) ON the datanode should almost never show high amount of block requests from the local regionservers. So my starting point for debugging was the datanode since it was doing a ton of I/O. The clienttrace logs helped me figure out which RS nodes were making block requests. I hacked up a script to report which blocks are being requested and how many times per minute. I found that some blocks were being requested 10+ times in a minute and over 2000 times in an hour from the same regionserver. This was causing the server to do 40+MB/s on reads alone. That was on the higher side, the average was closer to 100 or so per hour. Now why did I end up in such a situation. It happened due to the fact that I added servers to the cluster and rebalanced the cluster. At the same time I added some drives and also removed the offending server in my setup. This caused some of the data to not be co-located with the regionservers. Given that major_compaction was disabled and it would not have run for a while (atleast on some tables) these block requests would not go away. One of my regionservers was totally overwhelmed. I made the situation worse when I removed the server that was under heavy load with the assumption that it's a hardware problem with the box without doing a deep dive (doh!). Given that regionservers will be added in the future, I expect block locality to go down till major_compaction runs. Also nodes can go down and cause this problem. So I started thinking of probable solutions, but first some observations. *Observations/Comments* - The surprising part was the regionservers were trying to make so many requests for the same block in the same minute (let alone hour). Could this happen because the original request took a few seconds and so the regionserver re-requested ? I didn't see any fetch errors in the regionserver logs for blocks. - Even more strange; my heap size was at 11G and the time when this was happening, the used heap was at 2-4G. I would have expected the heap to grow higher than that since the blockCache should be using atleast 40% of the available heap space. - Another strange thing that I observed was, the block was being requested from the same datanode every single time. *Possible Solution/Changes* - Would it make sense to give remote blocks higher priority over the local blocks that can be read via SCR and not let them get evicted if there is a tie in which block to evict ? - Should we throttle the number of outgoing requests for a block ? I am not sure if my firewall caused some issue but I wouldn't expect multiple block fetch requests in the same minute. I did see a few RST packets getting dropped at the firewall but I wasn't able to trace the problem was due to this. - We have 3 replicas available, shouldn't we request from the other datanode if one might take a lot of time ? The amount of time it took to read a block went up when the box was under heavy load, yet the re-requests were going to the same one. Is this something that is available on the DFSClient and can we exploit it ? - Is it possible to migrate a region to a server which has higher number of blocks available for it ? We don't need to make this automatic, but we could provide a command that could be invoked manually to assign a region to a specific regionserver. Thoughts ? Thanks, Viral
Using separator/delimiter in HBase rowkey?
Hello, I am trying to get some advice on pros/cons of using separator/delimiter as part of HBase row key. Currently one of our user activity tables has a rowkey design of UserID^TimeStamp with a separator of ^. (UserID is a string that won't include '^'). This is designed for the two common use cases in our system: (1) If we come from a context where the UserID is known, we can do a scan easily for all the user activities with a startRowKey and stopRowKey. (2) If we come from a external networked table where the row key of this user activity table is stored and can be retrieved as activityRowKey, then we can use the following code to parse out the UserID and do the same scan as in (1): String activityRowKeyStr = Bytes.toString(activityRowKey); String userId = activityRowKeyStr.subString(activityRowKeyStr.indexOf(^)+1) Then I can set startRowKey and stopRowKey for the scan based on userId. Here we get benefit of having the User ID as part of the row key with the separator (comparing to another solution that stores the userID as one of the columns in the user activity table). The reason I pick a separator after UserID is that sometimes we may not get a fixed length string of the UserID value. At one point I actually thought of using MD5 to hash the UserID and make it a fixed length, however, the possibility of collision and possible overhead of applying the hash function makes me pick the separator ^. My question: (1) I kind of make the argument that using a separator is kind of better than using a MD5 hash value. Does that seem reasonable? Could you comments on other pros and cons that I might miss (as the bases for my argument)? (2) On using a separator/delimiter, besides the requirements that this separator/delimiter shouldn't appear elsewhere in the rowkey, are there any other requirements? Are there any special separator/delimiters that are better/worse than the average ones? thanks! Jason
Re: Bulk loading HFiles via LoadIncrementalHFiles fails at a region that is being compacted, a bug?
Hello Michael, looking in the code, it seems to me that the 60s is hardcoded, however it retries for, on default, 10 times, so in total 10 minutes wait time, I upped that to 20 times, so now it is 20 minutes for me, but still, we have some pretty big regions whose compaction (which was the case in particular) can take more than 40 minutes. I have split the big regions to alleviate this, so getting a thread dump now will be difficult (this is in production so no problems is the point). Anyway, looking on the code, for me its hard to figure out which actions will block the lock from succeeding on the region at the place I indicated, so was hoping for an answer from an expert. If the compaction blocks the lock, it might be that at unit testing the compactions are faster than 10 minutes so the problem never exhibits. Is this the case? Stan
Re: Using separator/delimiter in HBase rowkey?
Hello Jason, Have you considered the following rowkey? murmur_128(userId) + timestamp + userId ? This handles both of your cases as (1) murmur 128 is much faster than md5 so will have very low overhead and (2) the userid at the end of the key will ensure that no murmur collisions will cause issues. This key also handle incrementing userIds well because close userIds will likely be in separate regions. Cheers, Mike On Mon, Jul 8, 2013 at 10:19 AM, Jason Huang jason.hu...@icare.com wrote: Hello, I am trying to get some advice on pros/cons of using separator/delimiter as part of HBase row key. Currently one of our user activity tables has a rowkey design of UserID^TimeStamp with a separator of ^. (UserID is a string that won't include '^'). This is designed for the two common use cases in our system: (1) If we come from a context where the UserID is known, we can do a scan easily for all the user activities with a startRowKey and stopRowKey. (2) If we come from a external networked table where the row key of this user activity table is stored and can be retrieved as activityRowKey, then we can use the following code to parse out the UserID and do the same scan as in (1): String activityRowKeyStr = Bytes.toString(activityRowKey); String userId = activityRowKeyStr.subString(activityRowKeyStr.indexOf(^)+1) Then I can set startRowKey and stopRowKey for the scan based on userId. Here we get benefit of having the User ID as part of the row key with the separator (comparing to another solution that stores the userID as one of the columns in the user activity table). The reason I pick a separator after UserID is that sometimes we may not get a fixed length string of the UserID value. At one point I actually thought of using MD5 to hash the UserID and make it a fixed length, however, the possibility of collision and possible overhead of applying the hash function makes me pick the separator ^. My question: (1) I kind of make the argument that using a separator is kind of better than using a MD5 hash value. Does that seem reasonable? Could you comments on other pros and cons that I might miss (as the bases for my argument)? (2) On using a separator/delimiter, besides the requirements that this separator/delimiter shouldn't appear elsewhere in the rowkey, are there any other requirements? Are there any special separator/delimiters that are better/worse than the average ones? thanks! Jason
Re: Using separator/delimiter in HBase rowkey?
Not saying this is a solution or better in anyway but just more food for thought. Is there any maximum size limit for UserIds? You can pad also for Users Ids of smaller length. You are using more space in this way though. It can help in sorting as well. Regards, Shahab On Mon, Jul 8, 2013 at 10:19 AM, Jason Huang jason.hu...@icare.com wrote: Hello, I am trying to get some advice on pros/cons of using separator/delimiter as part of HBase row key. Currently one of our user activity tables has a rowkey design of UserID^TimeStamp with a separator of ^. (UserID is a string that won't include '^'). This is designed for the two common use cases in our system: (1) If we come from a context where the UserID is known, we can do a scan easily for all the user activities with a startRowKey and stopRowKey. (2) If we come from a external networked table where the row key of this user activity table is stored and can be retrieved as activityRowKey, then we can use the following code to parse out the UserID and do the same scan as in (1): String activityRowKeyStr = Bytes.toString(activityRowKey); String userId = activityRowKeyStr.subString(activityRowKeyStr.indexOf(^)+1) Then I can set startRowKey and stopRowKey for the scan based on userId. Here we get benefit of having the User ID as part of the row key with the separator (comparing to another solution that stores the userID as one of the columns in the user activity table). The reason I pick a separator after UserID is that sometimes we may not get a fixed length string of the UserID value. At one point I actually thought of using MD5 to hash the UserID and make it a fixed length, however, the possibility of collision and possible overhead of applying the hash function makes me pick the separator ^. My question: (1) I kind of make the argument that using a separator is kind of better than using a MD5 hash value. Does that seem reasonable? Could you comments on other pros and cons that I might miss (as the bases for my argument)? (2) On using a separator/delimiter, besides the requirements that this separator/delimiter shouldn't appear elsewhere in the rowkey, are there any other requirements? Are there any special separator/delimiters that are better/worse than the average ones? thanks! Jason
Re: When to expand vertically vs. horizontally in Hbase
Ian, You still want to stick to your relational modeling. :-( You need to play around more with hierarchical models to get a better appreciation. If you model as if you're working with a RDBMS then you will end up with a poor HBase table design. In ERD models, you don't have the concept of a weak relationship. The weak relationship is that the model has no relationship between the entities. Its the application that manages that. Imagine a reference or look up table that in the model has no association. Using our example of an Order Entry system, its the application that hits the customer lookup table to capture relevant information for the order. That's why I refer to it as a weak association. On Jul 5, 2013, at 6:00 PM, Ian Varley ivar...@salesforce.com wrote: Sure. Maybe it's useful to talk about the functional aspect of relationships in models. In an RDBMS, explicit relationship play a couple roles: - foreign key constraints: don't allow a tuple in relation A to point to a row in relation B that doesn't exist - join optimization - knowledge of how two relations are logically connected can help perform joins in a more optimal way HBase, of course, provides neither of these features out of the box, so there is no difference between an implied (weakly coupled, to use your term) relationship and something stronger. Where it gets interesting is in the kind of denormalization you're talking about, where information that properly belongs to one entity is copied into another one for efficiency's sake, or to get some kind of atomicity protection. Your scenario below is doing this (duplicating customer info in the order records). To be fair, relational DBs also force this kind of behavior sometimes, again for efficiency reasons (we've all done it). HBase just starts there. :) Ian On Jul 5, 2013, at 4:22 PM, Michael Segel michael_se...@hotmail.com wrote: An entity is an entity. When you couple them you are saying that there's a relationship to them in the model. What I am saying is that you can have an HBase model which is not a single table, however when you look at your use case, you are querying data from a single table at a time. Going back to the order entry system. You may have a customer table which maintains all of the information about your customer yet you will also duplicate portions of the data in to the order system. You still have other entities such as your orders, pick slips, shipping and invoices. There won't be a hard or strong relationship between the customer table and the order table. When you go to your ERD tool, you wouldn't show a strong coupling of the data. Does that make sense? On Jul 5, 2013, at 1:56 PM, Ian Varley ivar...@salesforce.com wrote: Mike, what do you mean by you can have entities, except that they are not coupled? You mean, they have no relationship to each other? Or the relationship is defined elsewhere (e.g. application code)? The concept of coupling seems a little overloaded and not as concise here as relationship. Two tuples in a database can have a wide number of relationships to each other; the kinds of relationships that are actively supported differs between a traditional RDBMS and HBase, and proper HBase design requires understand these limitations precisely. I'm not trying to be an ERhttp://en.wikipedia.org/wiki/Entity%E2%80%93relationship_model apologist, there are a lot of ways in which it sucks. :) But if we want to evolve, we can't just pretend there's no history here to build on. Ian On Jul 5, 2013, at 1:41 PM, Michael Segel wrote: LOL... Ian wrote: But, something just occurred to me: just because your physical implementation (HBase) doesn't support normalized entities and relationships doesn't mean your *problem* doesn't have entities and relationships. :) An Author is one entity, a Title is another, and a Genre is a third. Understanding how they interact is a prerequisite for translating into a physical model that works well in HBase. (ERD modeling is not categorically the only way to understand that, but I've yet to hear a credible alternative that doesn't boil down to either ERD or do it in your head). You can have entities, except that they are not coupled. If you have a common key, then you may have a use for column families, it just depends on your data and how you access your data. Its not rocket science, but its a non-trivial matter. Not doing it right may mean that you are not going to get the most out of your system. On Jul 5, 2013, at 1:26 PM, Ian Varley ivar...@salesforce.commailto:ivar...@salesforce.com wrote: But, something just occurred to me: just because your physical implementation (HBase) doesn't support normalized entities and relationships doesn't mean your *problem* doesn't have entities and relationships. :) An Author is one entity, a Title is
Re: Using separator/delimiter in HBase rowkey?
Is murmur part of the standard java libraries? If not, you end up having to do a bit more maintenance of your cluster and that's going to be part of your tradeoff. On Jul 8, 2013, at 10:14 AM, Mike Axiak m...@axiak.net wrote: Hello Jason, Have you considered the following rowkey? murmur_128(userId) + timestamp + userId ? This handles both of your cases as (1) murmur 128 is much faster than md5 so will have very low overhead and (2) the userid at the end of the key will ensure that no murmur collisions will cause issues. This key also handle incrementing userIds well because close userIds will likely be in separate regions. Cheers, Mike On Mon, Jul 8, 2013 at 10:19 AM, Jason Huang jason.hu...@icare.com wrote: Hello, I am trying to get some advice on pros/cons of using separator/delimiter as part of HBase row key. Currently one of our user activity tables has a rowkey design of UserID^TimeStamp with a separator of ^. (UserID is a string that won't include '^'). This is designed for the two common use cases in our system: (1) If we come from a context where the UserID is known, we can do a scan easily for all the user activities with a startRowKey and stopRowKey. (2) If we come from a external networked table where the row key of this user activity table is stored and can be retrieved as activityRowKey, then we can use the following code to parse out the UserID and do the same scan as in (1): String activityRowKeyStr = Bytes.toString(activityRowKey); String userId = activityRowKeyStr.subString(activityRowKeyStr.indexOf(^)+1) Then I can set startRowKey and stopRowKey for the scan based on userId. Here we get benefit of having the User ID as part of the row key with the separator (comparing to another solution that stores the userID as one of the columns in the user activity table). The reason I pick a separator after UserID is that sometimes we may not get a fixed length string of the UserID value. At one point I actually thought of using MD5 to hash the UserID and make it a fixed length, however, the possibility of collision and possible overhead of applying the hash function makes me pick the separator ^. My question: (1) I kind of make the argument that using a separator is kind of better than using a MD5 hash value. Does that seem reasonable? Could you comments on other pros and cons that I might miss (as the bases for my argument)? (2) On using a separator/delimiter, besides the requirements that this separator/delimiter shouldn't appear elsewhere in the rowkey, are there any other requirements? Are there any special separator/delimiters that are better/worse than the average ones? thanks! Jason
Re: Using separator/delimiter in HBase rowkey?
On Mon, Jul 8, 2013 at 11:29 AM, Michael Segel michael_se...@hotmail.com wrote: If not, you end up having to do a bit more maintenance of your cluster and that's going to be part of your tradeoff. How so? -Mike
Re: Using separator/delimiter in HBase rowkey?
In 0.94, we have src/main/java/org/apache/hadoop/hbase/util/MurmurHash.java For hadoop 1, there is src/core/org/apache/hadoop/util/hash/MurmurHash.java Cheers On Mon, Jul 8, 2013 at 8:29 AM, Michael Segel michael_se...@hotmail.comwrote: Is murmur part of the standard java libraries? If not, you end up having to do a bit more maintenance of your cluster and that's going to be part of your tradeoff. On Jul 8, 2013, at 10:14 AM, Mike Axiak m...@axiak.net wrote: Hello Jason, Have you considered the following rowkey? murmur_128(userId) + timestamp + userId ? This handles both of your cases as (1) murmur 128 is much faster than md5 so will have very low overhead and (2) the userid at the end of the key will ensure that no murmur collisions will cause issues. This key also handle incrementing userIds well because close userIds will likely be in separate regions. Cheers, Mike On Mon, Jul 8, 2013 at 10:19 AM, Jason Huang jason.hu...@icare.com wrote: Hello, I am trying to get some advice on pros/cons of using separator/delimiter as part of HBase row key. Currently one of our user activity tables has a rowkey design of UserID^TimeStamp with a separator of ^. (UserID is a string that won't include '^'). This is designed for the two common use cases in our system: (1) If we come from a context where the UserID is known, we can do a scan easily for all the user activities with a startRowKey and stopRowKey. (2) If we come from a external networked table where the row key of this user activity table is stored and can be retrieved as activityRowKey, then we can use the following code to parse out the UserID and do the same scan as in (1): String activityRowKeyStr = Bytes.toString(activityRowKey); String userId = activityRowKeyStr.subString(activityRowKeyStr.indexOf(^)+1) Then I can set startRowKey and stopRowKey for the scan based on userId. Here we get benefit of having the User ID as part of the row key with the separator (comparing to another solution that stores the userID as one of the columns in the user activity table). The reason I pick a separator after UserID is that sometimes we may not get a fixed length string of the UserID value. At one point I actually thought of using MD5 to hash the UserID and make it a fixed length, however, the possibility of collision and possible overhead of applying the hash function makes me pick the separator ^. My question: (1) I kind of make the argument that using a separator is kind of better than using a MD5 hash value. Does that seem reasonable? Could you comments on other pros and cons that I might miss (as the bases for my argument)? (2) On using a separator/delimiter, besides the requirements that this separator/delimiter shouldn't appear elsewhere in the rowkey, are there any other requirements? Are there any special separator/delimiters that are better/worse than the average ones? thanks! Jason
Re: optimizing block cache requests + eviction
Would it make sense to give remote blocks higher priority over the local blocks that can be read via SCR and not let them get evicted if there is a tie in which block to evict ? That sounds like a reasonable idea. As are the others. But first, could this be a bug? What version of HBase? Were you able to take and save stack traces from the offending RegionServers while they were engaged in this behavior? (If not, maybe next time?) What were the HBase clients doing at the time that might correlate? Are there a set of steps that can be distilled that tend to trigger what you are seeing? On Monday, July 8, 2013, Viral Bajaria wrote: Hi, TL;DR; Trying to make a case for making the block eviction strategy smart and to not evict remote blocks more frequently and make the requests more smarter. The question here comes after I debugged the issue that I was having with random region servers hitting high load averages. I initially thought the problem was hardware related i.e. bad disk or network since the wait I/O was too high but it was a combination of things. I figured with SCR (short circuit read) ON the datanode should almost never show high amount of block requests from the local regionservers. So my starting point for debugging was the datanode since it was doing a ton of I/O. The clienttrace logs helped me figure out which RS nodes were making block requests. I hacked up a script to report which blocks are being requested and how many times per minute. I found that some blocks were being requested 10+ times in a minute and over 2000 times in an hour from the same regionserver. This was causing the server to do 40+MB/s on reads alone. That was on the higher side, the average was closer to 100 or so per hour. Now why did I end up in such a situation. It happened due to the fact that I added servers to the cluster and rebalanced the cluster. At the same time I added some drives and also removed the offending server in my setup. This caused some of the data to not be co-located with the regionservers. Given that major_compaction was disabled and it would not have run for a while (atleast on some tables) these block requests would not go away. One of my regionservers was totally overwhelmed. I made the situation worse when I removed the server that was under heavy load with the assumption that it's a hardware problem with the box without doing a deep dive (doh!). Given that regionservers will be added in the future, I expect block locality to go down till major_compaction runs. Also nodes can go down and cause this problem. So I started thinking of probable solutions, but first some observations. *Observations/Comments* - The surprising part was the regionservers were trying to make so many requests for the same block in the same minute (let alone hour). Could this happen because the original request took a few seconds and so the regionserver re-requested ? I didn't see any fetch errors in the regionserver logs for blocks. - Even more strange; my heap size was at 11G and the time when this was happening, the used heap was at 2-4G. I would have expected the heap to grow higher than that since the blockCache should be using atleast 40% of the available heap space. - Another strange thing that I observed was, the block was being requested from the same datanode every single time. *Possible Solution/Changes* - Would it make sense to give remote blocks higher priority over the local blocks that can be read via SCR and not let them get evicted if there is a tie in which block to evict ? - Should we throttle the number of outgoing requests for a block ? I am not sure if my firewall caused some issue but I wouldn't expect multiple block fetch requests in the same minute. I did see a few RST packets getting dropped at the firewall but I wasn't able to trace the problem was due to this. - We have 3 replicas available, shouldn't we request from the other datanode if one might take a lot of time ? The amount of time it took to read a block went up when the box was under heavy load, yet the re-requests were going to the same one. Is this something that is available on the DFSClient and can we exploit it ? - Is it possible to migrate a region to a server which has higher number of blocks available for it ? We don't need to make this automatic, but we could provide a command that could be invoked manually to assign a region to a specific regionserver. Thoughts ? Thanks, Viral -- Best regards, - Andy Problems worthy of attack prove their worth by hitting back. - Piet Hein (via Tom White)
Re: Using separator/delimiter in HBase rowkey?
You will need to put the jar into either every app that runs, or you will need to put it on every node. Every upgrade, you will need to make sure its still in your class path. More work for the admins. So how much faster is it over MD5? MD5 and SHA-1 are part of the Java libraries that ship w Sun/Oracle so you have them already installed and in your class path. Just saying... ;-) On Jul 8, 2013, at 10:36 AM, Mike Axiak m...@axiak.net wrote: On Mon, Jul 8, 2013 at 11:29 AM, Michael Segel michael_se...@hotmail.com wrote: If not, you end up having to do a bit more maintenance of your cluster and that's going to be part of your tradeoff. How so? -Mike
Re: Using separator/delimiter in HBase rowkey?
http://jsperf.com/murmur3-performance might be related. On Mon, Jul 8, 2013 at 8:54 AM, Michael Segel michael_se...@hotmail.comwrote: You will need to put the jar into either every app that runs, or you will need to put it on every node. Every upgrade, you will need to make sure its still in your class path. More work for the admins. So how much faster is it over MD5? MD5 and SHA-1 are part of the Java libraries that ship w Sun/Oracle so you have them already installed and in your class path. Just saying... ;-) On Jul 8, 2013, at 10:36 AM, Mike Axiak m...@axiak.net wrote: On Mon, Jul 8, 2013 at 11:29 AM, Michael Segel michael_se...@hotmail.com wrote: If not, you end up having to do a bit more maintenance of your cluster and that's going to be part of your tradeoff. How so? -Mike
Re: Using separator/delimiter in HBase rowkey?
I just don't understand.. every creation/interpretation of the key is going to require code to be used. The murmur implementation is with that code. How is there any extra burden? On Mon, Jul 8, 2013 at 11:54 AM, Michael Segel michael_se...@hotmail.com wrote: You will need to put the jar into either every app that runs, or you will need to put it on every node. Every upgrade, you will need to make sure its still in your class path. More work for the admins. So how much faster is it over MD5? MD5 and SHA-1 are part of the Java libraries that ship w Sun/Oracle so you have them already installed and in your class path. Just saying... ;-) On Jul 8, 2013, at 10:36 AM, Mike Axiak m...@axiak.net wrote: On Mon, Jul 8, 2013 at 11:29 AM, Michael Segel michael_se...@hotmail.com wrote: If not, you end up having to do a bit more maintenance of your cluster and that's going to be part of your tradeoff. How so? -Mike
Re: Using separator/delimiter in HBase rowkey?
Where is murmur? In your app? So then every app that wants to fetch that row must now use murmur. Added to Hadoop/HBase? Then when you do upgrades you have to make sure that the package is still in your class path. Note that different vendor's release management will mean YMMV as to what happens to your class paths and set up or if the jar gets blown out of the directory. Is the added cost in maintenance worth it? I don't know but I seriously doubt it. On Jul 8, 2013, at 11:00 AM, Mike Axiak m...@axiak.net wrote: I just don't understand.. every creation/interpretation of the key is going to require code to be used. The murmur implementation is with that code. How is there any extra burden? On Mon, Jul 8, 2013 at 11:54 AM, Michael Segel michael_se...@hotmail.com wrote: You will need to put the jar into either every app that runs, or you will need to put it on every node. Every upgrade, you will need to make sure its still in your class path. More work for the admins. So how much faster is it over MD5? MD5 and SHA-1 are part of the Java libraries that ship w Sun/Oracle so you have them already installed and in your class path. Just saying... ;-) On Jul 8, 2013, at 10:36 AM, Mike Axiak m...@axiak.net wrote: On Mon, Jul 8, 2013 at 11:29 AM, Michael Segel michael_se...@hotmail.com wrote: If not, you end up having to do a bit more maintenance of your cluster and that's going to be part of your tradeoff. How so? -Mike
RE: optimizing block cache requests + eviction
Viral, From what you described here I can conclude that either: 1. your data access pattern is random and uniform (no data locality at all) 2. or you trash block cache with scan operations with block cache enabled. or both. but the idea of treating local and remote blocks differently is very good. +1. Best regards, Vladimir Rodionov Principal Platform Engineer Carrier IQ, www.carrieriq.com e-mail: vrodio...@carrieriq.com From: Viral Bajaria [viral.baja...@gmail.com] Sent: Monday, July 08, 2013 3:04 AM To: user@hbase.apache.org Subject: optimizing block cache requests + eviction Hi, TL;DR; Trying to make a case for making the block eviction strategy smart and to not evict remote blocks more frequently and make the requests more smarter. The question here comes after I debugged the issue that I was having with random region servers hitting high load averages. I initially thought the problem was hardware related i.e. bad disk or network since the wait I/O was too high but it was a combination of things. I figured with SCR (short circuit read) ON the datanode should almost never show high amount of block requests from the local regionservers. So my starting point for debugging was the datanode since it was doing a ton of I/O. The clienttrace logs helped me figure out which RS nodes were making block requests. I hacked up a script to report which blocks are being requested and how many times per minute. I found that some blocks were being requested 10+ times in a minute and over 2000 times in an hour from the same regionserver. This was causing the server to do 40+MB/s on reads alone. That was on the higher side, the average was closer to 100 or so per hour. Now why did I end up in such a situation. It happened due to the fact that I added servers to the cluster and rebalanced the cluster. At the same time I added some drives and also removed the offending server in my setup. This caused some of the data to not be co-located with the regionservers. Given that major_compaction was disabled and it would not have run for a while (atleast on some tables) these block requests would not go away. One of my regionservers was totally overwhelmed. I made the situation worse when I removed the server that was under heavy load with the assumption that it's a hardware problem with the box without doing a deep dive (doh!). Given that regionservers will be added in the future, I expect block locality to go down till major_compaction runs. Also nodes can go down and cause this problem. So I started thinking of probable solutions, but first some observations. *Observations/Comments* - The surprising part was the regionservers were trying to make so many requests for the same block in the same minute (let alone hour). Could this happen because the original request took a few seconds and so the regionserver re-requested ? I didn't see any fetch errors in the regionserver logs for blocks. - Even more strange; my heap size was at 11G and the time when this was happening, the used heap was at 2-4G. I would have expected the heap to grow higher than that since the blockCache should be using atleast 40% of the available heap space. - Another strange thing that I observed was, the block was being requested from the same datanode every single time. *Possible Solution/Changes* - Would it make sense to give remote blocks higher priority over the local blocks that can be read via SCR and not let them get evicted if there is a tie in which block to evict ? - Should we throttle the number of outgoing requests for a block ? I am not sure if my firewall caused some issue but I wouldn't expect multiple block fetch requests in the same minute. I did see a few RST packets getting dropped at the firewall but I wasn't able to trace the problem was due to this. - We have 3 replicas available, shouldn't we request from the other datanode if one might take a lot of time ? The amount of time it took to read a block went up when the box was under heavy load, yet the re-requests were going to the same one. Is this something that is available on the DFSClient and can we exploit it ? - Is it possible to migrate a region to a server which has higher number of blocks available for it ? We don't need to make this automatic, but we could provide a command that could be invoked manually to assign a region to a specific regionserver. Thoughts ? Thanks, Viral Confidentiality Notice: The information contained in this message, including any attachments hereto, may be confidential and is intended to be read only by the individual or entity to whom this message is addressed. If the reader of this message is not the intended recipient or an agent or designee of the intended recipient, please note that any review, use, disclosure or distribution of this message or its attachments, in any form, is strictly prohibited. If you have received this message in error, please immediately notify the sender and/or
GC recommendations for large Region Server heaps
Hello: We have an HBase cluster with region servers running on 8GB heap size with a 0.6 block cache (it is a read heavy cluster, with bursty write traffic via MR jobs). (version: hbase-0.94.6.1) During HBaseCon, while speaking to a few attendees, I heard some folks were running region servers as high as 24GB and some others in the 16GB range. So - question: Are there any special GC recommendations (tuning parameters, flags, etc) that folks who run at these large heaps can recommend while moving up from an 8GB heap? i.e. for 16GB and for 24GB RS heaps ... ? I'm especially concerned about long pauses causing zk session timeouts and consequent RS shutdowns. Our boxes do have a lot of RAM and we are exploring how we can use more of it for the cluster while maintaining overall stability. Also - if there are clusters running multiple region servers per host, I'd be very interested to know what RS heap sizes those are being run at ... and whether this was chosen as an alternative to running a single RS with large heap. (I know I'll have to test the GC stuff out on my cluster and for my workloads anyway ... but just trying to get a feel of what sort of tuning options had to be used to have a stable HBase cluster with 16 or 24GB RS heaps). Thanks in advance, --Suraj
Re: GC recommendations for large Region Server heaps
Hey Suraj, I would recommend turning on MSlab and using the following settings: -XX:+UseParNewGC -XX:+UseConcMarkSweepGC -XX:CMSInitiatingOccupancyFraction=70 -XX:+CMSParallelRemarkEnabled On Mon, Jul 8, 2013 at 2:09 PM, Suraj Varma svarma...@gmail.com wrote: Hello: We have an HBase cluster with region servers running on 8GB heap size with a 0.6 block cache (it is a read heavy cluster, with bursty write traffic via MR jobs). (version: hbase-0.94.6.1) During HBaseCon, while speaking to a few attendees, I heard some folks were running region servers as high as 24GB and some others in the 16GB range. So - question: Are there any special GC recommendations (tuning parameters, flags, etc) that folks who run at these large heaps can recommend while moving up from an 8GB heap? i.e. for 16GB and for 24GB RS heaps ... ? I'm especially concerned about long pauses causing zk session timeouts and consequent RS shutdowns. Our boxes do have a lot of RAM and we are exploring how we can use more of it for the cluster while maintaining overall stability. Also - if there are clusters running multiple region servers per host, I'd be very interested to know what RS heap sizes those are being run at ... and whether this was chosen as an alternative to running a single RS with large heap. (I know I'll have to test the GC stuff out on my cluster and for my workloads anyway ... but just trying to get a feel of what sort of tuning options had to be used to have a stable HBase cluster with 16 or 24GB RS heaps). Thanks in advance, --Suraj -- Kevin O'Dell Systems Engineer, Cloudera
RE: GC recommendations for large Region Server heaps
I'm especially concerned about long pauses causing zk session timeouts and consequent RS shutdowns. Our boxes do have a lot of RAM and we are exploring how we can use more of it for the cluster while maintaining overall stability. Suraj, what is the maximum GC pause you have observed in your cluster so far? I do not think s-t-w GC pause on 8GB heap will lasts longer than 5-8 secs. Best regards, Vladimir Rodionov Principal Platform Engineer Carrier IQ, www.carrieriq.com e-mail: vrodio...@carrieriq.com From: Suraj Varma [svarma...@gmail.com] Sent: Monday, July 08, 2013 11:09 AM To: user@hbase.apache.org Subject: GC recommendations for large Region Server heaps Hello: We have an HBase cluster with region servers running on 8GB heap size with a 0.6 block cache (it is a read heavy cluster, with bursty write traffic via MR jobs). (version: hbase-0.94.6.1) During HBaseCon, while speaking to a few attendees, I heard some folks were running region servers as high as 24GB and some others in the 16GB range. So - question: Are there any special GC recommendations (tuning parameters, flags, etc) that folks who run at these large heaps can recommend while moving up from an 8GB heap? i.e. for 16GB and for 24GB RS heaps ... ? I'm especially concerned about long pauses causing zk session timeouts and consequent RS shutdowns. Our boxes do have a lot of RAM and we are exploring how we can use more of it for the cluster while maintaining overall stability. Also - if there are clusters running multiple region servers per host, I'd be very interested to know what RS heap sizes those are being run at ... and whether this was chosen as an alternative to running a single RS with large heap. (I know I'll have to test the GC stuff out on my cluster and for my workloads anyway ... but just trying to get a feel of what sort of tuning options had to be used to have a stable HBase cluster with 16 or 24GB RS heaps). Thanks in advance, --Suraj Confidentiality Notice: The information contained in this message, including any attachments hereto, may be confidential and is intended to be read only by the individual or entity to whom this message is addressed. If the reader of this message is not the intended recipient or an agent or designee of the intended recipient, please note that any review, use, disclosure or distribution of this message or its attachments, in any form, is strictly prohibited. If you have received this message in error, please immediately notify the sender and/or notificati...@carrieriq.com and delete or destroy any copy of this message and its attachments.
Re: optimizing block cache requests + eviction
Thanks guys for going through that never-ending email! I will create the JIRA for block cache eviction and the regionserver assignment command. Ted already pointed to the JIRA which tries to go a different datanode if the primary is busy (I will add comments to that one). To answer Andrews' questions: - I am using HBase 0.94.4 - I tried taking a stack trace using jstack but after the dump it crashed the regionserver. I also did not take the dump on the offending regionserver, rather took it on the regionservers that were making the block count. I will take a stack trace on the offending server. Is there any other tool besides jstack ? I don't want to crash my regionserver. - The HBase clients workload is fairly random and I write to a table every 4-5 seconds. I have varying workloads for different tables. But I do a lot of batching on the client side and group similar rowkeys together before doing a GET/PUT. For example: best case I end up doing ~100 puts every second to a region or in the worst case it's ~5K puts every second. But again since the workload is fairly random. Currently the clients for the table which had the most amount of data has been disabled and yet I see the heavy loads. To answer Vladimir's points: - Data access pattern definitely turns out to be uniform over a period of time. - I just did a sweep of my code base and found that there are a few places where Scanner are using block cache. I will disable that and see how it goes. Thanks, Viral
Re: optimizing block cache requests + eviction
I was able to reproduce the same regionserver asking for the same local block over 300 times within the same 2 minute window by running one of my heavy workloads. Let me try and gather some stack dumps. I agree that jstack crashing the jvm is concerning but there is nothing in the errors to know why it happened. I will keep that conversation out of here. As an addendum, I am using asynchbase as my client. Not sure if the arrival of multiple requests for rowkeys that could be in the same non-cached block causes hbase to queue up a non-cached block read via SCR and since the box is under load, it queues up multiple of these and makes the problem worse. Thanks, Viral On Mon, Jul 8, 2013 at 3:53 PM, Andrew Purtell apurt...@apache.org wrote: but unless the behavior you see is the _same_ regionserver asking for the _same_ block many times consecutively, it's probably workload related.
Re: optimizing block cache requests + eviction
Do you know if it's a data or meta block? J-D On Mon, Jul 8, 2013 at 4:28 PM, Viral Bajaria viral.baja...@gmail.com wrote: I was able to reproduce the same regionserver asking for the same local block over 300 times within the same 2 minute window by running one of my heavy workloads. Let me try and gather some stack dumps. I agree that jstack crashing the jvm is concerning but there is nothing in the errors to know why it happened. I will keep that conversation out of here. As an addendum, I am using asynchbase as my client. Not sure if the arrival of multiple requests for rowkeys that could be in the same non-cached block causes hbase to queue up a non-cached block read via SCR and since the box is under load, it queues up multiple of these and makes the problem worse. Thanks, Viral On Mon, Jul 8, 2013 at 3:53 PM, Andrew Purtell apurt...@apache.org wrote: but unless the behavior you see is the _same_ regionserver asking for the _same_ block many times consecutively, it's probably workload related.
Re: optimizing block cache requests + eviction
Good question. When I looked at the logs, it's not clear from it whether it's reading a meta or data block. Is there any kind of log line that indicates that ? Given that it's saying that it's ready from a startOffset I would assume this is a data block. A question that comes to mind, is this read doing a seek to that position directly or is it going to cache the block ? Looks like it is not caching the block if it's reading directly from a given offset. Or am I wrong ? Following is a sample line that I used while debugging: 2013-07-08 22:58:55,221 DEBUG org.apache.hadoop.hdfs.DFSClient: New BlockReaderLocal for file /mnt/data/current/subdir34/subdir26/blk_-448970697931783518 of size 67108864 startOffset 13006577 length 54102287 short circuit checksum true On Mon, Jul 8, 2013 at 4:37 PM, Jean-Daniel Cryans jdcry...@apache.orgwrote: Do you know if it's a data or meta block?
Re: optimizing block cache requests + eviction
FYI, if u disable your block cache - you will ask for Index blocks for every single request. So such a high rate of request is plausible for Index blocks even when your requests are totally random on your data. Varun On Mon, Jul 8, 2013 at 4:45 PM, Viral Bajaria viral.baja...@gmail.comwrote: Good question. When I looked at the logs, it's not clear from it whether it's reading a meta or data block. Is there any kind of log line that indicates that ? Given that it's saying that it's ready from a startOffset I would assume this is a data block. A question that comes to mind, is this read doing a seek to that position directly or is it going to cache the block ? Looks like it is not caching the block if it's reading directly from a given offset. Or am I wrong ? Following is a sample line that I used while debugging: 2013-07-08 22:58:55,221 DEBUG org.apache.hadoop.hdfs.DFSClient: New BlockReaderLocal for file /mnt/data/current/subdir34/subdir26/blk_-448970697931783518 of size 67108864 startOffset 13006577 length 54102287 short circuit checksum true On Mon, Jul 8, 2013 at 4:37 PM, Jean-Daniel Cryans jdcry...@apache.org wrote: Do you know if it's a data or meta block?
Re: optimizing block cache requests + eviction
meta blocks are at the end: http://hbase.apache.org/book.html#d2617e12979, a way to tell would be by logging from the HBase side but then I guess it's hard to reconcile with which file we're actually reading from... Regarding your second question, you are asking if we block HDFS blocks? We don't, since we don't even know about HDFS blocks. The BlockReader seeks into the file and return whatever data is asked. J-D On Mon, Jul 8, 2013 at 4:45 PM, Viral Bajaria viral.baja...@gmail.com wrote: Good question. When I looked at the logs, it's not clear from it whether it's reading a meta or data block. Is there any kind of log line that indicates that ? Given that it's saying that it's ready from a startOffset I would assume this is a data block. A question that comes to mind, is this read doing a seek to that position directly or is it going to cache the block ? Looks like it is not caching the block if it's reading directly from a given offset. Or am I wrong ? Following is a sample line that I used while debugging: 2013-07-08 22:58:55,221 DEBUG org.apache.hadoop.hdfs.DFSClient: New BlockReaderLocal for file /mnt/data/current/subdir34/subdir26/blk_-448970697931783518 of size 67108864 startOffset 13006577 length 54102287 short circuit checksum true On Mon, Jul 8, 2013 at 4:37 PM, Jean-Daniel Cryans jdcry...@apache.orgwrote: Do you know if it's a data or meta block?
Re: optimizing block cache requests + eviction
We haven't disable block cache. So I doubt that's the problem. On Mon, Jul 8, 2013 at 4:50 PM, Varun Sharma va...@pinterest.com wrote: FYI, if u disable your block cache - you will ask for Index blocks for every single request. So such a high rate of request is plausible for Index blocks even when your requests are totally random on your data. Varun
Re: Using separator/delimiter in HBase rowkey?
thanks for all these valuable comments. Jason On Mon, Jul 8, 2013 at 12:25 PM, Michael Segel michael_se...@hotmail.comwrote: Where is murmur? In your app? So then every app that wants to fetch that row must now use murmur. Added to Hadoop/HBase? Then when you do upgrades you have to make sure that the package is still in your class path. Note that different vendor's release management will mean YMMV as to what happens to your class paths and set up or if the jar gets blown out of the directory. Is the added cost in maintenance worth it? I don't know but I seriously doubt it. On Jul 8, 2013, at 11:00 AM, Mike Axiak m...@axiak.net wrote: I just don't understand.. every creation/interpretation of the key is going to require code to be used. The murmur implementation is with that code. How is there any extra burden? On Mon, Jul 8, 2013 at 11:54 AM, Michael Segel michael_se...@hotmail.com wrote: You will need to put the jar into either every app that runs, or you will need to put it on every node. Every upgrade, you will need to make sure its still in your class path. More work for the admins. So how much faster is it over MD5? MD5 and SHA-1 are part of the Java libraries that ship w Sun/Oracle so you have them already installed and in your class path. Just saying... ;-) On Jul 8, 2013, at 10:36 AM, Mike Axiak m...@axiak.net wrote: On Mon, Jul 8, 2013 at 11:29 AM, Michael Segel michael_se...@hotmail.com wrote: If not, you end up having to do a bit more maintenance of your cluster and that's going to be part of your tradeoff. How so? -Mike
Re: GC recommendations for large Region Server heaps
Hi, Check http://blog.sematext.com/2013/06/24/g1-cms-java-garbage-collector/ Those graphs show RegionServer before and after switch to G1. The dashboard screenshot further below shows CMS (top row) vs. G1 (bottom row). After those tests we ended up switching to G1 across the whole cluster and haven't had issues or major pauses since knock on keyboard. Otis -- Solr ElasticSearch Support -- http://sematext.com/ Performance Monitoring -- http://sematext.com/spm On Mon, Jul 8, 2013 at 2:56 PM, Stack st...@duboce.net wrote: On Mon, Jul 8, 2013 at 11:09 AM, Suraj Varma svarma...@gmail.com wrote: Hello: We have an HBase cluster with region servers running on 8GB heap size with a 0.6 block cache (it is a read heavy cluster, with bursty write traffic via MR jobs). (version: hbase-0.94.6.1) During HBaseCon, while speaking to a few attendees, I heard some folks were running region servers as high as 24GB and some others in the 16GB range. So - question: Are there any special GC recommendations (tuning parameters, flags, etc) that folks who run at these large heaps can recommend while moving up from an 8GB heap? i.e. for 16GB and for 24GB RS heaps ... ? I'm especially concerned about long pauses causing zk session timeouts and consequent RS shutdowns. Our boxes do have a lot of RAM and we are exploring how we can use more of it for the cluster while maintaining overall stability. Also - if there are clusters running multiple region servers per host, I'd be very interested to know what RS heap sizes those are being run at ... and whether this was chosen as an alternative to running a single RS with large heap. (I know I'll have to test the GC stuff out on my cluster and for my workloads anyway ... but just trying to get a feel of what sort of tuning options had to be used to have a stable HBase cluster with 16 or 24GB RS heaps). You hit full GC in this 8G heap Suraj? Can you try running one server at 24G to see how it does (with GC logging enabled so you can watch it over time)? On one hand, more heap may make it so you avoid full GC -- if you are hitting them now at 8G -- because application has more head room. On other hand, yes, if a full GC hits, it will be gone for proportionally longer than for your 8G heap. St.Ack
Re: GC recommendations for large Region Server heaps
This is my HBASE GC options of CMS, it does work well. XX:+DisableExplicitGC -XX:+UseCompressedOops -XX:PermSize=160m -XX:MaxPermSize=160m -XX:GCTimeRatio=19 -XX:SoftRefLRUPolicyMSPerMB=0 -XX:SurvivorRatio=2 -XX:MaxTenuringThreshold=1 -XX:+UseFastAccessorMethods -XX:+UseParNewGC -XX:+UseConcMarkSweepGC -XX:+CMSParallelRemarkEnabled -XX:CMSInitiatingOccupancyFraction=70 -XX:+UseCMSCompactAtFullCollection -XX:CMSFullGCsBeforeCompaction=0 -XX:+CMSClassUnloadingEnabled -XX:CMSMaxAbortablePrecleanTime=300 -XX:+CMSScavengeBeforeRemark On Tue, Jul 9, 2013 at 1:12 PM, Otis Gospodnetic otis.gospodne...@gmail.com wrote: Hi, Check http://blog.sematext.com/2013/06/24/g1-cms-java-garbage-collector/ Those graphs show RegionServer before and after switch to G1. The dashboard screenshot further below shows CMS (top row) vs. G1 (bottom row). After those tests we ended up switching to G1 across the whole cluster and haven't had issues or major pauses since knock on keyboard. Otis -- Solr ElasticSearch Support -- http://sematext.com/ Performance Monitoring -- http://sematext.com/spm On Mon, Jul 8, 2013 at 2:56 PM, Stack st...@duboce.net wrote: On Mon, Jul 8, 2013 at 11:09 AM, Suraj Varma svarma...@gmail.com wrote: Hello: We have an HBase cluster with region servers running on 8GB heap size with a 0.6 block cache (it is a read heavy cluster, with bursty write traffic via MR jobs). (version: hbase-0.94.6.1) During HBaseCon, while speaking to a few attendees, I heard some folks were running region servers as high as 24GB and some others in the 16GB range. So - question: Are there any special GC recommendations (tuning parameters, flags, etc) that folks who run at these large heaps can recommend while moving up from an 8GB heap? i.e. for 16GB and for 24GB RS heaps ... ? I'm especially concerned about long pauses causing zk session timeouts and consequent RS shutdowns. Our boxes do have a lot of RAM and we are exploring how we can use more of it for the cluster while maintaining overall stability. Also - if there are clusters running multiple region servers per host, I'd be very interested to know what RS heap sizes those are being run at ... and whether this was chosen as an alternative to running a single RS with large heap. (I know I'll have to test the GC stuff out on my cluster and for my workloads anyway ... but just trying to get a feel of what sort of tuning options had to be used to have a stable HBase cluster with 16 or 24GB RS heaps). You hit full GC in this 8G heap Suraj? Can you try running one server at 24G to see how it does (with GC logging enabled so you can watch it over time)? On one hand, more heap may make it so you avoid full GC -- if you are hitting them now at 8G -- because application has more head room. On other hand, yes, if a full GC hits, it will be gone for proportionally longer than for your 8G heap. St.Ack