Specify the size of a hregion when creating a table?
Can we specify the size of a hregion when creating a table? and how ?
Re: Any update for this issue HBASE-3529
Boris Lublinsky and I published an article on something we did a few years ago. (It's on InfoQ ) We did a small project of integrating Lucene and HBase for searching POI data. It would probably do what you want. Sent from a remote device. Please excuse any typos... Mike Segel On May 30, 2013, at 9:35 PM, dong.yajun dongt...@gmail.com wrote: hi Ted, not yet right now about lily. I would like to use hbase to store product reviews, the system should support secondary index, full text search and faceting(Lucene) which support paging and sorting. On Fri, May 31, 2013 at 10:24 AM, Ted Yu yuzhih...@gmail.com wrote: Jason is no longer working on this issue. Can you tell us your use case ? Have you looked at http://www.lilyproject.org/lily/index.html ? Thanks On Thu, May 30, 2013 at 7:06 PM, dong.yajun dongt...@gmail.com wrote: Hello list, Any one can give me some follow up information about this issue HBASE-3529, I'm wondering it has more than 2 years no update. Best, -- *Rick Dong* -- *Ric Dong*
Re: Specify the size of a hregion when creating a table?
Take a look at http://hbase.apache.org/apidocs/org/apache/hadoop/hbase/HTableDescriptor.html#setMaxFileSize(long) Cheers On May 31, 2013, at 12:22 AM, fx_bull javac...@gmail.com wrote: Can we specify the size of a hregion when creating a table? and how ?
Re: Specify the size of a hregion when creating a table?
many thanks ! 在 2013-5-31,下午5:21,Ted Yu yuzhih...@gmail.com 写道: Take a look at http://hbase.apache.org/apidocs/org/apache/hadoop/hbase/HTableDescriptor.html#setMaxFileSize(long) Cheers On May 31, 2013, at 12:22 AM, fx_bull javac...@gmail.com wrote: Can we specify the size of a hregion when creating a table? and how ?
debugging coprocessors in Netbeans
Hello, please, do you have a tip for debugging of coprocessors in Netbeans? I've got stuck on the line table.coprocessorExec{ and I am not able to trace the coprocessor. I tried to run HBase with HBASE_OPTS=-Xrunjdwp:transport=dt_socket,address=4530,server=y,suspend=n Then I tried Debug Attach Debugger In Debugger console I have this: Attaching to golem:4530 User program running But all the buttons like Step Over are inactive and I can't see any variable, but: No variables to display, because there is no current thread. What to do? Thanks.
Re: are ResultScanners vaid after hTable.close()
This is the Hadoop users list. Please ask HBase questions on their own, vibrant user community at user@hbase.apache.org for best responses. I've moved your post there. Please respond back over this moved address instead of the hadoop lists. On Fri, May 31, 2013 at 6:00 PM, Ted r6squee...@gmail.com wrote: I tried scouring the API docs as well as googling this and I can't find a definitive answer. If I get an HTable instance and I close it, do I have to make sure I'm finished using the ResultScanner and the Results before I close the hTable? (i.e. like JDBC connection/resultSets?) It looks like my code runs even after I close the hTable but I'm not sure that it's not just working due to prefetched due / scannerCaching or something. -- Ted. -- Harsh J
Re: Explosion in datasize using HBase as a MR sink
On your data set size, I would go on HFile OutputFormat and then bulk load in into HBase. Why go through the Put flow anyway (memstore, flush, WAL), especially if you have the input ready at your disposal for re-try if something fails? Sounds faster to me anyway. On May 30, 2013, at 10:52 PM, Rob Verkuylen r...@verkuylen.net wrote: On May 30, 2013, at 4:51, Stack st...@duboce.net wrote: Triggering a major compaction does not alter the overall 217.5GB size? A major compaction reduces the size from the original 219GB to the 217,5GB, so barely a reduction. 80% of the region sizes are 1,4GB before and after. I haven't merged the smaller regions, but that still would not bring the size down to the 2,5-5 or so GB I would expect given T2's size. You have speculative execution turned on in your MR job so its possible you write many versions? I've turned off speculative execution (through conf.set) just for the mappers, since we're not using reducers, should we? I will triple check the actual job settings in the job tracker, since I need to make the settings on a job level. Does your MR job fail many tasks (and though it fails, until it fails, it will have written some subset of the task hence bloating your versions?). We've had problems with failing mappers, because of zookeeper timeouts on large inserts, we increased zookeeper timeout and blockingstorefiles to accommodate. Now we don't get failures. This job writes to a cleanly made table, versions set to 1, so there shouldn't be extra versions I assume(?). You are putting everything into protobufs? Could that be bloating your data? Can you take a smaller subset and dump to the log a string version of the pb. Use TextFormat https://developers.google.com/protocol-buffers/docs/reference/java/com/google/protobuf/TextFormat#shortDebugString(com.google.protobuf.MessageOrBuilder) The protobufs reduce the size to roughly 40% of the original XML data in T1. The MR parser is a port of the python parse code we use going from T1 to T2. I've done manual comparisons on 20-30 records from T2.1 and T2 and they are identical, with only minute differences, because of slightly different parsing. I've done these in hbase shell, I will try log dumping them too. It can be informative looking at hfile content. It could give you a clue as to the bloat. See http://hbase.apache.org/book.html#hfile_tool I will give this a go and report back. Any other debugging suggestions are more then welcome :) Thnx, Rob
Re: HBASE install shell script on cluster
We have developed some custom scripts on top of fabric (http://docs.fabfile.org/en/1.6/). I've asked the developer on our team to see if can share some of it to the community. It's mainly used for development/QA/Integration test purposes. For production deployment we have a in-house chef like system we use, so can't share much there :) On May 30, 2013, at 5:40 AM, Stack st...@duboce.net wrote: On Wed, May 29, 2013 at 10:49 AM, Jay Vyas jayunit...@gmail.com wrote: Hi ! I've been working on installing HBASE using a shell script over some nodes. Usually folks do chef, puppet, etc. installing nodes. Do you not want to go that route? I can't help but think that someone else may have tried this before, if someone wants to share a gist or script from somewhere i could potentially modify / update it. At this point, Im just doing shell but was considering maybe python/ruby templating for the XML files. There is stuff like this if you chef or puppet it. http://hstack.org/hstack-automated-deployment-using-puppet/ http://palominodb.com/blog/2012/11/01/chef-cookbooks-hbase-centos-released St.Ack
Best practices for loading data into hbase
Hi, We are still very new at all of this hbase/hadoop/mapreduce stuff. We are looking for the best practices that will fit our requirements. We are currently using the latest cloudera vmware's (single node) for our development tests. The problem is as follows: We have multiple sources in different format (xml, csv, etc), which are dumps of existing systems. As one might think, there will be an initial import of the data into hbase and afterwards, the systems would most likely dump whatever data they have accumulated since the initial import into hbase or since the last data dump. Another thing, we would require to have an intermediary step, so that we can ensure all of a source's data can be successfully processed, something which would look like: XML data file --(MR JOB)-- Intermediate (hbase table or hfile?) --(MR JOB)-- production tables in hbase We're guessing we can't use something like a transaction in hbase, so we thought about using a intermediate step: Is that how things are normally done? As we import data into hbase, we will be populating several tables that links data parts together (account X in System 1 == account Y in System 2) as tuples in 3 tables. Currently, this is being done by a mapreduce job which reads the XML source and uses multiTableOutputFormat to put data into those 3 hbase tables. This method isn't that fast using our test sample (2 minutes for 5Mb), so we are looking at optimizing the loading of data. We have been researching bulk loading but we are unsure of a couple of things: Once we process an xml file and we populate our 3 production hbase tables, could we bulk load another xml file and append this new data to our 3 tables or would it write over what was written before? In order to bulk load, we need to output a file using HFileOutputFormat. Since MultiHFileOutputFormat doesn't seem to officially exist yet (still in the works, right?), should we process our input xml file with 3 MapReduce jobs instead of 1 and output an hfile for each, which we could then become our intermediate step (if all 3 hfiles were created without errors, then process was successful: bulk load in hbase)? Can you experiment with bulk loading on a vmware? We're experiencing problems with partition file not being found with the following exception: java.lang.Exception: java.lang.IllegalArgumentException: Can't read partitions file at org.apache.hadoop.mapred.LocalJobRunner$Job.run(LocalJobRunner.java:404) Caused by: java.lang.IllegalArgumentException: Can't read partitions file at org.apache.hadoop.mapreduce.lib.partition.TotalOrderPartitioner.setConf(TotalOrderPartitioner.java:108) at org.apache.hadoop.util.ReflectionUtils.setConf(ReflectionUtils.java:70) at org.apache.hadoop.util.ReflectionUtils.newInstance(ReflectionUtils.java:130) at org.apache.hadoop.mapred.MapTask$NewOutputCollector.init(MapTask.java:588) We also tried another idea on how to speed things up: What if instead of doing individual puts, we passed a list of puts to put() (eg: htable.put(putList) ). Internally in hbase, would there be less overhead vs multiple calls to put()? It seems to be faster, however since we're not using context.write, I'm guessing this will lead to problems later on, right? Turning off WAL on puts to speed things up isn't an option, since data loss would be unacceptable, even if the chances of a failure occurring are slim. Thanks, David
Re: Best practices for loading data into hbase
You cannot use the local job tracker (that is, the one that gets started if you don't have one running) with the TotalOrderPartitioner. You'll need to fully install hadoop on that vmware node. Google that error to find other relevant comments. J-D On Fri, May 31, 2013 at 1:19 PM, David Poisson david.pois...@ca.fujitsu.com wrote: Hi, We are still very new at all of this hbase/hadoop/mapreduce stuff. We are looking for the best practices that will fit our requirements. We are currently using the latest cloudera vmware's (single node) for our development tests. The problem is as follows: We have multiple sources in different format (xml, csv, etc), which are dumps of existing systems. As one might think, there will be an initial import of the data into hbase and afterwards, the systems would most likely dump whatever data they have accumulated since the initial import into hbase or since the last data dump. Another thing, we would require to have an intermediary step, so that we can ensure all of a source's data can be successfully processed, something which would look like: XML data file --(MR JOB)-- Intermediate (hbase table or hfile?) --(MR JOB)-- production tables in hbase We're guessing we can't use something like a transaction in hbase, so we thought about using a intermediate step: Is that how things are normally done? As we import data into hbase, we will be populating several tables that links data parts together (account X in System 1 == account Y in System 2) as tuples in 3 tables. Currently, this is being done by a mapreduce job which reads the XML source and uses multiTableOutputFormat to put data into those 3 hbase tables. This method isn't that fast using our test sample (2 minutes for 5Mb), so we are looking at optimizing the loading of data. We have been researching bulk loading but we are unsure of a couple of things: Once we process an xml file and we populate our 3 production hbase tables, could we bulk load another xml file and append this new data to our 3 tables or would it write over what was written before? In order to bulk load, we need to output a file using HFileOutputFormat. Since MultiHFileOutputFormat doesn't seem to officially exist yet (still in the works, right?), should we process our input xml file with 3 MapReduce jobs instead of 1 and output an hfile for each, which we could then become our intermediate step (if all 3 hfiles were created without errors, then process was successful: bulk load in hbase)? Can you experiment with bulk loading on a vmware? We're experiencing problems with partition file not being found with the following exception: java.lang.Exception: java.lang.IllegalArgumentException: Can't read partitions file at org.apache.hadoop.mapred.LocalJobRunner$Job.run(LocalJobRunner.java:404) Caused by: java.lang.IllegalArgumentException: Can't read partitions file at org.apache.hadoop.mapreduce.lib.partition.TotalOrderPartitioner.setConf(TotalOrderPartitioner.java:108) at org.apache.hadoop.util.ReflectionUtils.setConf(ReflectionUtils.java:70) at org.apache.hadoop.util.ReflectionUtils.newInstance(ReflectionUtils.java:130) at org.apache.hadoop.mapred.MapTask$NewOutputCollector.init(MapTask.java:588) We also tried another idea on how to speed things up: What if instead of doing individual puts, we passed a list of puts to put() (eg: htable.put(putList) ). Internally in hbase, would there be less overhead vs multiple calls to put()? It seems to be faster, however since we're not using context.write, I'm guessing this will lead to problems later on, right? Turning off WAL on puts to speed things up isn't an option, since data loss would be unacceptable, even if the chances of a failure occurring are slim. Thanks, David
Re: Best practices for loading data into hbase
bq. Once we process an xml file and we populate our 3 production hbase tables, could we bulk load another xml file and append this new data to our 3 tables or would it write over what was written before? You can bulk load another XML file. bq. should we process our input xml file with 3 MapReduce jobs instead of 1 You don't need to use 3 jobs. Looks like you were using CDH. Mind telling us the version number for HBase and hadoop ? Thanks On Fri, May 31, 2013 at 1:19 PM, David Poisson david.pois...@ca.fujitsu.com wrote: Hi, We are still very new at all of this hbase/hadoop/mapreduce stuff. We are looking for the best practices that will fit our requirements. We are currently using the latest cloudera vmware's (single node) for our development tests. The problem is as follows: We have multiple sources in different format (xml, csv, etc), which are dumps of existing systems. As one might think, there will be an initial import of the data into hbase and afterwards, the systems would most likely dump whatever data they have accumulated since the initial import into hbase or since the last data dump. Another thing, we would require to have an intermediary step, so that we can ensure all of a source's data can be successfully processed, something which would look like: XML data file --(MR JOB)-- Intermediate (hbase table or hfile?) --(MR JOB)-- production tables in hbase We're guessing we can't use something like a transaction in hbase, so we thought about using a intermediate step: Is that how things are normally done? As we import data into hbase, we will be populating several tables that links data parts together (account X in System 1 == account Y in System 2) as tuples in 3 tables. Currently, this is being done by a mapreduce job which reads the XML source and uses multiTableOutputFormat to put data into those 3 hbase tables. This method isn't that fast using our test sample (2 minutes for 5Mb), so we are looking at optimizing the loading of data. We have been researching bulk loading but we are unsure of a couple of things: Once we process an xml file and we populate our 3 production hbase tables, could we bulk load another xml file and append this new data to our 3 tables or would it write over what was written before? In order to bulk load, we need to output a file using HFileOutputFormat. Since MultiHFileOutputFormat doesn't seem to officially exist yet (still in the works, right?), should we process our input xml file with 3 MapReduce jobs instead of 1 and output an hfile for each, which we could then become our intermediate step (if all 3 hfiles were created without errors, then process was successful: bulk load in hbase)? Can you experiment with bulk loading on a vmware? We're experiencing problems with partition file not being found with the following exception: java.lang.Exception: java.lang.IllegalArgumentException: Can't read partitions file at org.apache.hadoop.mapred.LocalJobRunner$Job.run(LocalJobRunner.java:404) Caused by: java.lang.IllegalArgumentException: Can't read partitions file at org.apache.hadoop.mapreduce.lib.partition.TotalOrderPartitioner.setConf(TotalOrderPartitioner.java:108) at org.apache.hadoop.util.ReflectionUtils.setConf(ReflectionUtils.java:70) at org.apache.hadoop.util.ReflectionUtils.newInstance(ReflectionUtils.java:130) at org.apache.hadoop.mapred.MapTask$NewOutputCollector.init(MapTask.java:588) We also tried another idea on how to speed things up: What if instead of doing individual puts, we passed a list of puts to put() (eg: htable.put(putList) ). Internally in hbase, would there be less overhead vs multiple calls to put()? It seems to be faster, however since we're not using context.write, I'm guessing this will lead to problems later on, right? Turning off WAL on puts to speed things up isn't an option, since data loss would be unacceptable, even if the chances of a failure occurring are slim. Thanks, David
Re: Best practices for loading data into hbase
I am sorry to barge in when heavyweights are already involved here. But, just out of curiosity, why don't you use Sqoop http://sqoop.apache.org/ to import the data directly from your existing systems into HBase instead of first taking the dump and then doing the import. Sqoop allows us to do incremental imports as well. Pardon me if this sounds childish. Warm Regards, Tariq cloudfront.blogspot.com On Sat, Jun 1, 2013 at 1:56 AM, Ted Yu yuzhih...@gmail.com wrote: bq. Once we process an xml file and we populate our 3 production hbase tables, could we bulk load another xml file and append this new data to our 3 tables or would it write over what was written before? You can bulk load another XML file. bq. should we process our input xml file with 3 MapReduce jobs instead of 1 You don't need to use 3 jobs. Looks like you were using CDH. Mind telling us the version number for HBase and hadoop ? Thanks On Fri, May 31, 2013 at 1:19 PM, David Poisson david.pois...@ca.fujitsu.com wrote: Hi, We are still very new at all of this hbase/hadoop/mapreduce stuff. We are looking for the best practices that will fit our requirements. We are currently using the latest cloudera vmware's (single node) for our development tests. The problem is as follows: We have multiple sources in different format (xml, csv, etc), which are dumps of existing systems. As one might think, there will be an initial import of the data into hbase and afterwards, the systems would most likely dump whatever data they have accumulated since the initial import into hbase or since the last data dump. Another thing, we would require to have an intermediary step, so that we can ensure all of a source's data can be successfully processed, something which would look like: XML data file --(MR JOB)-- Intermediate (hbase table or hfile?) --(MR JOB)-- production tables in hbase We're guessing we can't use something like a transaction in hbase, so we thought about using a intermediate step: Is that how things are normally done? As we import data into hbase, we will be populating several tables that links data parts together (account X in System 1 == account Y in System 2) as tuples in 3 tables. Currently, this is being done by a mapreduce job which reads the XML source and uses multiTableOutputFormat to put data into those 3 hbase tables. This method isn't that fast using our test sample (2 minutes for 5Mb), so we are looking at optimizing the loading of data. We have been researching bulk loading but we are unsure of a couple of things: Once we process an xml file and we populate our 3 production hbase tables, could we bulk load another xml file and append this new data to our 3 tables or would it write over what was written before? In order to bulk load, we need to output a file using HFileOutputFormat. Since MultiHFileOutputFormat doesn't seem to officially exist yet (still in the works, right?), should we process our input xml file with 3 MapReduce jobs instead of 1 and output an hfile for each, which we could then become our intermediate step (if all 3 hfiles were created without errors, then process was successful: bulk load in hbase)? Can you experiment with bulk loading on a vmware? We're experiencing problems with partition file not being found with the following exception: java.lang.Exception: java.lang.IllegalArgumentException: Can't read partitions file at org.apache.hadoop.mapred.LocalJobRunner$Job.run(LocalJobRunner.java:404) Caused by: java.lang.IllegalArgumentException: Can't read partitions file at org.apache.hadoop.mapreduce.lib.partition.TotalOrderPartitioner.setConf(TotalOrderPartitioner.java:108) at org.apache.hadoop.util.ReflectionUtils.setConf(ReflectionUtils.java:70) at org.apache.hadoop.util.ReflectionUtils.newInstance(ReflectionUtils.java:130) at org.apache.hadoop.mapred.MapTask$NewOutputCollector.init(MapTask.java:588) We also tried another idea on how to speed things up: What if instead of doing individual puts, we passed a list of puts to put() (eg: htable.put(putList) ). Internally in hbase, would there be less overhead vs multiple calls to put()? It seems to be faster, however since we're not using context.write, I'm guessing this will lead to problems later on, right? Turning off WAL on puts to speed things up isn't an option, since data loss would be unacceptable, even if the chances of a failure occurring are slim. Thanks, David
Re: querying hbase
On 05/24/2013 02:50 PM, Andrew Purtell wrote: On Thu, May 23, 2013 at 5:10 PM, James Taylor jtay...@salesforce.comwrote: Has there been any discussions on running the HBase server in an OSGi container? I believe the only discussions have been on avoiding talk about coprocessor reloading, as it implies either a reimplementation of or taking on an OSGi runtime. Is there a benefit to restarting a regionserver in an OSGi container versus restarting a Java process? Or would that work otherwise like an update the coprocessor and filters in the container then trigger the embedded regionserver to do a quick close and reopen of the regions? My thinking was that an OSGi container would allow a new version of a coprocessor (and/or custom filter) jar to be loaded. Class conflicts between the old jar and the new jar would no longer be a problem - you'd never need to unload the old jar. Instead, future HBase operations that invoke the coprocessor would cause the newly loaded jar to be used instead of the older one. I'm not sure if this is possible or not. The whole idea would be to prevent a rolling restart or region close/reopen.
Re: HConnectionManager$HConnectionImplementation.locateRegionInMeta
Even if I initiate the call via a pooled htable, the MetaScanner seems to use a concrete HTable instance. The constructor invoked seems to create a java ThreadPoolExecutor. I am not 100% sure but I think as long as nothing is submitted to the ThreadPoolExecutor it won't create any threads. I just wanted to confirm this was the case. I do see the connection is shared. --Kireet On 5/30/13 7:38 PM, Ted Yu wrote: HTablePool$**PooledHTable is a wrapper around HTable. Here is how HTable obtains a connection: public HTable(Configuration conf, final byte[] tableName, final ExecutorService pool) throws IOException { this.connection = HConnectionManager.getConnection(conf); Meaning the connection is a shared one based on certain key/value pairs from conf. bq. So every call to batch will create a new thread? I don't think so. On Thu, May 30, 2013 at 11:28 AM, Kireet kireet-teh5dpvpl8nqt0dzr+a...@public.gmane.org wrote: Thanks, will give it a shot. So I should download 0.94.7 (latest stable) and run the patch tool on top with the backport? This is a little new to me. Also, I was looking at the stack below. From my reading of the code, the HTable.batch() call will always cause the prefetch call to occur, which will cause a new HTable object to get created. The constructor used in creating a new thread pool. So every call to batch will create a new thread? Or the HTable's thread pool never gets used as the pool is only used for writes? I think I am missing something but just want to confirm. Thanks Kireet On 5/30/13 12:48 PM, Himanshu Vashishtha wrote: bq. Anoop attached backported patch in HBASE-8655. It should go into 0.94.9, the next release - current is 0.94.8 In case you want it sooner, you can apply 8655 patch and test/verify it. Thanks, Himanshu On Thu, May 30, 2013 at 7:26 AM, Ted Yu yuzhihong-** Re5JQEeQqe8AvxtiuMwx3w@public.**gmane.orgyuzhihong-re5jqeeqqe8avxtiumwx3w-xmd5yjdbdmrexy1tmh2...@public.gmane.org wrote: Anoop attached backported patch in HBASE-8655 It should go into 0.94.9, the next release - current is 0.94.8 Cheers On Thu, May 30, 2013 at 7:01 AM, Kireet kireet-Teh5dPVPL8nQT0dZR+** alfa-xmd5yjdbdmrexy1tmh2...@public.gmane.org kireet-teh5dpvpl8nqt0dzr%2balfa-xmd5yjdbdmrexy1tmh2...@public.gmane.org wrote: How long do backports typically take? We have to go live in a month ready or not. Thanks for the quick replies Anoop and Ted. --Kireet On 5/30/13 9:20 AM, Ted Yu wrote: 0.95 client is not compatible with 0.94 cluster. So you cannot use 0.95 client. Cheers On May 30, 2013, at 6:12 AM, Kireet kireet-Teh5dPVPL8nQT0dZR+** AlfA-XMD5yJDbdMReXY1tMh2IBg@**public.gmane.orgalfa-xmd5yjdbdmrexy1tmh2ibg-xmd5yjdbdmrexy1tmh2...@public.gmane.org kireet-Teh5dPVPL8nQT0dZR%**2BAlfA-XMD5yJDbdMReXY1tMh2IBg@** public.gmane.orgkireet-teh5dpvpl8nqt0dzr%252balfa-xmd5yjdbdmrexy1tmh2ibg-xmd5yjdbdmrexy1tmh2...@public.gmane.org wrote: Would there be a problem if our cluster is 0.94 and we use a 0.95 client? I am not familiar with the HBase code base, but I did a dump of the thread that is actually running (below). It seems like it is related to the issue you mentioned as the running thread is doing the prefetch logic. Would pre-splitting tables help here? We are doing some performance tests and essentially starting from an empty instance. java.lang.Thread.State: WAITING (on object monitor) at java.lang.Object.wait(Native Method) at java.lang.Object.wait(Object.java:503) at org.apache.zookeeper.ClientCnxn.submitRequest(** ClientCnxn.java:1309) - locked 0xe10cf830 (a org.apache.zookeeper.** ClientCnxn$Packet) at org.apache.zookeeper.ZooKeeper.exists(ZooKeeper.java:1036) at org.apache.hadoop.hbase.zookeeper. RecoverableZooKeeper.exists(** RecoverableZooKeeper.java:172) at org.apache.hadoop.hbase.zookeeper.ZKUtil.checkExists( ZKUtil.java:450) at org.apache.hadoop.hbase.zookeeper.ZooKeeperNodeTracker.** checkIfBaseNodeAvailable(ZooKeeperNodeTracker.java:208) at org.apache.hadoop.hbase.zookeeper.RootRegionTracker.** waitRootRegionLocation(RootRegionTracker.java:77) at org.apache.hadoop.hbase.client.HConnectionManager$** HConnectionImplementation.locateRegion( HConnectionManager.java:874) at org.apache.hadoop.hbase.client.HConnectionManager$** HConnectionImplementation.locateRegionInMeta(** HConnectionManager.java:987) at org.apache.hadoop.hbase.client.HConnectionManager$** HConnectionImplementation.locateRegion( HConnectionManager.java:885) at org.apache.hadoop.hbase.client.HConnectionManager$** HConnectionImplementation.locateRegion( HConnectionManager.java:846) at org.apache.hadoop.hbase.client.HTable.finishSetup(** HTable.java:234) at org.apache.hadoop.hbase.client.HTable.init(HTable. java:174) at org.apache.hadoop.hbase.client.MetaScanner.metaScan(** MetaScanner.java:160) at
Re: HConnectionManager$HConnectionImplementation.locateRegionInMeta
Indeed. That is bad. I cannot see a clean fix immediately, but we need to look at this. Mind filing a ticket, Kireet? -- Lars From: Kireet kir...@feedly.com To: public-user-50Pas4EWwPEyzMRdD/i...@plane.gmane.org Sent: Friday, May 31, 2013 11:58 AM Subject: Re: HConnectionManager$HConnectionImplementation.locateRegionInMeta Even if I initiate the call via a pooled htable, the MetaScanner seems to use a concrete HTable instance. The constructor invoked seems to create a java ThreadPoolExecutor. I am not 100% sure but I think as long as nothing is submitted to the ThreadPoolExecutor it won't create any threads. I just wanted to confirm this was the case. I do see the connection is shared. --Kireet On 5/30/13 7:38 PM, Ted Yu wrote: HTablePool$**PooledHTable is a wrapper around HTable. Here is how HTable obtains a connection: public HTable(Configuration conf, final byte[] tableName, final ExecutorService pool) throws IOException { this.connection = HConnectionManager.getConnection(conf); Meaning the connection is a shared one based on certain key/value pairs from conf. bq. So every call to batch will create a new thread? I don't think so. On Thu, May 30, 2013 at 11:28 AM, Kireet kireet-teh5dpvpl8nqt0dzr+a...@public.gmane.org wrote: Thanks, will give it a shot. So I should download 0.94.7 (latest stable) and run the patch tool on top with the backport? This is a little new to me. Also, I was looking at the stack below. From my reading of the code, the HTable.batch() call will always cause the prefetch call to occur, which will cause a new HTable object to get created. The constructor used in creating a new thread pool. So every call to batch will create a new thread? Or the HTable's thread pool never gets used as the pool is only used for writes? I think I am missing something but just want to confirm. Thanks Kireet On 5/30/13 12:48 PM, Himanshu Vashishtha wrote: bq. Anoop attached backported patch in HBASE-8655. It should go into 0.94.9, the next release - current is 0.94.8 In case you want it sooner, you can apply 8655 patch and test/verify it. Thanks, Himanshu On Thu, May 30, 2013 at 7:26 AM, Ted Yu yuzhihong-** Re5JQEeQqe8AvxtiuMwx3w@public.**gmane.orgyuzhihong-re5jqeeqqe8avxtiumwx3w-xmd5yjdbdmrexy1tmh2...@public.gmane.org wrote: Anoop attached backported patch in HBASE-8655 It should go into 0.94.9, the next release - current is 0.94.8 Cheers On Thu, May 30, 2013 at 7:01 AM, Kireet kireet-Teh5dPVPL8nQT0dZR+** alfa-xmd5yjdbdmrexy1tmh2...@public.gmane.org kireet-teh5dpvpl8nqt0dzr%2balfa-xmd5yjdbdmrexy1tmh2...@public.gmane.org wrote: How long do backports typically take? We have to go live in a month ready or not. Thanks for the quick replies Anoop and Ted. --Kireet On 5/30/13 9:20 AM, Ted Yu wrote: 0.95 client is not compatible with 0.94 cluster. So you cannot use 0.95 client. Cheers On May 30, 2013, at 6:12 AM, Kireet kireet-Teh5dPVPL8nQT0dZR+** AlfA-XMD5yJDbdMReXY1tMh2IBg@**public.gmane.orgalfa-xmd5yjdbdmrexy1tmh2ibg-xmd5yjdbdmrexy1tmh2...@public.gmane.org kireet-Teh5dPVPL8nQT0dZR%**2BAlfA-XMD5yJDbdMReXY1tMh2IBg@** public.gmane.orgkireet-teh5dpvpl8nqt0dzr%252balfa-xmd5yjdbdmrexy1tmh2ibg-xmd5yjdbdmrexy1tmh2...@public.gmane.org wrote: Would there be a problem if our cluster is 0.94 and we use a 0.95 client? I am not familiar with the HBase code base, but I did a dump of the thread that is actually running (below). It seems like it is related to the issue you mentioned as the running thread is doing the prefetch logic. Would pre-splitting tables help here? We are doing some performance tests and essentially starting from an empty instance. java.lang.Thread.State: WAITING (on object monitor) at java.lang.Object.wait(Native Method) at java.lang.Object.wait(Object.java:503) at org.apache.zookeeper.ClientCnxn.submitRequest(** ClientCnxn.java:1309) - locked 0xe10cf830 (a org.apache.zookeeper.** ClientCnxn$Packet) at org.apache.zookeeper.ZooKeeper.exists(ZooKeeper.java:1036) at org.apache.hadoop.hbase.zookeeper. RecoverableZooKeeper.exists(** RecoverableZooKeeper.java:172) at org.apache.hadoop.hbase.zookeeper.ZKUtil.checkExists( ZKUtil.java:450) at org.apache.hadoop.hbase.zookeeper.ZooKeeperNodeTracker.** checkIfBaseNodeAvailable(ZooKeeperNodeTracker.java:208) at org.apache.hadoop.hbase.zookeeper.RootRegionTracker.** waitRootRegionLocation(RootRegionTracker.java:77) at org.apache.hadoop.hbase.client.HConnectionManager$** HConnectionImplementation.locateRegion( HConnectionManager.java:874) at org.apache.hadoop.hbase.client.HConnectionManager$** HConnectionImplementation.locateRegionInMeta(** HConnectionManager.java:987) at org.apache.hadoop.hbase.client.HConnectionManager$** HConnectionImplementation.locateRegion(