Specify the size of a hregion when creating a table?

2013-05-31 Thread fx_bull
Can we specify the size of a hregion when creating a table?  and  how ?





Re: Any update for this issue HBASE-3529

2013-05-31 Thread Michel Segel
Boris Lublinsky and I published an article on something we did a few years ago. 
(It's on  InfoQ )

We did a small project of integrating Lucene and HBase for searching POI data.

It would probably do what you want. 

Sent from a remote device. Please excuse any typos...

Mike Segel

On May 30, 2013, at 9:35 PM, dong.yajun dongt...@gmail.com wrote:

 hi Ted,
 
 not yet right now about lily.
 
 I would like to use hbase to store product reviews,  the system should
 support secondary index, full text search and faceting(Lucene) which
 support paging and sorting.
 
 
 On Fri, May 31, 2013 at 10:24 AM, Ted Yu yuzhih...@gmail.com wrote:
 
 Jason is no longer working on this issue.
 
 Can you tell us your use case ?
 
 Have you looked at http://www.lilyproject.org/lily/index.html ?
 
 Thanks
 
 On Thu, May 30, 2013 at 7:06 PM, dong.yajun dongt...@gmail.com wrote:
 
 Hello list,
 
 Any one can give me some follow up information about this issue
 HBASE-3529,
 I'm wondering it has more than 2 years no update.
 
 Best,
 --
 *Rick Dong*
 
 
 
 -- 
 *Ric Dong*


Re: Specify the size of a hregion when creating a table?

2013-05-31 Thread Ted Yu
Take a look at 
http://hbase.apache.org/apidocs/org/apache/hadoop/hbase/HTableDescriptor.html#setMaxFileSize(long)

Cheers

On May 31, 2013, at 12:22 AM, fx_bull javac...@gmail.com wrote:

 Can we specify the size of a hregion when creating a table?  and  how ?
 
 
 


Re: Specify the size of a hregion when creating a table?

2013-05-31 Thread fx_bull
many thanks !


在 2013-5-31,下午5:21,Ted Yu yuzhih...@gmail.com 写道:

 Take a look at 
 http://hbase.apache.org/apidocs/org/apache/hadoop/hbase/HTableDescriptor.html#setMaxFileSize(long)
 
 Cheers
 
 On May 31, 2013, at 12:22 AM, fx_bull javac...@gmail.com wrote:
 
 Can we specify the size of a hregion when creating a table?  and  how ?
 
 
 



debugging coprocessors in Netbeans

2013-05-31 Thread Pavel Hančar
 Hello,
please, do you have a tip for debugging of coprocessors in Netbeans?
I've got stuck on the line
table.coprocessorExec{
and I am not able to trace the coprocessor. I tried to run HBase with
HBASE_OPTS=-Xrunjdwp:transport=dt_socket,address=4530,server=y,suspend=n
Then I tried Debug  Attach Debugger

In Debugger console I have this:
Attaching to golem:4530
User program running

But all the buttons like Step Over are inactive and I can't see any
variable, but: No variables to display, because there is no current
thread.

What to do?
 Thanks.


Re: are ResultScanners vaid after hTable.close()

2013-05-31 Thread Harsh J
This is the Hadoop users list. Please ask HBase questions on their
own, vibrant user community at user@hbase.apache.org for best
responses. I've moved your post there. Please respond back over this
moved address instead of the hadoop lists.

On Fri, May 31, 2013 at 6:00 PM, Ted r6squee...@gmail.com wrote:
 I tried scouring the API docs as well as googling this and I can't
 find a definitive answer.

 If I get an HTable instance and I close it, do I have to make sure I'm
 finished using the ResultScanner and the Results before I close the
 hTable? (i.e. like JDBC connection/resultSets?)

 It looks like my code runs even after I close the hTable but I'm not
 sure that it's not just working due to prefetched due / scannerCaching
 or something.

 --
 Ted.



-- 
Harsh J


Re: Explosion in datasize using HBase as a MR sink

2013-05-31 Thread Asaf Mesika
On your data set size, I would go on HFile OutputFormat and then bulk load in 
into HBase. Why go through the Put flow anyway (memstore, flush, WAL), 
especially if you have the input ready at your disposal for re-try if something 
fails?
Sounds faster to me anyway.

On May 30, 2013, at 10:52 PM, Rob Verkuylen r...@verkuylen.net wrote:

 
 On May 30, 2013, at 4:51, Stack st...@duboce.net wrote:
 
 Triggering a major compaction does not alter the overall 217.5GB size?
 
 A major compaction reduces the size from the original 219GB to the 217,5GB, 
 so barely a reduction. 
 80% of the region sizes are 1,4GB before and after. I haven't merged the 
 smaller regions,
 but that still would not bring the size down to the 2,5-5 or so GB I would 
 expect given T2's size.
 
 You have speculative execution turned on in your MR job so its possible you
 write many versions?
 
 I've turned off speculative execution (through conf.set) just for the 
 mappers, since we're not using reducers, should we? 
 I will triple check the actual job settings in the job tracker, since I need 
 to make the settings on a job level.
 
 Does your MR job fail many tasks (and though it fails, until it fails, it
 will have written some subset of the task hence bloating your versions?).
 
 We've had problems with failing mappers, because of zookeeper timeouts on 
 large inserts,
 we increased zookeeper timeout and blockingstorefiles to accommodate. Now we 
 don't
 get failures. This job writes to a cleanly made table, versions set to 1, so 
 there shouldn't be
 extra versions I assume(?).
 
 You are putting everything into protobufs?  Could that be bloating your
 data?  Can you take a smaller subset and dump to the log a string version
 of the pb.  Use TextFormat
 https://developers.google.com/protocol-buffers/docs/reference/java/com/google/protobuf/TextFormat#shortDebugString(com.google.protobuf.MessageOrBuilder)
 
 The protobufs reduce the size to roughly 40% of the original XML data in T1. 
 The MR parser is a port of the python parse code we use going from T1 to T2.
 I've done manual comparisons on 20-30 records from T2.1 and T2 and they are 
 identical, 
 with only minute differences, because of slightly different parsing. I've 
 done these in hbase shell,
 I will try log dumping them too.
 
 It can be informative looking at hfile content.  It could give you a clue
 as to the bloat.  See http://hbase.apache.org/book.html#hfile_tool
 
 I will give this a go and report back. Any other debugging suggestions are 
 more then welcome :)
 
 Thnx, Rob
 



Re: HBASE install shell script on cluster

2013-05-31 Thread Asaf Mesika
We have developed some custom scripts on top of fabric 
(http://docs.fabfile.org/en/1.6/).
I've asked the developer on our team to see if can share some of it to the 
community.
It's mainly used for development/QA/Integration test purposes.

For production deployment we have a in-house chef like system we use, so 
can't share much there :)

On May 30, 2013, at 5:40 AM, Stack st...@duboce.net wrote:

 On Wed, May 29, 2013 at 10:49 AM, Jay Vyas jayunit...@gmail.com wrote:
 
 Hi !
 
 I've been working on installing HBASE using a shell script over some
 nodes.
 
 
 Usually folks do chef, puppet, etc. installing nodes.  Do you not want to
 go that route?
 
 
 
 I can't help but think that someone else may have tried this before, if
 someone wants to share a gist or script from somewhere i could potentially
 modify / update it.
 
 At this point, Im just doing shell but was considering maybe python/ruby
 templating for the XML files.
 
 
 
 There is stuff like this if you chef or puppet it.
 
 http://hstack.org/hstack-automated-deployment-using-puppet/
 http://palominodb.com/blog/2012/11/01/chef-cookbooks-hbase-centos-released
 
 St.Ack



Best practices for loading data into hbase

2013-05-31 Thread David Poisson
Hi,
 We are still very new at all of this hbase/hadoop/mapreduce stuff. We are 
looking for the best practices that will fit our requirements. We are currently 
using the latest cloudera vmware's (single node) for our development tests.

The problem is as follows: 

We have multiple sources in different format (xml, csv, etc), which are dumps 
of existing systems. As one might think, there will be an initial import of 
the data into hbase 
and afterwards, the systems would most likely dump whatever data they have 
accumulated since the initial import into hbase or since the last data dump. 
Another thing, we would require to have an
intermediary step, so that we can ensure all of a source's data can be 
successfully processed, something which would look like:

XML data file --(MR JOB)-- Intermediate (hbase table or hfile?) --(MR JOB)-- 
production tables in hbase

We're guessing we can't use something like a transaction in hbase, so we 
thought about using a intermediate step: Is that how things are normally done?

As we import data into hbase, we will be populating several tables that links 
data parts together (account X in System 1 == account Y in System 2) as tuples 
in 3 tables. Currently, 
this is being done by a mapreduce job which reads the XML source and uses 
multiTableOutputFormat to put data into those 3 hbase tables. This method
isn't that fast using our test sample (2 minutes for 5Mb), so we are looking at 
optimizing the loading of data.

We have been researching bulk loading but we are unsure of a couple of things:
Once we process an xml file and we populate our 3 production hbase tables, 
could we bulk load another xml file and append this new data to our 3 tables or 
would it write over what was written before?
In order to bulk load, we need to output a file using HFileOutputFormat. Since 
MultiHFileOutputFormat doesn't seem to officially exist yet (still in the 
works, right?), should we process our input xml file
with 3 MapReduce jobs instead of 1 and output an hfile for each, which we could 
then become our intermediate step (if all 3 hfiles were created without errors, 
then process was successful: bulk load
in hbase)? Can you experiment with bulk loading on a vmware? We're experiencing 
problems with partition file not being found with the following exception:

java.lang.Exception: java.lang.IllegalArgumentException: Can't read partitions 
file
at 
org.apache.hadoop.mapred.LocalJobRunner$Job.run(LocalJobRunner.java:404)
Caused by: java.lang.IllegalArgumentException: Can't read partitions file
at 
org.apache.hadoop.mapreduce.lib.partition.TotalOrderPartitioner.setConf(TotalOrderPartitioner.java:108)
at 
org.apache.hadoop.util.ReflectionUtils.setConf(ReflectionUtils.java:70)
at 
org.apache.hadoop.util.ReflectionUtils.newInstance(ReflectionUtils.java:130)
at 
org.apache.hadoop.mapred.MapTask$NewOutputCollector.init(MapTask.java:588)

We also tried another idea on how to speed things up: What if instead of doing 
individual puts, we passed a list of puts to put() (eg: htable.put(putList) ). 
Internally in hbase, would there be less overhead vs multiple
calls to put()? It seems to be faster, however since we're not using 
context.write, I'm guessing this will lead to problems later on, right?

Turning off WAL on puts to speed things up isn't an option, since data loss 
would be unacceptable, even if the chances of a failure occurring are slim.

Thanks, David

Re: Best practices for loading data into hbase

2013-05-31 Thread Jean-Daniel Cryans
You cannot use the local job tracker (that is, the one that gets
started if you don't have one running) with the TotalOrderPartitioner.

You'll need to fully install hadoop on that vmware node.

Google that error to find other relevant comments.

J-D

On Fri, May 31, 2013 at 1:19 PM, David Poisson
david.pois...@ca.fujitsu.com wrote:
 Hi,
  We are still very new at all of this hbase/hadoop/mapreduce stuff. We 
 are looking for the best practices that will fit our requirements. We are 
 currently using the latest cloudera vmware's (single node) for our 
 development tests.

 The problem is as follows:

 We have multiple sources in different format (xml, csv, etc), which are dumps 
 of existing systems. As one might think, there will be an initial import of 
 the data into hbase
 and afterwards, the systems would most likely dump whatever data they have 
 accumulated since the initial import into hbase or since the last data dump. 
 Another thing, we would require to have an
 intermediary step, so that we can ensure all of a source's data can be 
 successfully processed, something which would look like:

 XML data file --(MR JOB)-- Intermediate (hbase table or hfile?) --(MR 
 JOB)-- production tables in hbase

 We're guessing we can't use something like a transaction in hbase, so we 
 thought about using a intermediate step: Is that how things are normally done?

 As we import data into hbase, we will be populating several tables that links 
 data parts together (account X in System 1 == account Y in System 2) as 
 tuples in 3 tables. Currently,
 this is being done by a mapreduce job which reads the XML source and uses 
 multiTableOutputFormat to put data into those 3 hbase tables. This method
 isn't that fast using our test sample (2 minutes for 5Mb), so we are looking 
 at optimizing the loading of data.

 We have been researching bulk loading but we are unsure of a couple of things:
 Once we process an xml file and we populate our 3 production hbase tables, 
 could we bulk load another xml file and append this new data to our 3 tables 
 or would it write over what was written before?
 In order to bulk load, we need to output a file using HFileOutputFormat. 
 Since MultiHFileOutputFormat doesn't seem to officially exist yet (still in 
 the works, right?), should we process our input xml file
 with 3 MapReduce jobs instead of 1 and output an hfile for each, which we 
 could then become our intermediate step (if all 3 hfiles were created without 
 errors, then process was successful: bulk load
 in hbase)? Can you experiment with bulk loading on a vmware? We're 
 experiencing problems with partition file not being found with the following 
 exception:

 java.lang.Exception: java.lang.IllegalArgumentException: Can't read 
 partitions file
 at 
 org.apache.hadoop.mapred.LocalJobRunner$Job.run(LocalJobRunner.java:404)
 Caused by: java.lang.IllegalArgumentException: Can't read partitions file
 at 
 org.apache.hadoop.mapreduce.lib.partition.TotalOrderPartitioner.setConf(TotalOrderPartitioner.java:108)
 at 
 org.apache.hadoop.util.ReflectionUtils.setConf(ReflectionUtils.java:70)
 at 
 org.apache.hadoop.util.ReflectionUtils.newInstance(ReflectionUtils.java:130)
 at 
 org.apache.hadoop.mapred.MapTask$NewOutputCollector.init(MapTask.java:588)

 We also tried another idea on how to speed things up: What if instead of 
 doing individual puts, we passed a list of puts to put() (eg: 
 htable.put(putList) ). Internally in hbase, would there be less overhead vs 
 multiple
 calls to put()? It seems to be faster, however since we're not using 
 context.write, I'm guessing this will lead to problems later on, right?

 Turning off WAL on puts to speed things up isn't an option, since data loss 
 would be unacceptable, even if the chances of a failure occurring are slim.

 Thanks, David


Re: Best practices for loading data into hbase

2013-05-31 Thread Ted Yu
bq. Once we process an xml file and we populate our 3 production hbase
tables, could we bulk load another xml file and append this new data to our
3 tables or would it write over what was written before?

You can bulk load another XML file.

bq. should we process our input xml file with 3 MapReduce jobs instead of 1

You don't need to use 3 jobs.

Looks like you were using CDH. Mind telling us the version number for HBase
and hadoop ?

Thanks

On Fri, May 31, 2013 at 1:19 PM, David Poisson david.pois...@ca.fujitsu.com
 wrote:

 Hi,
  We are still very new at all of this hbase/hadoop/mapreduce stuff. We
 are looking for the best practices that will fit our requirements. We are
 currently using the latest cloudera vmware's (single node) for our
 development tests.

 The problem is as follows:

 We have multiple sources in different format (xml, csv, etc), which are
 dumps of existing systems. As one might think, there will be an initial
 import of the data into hbase
 and afterwards, the systems would most likely dump whatever data they have
 accumulated since the initial import into hbase or since the last data
 dump. Another thing, we would require to have an
 intermediary step, so that we can ensure all of a source's data can be
 successfully processed, something which would look like:

 XML data file --(MR JOB)-- Intermediate (hbase table or hfile?) --(MR
 JOB)-- production tables in hbase

 We're guessing we can't use something like a transaction in hbase, so we
 thought about using a intermediate step: Is that how things are normally
 done?

 As we import data into hbase, we will be populating several tables that
 links data parts together (account X in System 1 == account Y in System 2)
 as tuples in 3 tables. Currently,
 this is being done by a mapreduce job which reads the XML source and uses
 multiTableOutputFormat to put data into those 3 hbase tables. This method
 isn't that fast using our test sample (2 minutes for 5Mb), so we are
 looking at optimizing the loading of data.

 We have been researching bulk loading but we are unsure of a couple of
 things:
 Once we process an xml file and we populate our 3 production hbase
 tables, could we bulk load another xml file and append this new data to our
 3 tables or would it write over what was written before?
 In order to bulk load, we need to output a file using HFileOutputFormat.
 Since MultiHFileOutputFormat doesn't seem to officially exist yet (still in
 the works, right?), should we process our input xml file
 with 3 MapReduce jobs instead of 1 and output an hfile for each, which we
 could then become our intermediate step (if all 3 hfiles were created
 without errors, then process was successful: bulk load
 in hbase)? Can you experiment with bulk loading on a vmware? We're
 experiencing problems with partition file not being found with the
 following exception:

 java.lang.Exception: java.lang.IllegalArgumentException: Can't read
 partitions file
 at
 org.apache.hadoop.mapred.LocalJobRunner$Job.run(LocalJobRunner.java:404)
 Caused by: java.lang.IllegalArgumentException: Can't read partitions file
 at
 org.apache.hadoop.mapreduce.lib.partition.TotalOrderPartitioner.setConf(TotalOrderPartitioner.java:108)
 at
 org.apache.hadoop.util.ReflectionUtils.setConf(ReflectionUtils.java:70)
 at
 org.apache.hadoop.util.ReflectionUtils.newInstance(ReflectionUtils.java:130)
 at
 org.apache.hadoop.mapred.MapTask$NewOutputCollector.init(MapTask.java:588)

 We also tried another idea on how to speed things up: What if instead of
 doing individual puts, we passed a list of puts to put() (eg:
 htable.put(putList) ). Internally in hbase, would there be less overhead vs
 multiple
 calls to put()? It seems to be faster, however since we're not using
 context.write, I'm guessing this will lead to problems later on, right?

 Turning off WAL on puts to speed things up isn't an option, since data
 loss would be unacceptable, even if the chances of a failure occurring are
 slim.

 Thanks, David


Re: Best practices for loading data into hbase

2013-05-31 Thread Mohammad Tariq
I am sorry to barge in when heavyweights are already involved here. But,
just out of curiosity, why don't you use Sqoop http://sqoop.apache.org/ to
import the data directly from your existing systems into HBase instead of
first taking the dump and then doing the import. Sqoop allows us to do
incremental imports as well.

Pardon me if this sounds childish.

Warm Regards,
Tariq
cloudfront.blogspot.com


On Sat, Jun 1, 2013 at 1:56 AM, Ted Yu yuzhih...@gmail.com wrote:

 bq. Once we process an xml file and we populate our 3 production hbase
 tables, could we bulk load another xml file and append this new data to our
 3 tables or would it write over what was written before?

 You can bulk load another XML file.

 bq. should we process our input xml file with 3 MapReduce jobs instead of 1

 You don't need to use 3 jobs.

 Looks like you were using CDH. Mind telling us the version number for HBase
 and hadoop ?

 Thanks

 On Fri, May 31, 2013 at 1:19 PM, David Poisson 
 david.pois...@ca.fujitsu.com
  wrote:

  Hi,
   We are still very new at all of this hbase/hadoop/mapreduce stuff.
 We
  are looking for the best practices that will fit our requirements. We are
  currently using the latest cloudera vmware's (single node) for our
  development tests.
 
  The problem is as follows:
 
  We have multiple sources in different format (xml, csv, etc), which are
  dumps of existing systems. As one might think, there will be an initial
  import of the data into hbase
  and afterwards, the systems would most likely dump whatever data they
 have
  accumulated since the initial import into hbase or since the last data
  dump. Another thing, we would require to have an
  intermediary step, so that we can ensure all of a source's data can be
  successfully processed, something which would look like:
 
  XML data file --(MR JOB)-- Intermediate (hbase table or hfile?) --(MR
  JOB)-- production tables in hbase
 
  We're guessing we can't use something like a transaction in hbase, so we
  thought about using a intermediate step: Is that how things are normally
  done?
 
  As we import data into hbase, we will be populating several tables that
  links data parts together (account X in System 1 == account Y in System
 2)
  as tuples in 3 tables. Currently,
  this is being done by a mapreduce job which reads the XML source and uses
  multiTableOutputFormat to put data into those 3 hbase tables. This
 method
  isn't that fast using our test sample (2 minutes for 5Mb), so we are
  looking at optimizing the loading of data.
 
  We have been researching bulk loading but we are unsure of a couple of
  things:
  Once we process an xml file and we populate our 3 production hbase
  tables, could we bulk load another xml file and append this new data to
 our
  3 tables or would it write over what was written before?
  In order to bulk load, we need to output a file using HFileOutputFormat.
  Since MultiHFileOutputFormat doesn't seem to officially exist yet (still
 in
  the works, right?), should we process our input xml file
  with 3 MapReduce jobs instead of 1 and output an hfile for each, which we
  could then become our intermediate step (if all 3 hfiles were created
  without errors, then process was successful: bulk load
  in hbase)? Can you experiment with bulk loading on a vmware? We're
  experiencing problems with partition file not being found with the
  following exception:
 
  java.lang.Exception: java.lang.IllegalArgumentException: Can't read
  partitions file
  at
  org.apache.hadoop.mapred.LocalJobRunner$Job.run(LocalJobRunner.java:404)
  Caused by: java.lang.IllegalArgumentException: Can't read partitions file
  at
 
 org.apache.hadoop.mapreduce.lib.partition.TotalOrderPartitioner.setConf(TotalOrderPartitioner.java:108)
  at
  org.apache.hadoop.util.ReflectionUtils.setConf(ReflectionUtils.java:70)
  at
 
 org.apache.hadoop.util.ReflectionUtils.newInstance(ReflectionUtils.java:130)
  at
 
 org.apache.hadoop.mapred.MapTask$NewOutputCollector.init(MapTask.java:588)
 
  We also tried another idea on how to speed things up: What if instead of
  doing individual puts, we passed a list of puts to put() (eg:
  htable.put(putList) ). Internally in hbase, would there be less overhead
 vs
  multiple
  calls to put()? It seems to be faster, however since we're not using
  context.write, I'm guessing this will lead to problems later on, right?
 
  Turning off WAL on puts to speed things up isn't an option, since data
  loss would be unacceptable, even if the chances of a failure occurring
 are
  slim.
 
  Thanks, David



Re: querying hbase

2013-05-31 Thread James Taylor

On 05/24/2013 02:50 PM, Andrew Purtell wrote:

On Thu, May 23, 2013 at 5:10 PM, James Taylor jtay...@salesforce.comwrote:


Has there been any discussions on running the HBase server in an OSGi
container?


I believe the only discussions have been on avoiding talk about coprocessor
reloading, as it implies either a reimplementation of or taking on an OSGi
runtime.

Is there a benefit to restarting a regionserver in an OSGi container versus
restarting a Java process?

Or would that work otherwise like an update the coprocessor and filters in
the container then trigger the embedded regionserver to do a quick close
and reopen of the regions?

My thinking was that an OSGi container would allow a new version of a 
coprocessor (and/or custom filter) jar to be loaded. Class conflicts 
between the old jar and the new jar would no longer be a problem - you'd 
never need to unload the old jar. Instead, future HBase operations that 
invoke the coprocessor would cause the newly loaded jar to be used 
instead of the older one. I'm not sure if this is possible or not. The 
whole idea would be to prevent a rolling restart or region close/reopen.


Re: HConnectionManager$HConnectionImplementation.locateRegionInMeta

2013-05-31 Thread Kireet



Even if I initiate the call via a pooled htable, the MetaScanner seems 
to use a concrete HTable instance. The constructor invoked seems to 
create a java ThreadPoolExecutor. I am not 100% sure but I think as long 
as nothing is submitted to the ThreadPoolExecutor it won't create any 
threads. I just wanted to confirm this was the case. I do see the 
connection is shared.


--Kireet



On 5/30/13 7:38 PM, Ted Yu wrote:

HTablePool$**PooledHTable is a wrapper around HTable.

Here is how HTable obtains a connection:

   public HTable(Configuration conf, final byte[] tableName, final
ExecutorService pool)
   throws IOException {
 this.connection = HConnectionManager.getConnection(conf);

Meaning the connection is a shared one based on certain key/value pairs
from conf.

bq. So every call to batch will create a new thread?

I don't think so.

On Thu, May 30, 2013 at 11:28 AM, Kireet 
kireet-teh5dpvpl8nqt0dzr+a...@public.gmane.org wrote:




Thanks, will give it a shot. So I should download 0.94.7 (latest stable)
and run the patch tool on top with the backport? This is a little new to me.

Also, I was looking at the stack below. From my reading of the code, the
HTable.batch() call will always cause the prefetch call to occur, which
will cause a new HTable object to get created. The constructor used in
creating a new thread pool. So every call to batch will create a new
thread? Or the HTable's thread pool never gets used as the pool is only
used for writes? I think I am missing something but just want to confirm.

Thanks
Kireet

On 5/30/13 12:48 PM, Himanshu Vashishtha wrote:


bq. Anoop attached backported patch in HBASE-8655. It should go into

0.94.9, the next release - current is 0.94.8

In case you want it sooner, you can apply 8655 patch and test/verify it.

Thanks,
Himanshu



On Thu, May 30, 2013 at 7:26 AM, Ted Yu yuzhihong-**
Re5JQEeQqe8AvxtiuMwx3w@public.**gmane.orgyuzhihong-re5jqeeqqe8avxtiumwx3w-xmd5yjdbdmrexy1tmh2...@public.gmane.org
wrote:

  Anoop attached backported patch in HBASE-8655


It should go into 0.94.9, the next release - current is 0.94.8

Cheers

On Thu, May 30, 2013 at 7:01 AM, Kireet kireet-Teh5dPVPL8nQT0dZR+**
alfa-xmd5yjdbdmrexy1tmh2...@public.gmane.org 
kireet-teh5dpvpl8nqt0dzr%2balfa-xmd5yjdbdmrexy1tmh2...@public.gmane.org
wrote:




How long do backports typically take? We have to go live in a month
ready
or not. Thanks for the quick replies Anoop and Ted.

--Kireet


On 5/30/13 9:20 AM, Ted Yu wrote:

  0.95 client is not compatible with 0.94 cluster. So you cannot use 0.95

client.

Cheers

On May 30, 2013, at 6:12 AM, Kireet kireet-Teh5dPVPL8nQT0dZR+**
AlfA-XMD5yJDbdMReXY1tMh2IBg@**public.gmane.orgalfa-xmd5yjdbdmrexy1tmh2ibg-xmd5yjdbdmrexy1tmh2...@public.gmane.org
kireet-Teh5dPVPL8nQT0dZR%**2BAlfA-XMD5yJDbdMReXY1tMh2IBg@**
public.gmane.orgkireet-teh5dpvpl8nqt0dzr%252balfa-xmd5yjdbdmrexy1tmh2ibg-xmd5yjdbdmrexy1tmh2...@public.gmane.org


wrote:




Would there be a problem if our cluster is 0.94 and we use a 0.95


client?





I am not familiar with the HBase code base, but I did a dump of the
thread that is actually running (below). It seems like it is related


to the



issue you mentioned as the running thread is doing the prefetch logic.

Would pre-splitting tables help here? We are doing some performance


tests



and essentially starting from an empty instance.


java.lang.Thread.State: WAITING (on object monitor)
at java.lang.Object.wait(Native Method)
at java.lang.Object.wait(Object.java:503)
at org.apache.zookeeper.ClientCnxn.submitRequest(**
ClientCnxn.java:1309)
- locked 0xe10cf830 (a org.apache.zookeeper.**
ClientCnxn$Packet)
at org.apache.zookeeper.ZooKeeper.exists(ZooKeeper.java:1036)
at org.apache.hadoop.hbase.zookeeper.
RecoverableZooKeeper.exists(**
RecoverableZooKeeper.java:172)
at org.apache.hadoop.hbase.zookeeper.ZKUtil.checkExists(
ZKUtil.java:450)
at org.apache.hadoop.hbase.zookeeper.ZooKeeperNodeTracker.**
checkIfBaseNodeAvailable(ZooKeeperNodeTracker.java:208)
at org.apache.hadoop.hbase.zookeeper.RootRegionTracker.**
waitRootRegionLocation(RootRegionTracker.java:77)
at org.apache.hadoop.hbase.client.HConnectionManager$**
HConnectionImplementation.locateRegion(
HConnectionManager.java:874)
at org.apache.hadoop.hbase.client.HConnectionManager$**
HConnectionImplementation.locateRegionInMeta(**
HConnectionManager.java:987)
at org.apache.hadoop.hbase.client.HConnectionManager$**
HConnectionImplementation.locateRegion(
HConnectionManager.java:885)
at org.apache.hadoop.hbase.client.HConnectionManager$**
HConnectionImplementation.locateRegion(
HConnectionManager.java:846)
at org.apache.hadoop.hbase.client.HTable.finishSetup(**
HTable.java:234)
at org.apache.hadoop.hbase.client.HTable.init(HTable.
java:174)
at org.apache.hadoop.hbase.client.MetaScanner.metaScan(**
MetaScanner.java:160)
at 

Re: HConnectionManager$HConnectionImplementation.locateRegionInMeta

2013-05-31 Thread lars hofhansl
Indeed. That is bad.
I cannot see a clean fix immediately, but we need to look at this.

Mind filing a ticket, Kireet?

-- Lars




 From: Kireet kir...@feedly.com
To: public-user-50Pas4EWwPEyzMRdD/i...@plane.gmane.org 
Sent: Friday, May 31, 2013 11:58 AM
Subject: Re: HConnectionManager$HConnectionImplementation.locateRegionInMeta
 



Even if I initiate the call via a pooled htable, the MetaScanner seems 
to use a concrete HTable instance. The constructor invoked seems to 
create a java ThreadPoolExecutor. I am not 100% sure but I think as long 
as nothing is submitted to the ThreadPoolExecutor it won't create any 
threads. I just wanted to confirm this was the case. I do see the 
connection is shared.

--Kireet



On 5/30/13 7:38 PM, Ted Yu wrote:
 HTablePool$**PooledHTable is a wrapper around HTable.

 Here is how HTable obtains a connection:

    public HTable(Configuration conf, final byte[] tableName, final
 ExecutorService pool)
        throws IOException {
      this.connection = HConnectionManager.getConnection(conf);

 Meaning the connection is a shared one based on certain key/value pairs
 from conf.

 bq. So every call to batch will create a new thread?

 I don't think so.

 On Thu, May 30, 2013 at 11:28 AM, Kireet 
 kireet-teh5dpvpl8nqt0dzr+a...@public.gmane.org wrote:



 Thanks, will give it a shot. So I should download 0.94.7 (latest stable)
 and run the patch tool on top with the backport? This is a little new to me.

 Also, I was looking at the stack below. From my reading of the code, the
 HTable.batch() call will always cause the prefetch call to occur, which
 will cause a new HTable object to get created. The constructor used in
 creating a new thread pool. So every call to batch will create a new
 thread? Or the HTable's thread pool never gets used as the pool is only
 used for writes? I think I am missing something but just want to confirm.

 Thanks
 Kireet

 On 5/30/13 12:48 PM, Himanshu Vashishtha wrote:

 bq. Anoop attached backported patch in HBASE-8655. It should go into

 0.94.9, the next release - current is 0.94.8

 In case you want it sooner, you can apply 8655 patch and test/verify it.

 Thanks,
 Himanshu



 On Thu, May 30, 2013 at 7:26 AM, Ted Yu yuzhihong-**
 Re5JQEeQqe8AvxtiuMwx3w@public.**gmane.orgyuzhihong-re5jqeeqqe8avxtiumwx3w-xmd5yjdbdmrexy1tmh2...@public.gmane.org
 wrote:

   Anoop attached backported patch in HBASE-8655

 It should go into 0.94.9, the next release - current is 0.94.8

 Cheers

 On Thu, May 30, 2013 at 7:01 AM, Kireet kireet-Teh5dPVPL8nQT0dZR+**
 alfa-xmd5yjdbdmrexy1tmh2...@public.gmane.org 
 kireet-teh5dpvpl8nqt0dzr%2balfa-xmd5yjdbdmrexy1tmh2...@public.gmane.org
 wrote:



 How long do backports typically take? We have to go live in a month
 ready
 or not. Thanks for the quick replies Anoop and Ted.

 --Kireet


 On 5/30/13 9:20 AM, Ted Yu wrote:

   0.95 client is not compatible with 0.94 cluster. So you cannot use 0.95
 client.

 Cheers

 On May 30, 2013, at 6:12 AM, Kireet kireet-Teh5dPVPL8nQT0dZR+**
 AlfA-XMD5yJDbdMReXY1tMh2IBg@**public.gmane.orgalfa-xmd5yjdbdmrexy1tmh2ibg-xmd5yjdbdmrexy1tmh2...@public.gmane.org
 kireet-Teh5dPVPL8nQT0dZR%**2BAlfA-XMD5yJDbdMReXY1tMh2IBg@**
 public.gmane.orgkireet-teh5dpvpl8nqt0dzr%252balfa-xmd5yjdbdmrexy1tmh2ibg-xmd5yjdbdmrexy1tmh2...@public.gmane.org


 wrote:



 Would there be a problem if our cluster is 0.94 and we use a 0.95

 client?


 I am not familiar with the HBase code base, but I did a dump of the
 thread that is actually running (below). It seems like it is related

 to the

 issue you mentioned as the running thread is doing the prefetch logic.
 Would pre-splitting tables help here? We are doing some performance

 tests

 and essentially starting from an empty instance.

 java.lang.Thread.State: WAITING (on object monitor)
 at java.lang.Object.wait(Native Method)
 at java.lang.Object.wait(Object.java:503)
 at org.apache.zookeeper.ClientCnxn.submitRequest(**
 ClientCnxn.java:1309)
 - locked 0xe10cf830 (a org.apache.zookeeper.**
 ClientCnxn$Packet)
 at org.apache.zookeeper.ZooKeeper.exists(ZooKeeper.java:1036)
 at org.apache.hadoop.hbase.zookeeper.
 RecoverableZooKeeper.exists(**
 RecoverableZooKeeper.java:172)
 at org.apache.hadoop.hbase.zookeeper.ZKUtil.checkExists(
 ZKUtil.java:450)
 at org.apache.hadoop.hbase.zookeeper.ZooKeeperNodeTracker.**
 checkIfBaseNodeAvailable(ZooKeeperNodeTracker.java:208)
 at org.apache.hadoop.hbase.zookeeper.RootRegionTracker.**
 waitRootRegionLocation(RootRegionTracker.java:77)
 at org.apache.hadoop.hbase.client.HConnectionManager$**
 HConnectionImplementation.locateRegion(
 HConnectionManager.java:874)
 at org.apache.hadoop.hbase.client.HConnectionManager$**
 HConnectionImplementation.locateRegionInMeta(**
 HConnectionManager.java:987)
 at org.apache.hadoop.hbase.client.HConnectionManager$**
 HConnectionImplementation.locateRegion(