Re: Loading data from Hive to HBase takes too long

2013-08-20 Thread Hao Ren

Hi, Lars

Thank you for your reply and sorry for the unclarity.

Actually, hbase daemon is runing only on the master, just one server. It 
uses HDFS as its storage.
The input data is on the EBS. It is wrtten in HBase which is over Hdfs 
based on EBS.


The only turning I did is :

property
 namehbase.client.scanner.caching/name
 value1/value
/property

That makes count(*) fast.

When loading to HDFS dirctly, it just ends in less than 10 mins.

In addition, when loading loading other data sets with different schema which 
is about 700 mb into HBase, it takes only a few minutes.

Thank you again.

Hao.

Le 20/08/2013 01:51, lars hofhansl a écrit :

Hi Hao,

how do you run HBase in pseudo distributed mode, yet with 3 slaves?
Where is the data written in EC2? EBS or local storage?
Did you do any other tuning at the HBase or HDFS level (server side)?

If your replication level is still set to 3 you're seeing somewhat of a worst 
case scenario, where each node gets 100% of all writes, and the speed is always 
dominated by your slowest machine.
How does Hive perform here when you write to HDFS directly?

Sorry, many questions :)

-- Lars


From: Hao Ren h@claravista.fr
To: user@hbase.apache.org
Sent: Monday, August 19, 2013 1:50 AM
Subject: Re: Loading data from Hive to HBase takes too long


Update:

There are 1 master and 3 slaves in my cluster.
They are all m1.medium instances.

*Instance Family* *Instance Type* *Processor Arch* *vCPU* *ECU*
*Memory (GiB)* *Instance Storage (GB)* *EBS-optimized Available*
*Network Performance*









General purpose m1.medium 32-bit or
64-bit 1 2 3.75 1 x 410 - Moderate


Le 19/08/2013 10:44, Hao Ren a écrit :

Update:

I messed up some queries, here are the right ones:

CREATE TABLE hbase_table (
material_id int,
new_id_client int,
last_purchase_date int)
STORED BY 'org.apache.hadoop.hive.hbase.HBaseStorageHandler'
WITH SERDEPROPERTIES (hbase.columns.mapping =
:key,cf1:idclt,cf1:dt_last_purchase)
TBLPROPERTIES(hbase.table.name = test);

insert OVERWRITE TABLE hbase_table
select * from test;  -- takes a long time (about 8 hours)

# bin/hadoop dfs -dus /user/hive/warehouse/test
hdfs://ec2-54-234-17-36.compute-1.amazonaws.com:9010/user/hive/warehouse/test
1318012108

the table 'test' is just about 1.3 GB.



Le 19/08/2013 10:40, Hao Ren a écrit :

Hi,

I am runing Hive and Hbase on the same Amazon EC2 cluster, where
Hbase is in a pseudo-distributed mode.

After integrating HBase in Hive, I find that it takes a long time
when runing a insert overwrite query from hive in order to load
data into a related HBase table.

In fact, the size of data is about 1.3Gb. I dont think it's normal.

Maybe there are something wrong with my configuration.

Here are some queries:

CREATE TABLE hbase_table (
material_id int,
new_id_client int,
last_purchase_date int)
STORED BY 'org.apache.hadoop.hive.hbase.HBaseStorageHandler'
WITH SERDEPROPERTIES (hbase.columns.mapping =
:key,cf1:idclt,cf1:dt_last_purchase)
TBLPROPERTIES(hbase.table.name = test);

insert OVERWRITE TABLE t_LIGNES_DERN_VENTES
select * from test;  -- takes a long time (about 8 hours)


Here are some configurations files for my cluster :

# cat hive/conf/hive-site.xml

?xml version=1.0?
?xml-stylesheet type=text/xsl href=configuration.xsl?

configuration

  property
  namehbase.zookeeper.quorum/name
  valueip-10-159-41-177.ec2.internal/value
  /property

  property
  namehive.aux.jars.path/name
value/root/hive/build/dist/lib/hive-hbase-handler-0.9.0-amplab-4.jar,/root/hive/build/dist/lib/hbase-0.92.0.jar,/root/hive/build/dist/lib/zookeeper-3.4.3.jar,/root/hive/build/dist/lib/guava-r09.jar/value

  /property

  property
  namehbase.client.scanner.caching/name
  value1/value
  /property

/configuration

# cat hbase-0.92.0/conf/hbase-site.xml

?xml version=1.0?
?xml-stylesheet type=text/xsl href=configuration.xsl?

configuration

  property
  namehbase.rootdir/name
valuehdfs://ec2-54-234-17-36.compute-1.amazonaws.com:9010/hbase/value

  /property

  property
  namehbase.cluster.distributed/name
  valuetrue/value
  /property

  property
  namehbase.zookeeper.quorum/name
  valueip-10-159-41-177.ec2.internal/value
  /property

  property
  namehbase.client.scanner.caching/name
  value1/value
  /property

/configuration

Any help is highly appreciated!

Thank you.

Hao








--
Hao Ren
ClaraVista
www.claravista.fr


Replication queue?

2013-08-20 Thread Jean-Marc Spaggiari
Hi,

If I have a master - slave replication, and master went down, replication
will start back where it was when master will come back online. Fine.
If I have a master - slave replication, and slave went down, is the data
queued until the slave come back online and then sent? If so, how big can
be this queu, and how long can the slave be down?

Same questions for master - master... I guess for this one, it's like for
the 1 line above and it's fine, right?

Thanks,

JM


Does HBase supports parallel table scan if I use MapReduce

2013-08-20 Thread yonghu
Hello,

I know if I use default scan api, HBase scans table in a serial manner, as
it needs to guarantee the order of the returned tuples. My question is if I
use MapReduce to read the HBase table, and directly output the results in
HDFS, not returned back to client. The HBase scan is still in a serial
manner or in this situation it can run a parallel scan.

Thanks!

Yong


Performance penalty: Custom Filter names serialization

2013-08-20 Thread Pablo Medina
Hi all,

I'm using custom filters to retrieve filtered data from HBase using the
native api. I noticed that the class full names of those custom filters is
being sent as the bytes representation of the string using
Text.writeString(). This consumes a lot of network bandwidth in my case due
to using 5 custom filters per Get and issuing 1.5 million gets per minute.
I took at look at the code (org.apache.hadoop.hbase.io.HbaseObjectWritable)
and It seems that HBase registers its known classes (Get, Put, etc...) and
associates them with an Integer (CODE_TO_CLASS and CLASS_TO_CODE). That
integer is sent instead of the full class name for those known classes. I
did a test reducing my custom filter class names to 2 or 3 letters and it
improved my performance in 25%.
Is there any way to register my custom filter classes to behave the same
as HBase's classes? If not, does it make sense to introduce a change to do
that? Is there any other workaround for this issue?

Thanks!


Re: Performance penalty: Custom Filter names serialization

2013-08-20 Thread Ted Yu
Are you using HBase 0.92 or 0.94 ?

In 0.95 and later releases, HbaseObjectWritable doesn't exist. Protobuf is
used for communication.

Cheers


On Tue, Aug 20, 2013 at 8:56 AM, Pablo Medina pablomedin...@gmail.comwrote:

 Hi all,

 I'm using custom filters to retrieve filtered data from HBase using the
 native api. I noticed that the class full names of those custom filters is
 being sent as the bytes representation of the string using
 Text.writeString(). This consumes a lot of network bandwidth in my case due
 to using 5 custom filters per Get and issuing 1.5 million gets per minute.
 I took at look at the code (org.apache.hadoop.hbase.io.HbaseObjectWritable)
 and It seems that HBase registers its known classes (Get, Put, etc...) and
 associates them with an Integer (CODE_TO_CLASS and CLASS_TO_CODE). That
 integer is sent instead of the full class name for those known classes. I
 did a test reducing my custom filter class names to 2 or 3 letters and it
 improved my performance in 25%.
 Is there any way to register my custom filter classes to behave the same
 as HBase's classes? If not, does it make sense to introduce a change to do
 that? Is there any other workaround for this issue?

 Thanks!



Re: Performance penalty: Custom Filter names serialization

2013-08-20 Thread Jean-Marc Spaggiari
But even if we are using Protobuf, he is going to face the same issue,
right?

We should have a way to send the filter once with a number to say to the
regions that this filter, moving forward, will be represented by this
number. There is some risk to re-use a number of a filter already using it,
but I'm sure we can come with some mechanism to avoid that.

2013/8/20 Ted Yu yuzhih...@gmail.com

 Are you using HBase 0.92 or 0.94 ?

 In 0.95 and later releases, HbaseObjectWritable doesn't exist. Protobuf is
 used for communication.

 Cheers


 On Tue, Aug 20, 2013 at 8:56 AM, Pablo Medina pablomedin...@gmail.com
 wrote:

  Hi all,
 
  I'm using custom filters to retrieve filtered data from HBase using the
  native api. I noticed that the class full names of those custom filters
 is
  being sent as the bytes representation of the string using
  Text.writeString(). This consumes a lot of network bandwidth in my case
 due
  to using 5 custom filters per Get and issuing 1.5 million gets per
 minute.
  I took at look at the code
 (org.apache.hadoop.hbase.io.HbaseObjectWritable)
  and It seems that HBase registers its known classes (Get, Put, etc...)
 and
  associates them with an Integer (CODE_TO_CLASS and CLASS_TO_CODE). That
  integer is sent instead of the full class name for those known classes. I
  did a test reducing my custom filter class names to 2 or 3 letters and it
  improved my performance in 25%.
  Is there any way to register my custom filter classes to behave the
 same
  as HBase's classes? If not, does it make sense to introduce a change to
 do
  that? Is there any other workaround for this issue?
 
  Thanks!
 



Re: Replication queue?

2013-08-20 Thread Jean-Daniel Cryans
You can find a lot here: http://hbase.apache.org/replication.html

And how many logs you can queue is how much disk space you have :)


On Tue, Aug 20, 2013 at 7:23 AM, Jean-Marc Spaggiari 
jean-m...@spaggiari.org wrote:

 Hi,

 If I have a master - slave replication, and master went down, replication
 will start back where it was when master will come back online. Fine.
 If I have a master - slave replication, and slave went down, is the data
 queued until the slave come back online and then sent? If so, how big can
 be this queu, and how long can the slave be down?

 Same questions for master - master... I guess for this one, it's like for
 the 1 line above and it's fine, right?

 Thanks,

 JM



Re: Replication queue?

2013-08-20 Thread Jean-Marc Spaggiari
RTFM? ;)

Thanks for pointing me to this link! I have all the responses I need there.

JM

2013/8/20 Jean-Daniel Cryans jdcry...@apache.org

 You can find a lot here: http://hbase.apache.org/replication.html

 And how many logs you can queue is how much disk space you have :)


 On Tue, Aug 20, 2013 at 7:23 AM, Jean-Marc Spaggiari 
 jean-m...@spaggiari.org wrote:

  Hi,
 
  If I have a master - slave replication, and master went down,
 replication
  will start back where it was when master will come back online. Fine.
  If I have a master - slave replication, and slave went down, is the data
  queued until the slave come back online and then sent? If so, how big can
  be this queu, and how long can the slave be down?
 
  Same questions for master - master... I guess for this one, it's like
 for
  the 1 line above and it's fine, right?
 
  Thanks,
 
  JM
 



Re: Major Compaction in 0.90.6

2013-08-20 Thread Jean-Daniel Cryans
On Mon, Aug 19, 2013 at 11:52 PM, Monish r monishs...@gmail.com wrote:

 Hi Jean,


s/Jean/Jean-Daniel ;)


 Thanks for the explanation.

 Just a clarification on the third answer,

 In our current cluster ( 0.90.6 ) , i find that irrespective of whether TTL
 is set or not , Major compaction compaction rewrites hfile for the region (
 there is only one hfile for that region )  on every manual major compaction
 trigger.



Can you enable DEBUG logs? You'd see why the major compaction is triggered.




 log :

 2013-08-19 14:15:29,926 INFO org.apache.hadoop.hbase.regionserver.Store:
 Completed major compaction of 1 file(s), new

 file=hdfs://x.x.x.x:9000/hbase/NOTIFICATION_HISTORY/b00086bca62ee55796a960002291aca4/n/4754838096619480671

 i find a new file is created for every major compaction triggger.

 Regards,
 R.Monish


 On Mon, Aug 19, 2013 at 11:52 PM, Jean-Daniel Cryans jdcry...@apache.org
 wrote:

  Inline.
 
  J-D
 
 
  On Mon, Aug 19, 2013 at 2:48 AM, Monish r monishs...@gmail.com wrote:
 
   Hi guys,
   I have the following questions in HBASE 0.90.6
  
   1. Does hbase use only one compaction thread to handle both major and
  minor
   compaction?
  
 
  Yes, look at CompactSplitThread
 
 
  
   2. If hbase uses multiple compaction threads, which configuration
  parameter
   defines the number of compaction threads?
  
 
  It doesn't in 0.90.6 but CompactSplitThread lists those for 0.92+
 
  hbase.regionserver.thread.compaction.large
  hbase.regionserver.thread.compaction.small
 
 
  
   3. After hbase.majorcompaction.interval from last major compaction ,if
   major compaction is executed on a table already major compacted Does
  hbase
   skip all the table regions from major compaction?
  
 
  Determining if something is major-compacted is definitely not at the
  table-level.
 
  In 0.90.6, MajorCompactionChecker will ask HRegion.isMajorCompaction() to
  check if it needs to major compact again, which in turns checks every
  Store. FWW if you have TTL turned on it will still major compact a major
  compacted file, HFiles don't have an index of what's deleted or TTL'd and
  it doesn't do a full read of each files to check.
 
 
  
   Regards,
   R.Monish
  
 



Lets talk about joins...

2013-08-20 Thread Michael Segel

When you start looking at secondary indexing, they really become powerful when 
you want to join two tables. 
(Something I thought was already being discussed) 

So you can use the inverted table as a secondary index with one small glitch... 
And then create a table of indexes.  Where each row represents an index and the 
columns are the rowkeys in that index. 
(Call it a foreign key table.) 

Now for the glitch... what happens when your row exceeds the width of your 
region. ;-) 
There's a solution for that. ;-) 

The other issue would be asynchronous writes. 

I figured that one should get the talk started now, rather than wait until 
later. 

This is why you want secondary indexes., the other issue... theta joins but 
lets save that for later.

The opinions expressed here are mine, while they may reflect a cognitive 
thought, that is purely accidental. 
Use at your own risk. 
Michael Segel
michael_segel (AT) hotmail.com







Re: Performance penalty: Custom Filter names serialization

2013-08-20 Thread Federico Gaule

Hi everyone,

I'm facing the same issue as Pablo. Renaming my classes used in HBase 
context improved network usage more than 20%. It would be really nice to 
have an improvement around this.




On 08/20/2013 01:15 PM, Jean-Marc Spaggiari wrote:

But even if we are using Protobuf, he is going to face the same issue,
right?

We should have a way to send the filter once with a number to say to the
regions that this filter, moving forward, will be represented by this
number. There is some risk to re-use a number of a filter already using it,
but I'm sure we can come with some mechanism to avoid that.

2013/8/20 Ted Yu yuzhih...@gmail.com


Are you using HBase 0.92 or 0.94 ?

In 0.95 and later releases, HbaseObjectWritable doesn't exist. Protobuf is
used for communication.

Cheers


On Tue, Aug 20, 2013 at 8:56 AM, Pablo Medina pablomedin...@gmail.com

wrote:
Hi all,

I'm using custom filters to retrieve filtered data from HBase using the
native api. I noticed that the class full names of those custom filters

is

being sent as the bytes representation of the string using
Text.writeString(). This consumes a lot of network bandwidth in my case

due

to using 5 custom filters per Get and issuing 1.5 million gets per

minute.

I took at look at the code

(org.apache.hadoop.hbase.io.HbaseObjectWritable)

and It seems that HBase registers its known classes (Get, Put, etc...)

and

associates them with an Integer (CODE_TO_CLASS and CLASS_TO_CODE). That
integer is sent instead of the full class name for those known classes. I
did a test reducing my custom filter class names to 2 or 3 letters and it
improved my performance in 25%.
Is there any way to register my custom filter classes to behave the

same

as HBase's classes? If not, does it make sense to introduce a change to

do

that? Is there any other workaround for this issue?

Thanks!





Re: Chocolatey package for Windows

2013-08-20 Thread Nick Dimiduk
Hi Andrew,

I don't think the homebrew recipes are managed by an HBase developer.
Rather, someone in the community has taken it upon themselves to
provide the project through brew. Likewise, the Apache HBase project does
not provide RPM or DEB packages, but you're likely to find them if you look
around.

Maybe you can find a willing maintainer on the users@ list? (I don't run
Windows very often so I won't make a good volunteer)

Thanks,
Nick

On Tuesday, August 20, 2013, Andrew Pennebaker wrote:

 Could we automate the installation process for Windows with a
 Chocolateyhttp://chocolatey.org/package, the way we offer a
 Homebrew
 https://github.com/mxcl/homebrew/blob/master/Library/Formula/hbase.rb
 formula
 for Mac OS X?