Re: Loading data from Hive to HBase takes too long
Hi, Lars Thank you for your reply and sorry for the unclarity. Actually, hbase daemon is runing only on the master, just one server. It uses HDFS as its storage. The input data is on the EBS. It is wrtten in HBase which is over Hdfs based on EBS. The only turning I did is : property namehbase.client.scanner.caching/name value1/value /property That makes count(*) fast. When loading to HDFS dirctly, it just ends in less than 10 mins. In addition, when loading loading other data sets with different schema which is about 700 mb into HBase, it takes only a few minutes. Thank you again. Hao. Le 20/08/2013 01:51, lars hofhansl a écrit : Hi Hao, how do you run HBase in pseudo distributed mode, yet with 3 slaves? Where is the data written in EC2? EBS or local storage? Did you do any other tuning at the HBase or HDFS level (server side)? If your replication level is still set to 3 you're seeing somewhat of a worst case scenario, where each node gets 100% of all writes, and the speed is always dominated by your slowest machine. How does Hive perform here when you write to HDFS directly? Sorry, many questions :) -- Lars From: Hao Ren h@claravista.fr To: user@hbase.apache.org Sent: Monday, August 19, 2013 1:50 AM Subject: Re: Loading data from Hive to HBase takes too long Update: There are 1 master and 3 slaves in my cluster. They are all m1.medium instances. *Instance Family* *Instance Type* *Processor Arch* *vCPU* *ECU* *Memory (GiB)* *Instance Storage (GB)* *EBS-optimized Available* *Network Performance* General purpose m1.medium 32-bit or 64-bit 1 2 3.75 1 x 410 - Moderate Le 19/08/2013 10:44, Hao Ren a écrit : Update: I messed up some queries, here are the right ones: CREATE TABLE hbase_table ( material_id int, new_id_client int, last_purchase_date int) STORED BY 'org.apache.hadoop.hive.hbase.HBaseStorageHandler' WITH SERDEPROPERTIES (hbase.columns.mapping = :key,cf1:idclt,cf1:dt_last_purchase) TBLPROPERTIES(hbase.table.name = test); insert OVERWRITE TABLE hbase_table select * from test; -- takes a long time (about 8 hours) # bin/hadoop dfs -dus /user/hive/warehouse/test hdfs://ec2-54-234-17-36.compute-1.amazonaws.com:9010/user/hive/warehouse/test 1318012108 the table 'test' is just about 1.3 GB. Le 19/08/2013 10:40, Hao Ren a écrit : Hi, I am runing Hive and Hbase on the same Amazon EC2 cluster, where Hbase is in a pseudo-distributed mode. After integrating HBase in Hive, I find that it takes a long time when runing a insert overwrite query from hive in order to load data into a related HBase table. In fact, the size of data is about 1.3Gb. I dont think it's normal. Maybe there are something wrong with my configuration. Here are some queries: CREATE TABLE hbase_table ( material_id int, new_id_client int, last_purchase_date int) STORED BY 'org.apache.hadoop.hive.hbase.HBaseStorageHandler' WITH SERDEPROPERTIES (hbase.columns.mapping = :key,cf1:idclt,cf1:dt_last_purchase) TBLPROPERTIES(hbase.table.name = test); insert OVERWRITE TABLE t_LIGNES_DERN_VENTES select * from test; -- takes a long time (about 8 hours) Here are some configurations files for my cluster : # cat hive/conf/hive-site.xml ?xml version=1.0? ?xml-stylesheet type=text/xsl href=configuration.xsl? configuration property namehbase.zookeeper.quorum/name valueip-10-159-41-177.ec2.internal/value /property property namehive.aux.jars.path/name value/root/hive/build/dist/lib/hive-hbase-handler-0.9.0-amplab-4.jar,/root/hive/build/dist/lib/hbase-0.92.0.jar,/root/hive/build/dist/lib/zookeeper-3.4.3.jar,/root/hive/build/dist/lib/guava-r09.jar/value /property property namehbase.client.scanner.caching/name value1/value /property /configuration # cat hbase-0.92.0/conf/hbase-site.xml ?xml version=1.0? ?xml-stylesheet type=text/xsl href=configuration.xsl? configuration property namehbase.rootdir/name valuehdfs://ec2-54-234-17-36.compute-1.amazonaws.com:9010/hbase/value /property property namehbase.cluster.distributed/name valuetrue/value /property property namehbase.zookeeper.quorum/name valueip-10-159-41-177.ec2.internal/value /property property namehbase.client.scanner.caching/name value1/value /property /configuration Any help is highly appreciated! Thank you. Hao -- Hao Ren ClaraVista www.claravista.fr
Replication queue?
Hi, If I have a master - slave replication, and master went down, replication will start back where it was when master will come back online. Fine. If I have a master - slave replication, and slave went down, is the data queued until the slave come back online and then sent? If so, how big can be this queu, and how long can the slave be down? Same questions for master - master... I guess for this one, it's like for the 1 line above and it's fine, right? Thanks, JM
Does HBase supports parallel table scan if I use MapReduce
Hello, I know if I use default scan api, HBase scans table in a serial manner, as it needs to guarantee the order of the returned tuples. My question is if I use MapReduce to read the HBase table, and directly output the results in HDFS, not returned back to client. The HBase scan is still in a serial manner or in this situation it can run a parallel scan. Thanks! Yong
Performance penalty: Custom Filter names serialization
Hi all, I'm using custom filters to retrieve filtered data from HBase using the native api. I noticed that the class full names of those custom filters is being sent as the bytes representation of the string using Text.writeString(). This consumes a lot of network bandwidth in my case due to using 5 custom filters per Get and issuing 1.5 million gets per minute. I took at look at the code (org.apache.hadoop.hbase.io.HbaseObjectWritable) and It seems that HBase registers its known classes (Get, Put, etc...) and associates them with an Integer (CODE_TO_CLASS and CLASS_TO_CODE). That integer is sent instead of the full class name for those known classes. I did a test reducing my custom filter class names to 2 or 3 letters and it improved my performance in 25%. Is there any way to register my custom filter classes to behave the same as HBase's classes? If not, does it make sense to introduce a change to do that? Is there any other workaround for this issue? Thanks!
Re: Performance penalty: Custom Filter names serialization
Are you using HBase 0.92 or 0.94 ? In 0.95 and later releases, HbaseObjectWritable doesn't exist. Protobuf is used for communication. Cheers On Tue, Aug 20, 2013 at 8:56 AM, Pablo Medina pablomedin...@gmail.comwrote: Hi all, I'm using custom filters to retrieve filtered data from HBase using the native api. I noticed that the class full names of those custom filters is being sent as the bytes representation of the string using Text.writeString(). This consumes a lot of network bandwidth in my case due to using 5 custom filters per Get and issuing 1.5 million gets per minute. I took at look at the code (org.apache.hadoop.hbase.io.HbaseObjectWritable) and It seems that HBase registers its known classes (Get, Put, etc...) and associates them with an Integer (CODE_TO_CLASS and CLASS_TO_CODE). That integer is sent instead of the full class name for those known classes. I did a test reducing my custom filter class names to 2 or 3 letters and it improved my performance in 25%. Is there any way to register my custom filter classes to behave the same as HBase's classes? If not, does it make sense to introduce a change to do that? Is there any other workaround for this issue? Thanks!
Re: Performance penalty: Custom Filter names serialization
But even if we are using Protobuf, he is going to face the same issue, right? We should have a way to send the filter once with a number to say to the regions that this filter, moving forward, will be represented by this number. There is some risk to re-use a number of a filter already using it, but I'm sure we can come with some mechanism to avoid that. 2013/8/20 Ted Yu yuzhih...@gmail.com Are you using HBase 0.92 or 0.94 ? In 0.95 and later releases, HbaseObjectWritable doesn't exist. Protobuf is used for communication. Cheers On Tue, Aug 20, 2013 at 8:56 AM, Pablo Medina pablomedin...@gmail.com wrote: Hi all, I'm using custom filters to retrieve filtered data from HBase using the native api. I noticed that the class full names of those custom filters is being sent as the bytes representation of the string using Text.writeString(). This consumes a lot of network bandwidth in my case due to using 5 custom filters per Get and issuing 1.5 million gets per minute. I took at look at the code (org.apache.hadoop.hbase.io.HbaseObjectWritable) and It seems that HBase registers its known classes (Get, Put, etc...) and associates them with an Integer (CODE_TO_CLASS and CLASS_TO_CODE). That integer is sent instead of the full class name for those known classes. I did a test reducing my custom filter class names to 2 or 3 letters and it improved my performance in 25%. Is there any way to register my custom filter classes to behave the same as HBase's classes? If not, does it make sense to introduce a change to do that? Is there any other workaround for this issue? Thanks!
Re: Replication queue?
You can find a lot here: http://hbase.apache.org/replication.html And how many logs you can queue is how much disk space you have :) On Tue, Aug 20, 2013 at 7:23 AM, Jean-Marc Spaggiari jean-m...@spaggiari.org wrote: Hi, If I have a master - slave replication, and master went down, replication will start back where it was when master will come back online. Fine. If I have a master - slave replication, and slave went down, is the data queued until the slave come back online and then sent? If so, how big can be this queu, and how long can the slave be down? Same questions for master - master... I guess for this one, it's like for the 1 line above and it's fine, right? Thanks, JM
Re: Replication queue?
RTFM? ;) Thanks for pointing me to this link! I have all the responses I need there. JM 2013/8/20 Jean-Daniel Cryans jdcry...@apache.org You can find a lot here: http://hbase.apache.org/replication.html And how many logs you can queue is how much disk space you have :) On Tue, Aug 20, 2013 at 7:23 AM, Jean-Marc Spaggiari jean-m...@spaggiari.org wrote: Hi, If I have a master - slave replication, and master went down, replication will start back where it was when master will come back online. Fine. If I have a master - slave replication, and slave went down, is the data queued until the slave come back online and then sent? If so, how big can be this queu, and how long can the slave be down? Same questions for master - master... I guess for this one, it's like for the 1 line above and it's fine, right? Thanks, JM
Re: Major Compaction in 0.90.6
On Mon, Aug 19, 2013 at 11:52 PM, Monish r monishs...@gmail.com wrote: Hi Jean, s/Jean/Jean-Daniel ;) Thanks for the explanation. Just a clarification on the third answer, In our current cluster ( 0.90.6 ) , i find that irrespective of whether TTL is set or not , Major compaction compaction rewrites hfile for the region ( there is only one hfile for that region ) on every manual major compaction trigger. Can you enable DEBUG logs? You'd see why the major compaction is triggered. log : 2013-08-19 14:15:29,926 INFO org.apache.hadoop.hbase.regionserver.Store: Completed major compaction of 1 file(s), new file=hdfs://x.x.x.x:9000/hbase/NOTIFICATION_HISTORY/b00086bca62ee55796a960002291aca4/n/4754838096619480671 i find a new file is created for every major compaction triggger. Regards, R.Monish On Mon, Aug 19, 2013 at 11:52 PM, Jean-Daniel Cryans jdcry...@apache.org wrote: Inline. J-D On Mon, Aug 19, 2013 at 2:48 AM, Monish r monishs...@gmail.com wrote: Hi guys, I have the following questions in HBASE 0.90.6 1. Does hbase use only one compaction thread to handle both major and minor compaction? Yes, look at CompactSplitThread 2. If hbase uses multiple compaction threads, which configuration parameter defines the number of compaction threads? It doesn't in 0.90.6 but CompactSplitThread lists those for 0.92+ hbase.regionserver.thread.compaction.large hbase.regionserver.thread.compaction.small 3. After hbase.majorcompaction.interval from last major compaction ,if major compaction is executed on a table already major compacted Does hbase skip all the table regions from major compaction? Determining if something is major-compacted is definitely not at the table-level. In 0.90.6, MajorCompactionChecker will ask HRegion.isMajorCompaction() to check if it needs to major compact again, which in turns checks every Store. FWW if you have TTL turned on it will still major compact a major compacted file, HFiles don't have an index of what's deleted or TTL'd and it doesn't do a full read of each files to check. Regards, R.Monish
Lets talk about joins...
When you start looking at secondary indexing, they really become powerful when you want to join two tables. (Something I thought was already being discussed) So you can use the inverted table as a secondary index with one small glitch... And then create a table of indexes. Where each row represents an index and the columns are the rowkeys in that index. (Call it a foreign key table.) Now for the glitch... what happens when your row exceeds the width of your region. ;-) There's a solution for that. ;-) The other issue would be asynchronous writes. I figured that one should get the talk started now, rather than wait until later. This is why you want secondary indexes., the other issue... theta joins but lets save that for later. The opinions expressed here are mine, while they may reflect a cognitive thought, that is purely accidental. Use at your own risk. Michael Segel michael_segel (AT) hotmail.com
Re: Performance penalty: Custom Filter names serialization
Hi everyone, I'm facing the same issue as Pablo. Renaming my classes used in HBase context improved network usage more than 20%. It would be really nice to have an improvement around this. On 08/20/2013 01:15 PM, Jean-Marc Spaggiari wrote: But even if we are using Protobuf, he is going to face the same issue, right? We should have a way to send the filter once with a number to say to the regions that this filter, moving forward, will be represented by this number. There is some risk to re-use a number of a filter already using it, but I'm sure we can come with some mechanism to avoid that. 2013/8/20 Ted Yu yuzhih...@gmail.com Are you using HBase 0.92 or 0.94 ? In 0.95 and later releases, HbaseObjectWritable doesn't exist. Protobuf is used for communication. Cheers On Tue, Aug 20, 2013 at 8:56 AM, Pablo Medina pablomedin...@gmail.com wrote: Hi all, I'm using custom filters to retrieve filtered data from HBase using the native api. I noticed that the class full names of those custom filters is being sent as the bytes representation of the string using Text.writeString(). This consumes a lot of network bandwidth in my case due to using 5 custom filters per Get and issuing 1.5 million gets per minute. I took at look at the code (org.apache.hadoop.hbase.io.HbaseObjectWritable) and It seems that HBase registers its known classes (Get, Put, etc...) and associates them with an Integer (CODE_TO_CLASS and CLASS_TO_CODE). That integer is sent instead of the full class name for those known classes. I did a test reducing my custom filter class names to 2 or 3 letters and it improved my performance in 25%. Is there any way to register my custom filter classes to behave the same as HBase's classes? If not, does it make sense to introduce a change to do that? Is there any other workaround for this issue? Thanks!
Re: Chocolatey package for Windows
Hi Andrew, I don't think the homebrew recipes are managed by an HBase developer. Rather, someone in the community has taken it upon themselves to provide the project through brew. Likewise, the Apache HBase project does not provide RPM or DEB packages, but you're likely to find them if you look around. Maybe you can find a willing maintainer on the users@ list? (I don't run Windows very often so I won't make a good volunteer) Thanks, Nick On Tuesday, August 20, 2013, Andrew Pennebaker wrote: Could we automate the installation process for Windows with a Chocolateyhttp://chocolatey.org/package, the way we offer a Homebrew https://github.com/mxcl/homebrew/blob/master/Library/Formula/hbase.rb formula for Mac OS X?