Re: timeouts with lots of coprocessor puts on single row
On Mon, Aug 26, 2013 at 10:56 PM, Olle Mårtensson olle.martens...@gmail.com wrote: Thank you for the link Anil it was a good explanation indeed. It's not recommended to do put/deletes across region servers like this. That was not my intention, I want to keep the region for the aggregates and the aggregated values on the same server. I read in the link that you gave me that I can achieve this by using coprocessor on the master, so I will try that out. Try to move this aggregation on the client side or at least outside RS. This is what I try to avoid since doing this would cause big data transfers between the client and the region server. The whole purpose of using the coprocessor is to push the aggregation work to the nodes where data is local and to minimize data transfer between the nodes. Why do you think it's a bad idea to do aggregate values inside of the regionserver, is it because it occupies RPC threads or because it's not a good usecase for coprocessors ? I got the impression that your code is doing Inter-RS puts/gets from the coprocessor. Do you think it's a bad idea even if I keep the regions for the two rows involved on the same regionserver and bypass RPC as the link suggests? In my opinion, then it should be fine. I am not aware of how heavy/complex your aggregations are. Obviously, more complex your CP(coprocessor) is, more load you are putting on RS. Thanks // Olle On Mon, Aug 26, 2013 at 5:43 PM, anil gupta anilgupt...@gmail.com wrote: On Mon, Aug 26, 2013 at 7:27 AM, Olle Mårtensson olle.martens...@gmail.comwrote: Hi, I have developed a coprocessor that is extending BaseRegionObserver and implements the postPut method. The postPut method scans the columns of the row that the put was issued on and calculates an aggregated based on these values, when this is done a row in another table is updated with the aggregated value. This is an anti-pattern. It's not recommended to do put/deletes across region servers like this. Try to move this aggregation on the client side or at least outside RS. Here is the link for much detailed explanation why this is not good: http://search-hadoop.com/m/XtAi5Fogw32 This works out fine until I put some stress on one row, then the threads on the regionserver hosting the table will freeze on flushing the put on the aggregated value. The client application basically do 100 concurrent puts on one row in a tight loop( on the table where the coprocessor is activated ). After that the client sleeps for a while and tries to fetch the aggregated value and here the client freezes and periodically burps out exceptions. It works if I don't run so many put's in parallel. The HBASE environment is pseudo distributed 0.94.11 with one regionserver. I have tried using a connection pool in the coprocessor, bumped up the heapsize of the regionServer and also to up the number of RPC threads for the regionserver but without luck. The pseudo code postPut would be something like this: vals = env.getRegion().get(get).getFamilyMap().values() agg_val = aggregate(vals) agg_table = env.getTable(aggregates) agg_table.setAutoFlush(false) put = new Put() put.add(agg_val) agg_table.put(put) agg_table.flushCommits() agg_table.close() And the real clojure variant is: https://gist.github.com/ollez/d0450930a591912aea5d#file-gistfile1-clj The hbase-site.xml: https://gist.github.com/ollez/d0450930a591912aea5d#file-hbase-site-xml The regionserver stacktrace: https://gist.github.com/ollez/d0450930a591912aea5d#file-regionserver-stacktrace The client exceptions: https://gist.github.com/ollez/d0450930a591912aea5d#file-client-exceptions Thanks // Olle -- Thanks Regards, Anil Gupta -- Thanks Regards, Anil Gupta
HBase-Hive integration performance issues
Hi, I am running Hive and HBase on Amazon EC2. By following the tutorial: https://cwiki.apache.org/confluence/display/Hive/HBaseIntegration , I managed to create a HBase table from Hive and insert data into it. It works but with a low performance. To be specific, inserting 1.3 Gb (50 M rows, 3 columns) takes 30 mins. It is far from what I excepted, say 100 s. Actually, my EC2 cluster contains 3 slaves and 1 master whose instance type is medium(http://aws.amazon.com/ec2/instance-types/#instance-type). Hadoop 1.0.4 is installed on my cluster. HBase is in pseudo-distributed mode. A region server is running on the master. HDFS is used as storage. Here are some configuration files: *// hive-site.xml* configuration property namehbase.zookeeper.quorum/name valueip-10-178-13-39.ec2.internal/value /property property namehive.aux.jars.path/name value/root/hive/build/dist/lib/hive-hbase-handler-0.9.0-amplab-4.jar,/root/hive/build/dist/lib/hbase-0.92.0.jar,/root/hive/build/dist/lib/zookeeper-3.4.3.jar,/root/hive/build/dist/lib/guava-r09.jar/value /property property namehbase.client.scanner.caching/name value1/value /property /configuration *// hbase-site.xml* configuration property namehbase.rootdir/name valuehdfs://ec2-54-226-206-28.compute-1.amazonaws.com:9010/hbase/value /property property namehbase.cluster.distributed/name valuetrue/value /property property namehbase.zookeeper.quorum/name valueip-10-178-13-39.ec2.internal/value /property property namehbase.client.scanner.caching/name value1/value /property /configuration *For understanding, I have some questions:* 1) In order to improve read performance, I have set hbase.client.scanner.caching to 1. But I don't know how to improve write performance. Is there some basic config to do ? 2) Does the distributed mode matter ? Does fully-distributed mode have better write performance than pseudo-distributed mode ? 3) If the number of region server is increased, will the write performance be improved ? 4) In pseudo-distributed mode (one hbase daemon on master), when writing data from hive to a hbase table, is the master the only entry to HBase ? I don't think all data passes through the master is efficient. I wonder whether it is possible write data in parallel from hive to hbase directly in using mapReduce ? 5) Will the HBase bulk loading help a lot ? I am new to HBase, but I really want to integrate HBase in production. Any help is highly appreciated ! =) Hao -- Hao Ren ClaraVista www.claravista.fr
Data Deduplication in HBase
Hi, I have a use case in which I need to store segments of mp3 files in hbase. A song may come to the application in different ovelapping segments. For example, a 5 min song can have the following segments 0-1,0.5-2,2-4,3-5. As seen, some of the data is duplicate (3-4 is present in the last 2 segments). What would be the ideal way of removing this duplicate storage? Will snappy compression help here or do I need to write some logic over HBase? Also, what if I store a single segment multiple times. Will hbase do some sort of deduplication? Regards, Anand
Re: HBase-Hive integration performance issues
Hao, A couple thoughts here. This could be related to many things. 1. Did you pre-split your regions? If not, you could be hot-spotting on a single server and then waiting for the region to split. If that is the case, you could actually only be using a single server for much of your load (if not all - depends on the region size you have configured) While running did you see one system take the full load (via top, ganglia, or some other tool)? 2. The memory on each of these systems is quite low - 1.7 or 3.7 gb depending if it is compute or memory - either way, it is way low, and I'd expect you to be doing a lot of swapping. You'll need 1 GB for each daemon, which leaves you very little room for the OS (at 3.7 gb). Do you see swapping? What are your JVM parameters? 3. Do these same 4 servers run your Hadoop infrastructure and the hive query? If so, the system is woefully underpowered if you expect to see production-like speed. Running an Hive query on top of an HBase cluster with so few resources will just not work out well in the end ;) -Matt On Tue, Aug 27, 2013 at 7:51 AM, Hao Ren h@claravista.fr wrote: Hi, I am running Hive and HBase on Amazon EC2. By following the tutorial: https://cwiki.apache.org/**confluence/display/Hive/**HBaseIntegrationhttps://cwiki.apache.org/confluence/display/Hive/HBaseIntegration, I managed to create a HBase table from Hive and insert data into it. It works but with a low performance. To be specific, inserting 1.3 Gb (50 M rows, 3 columns) takes 30 mins. It is far from what I excepted, say 100 s. Actually, my EC2 cluster contains 3 slaves and 1 master whose instance type is medium(http://aws.amazon.com/**ec2/instance-types/#instance-**typehttp://aws.amazon.com/ec2/instance-types/#instance-type ). Hadoop 1.0.4 is installed on my cluster. HBase is in pseudo-distributed mode. A region server is running on the master. HDFS is used as storage. Here are some configuration files: *// hive-site.xml* configuration property namehbase.zookeeper.quorum/**name valueip-10-178-13-39.ec2.**internal/value /property property namehive.aux.jars.path/**name value/root/hive/build/dist/**lib/hive-hbase-handler-0.9.0-** amplab-4.jar,/root/hive/build/**dist/lib/hbase-0.92.0.jar,/** root/hive/build/dist/lib/**zookeeper-3.4.3.jar,/root/** hive/build/dist/lib/guava-r09.**jar/value /property property namehbase.client.scanner.**caching/name value1/value /property /configuration *// hbase-site.xml* configuration property namehbase.rootdir/name valuehdfs://ec2-54-226-206-**28.compute-1.amazonaws.com:**9010/hbasehttp://ec2-54-226-206-28.compute-1.amazonaws.com:9010/hbase /value /property property namehbase.cluster.**distributed/name valuetrue/value /property property namehbase.zookeeper.quorum/**name valueip-10-178-13-39.ec2.**internal/value /property property namehbase.client.scanner.**caching/name value1/value /property /configuration *For understanding, I have some questions:* 1) In order to improve read performance, I have set hbase.client.scanner.caching to 1. But I don't know how to improve write performance. Is there some basic config to do ? 2) Does the distributed mode matter ? Does fully-distributed mode have better write performance than pseudo-distributed mode ? 3) If the number of region server is increased, will the write performance be improved ? 4) In pseudo-distributed mode (one hbase daemon on master), when writing data from hive to a hbase table, is the master the only entry to HBase ? I don't think all data passes through the master is efficient. I wonder whether it is possible write data in parallel from hive to hbase directly in using mapReduce ? 5) Will the HBase bulk loading help a lot ? I am new to HBase, but I really want to integrate HBase in production. Any help is highly appreciated ! =) Hao -- Hao Ren ClaraVista www.claravista.fr
Re: Data Deduplication in HBase
bq. Will hbase do some sort of deduplication? I don't think so. What is the granularity of segment overlap ? In the above example, it seems to be 0.5 Cheers On Tue, Aug 27, 2013 at 7:12 AM, Anand Nalya anand.na...@gmail.com wrote: Hi, I have a use case in which I need to store segments of mp3 files in hbase. A song may come to the application in different ovelapping segments. For example, a 5 min song can have the following segments 0-1,0.5-2,2-4,3-5. As seen, some of the data is duplicate (3-4 is present in the last 2 segments). What would be the ideal way of removing this duplicate storage? Will snappy compression help here or do I need to write some logic over HBase? Also, what if I store a single segment multiple times. Will hbase do some sort of deduplication? Regards, Anand
Re: HBase-Hive integration performance issues
Matt, Thank you for the lightning reply. I will try out what you have mentioned in these days, thus I could tell you some news on the issue in detail. Thank you again. Your suggestions show me the way. =) Hao Le 27/08/2013 16:13, Matt Davies a écrit : Hao, A couple thoughts here. This could be related to many things. 1. Did you pre-split your regions? If not, you could be hot-spotting on a single server and then waiting for the region to split. If that is the case, you could actually only be using a single server for much of your load (if not all - depends on the region size you have configured) While running did you see one system take the full load (via top, ganglia, or some other tool)? 2. The memory on each of these systems is quite low - 1.7 or 3.7 gb depending if it is compute or memory - either way, it is way low, and I'd expect you to be doing a lot of swapping. You'll need 1 GB for each daemon, which leaves you very little room for the OS (at 3.7 gb). Do you see swapping? What are your JVM parameters? 3. Do these same 4 servers run your Hadoop infrastructure and the hive query? If so, the system is woefully underpowered if you expect to see production-like speed. Running an Hive query on top of an HBase cluster with so few resources will just not work out well in the end ;) -Matt On Tue, Aug 27, 2013 at 7:51 AM, Hao Ren h@claravista.fr wrote: Hi, I am running Hive and HBase on Amazon EC2. By following the tutorial: https://cwiki.apache.org/**confluence/display/Hive/**HBaseIntegrationhttps://cwiki.apache.org/confluence/display/Hive/HBaseIntegration, I managed to create a HBase table from Hive and insert data into it. It works but with a low performance. To be specific, inserting 1.3 Gb (50 M rows, 3 columns) takes 30 mins. It is far from what I excepted, say 100 s. Actually, my EC2 cluster contains 3 slaves and 1 master whose instance type is medium(http://aws.amazon.com/**ec2/instance-types/#instance-**typehttp://aws.amazon.com/ec2/instance-types/#instance-type ). Hadoop 1.0.4 is installed on my cluster. HBase is in pseudo-distributed mode. A region server is running on the master. HDFS is used as storage. Here are some configuration files: *// hive-site.xml* configuration property namehbase.zookeeper.quorum/**name valueip-10-178-13-39.ec2.**internal/value /property property namehive.aux.jars.path/**name value/root/hive/build/dist/**lib/hive-hbase-handler-0.9.0-** amplab-4.jar,/root/hive/build/**dist/lib/hbase-0.92.0.jar,/** root/hive/build/dist/lib/**zookeeper-3.4.3.jar,/root/** hive/build/dist/lib/guava-r09.**jar/value /property property namehbase.client.scanner.**caching/name value1/value /property /configuration *// hbase-site.xml* configuration property namehbase.rootdir/name valuehdfs://ec2-54-226-206-**28.compute-1.amazonaws.com:**9010/hbasehttp://ec2-54-226-206-28.compute-1.amazonaws.com:9010/hbase /value /property property namehbase.cluster.**distributed/name valuetrue/value /property property namehbase.zookeeper.quorum/**name valueip-10-178-13-39.ec2.**internal/value /property property namehbase.client.scanner.**caching/name value1/value /property /configuration *For understanding, I have some questions:* 1) In order to improve read performance, I have set hbase.client.scanner.caching to 1. But I don't know how to improve write performance. Is there some basic config to do ? 2) Does the distributed mode matter ? Does fully-distributed mode have better write performance than pseudo-distributed mode ? 3) If the number of region server is increased, will the write performance be improved ? 4) In pseudo-distributed mode (one hbase daemon on master), when writing data from hive to a hbase table, is the master the only entry to HBase ? I don't think all data passes through the master is efficient. I wonder whether it is possible write data in parallel from hive to hbase directly in using mapReduce ? 5) Will the HBase bulk loading help a lot ? I am new to HBase, but I really want to integrate HBase in production. Any help is highly appreciated ! =) Hao -- Hao Ren ClaraVista www.claravista.fr -- Hao Ren ClaraVista www.claravista.fr
[Question: replication] why only one regionserver is used during replication? 0.94.9
hi, guys, I am using hbase 0.94.9. And setup replication from a 4-nodes master(3 regserver) to a 3-nodes slave(2 regserver). I can tell that all source regservers can successfully replicate data. However, it seems for each particular table, only one regserver will handle its replication at each given table. For example, I am using YCSB to load 1,000,000 rows with workloada, with 16 threads. During the load period, I looked at the ageOfLastShippedOp and sizeOfLogQueue. I can tell one of the regserver from Master is doing the replication. While values of both age and sizeOfLog are growing, another two regserver doesn't come into help. So does that mean: for each table and process, only one regionserver will do the replication regardless how long the queue is? Or did I miss some setup configuration? Thanks. Demai
Re: [Question: replication] why only one regionserver is used during replication? 0.94.9
Region servers replicate data written to them, so look at how your regions are distributed. J-D On Tue, Aug 27, 2013 at 11:29 AM, Demai Ni nid...@gmail.com wrote: hi, guys, I am using hbase 0.94.9. And setup replication from a 4-nodes master(3 regserver) to a 3-nodes slave(2 regserver). I can tell that all source regservers can successfully replicate data. However, it seems for each particular table, only one regserver will handle its replication at each given table. For example, I am using YCSB to load 1,000,000 rows with workloada, with 16 threads. During the load period, I looked at the ageOfLastShippedOp and sizeOfLogQueue. I can tell one of the regserver from Master is doing the replication. While values of both age and sizeOfLog are growing, another two regserver doesn't come into help. So does that mean: for each table and process, only one regionserver will do the replication regardless how long the queue is? Or did I miss some setup configuration? Thanks. Demai
Fwd: Hbase 0.94.6 stargate can't use multi get
Hey all, I try to use multi get for receiving different versions of row but it give me only one always. For example I have table log, and column family data:get. I put a lot of versions of row/data:log. Now I try to get all versions of this key. As it said in manual http://wiki.apache.org/hadoop/Hbase/Stargate (Cell or Row Query (Multiple Values)): using browser for request http://myhost.com:8080/log/data:get/0,1377633354/?v=10 The response contain only one version of a key instid of give me all (max 10) available versions of a key. I break my brain on this problem. Please reffer me to right way. Thanks!
Re: [Question: replication] why only one regionserver is used during replication? 0.94.9
J-D, thanks for the tip. On Tue, Aug 27, 2013 at 11:40 AM, Jean-Daniel Cryans jdcry...@apache.orgwrote: Region servers replicate data written to them, so look at how your regions are distributed. J-D On Tue, Aug 27, 2013 at 11:29 AM, Demai Ni nid...@gmail.com wrote: hi, guys, I am using hbase 0.94.9. And setup replication from a 4-nodes master(3 regserver) to a 3-nodes slave(2 regserver). I can tell that all source regservers can successfully replicate data. However, it seems for each particular table, only one regserver will handle its replication at each given table. For example, I am using YCSB to load 1,000,000 rows with workloada, with 16 threads. During the load period, I looked at the ageOfLastShippedOp and sizeOfLogQueue. I can tell one of the regserver from Master is doing the replication. While values of both age and sizeOfLog are growing, another two regserver doesn't come into help. So does that mean: for each table and process, only one regionserver will do the replication regardless how long the queue is? Or did I miss some setup configuration? Thanks. Demai
Hbase 0.94.6 stargate can't use multi get
Hey all, I try to use multi get for receiving different versions of row but it give me only one always. For example I have table log, and column family data:get. I put a lot of versions of row/data:log. Now I try to get all versions of this key. As it said in manual http://wiki.apache.org/hadoop/Hbase/Stargate (Cell or Row Query (Multiple Values)): using browser for request http://myhost.com:8080/log/data:get/0,1377633354/?v=10 The response contain only one version of a key instid of give me all (max 10) available versions of a key. I break my brain on this problem. Please reffer me to right way.
Hbase 0.94.6 stargate can't use multi get
Hey all, I try to use multi get for receiving different versions of row but it give me only one always. For example I have table log, and column family data:get. I put a lot of versions of row/data:log. Now I try to get all versions of this key. As it said in manual http://wiki.apache.org/hadoop/Hbase/Stargate (Cell or Row Query (Multiple Values)): using browser for request http://myhost.com:8080/log/data:get/0,1377633354/?v=10 The response contain only one version of a key instid of give me all (max 10) available versions of a key. I break my brain on this problem. Please reffer me to right way. Thanks!
Writing multiple tables from reducer
Hi, I am new to hbase and am trying to achieve the following. I am reading data from hdfs in mapper and parsing it.. So, in reducer I want my output to write to hbase instead of hdfs But here is the thing. public static class MyTableReducer extends TableReducerText, Text, ImmutableBytesWritable { public void reduce(Text key, IterableText values, Context context) throws IOException, InterruptedException { int type = getType(values.toString()); if (type == 1) // put data to table 1 if (type==2) // put data to table 2 } } How do I do this? Thanks
Re: Writing multiple tables from reducer
You can use HBase's MultiTableOutputFormat: http://hbase.apache.org/apidocs/org/apache/hadoop/hbase/mapreduce/MultiTableOutputFormat.html An example can be found in this blog post: http://www.wildnove.com/2011/07/19/tutorial-hadoop-and-hbase-multitableoutputformat/ On Wed, Aug 28, 2013 at 3:50 AM, jamal sasha jamalsha...@gmail.com wrote: Hi, I am new to hbase and am trying to achieve the following. I am reading data from hdfs in mapper and parsing it.. So, in reducer I want my output to write to hbase instead of hdfs But here is the thing. public static class MyTableReducer extends TableReducerText, Text, ImmutableBytesWritable { public void reduce(Text key, IterableText values, Context context) throws IOException, InterruptedException { int type = getType(values.toString()); if (type == 1) // put data to table 1 if (type==2) // put data to table 2 } } How do I do this? Thanks -- Harsh J
Region locality and core/thread availability
I think my app wants to hit a particular region all the time. Since my table is only read-only (lookup). I have created more than one and randomly pick one to use. This way I can load the whole cluster. Is there a feature in Hbase which lets coprocessors run on another region server on another copy of the region if the original assigned region server is busy, for read only operations or if we mark the table read-only ? I guess not. Anyways, any tools/debugging tips to confirm this hot region behavior ? Regards, - kiru Kiru Pakkirisamy | webcloudtech.wordpress.com
Re: [Question: replication] why only one regionserver is used during replication? 0.94.9
BEGIN:VCALENDAR VERSION:2.0 PRODID://RESEARCH IN MOTION//BIS 3.0 METHOD:REQUEST BEGIN:VEVENT X-RIM-REVISION:0 X-MICROSOFT-CDO-BUSYSTATUS:BUSY SUMMARY:Re: [Question: replication] why only one regionserver is used durin g replication? 0.94.9 CLASS:PUBLIC ATTENDEE;PARTSTAT=NEEDS-ACTION;RSVP=TRUE:MAILTO:user@hbase.apache.org ATTENDEE;PARTSTAT=NEEDS-ACTION;RSVP=TRUE:MAILTO:nid...@gmail.com UID:XRIMCAL-564352521-1675005804-15279064 SEQUENCE:2 DTSTART:20130827T231557Z DTEND:20130828T001557Z DESCRIPTION:J-D\, thanks for the tip.\n\n\nOn Tue\, Aug 27\, 2013 at 11:40 AM\, Jean-Daniel Cryans jdcry...@apache.orgwrote:\n\n Region servers re plicate data written to them\, so look at how your regions\n are distribu ted.\n\n J-D\n\n\n On Tue\, Aug 27\, 2013 at 11:29 AM\, Demai Ni nid m...@gmail.com wrote:\n\n hi\, guys\,\n \n I am using hbase 0.94. 9. And setup replication from a 4-nodes master(3\n regserver) to a 3-no des slave(2 regserver).\n \n I can tell that all source regservers c an successfully replicate data.\n However\, it seems for each particula r table\, only one regserver will\n handle\n its replication at each g iven table.\n \n For example\, I am using YCSB to load 1\,000\,000 ro ws with workloada\, with\n 16\n threads. During the load period\, I lo oked at the ageOfLastShippedOp and\n sizeOfLogQueue. I can tell one of the regserver from Master is doing the\n replication. While values of b oth age and sizeOfLog are growing\, another\n two regserver doesn't com e into help.\n \n So does that mean: for each table and process\, onl y one regionserver will\n do the replication regardless how long the qu eue is? Or did I miss some\n setup configuration?\n \n Thanks.\n \n Demai\n \n DTSTAMP:20130827T231318Z ORGANIZER:MAILTO:tla...@capitaltg.com END:VEVENT END:VCALENDAR
how to export data from hbase to mysql?
hi,all: any good idea? thanks
Re: how to export data from hbase to mysql?
Take a look at sqoop? Le 2013-08-27 23:08, ch huang justlo...@gmail.com a écrit : hi,all: any good idea? thanks
Re: how to export data from hbase to mysql?
Or if you'd like to be able to use SQL directly on it, take a look at Phoenix (https://github.com/forcedotcom/phoenix). James On Aug 27, 2013, at 8:14 PM, Jean-Marc Spaggiari jean-m...@spaggiari.org wrote: Take a look at sqoop? Le 2013-08-27 23:08, ch huang justlo...@gmail.com a écrit : hi,all: any good idea? thanks
Re: Hbase 0.94.6 stargate can't use multi get
Hi , Can you please query for the schema of the table and show us here. Would like to know what is value for VERSIONS that you have set for the column family . I hope you have set it to 10. Ex:http://myhost.com:8080/log/schemahttp://myhost.com:8080/log/data:get/0,1377633354/?v=10 Regards Ravi Magham On Wed, Aug 28, 2013 at 1:29 AM, Dmitriy Troyan troyan.dmit...@gmail.comwrote: Hey all, I try to use multi get for receiving different versions of row but it give me only one always. For example I have table log, and column family data:get. I put a lot of versions of row/data:log. Now I try to get all versions of this key. As it said in manual http://wiki.apache.org/hadoop/Hbase/Stargate (Cell or Row Query (Multiple Values)): using browser for request http://myhost.com:8080/log/data:get/0,1377633354/?v=10 The response contain only one version of a key instid of give me all (max 10) available versions of a key. I break my brain on this problem. Please reffer me to right way. Thanks!
Re: how to export data from hbase to mysql?
sqoop can not support export hbase data into mysql On Wed, Aug 28, 2013 at 11:13 AM, Jean-Marc Spaggiari jean-m...@spaggiari.org wrote: Take a look at sqoop? Le 2013-08-27 23:08, ch huang justlo...@gmail.com a écrit : hi,all: any good idea? thanks