Re: Regarding rowkey

2012-09-12 Thread Ramasubramanian
Hi thanks! But for loading data into hbase, adding hash in rowkey will improve performance? Regards, Rams On 12-Sep-2012, at 8:38 AM, lars hofhansl lhofha...@yahoo.com wrote: It depends. If you do not need to perform rangescans along (prefixes of) your row keys, you can prefix the row key

Re: Local debugging (possibly with Maven and HBaseTestingUtility?)

2012-09-12 Thread Jeroen Hoek
Thank you Ulrich, that looks like an interesting approach. I think I will give that a go. ~ Jeroen 2012/9/10 Ulrich Staudinger ustaudin...@activequant.com: my AQ Master Server might be of interest to you I have an embedded HBase server in it, it's very very straight forward to use:

Re: Regarding rowkey

2012-09-12 Thread Otis Gospodnetic
I think yes, because it will avoid hotspotting. I think we have a good post on that topic on Sematext Blob. Otis -- Performance Monitoring - http://sematext.com/spm On Sep 12, 2012 3:08 AM, Ramasubramanian ramasubramanian.naraya...@gmail.com wrote: Hi thanks! But for loading data into hbase,

Re: Regarding rowkey

2012-09-12 Thread Michael Segel
I wouldn't 'prefix' the hash to the key, but actually replace the key with a hash and store the unhashed key in a column. But that's a different discussion. In a nutshell, the problem is that there are a lot of potential use cases where you want to store data in a sequence dependent fashion.

Optimizing table scans

2012-09-12 Thread Amit Sela
Hi all, I'm trying to find the sweet spot for the cache size and batch size Scan() parameters. I'm scanning one table using HTable.getScanner() and iterating over the ResultScanner retrieved. I did some testing and got the following results: For scanning *100* rows. * Cache Batch Total

Re: Optimizing table scans

2012-09-12 Thread Michael Segel
How much memory do you have? What's the size of the underlying row? What does your network look like? 1GBe or 10GBe? There's more to it, and I think that you'll find that YMMV on what is an optimum scan size... HTH -Mike On Sep 12, 2012, at 7:57 AM, Amit Sela am...@infolinks.com wrote: Hi

Re: Optimizing table scans

2012-09-12 Thread Amit Sela
I allocate 10GB per RegionServer. An average row size is ~200 Bytes. The network is 1GB. It would be great if anyone could elaborate on the difference between Cache and Batch parameters. Thanks. On Wed, Sep 12, 2012 at 4:04 PM, Michael Segel michael_se...@hotmail.comwrote: How much memory do

Re: Optimizing table scans

2012-09-12 Thread Doug Meil
Hi there, See this for info on the block cache in the RegionServer.. http://hbase.apache.org/book.html 9.6.4. Block Cache Š and see this for batching on the scan parameter... http://hbase.apache.org/book.html#perf.reading 11.8.1. Scan Caching On 9/12/12 9:55 AM, Amit Sela

HBASE garbage collection problem

2012-09-12 Thread Amlan Roy
Hi, I was doing some load testing on my cluster. I am writing to HBase (version 0.92.0) from 20 threads simultaneously. After running the program for some time, one of my machines got unresponsive. I checked the GC log and found occurrences of both concurrent mode failure and promotion failed

Re: HBASE garbage collection problem

2012-09-12 Thread J Mohamed Zahoor
Try with lesser than 70% occupancy in CMS… for perm failure try increasing the permgen size. Another option is to try DoubleBuffer. ./zahoor On 12-Sep-2012, at 8:39 PM, Amlan Roy amlan@cleartrip.com wrote: UseConcMarkSweepGC

Re: HBASE garbage collection problem

2012-09-12 Thread J Mohamed Zahoor
Ignore the permgen advice. Thought promotion failure as permgen failure. ./zahoor On 12-Sep-2012, at 8:39 PM, Amlan Roy amlan@cleartrip.com wrote: UseConcMarkSweepGC

Re: Regarding rowkey

2012-09-12 Thread lars hofhansl
If you use a collision free hashing algorithm you're right. Otherwise you'd KVs suddenly grouped into rows that weren't part of the same row. With hash prefixing you can use a fast and simple hashing algorithm, because you do not need the hash to be unique. Depends again on various aspects.

Re: Strata/Hadoop World HBase Meetup on October 25th

2012-09-12 Thread Jonathan Hsieh
We are looking for one more potential talk so please email otis and I ( j...@cloudera.com) directly with any proposals. Thanks! Jon. On Tue, Sep 11, 2012 at 10:45 AM, Otis Gospodnetic otis_gospodne...@yahoo.com wrote: Hi, I don't think this was mentioned on the ML yet, but for those coming

Re: BigDecimalColumnInterpreter

2012-09-12 Thread Ted Yu
Thanks for digging, Julian. Looks like we need to support BigDecimal in HbaseObjectWritable Actually once a test is written for BigDecimalColumnInterpreter, it would become much easier for anyone to debug this issue. On Wed, Sep 12, 2012 at 9:27 AM, Julian Wissmann

Re: Regarding rowkey

2012-09-12 Thread Michael Segel
MD5 should work, SHA-1 while theoretically may have a collision, it hasn't been found. Then there's SHA-2... I don't disagree with your assertion, however... it causes the key to be longer that it should have to be. If you insist on doing this... then take the MD5 hash, truncate it to 4

Re: java.io.IOException: Pass a Delete or a Put

2012-09-12 Thread Jothikumar Ekanath
Any help on this one please. On Tue, Sep 11, 2012 at 11:19 AM, Jothikumar Ekanath kbmku...@gmail.comwrote: Hi Stack, Thanks for the reply. I looked at the code and i am having a very basic confusion on how to use it correctly. The code i wrote earlier has the following input

Re: Regarding rowkey

2012-09-12 Thread lars hofhansl
Not insisting :) MD5 and SHA-1 would be reasonable and can be used to replace the key as you say. - Original Message - From: Michael Segel michael_se...@hotmail.com To: user@hbase.apache.org; lars hofhansl lhofha...@yahoo.com Cc: Sent: Wednesday, September 12, 2012 9:49 AM Subject: Re:

Re: Performance: hive+hbase integration query against the row_key

2012-09-12 Thread Jean-Daniel Cryans
On Tue, Sep 11, 2012 at 6:56 AM, Shengjie Min shengjie@gmail.com wrote: 1. if you do a hive query against the row key like select * from hive_hbase_test where key='blabla', this would utilize the hbase row_key index which give you very quick nearly real-time response just like hbase does.

Re: Tracking down coprocessor pauses

2012-09-12 Thread Tom Brown
I have captured some logs from what is happening during one of these pauses. http://pastebin.com/K162Einz Can someone help me figure out what's actually going on from these logs? --- My interpretation of the logs --- As you can see at the start of the logs, my coprocessor for updating the data

Re: Tracking down coprocessor pauses

2012-09-12 Thread Andrew Purtell
Inline On Wed, Sep 12, 2012 at 10:40 AM, Tom Brown tombrow...@gmail.com wrote: I have captured some logs from what is happening during one of these pauses. http://pastebin.com/K162Einz Can someone help me figure out what's actually going on from these logs? --- My interpretation of the

Re: Regarding rowkey

2012-09-12 Thread Ramasubramanian
Hi All, Can someone pls explain me layman term what rowkey and how to get the rowkey(in case of hash map) to load data faster into hbase. Regards, Rams On 12-Sep-2012, at 10:40 PM, lars hofhansl lhofha...@yahoo.com wrote: Not insisting :) MD5 and SHA-1 would be reasonable and can be used

Re: Regarding rowkey

2012-09-12 Thread lars hofhansl
I attempted to write this up here: http://hadoop-hbase.blogspot.com/2011/12/introduction-to-hbase.html - Original Message - From: Ramasubramanian ramasubramanian.naraya...@gmail.com To: user@hbase.apache.org user@hbase.apache.org Cc: Michael Segel michael_se...@hotmail.com;

Re: java.io.IOException: Pass a Delete or a Put

2012-09-12 Thread Jothikumar Ekanath
Hi Doug, That is where i took my code initially, not able to notice anything different from there. I know there is something wrong with the key in key out in my code, but not able to figure out. I have given below what i am using, Do you see anything wrong in there?

Re: BigDecimalColumnInterpreter

2012-09-12 Thread Julian Wissmann
Cool! I'm sure I'll find some time to digg into it early next week if nobody else lusts after it ;-) Cheers 2012/9/12 Ted Yu yuzhih...@gmail.com Thanks for digging, Julian. Looks like we need to support BigDecimal in HbaseObjectWritable Actually once a test is written for

Re: HBase 'Real-Time' reads?

2012-09-12 Thread Adrien Mogenet
WAL is just there for recover. Reads will meet the Memstore on their read path, that's how LSM Trees are working. On Wed, Sep 12, 2012 at 11:15 PM, Jason Huang jason.hu...@icare.com wrote: This might be a naive question but I am not able to find a good answer from searching online. The

Re: HBase 'Real-Time' reads?

2012-09-12 Thread Jason Huang
So - I guess at the time of the query we don't know if the data is in Memstore or in the RegionServer. In order to ensure we get the most recent version of data, every Hbase Read query will first go to Memstore and see if the data is there, and then go to RegionServers if it couldn't find that

Re: HBase 'Real-Time' reads?

2012-09-12 Thread Adrien Mogenet
I think you misunderstand concept of memstore. That's just the name of the temporary in-memory storage. Each region has its own memstore, and thus it's located on the regionserver itself. On Wed, Sep 12, 2012 at 11:24 PM, Jason Huang jason.hu...@icare.com wrote: So - I guess at the time of the

Re: HBase 'Real-Time' reads?

2012-09-12 Thread Jason Huang
I see now. Thanks for the quick response and clear explanation. Jason On Wed, Sep 12, 2012 at 5:28 PM, Adrien Mogenet adrien.moge...@gmail.com wrote: I think you misunderstand concept of memstore. That's just the name of the temporary in-memory storage. Each region has its own memstore, and

Re: Performance of scan setTimeRange VS manually doing it

2012-09-12 Thread n keywal
For each file; there is a time range. When you scan/search, the file is skipped if there is no overlap between the file timerange and the timerange of the query. As there are other parameters as well (row distribution, compaction effects, cache, bloom filters, ...) it's difficult to know in

Re: Performance of scan setTimeRange VS manually doing it

2012-09-12 Thread Tom Brown
It seems like the the internal logic for handling a time range is two part: First, as you said, each file contains the minimum and maximum timestamps contained within. This provides a very rough filter for the data, but if your data is right, the effect can be huge. Second, a time range acts a

Re: No of rows

2012-09-12 Thread Mohit Anchlia
But when resultscanner executes wouldn't it already query the servers for all the rows matching the startkey? I am tyring to avoid reading all the blocks from the file system that matches the keys. On Wed, Sep 12, 2012 at 3:59 PM, Doug Meil doug.m...@explorysmedical.comwrote: Hi there, If

Re: No of rows

2012-09-12 Thread lars hofhansl
No. By default each call to ClientScanner.next(...) incurs an RPC call to the HBase server, which is why it is important to enable scanner caching (as opposed to batching) if you expect to scan many rows. By default scanner caching is set to 1. From: Mohit

Re: No of rows

2012-09-12 Thread Mohit Anchlia
On Wed, Sep 12, 2012 at 4:48 PM, lars hofhansl lhofha...@yahoo.com wrote: No. By default each call to ClientScanner.next(...) incurs an RPC call to the HBase server, which is why it is important to enable scanner caching (as opposed to batching) if you expect to scan many rows. By default

Re: No of rows

2012-09-12 Thread lars hofhansl
If we set caching to N, the region server will attempt to scan N rows before the next() returns. So if you typically early out of a scan at the client the server will scan on average N/2 rows too many, which you have to trade off again the number of RPCs request without caching. Good numbers

RE: Performance of scan setTimeRange VS manually doing it

2012-09-12 Thread Anoop Sam John
@Tom I think your guess is correct. When the HFile can not be skipped as the max and min TS overlap with the given time range, that file will be scanned fully and certain rows will be filtered out. Those are read from HDFS. When you do the reseeks many such read can be avoided.. Remember that

Re: Performance of scan setTimeRange VS manually doing it

2012-09-12 Thread Xiang Hua
Hi, do you have script in python for rack awareness configuration? Thanks! beatls On Thu, Sep 13, 2012 at 5:52 AM, Tom Brown tombrow...@gmail.com wrote: When I query HBase, I always include a time range. This has not been a problem when querying recent data, but it seems to be an issue

Re: HBASE garbage collection problem

2012-09-12 Thread Xiang Hua
hi, where can i fidn the GC log? I am a newcomer. Thanks! beatls On Wed, Sep 12, 2012 at 11:09 PM, Amlan Roy amlan@cleartrip.com wrote: Hi, I was doing some load testing on my cluster. I am writing to HBase (version 0.92.0) from 20 threads simultaneously. After running the