Re: Key Value collision

2013-05-17 Thread Stack
On Thu, May 16, 2013 at 11:49 AM, Varun Sharma va...@pinterest.com wrote: Hi, I am wondering what happens when we add the following: row, col, timestamp -- v1 A flush happens. Now, we add row, col, timestamp -- v2 A flush happens again. In this case if MAX_VERSIONS == 1, how is the tie

Re: Question about HFile seeking

2013-05-17 Thread lars hofhansl
We use the index blocks to find the right block (we have 64k blocks by default). Once we found the block we a linear search for the KV we're looking for. In your example, you'd find the first block that contains a KV for row1c and then seek into that block until you found your KV. We cannot do

Re: GET performance degrades over time

2013-05-17 Thread lars hofhansl
Hi Viral, some questions: Are you adding new data or deleting data over time? Do you have bloom filters enabled? Which version of Hadoop? Anything funny the Datanode logs? -- Lars - Original Message - From: Viral Bajaria viral.baja...@gmail.com To: user@hbase.apache.org

Re: GET performance degrades over time

2013-05-17 Thread Viral Bajaria
Thanks for all the help in advance! Answers inline.. Hi Viral, some questions: Are you adding new data or deleting data over time? Yes I am continuously adding new data. The puts have not slowed down but that could also be an after effect of deferred log flush. Do you have bloom

Re: Question about HFile seeking

2013-05-17 Thread Stack
On Thu, May 16, 2013 at 3:26 PM, Varun Sharma va...@pinterest.com wrote: Referring to your comment above again If you doing a prefix scan w/ row1c, we should be starting the scan at row1c, not row1 (or more correctly at the row that starts the block we believe has a row1c row in it...) I

Re: Question about HFile seeking

2013-05-17 Thread Varun Sharma
Thanks Stack and Lars for the detailed answers - This question is not really motivated by performance problems... So the index indeed knows what part of the HFile key is the row and which part is the column qualifier. Thats what I needed to know. I initially thought it saw it as an opaque

Re: Question about HFile seeking

2013-05-17 Thread ramkrishna vasudevan
Generally we start with seeking on all the Hfiles corresponding to the region and load the blocks that correspond to that row key specified in the scan. If row1 and row1c are in the same block then we may start with row1. If they are in different blocks then we will start with the block

RE: Doubt Regading HLogs

2013-05-17 Thread Rishabh Agrawal
Is it a bug or part of design. It seems more of a design to me. Can someone guide me through the purpose of this feature. Thanks Rishabh From: Rishabh Agrawal Sent: Friday, May 17, 2013 4:24 PM To: user@hbase.apache.org Subject: Doubt Regading HLogs Hello, I am working with Hlogs of Hbase

Re: Doubt Regading HLogs

2013-05-17 Thread Nicolas Liochon
That's HDFS. When a file is currently written, the size is not known, as the write is in progress. So the namenode reports a size of zero (more exactly, it does not take into account the hdfs block beeing written when it calculates the size). When you read, you go to the datanode owning the data,

RE: Doubt Regading HLogs

2013-05-17 Thread Rishabh Agrawal
Thanks Nicolas, When will this file be finalized. Is it time bound? Or it will be always be zero for last one (even if it contains the data) -Original Message- From: Nicolas Liochon [mailto:nkey...@gmail.com] Sent: Friday, May 17, 2013 4:39 PM To: user Subject: Re: Doubt Regading HLogs

Re: Doubt Regading HLogs

2013-05-17 Thread yonghu
In this situation, you can set the property namehbase.regionserver. logroll.period/name value360/value /property to a short value, let's say 3000 and then you can see your log file with current size after 3 seconds. To Nicolas, I guess he wants somehow to analyze the HLog.

Scanner returning keys out of order

2013-05-17 Thread Jan Lukavský
Hi all, we are seeing very strange behavior of HBase (version 0.90.6-cdh3u5) in the following scenario: 1) Open scanner and start scanning. 2) Check order of returned keys (simple test if next key is lexigraphically greater than the previous one). 3) The check may occasionally fail.

Re: Scanner returning keys out of order

2013-05-17 Thread Jean-Marc Spaggiari
Hi Jan, 0.90.6 is a very old version of HBase... Will you have a chance to migrate to a more recent one? Most of your issues might probably be already fixed. JM 2013/5/17 Jan Lukavský jan.lukav...@firma.seznam.cz Hi all, we are seeing very strange behavior of HBase (version 0.90.6-cdh3u5)

Re: Doubt Regading HLogs

2013-05-17 Thread Nicolas Liochon
Yes, it's by design. The last log file the one beeing written by HBase The safe option is to wait for this file to be closed by HBase. As Yong said, you can change the roll parameter if you want it to be terminated sooner, but changing this parameter impacts the hdfs namenode load.10 minutes is

bulk load skipping tsv files

2013-05-17 Thread Jinyuan Zhou
Hi, I wonder if there are tool similar to org.apache.hadoop.hbase.mapreduce.ImportTsv. IimportTsv read from tsv file and create HFiles which are ready to be loaded into the corresponding region by another tool org.apache.hadoop.hbase.mapreduce.LoadIncrementalHFiles. What I want is to read from

Re: bulk load skipping tsv files

2013-05-17 Thread Ted Yu
bq. What I want is to read from some hbase table and create hfiles directly Can you describe your use case in more detail ? Thanks On Fri, May 17, 2013 at 7:52 AM, Jinyuan Zhou zhou.jiny...@gmail.comwrote: Hi, I wonder if there are tool similar to

Re: GET performance degrades over time

2013-05-17 Thread Jeremy Carroll
Look at how much Hard Disk utilization you have (IOPS / Svctm). You may just be under scaled for the QPS you desire for both read + write load. If you are performing random gets, you could expect around the low to mid 100's IOPS/sec per HDD. Use bonnie++ / IOZone / IOPing to verify. Also you

Re: bulk load skipping tsv files

2013-05-17 Thread Jinyuan Zhou
Actually, I wanted to update each row of a table each day. no new data needed, only some value will be changed by recalculation. It looks like every time I do, the data is doubled in table. even though it is update. I believe even an update will result in new hfiles and the cluster is then very

Re: bulk load skipping tsv files

2013-05-17 Thread Shahab Yunus
If I understood your usecase correctly, then if you don't need to maintain older versions of data then why don't you set the 'max version' parameter for your table to 1? I believe that the increase in data even in case of updates is due to that (?) Have you tried that? Regards, Shahab On Fri,

Re: [ANNOUNCE] Phoenix 1.2 is now available

2013-05-17 Thread James Taylor
Anil, Yes, everything is in the Phoenix GitHub repo. Will give you more detail of specific packages and classes off-list. Thanks, James On 05/16/2013 05:33 PM, anil gupta wrote: Hi James, Is this implementation present in the GitHub repo of Phoenix? If yes, can you provide me the package

Re: bulk load skipping tsv files

2013-05-17 Thread Ted Yu
Jinyuan: bq. no new data needed, only some value will be changed by recalculation. Have you considered using coprocessor to fullfil the above task ? Cheers On Fri, May 17, 2013 at 8:57 AM, Shahab Yunus shahab.yu...@gmail.comwrote: If I understood your usecase correctly, then if you don't

Re: GET performance degrades over time

2013-05-17 Thread Anoop John
Yes bloom filters have been enabled: ROWCOL Can u try with ROW bloom? -Anoop- On Fri, May 17, 2013 at 12:20 PM, Viral Bajaria viral.baja...@gmail.comwrote: Thanks for all the help in advance! Answers inline.. Hi Viral, some questions: Are you adding new data or deleting data

Re: GET performance degrades over time

2013-05-17 Thread Viral Bajaria
On Fri, May 17, 2013 at 8:23 AM, Jeremy Carroll phobos...@gmail.com wrote: Look at how much Hard Disk utilization you have (IOPS / Svctm). You may just be under scaled for the QPS you desire for both read + write load. If you are performing random gets, you could expect around the low to mid

Re: bulk load skipping tsv files

2013-05-17 Thread Jinyuan Zhou
I had thought about coprocessor. But I had an impression that coprocessor is last option one shoud try becuase it is so invasive to the jvm running hbase. Not sure about current status though. However, what the croprocessor can give me in this case is less network load. My problem is the hbase's

Re: bulk load skipping tsv files

2013-05-17 Thread Jinyuan Zhou
will try that. Thanks, On Fri, May 17, 2013 at 8:57 AM, Shahab Yunus shahab.yu...@gmail.comwrote: If I understood your usecase correctly, then if you don't need to maintain older versions of data then why don't you set the 'max version' parameter for your table to 1? I believe that the

Later version of HBase Client has a problem with DNS

2013-05-17 Thread Heng Sok
Hi all, I have been trying to run MapReduce job that involves using Hbase as source and sink. I have Hbase 0.94.2 and Hadoop 2.0 installed using Cloudera repository and following their instructions. When I use HBase client package version 0.94.2 and above, it gave the following DNS related

HEADS-UP: Upcoming bay area meetups and hbasecon

2013-05-17 Thread Stack
We have some meetups happening over the next few months. Sign up if you are interested in attending (or if you would like to present, write me off-list). First up, there is hbasecon2013 (http://hbasecon.com) on June 13th in SF. It is shaping up to be a great community day out with a bursting

Re: Later version of HBase Client has a problem with DNS

2013-05-17 Thread Stack
This has come up in the past: http://search-hadoop.com/m/mDn0i2kjGA32/NumberFormatException+dnssubj=unable+to+resolve+the+DNS+name Or check out this old thread: http://mail.openjdk.java.net/pipermail/jdk7-dev/2010-October/001605.html St.Ack On Fri, May 17, 2013 at 11:17 AM, Heng Sok