Re: Scan vs map-reduce
re: my first version is using 20,000 Get² Just throwing this out there, but have you looked at multi-get? Multi-get will group the gets by RegionServer internally. You are doing a lot of IO for a web-app so this is going to be tough to make ³fast², but there are ways to make it ³faster.² But since you only have 1,000,000 rows you might not have many regions, so this might wind up all going on the same RegionServer. On 4/14/14, 7:52 AM, Li Li fancye...@gmail.com wrote: I need to get about 20,000 rows from the table. the table is about 1,000,000 rows. my first version is using 20,000 Get and I found it's very slow. So I modified it to a scan and filter unrelated rows in the client. maybe I should write a coprocessor. btw, is there any filter available for me? something like sql statement where rowkey in('abc', 'abd' ). a very long in statement On Mon, Apr 14, 2014 at 7:46 PM, Jean-Marc Spaggiari jean-m...@spaggiari.org wrote: Hi Li Li, If you have more than one region, might be useful. MR will scan all the regions in parallel. If you do a full scan from a client API with no parallelism, then the MR job might be faster. But it will take more resources on the cluster and might impact the SLA of the other clients, if any, JM 2014-04-14 2:42 GMT-04:00 Mohammad Tariq donta...@gmail.com: Well, it depends. Could you please provide some more details?It will help us in giving a proper answer. Warm Regards, Tariq cloudfront.blogspot.com On Mon, Apr 14, 2014 at 11:38 AM, Li Li fancye...@gmail.com wrote: I have a full table scan which cost about 10 minutes. it seems a bottleneck for our application. if use map-reduce to rewrite it. will it be faster?
Re: How to generate a large dataset quickly.
re: So, I execute 3.2Mill of Put¹s in HBase. There will be 3.2 million Puts, but they won¹t be sent over 1 at a time if autoFlush on Htable is false. By default, htable should be using a 2mb write buffer, and then it groups the Puts by RegionServer. On 4/14/14, 2:21 PM, Guillermo Ortiz konstt2...@gmail.com wrote: Are there some benchmark about how long could it takes to insert data in HBase to have a reference? The output of my Mapper has 3.2mill. output. So, I execute 3.2Mill of Put's in HBase. Well, data has to be copied and sent to the reducers, but with a network of 1Gb it shouldn't take too much time. I'll check Ganglia. 2014-04-14 18:16 GMT+02:00 Ted Yu yuzhih...@gmail.com: I looked at revision history for HFileOutputFormat.java There was one patch, HBASE-8949, which went into 0.94.11 but it shouldn't affect throughput much. If you can use ganglia (or some similar tool) to pinpoint what caused the low ingest rate, that would give us more clue. BTW Is upgrading to newer release, such as 0.98.1 (which contains HBASE-8755), an option for you ? Cheers On Mon, Apr 14, 2014 at 5:41 AM, Guillermo Ortiz konstt2...@gmail.com wrote: I'm using. 0.94.6-cdh4.4.0, I use the bulkload: FileInputFormat.addInputPath(job, new Path(INPUT_FOLDER)); FileOutputFormat.setOutputPath(job, hbasePath); HTable table = new HTable(jConf, HBASE_TABLE); HFileOutputFormat.configureIncrementalLoad(job, table); It seems that it takes really long time when it starts to execute the Puts to HBase in the reduce phase. 2014-04-14 14:35 GMT+02:00 Ted Yu yuzhih...@gmail.com: Which hbase release did you run mapreduce job ? Cheers On Apr 14, 2014, at 4:50 AM, Guillermo Ortiz konstt2...@gmail.com wrote: I want to create a large dateset for HBase with different versions and number of rows. It's about 10M rows and 100 versions to do some benchmarks. What's the fastest way to create it?? I'm generating the dataset with a Mapreduce of 100.000rows and 10verions. It takes 17minutes and size around 7Gb. I don't know if I could do it quickly. The bottleneck is when MapReduces write the output and when transfer the output to the Reduces.
Re: HFile size writeup in HBase Blog
Thanks Ted! I can add that to the to-do list. Also have plans for read/write performance numbers too in a follow-up blog. On 4/11/14, 6:00 PM, Ted Yu yuzhih...@gmail.com wrote: Nice writeup, Doug. Do you have plan to profile Prefix Tree data block encoding ? Cheers On Fri, Apr 11, 2014 at 3:14 PM, Doug Meil doug.m...@explorysmedical.comwrote: Hey folks, Stack published a writeup I did on the HBase blog on the effects of rowkey size, column-name size, CF compression, data block encoding and KV storage approach on HFile size. For example, had large row keys vs. small row keys, used Snappy vs. LZO vs. etc., used prefix vs. fast-diff, used a KV per column vs. a single KV per row. We tried 'em all... and wrote it up. http://blogs.apache.org/hbase/ Doug Meil Chief Software Architect, Explorys doug.m...@explorysmedical.com
HFile size writeup in HBase Blog
Hey folks, Stack published a writeup I did on the HBase blog on the effects of rowkey size, column-name size, CF compression, data block encoding and KV storage approach on HFile size. For example, had large row keys vs. small row keys, used Snappy vs. LZO vs. etc., used prefix vs. fast-diff, used a KV per column vs. a single KV per row. We tried 'em all... and wrote it up. http://blogs.apache.org/hbase/ Doug Meil Chief Software Architect, Explorys doug.m...@explorysmedical.com
Re: How to get Last access time of a record
Hi there, On top of what Vladimir already saidŠ re: Table1: 80 m records say Author, Table2 : 5k records say Category Just 80 million records? Hbase tends to be overkill for relatively low data volumes. But if you wish to proceed this path, to extend what was already said, rather than thinking of it in terms of an RDBMS 2 table design, create a pre-joined table that has data from both tables as the query target. As for the LRU cache, ³premature optimization is the root of all evil². :-) Best of luck! On 2/24/14, 4:38 PM, Vikram Singh Chandel vikramsinghchan...@gmail.com wrote: Hi Vladimir We are planing to have around 40Gb for L1 and 150Gb for L2 and when this size is breached then we have start cleaning L1 and L2. now this cleaning (deletion of records) i needed that LRU info at record level, i.e. delete all records which are not been used past 15 days or later. We will save save this LRU info in a Metric column family. What we thought of using a Post Get Observer to write the value to Last Read column of Metric column family. this info we will later use for deletion of records. Is there any other simpler way. As you said block cache is at table level( if i am correct) but we info at record level Thanks On Tue, Feb 25, 2014 at 1:42 AM, Vladimir Rodionov vrodio...@carrieriq.comwrote: I recommend you work a little bit more on design. NoSQL in general and HBase in particular are not very good at joining tables, but very good at point and range queries. Sure, you can do some optimizations in your current approach: create CACHE table as IN_MEMORY, set TTL for say 1day (or less, depends on the data volume your are able to store ) and utilize HBase internal block cache (which is LRU) for that table. Best regards, Vladimir Rodionov Principal Platform Engineer Carrier IQ, www.carrieriq.com e-mail: vrodio...@carrieriq.com From: Vikram Singh Chandel [vikramsinghchan...@gmail.com] Sent: Monday, February 24, 2014 11:38 AM To: user@hbase.apache.org Subject: Re: How to get Last access time of a record Hi Vladimir, We are going to implement cache in HBase, let me give you a example We have two tables Table1: 80 m records say Author Table2 : 5k records say Category query : Get details of all publications by Author XYZ broken down by Category We fire a get on Table 1 to get a list of publications ids(hashed) Then we do a scan on Table 2 to get list of publications for each category and then we do Intersection of both list and in the end get the details from publication table. Now suppose same query comes again instead of doing all this computation again we are going to save the intersected results in a table we are calling L2 Cache (there's a L1 also) Hope you would have got idea of what we are trying to achieve. Now if you can help please On Tue, Feb 25, 2014 at 12:20 AM, Vladimir Rodionov vrodio...@carrieriq.com wrote: Interesting. You want to use HBase as a cache. What data are going to cache? Is it some kind of a cold storage on tapes or Blu-Ray disks? Just curious. Best regards, Vladimir Rodionov Principal Platform Engineer Carrier IQ, www.carrieriq.com e-mail: vrodio...@carrieriq.com From: Vikram Singh Chandel [vikramsinghchan...@gmail.com] Sent: Monday, February 24, 2014 4:25 AM To: user@hbase.apache.org Subject: Re: How to get Last access time of a record Hi Hbase provides cache on non processed data, we are implementing a second level of caching on processed data, for eg on intersected data between two tables, or on post processed data. On Mon, Feb 24, 2014 at 5:02 PM, haosdent haosd...@gmail.com wrote: HBase have already maintained a cache. we can get last accessed time for a record I think you could get this from your application level. On Mon, Feb 24, 2014 at 7:21 PM, Vikram Singh Chandel vikramsinghchan...@gmail.com wrote: Hi We are planning to implement caching mechanism for our Hbase data model for that we have to remove the *LRU (least recently used) records* from the cached table. Is there any way by which we can get last accessed time for a record, primarily the access will be using *Range Scan and Get * -- *Regards* *VIKRAM SINGH CHANDEL* Please do not print this email unless it is absolutely necessary,Reduce. Reuse. Recycle. Save our planet. -- Best Regards, Haosdent Huang -- *Regards* *VIKRAM SINGH CHANDEL* Please do not print this email unless it is absolutely necessary,Reduce. Reuse. Recycle. Save our planet. Confidentiality Notice: The information contained in this message, including any attachments hereto, may be confidential and is intended to be read only by the individual or entity to whom this message is addressed. If the
Re: Question on efficient, ordered composite keys
Hey there, re: efficient, correctly ordered, byte[] serialized composite row keys? I was the guy behind 7221 and that patch had the first part and the last part, but not the middle part (correctly ordered) because this patch relied on the HBase build-in implementations which have the aforementioned order issue. James already threw out a good option, but you could also take the 7221 patch and use it yourself and change the conversions to use Orderly or something that has the type conversions that are suitable for your purposes. Once HBase fixes the type conversion issue, some form of built-in utility for creating composite keys is critical because building composite keys is one of the most asked questions on the dist-list (what 7221 was attempting to address). on 1/14/14 4:01 PM, James Taylor jtay...@salesforce.com wrote: Hi Henning, My favorite implementation of efficient composite row keys is Phoenix. We support composite row keys whose byte representation sorts according to the natural sort order of the values (inspired by Lily). You can use our type system independent of querying/inserting data with Phoenix, the advantage being that when you want to support adhoc querying through SQL using Phoenix, it'll just work. Thanks, James On Tue, Jan 14, 2014 at 7:02 AM, Ted Yu yuzhih...@gmail.com wrote: Please take a look at HBASE-8089 which is an umbrella JIRA. Some of its subtasks are in 0.96 bq. claiming that short keys (as well as short column names) are relevant bq. Is that also true in 0.94.x? That is true in 0.94.x Cheers On Tue, Jan 14, 2014 at 6:56 AM, Henning Blohm henning.bl...@zfabrik.de wrote: Hi, for an application still running on Hbase 0.90.4 (but moving to 0.94.6) we are thinking about using more efficient composite row keys compared what we use today (fixed length strings with / separator). I ran into http://hbase.apache.org/book/rowkey.design.html claiming that short keys (as well as short column names) are relevant also when using compression (as there is no compression in caches/indices). Is that also true in 0.94.x? If so, is there some support for efficient, correctly ordered, byte[] serialized composite row keys? I ran into HBASE-7221 https://issues.apache.org/jira/browse/HBASE-7221 and HBASE-7692. For some time it seemed Orderly (https://github.com/ndimiduk/orderly) was suggested but then abandoned again in favor of ... nothing really. So, in short, do you have any favorite / suggested implementation? Thanks, Henning
Re: hbase read performance tuning failed
In addition to what Lars just said about the blocksize, this is a similar question to another one that somebody asked, and it's always good to make sure that you understand where your data is. As a sanity check, make sure it's not all on one or two RSs (look at the hbase web pages or with tools like Hannibal). Also, you definitely want to to turn HBase checksumming on - and when you do so you'll need to re-create the HFiles (e.g., you can't just change the config and bounce the HBase cluster). That's a significant reduction in I/O. Likewise, if you are doing a full-scan, make sure that you select only the attributes you need... See this for more: http://hbase.apache.org/book.html#perf.reading On 1/7/14 1:24 PM, lars hofhansl la...@apache.org wrote: If increasing hbase.client.scanner.caching makes no difference you have another issue. How many rows do you expect your to return? On contemporary hardware I manage to scan a few million KeyValues (i.e. columns) per second and per CPU core. Note that for scan performance you want to increase the BLOCKSIZE. -- Lars From: LEI Xiaofeng le...@ihep.ac.cn To: user@hbase.apache.org Sent: Monday, January 6, 2014 11:06 PM Subject: hbase read performance tuning failed Hi, I am running hbase-0.94.6-cdh4.5.0 and set up a cluster of 5 nodes. The random read performance is ok, but the scan performance is poor. I tried to increase hbase.client.scanner.caching to 100 to promote the scan performance but it made no difference. And when I tried to make smaller blocks by setting BLOCKSIZE when created tables to get better random read performance it made no difference too. So, I am wondering if anyone could give some advice to solve this problem. Thanks
Re: Hbase Performance Issue
In addition to what everybody else said, look what *where* the regions are for the target table. There may be 5 regions (for example), but look to see if they are all on the same RS. On 1/6/14 5:45 AM, Nicolas Liochon nkey...@gmail.com wrote: It's very strange that you don't see a perf improvement when you increase the number of nodes. Nothing in what you've done change the performances at the end? You may want to check: - the number of regions for this table. Are all the region server busy? Do you have some split on the table? - How much data you actually write. Is the compression enabled on this table? - Do you have compactions? You may want to change the max store file settings for unfrequent write load (see http://gbif.blogspot.fr/2012/07/optimizing-writes-in-hbase.html). It would be interesting to test as well the 0.96 release. On Sun, Jan 5, 2014 at 2:12 AM, Vladimir Rodionov vrodio...@carrieriq.comwrote: I think in this case, writing data to HDFS or HFile directly (for subsequent bulk loading) is the best option. HBase will never compete in write speed with HDFS. Best regards, Vladimir Rodionov Principal Platform Engineer Carrier IQ, www.carrieriq.com e-mail: vrodio...@carrieriq.com From: Ted Yu [yuzhih...@gmail.com] Sent: Saturday, January 04, 2014 2:33 PM To: user@hbase.apache.org Subject: Re: Hbase Performance Issue There're 8 items under: http://hbase.apache.org/book.html#perf.writing I guess you have through all of them :-) On Sat, Jan 4, 2014 at 1:34 PM, Akhtar Muhammad Din akhtar.m...@gmail.comwrote: Thanks guys for your precious time. Vladimir, as Ted rightly said i want to improve write performance currently (of course i want to read data as fast as possible later on) Kevin, my current understanding of bulk load is that you generate StoreFiles and later load through a command line program. I dont want to do any manual step. Our system is getting data after every 15 minutes, so requirement is to automate it through client API completely. Confidentiality Notice: The information contained in this message, including any attachments hereto, may be confidential and is intended to be read only by the individual or entity to whom this message is addressed. If the reader of this message is not the intended recipient or an agent or designee of the intended recipient, please note that any review, use, disclosure or distribution of this message or its attachments, in any form, is strictly prohibited. If you have received this message in error, please immediately notify the sender and/or notificati...@carrieriq.com and delete or destroy any copy of this message and its attachments.
Re: Online/Realtime query with filter and join?
You are going to want to figure out a rowkey (or a set of tables with rowkeys) to restrict the number of I/O's. If you just slap Impala in front of HBase (or even Phoenix, for that matter) you could write SQL against it but if it's winds up doing a full-scan of an Hbase table underneath you won't get your 100ms response time. Note: I'm not saying you can't do this with Impala or Phoenix, I'm just saying start with the rowkeys first so that you limit the I/O. Then start adding frameworks as needed (and/or build a schema with Phoenix in the same rowkey exercise). Such response-time requirements make me think that this is for application support, so why the requirement for SQL? Might want to start writing it as a Java program first. On 11/29/13 4:32 PM, Mourad K mourad...@gmail.com wrote: You might want to consider something like Impala or Phoenix, I presume you are trying to do some report query for dashboard or UI? MapReduce is certainly not adequate as there is too much latency on startup. If you want to give this a try, cdh4 and Impala are a good start. Mouradk On 29 Nov 2013, at 10:33, Ramon Wang ra...@appannie.com wrote: The general performance requirement for each query is less than 100 ms, that's the average level. Sounds crazy, but yes we need to find a way for it. Thanks Ramon On Fri, Nov 29, 2013 at 5:01 PM, yonghu yongyong...@gmail.com wrote: The question is what you mean of real-time. What is your performance request? In my opinion, I don't think the MapReduce is suitable for the real time data processing. On Fri, Nov 29, 2013 at 9:55 AM, Azuryy Yu azury...@gmail.com wrote: you can try phoniex. On 2013-11-29 3:44 PM, Ramon Wang ra...@appannie.com wrote: Hi Folks It seems to be impossible, but I still want to check if there is a way we can do complex query on HBase with Order By, JOIN.. etc like we have with normal RDBMS, we are asked to provided such a solution for it, any ideas? Thanks for your help. BTW, i think maybe impala from CDH would be a way to go, but haven't got time to check it yet. Thanks Ramon
Re: hbase schema design
Don't forget to look at this section for hbase schema design examples. http://hbase.apache.org/book.html#schema.casestudies On 9/17/13 1:52 PM, Adrian CAPDEFIER chivas314...@gmail.com wrote: Thanks for the tip. In the data warehousing world I used to call them surrogate keys - I wonder if there's any difference between the two. On Tue, Sep 17, 2013 at 6:41 PM, Vladimir Rodionov vrodio...@carrieriq.comwrote: Is there a built-in functionality to generate (integer) surrogate values in hbase that can be used on the rowkey or does it need to be hand code it from scratch? There is no such functionality in HBase. What are asking for is known as a dictionary compression : unique 1-1 association between arbitrary strings and numeric values. Best regards, Vladimir Rodionov Principal Platform Engineer Carrier IQ, www.carrieriq.com e-mail: vrodio...@carrieriq.com From: Ted Yu [yuzhih...@gmail.com] Sent: Tuesday, September 17, 2013 9:53 AM To: user@hbase.apache.org Subject: Re: hbase schema design I guess you were referring to section 6.3.2 bq. rowkey is stored and/ or read for every cell value The above is true. bq. the event description is a string of 0.1 to 2Kb You can enable Data Block encoding to reduce storage. Cheers On Tue, Sep 17, 2013 at 9:44 AM, Adrian CAPDEFIER chivas314...@gmail.com wrote: Howdy all, I'm trying to use hbase for the first time (plenty of other experience with RDBMS database though), and I have a couple of questions after reading The Book. I am a bit confused by the advice to reduce the row size in the hbase book. It states that every cell value is accomplished by the coordinates (row, column and timestamp). I'm just trying to be thorough, so am I to understand that the rowkey is stored and/ or read for every cell value in a record or just once per column family in a record? I am intrigued by the rows as columns design as described in the book at http://hbase.apache.org/book.html#rowkey.design. To make a long story short, I will end up with a table to store event types and number of occurrences in each day. I would prefer to have the event description as the row key and the dates when it happened as columns - up to 7300 for roughly 20 years. However, the event description is a string of 0.1 to 2Kb and if it is stored for each cell value, I will need to use a surrogate (shorter) value. Is there a built-in functionality to generate (integer) surrogate values in hbase that can be used on the rowkey or does it need to be hand code it from scratch? Confidentiality Notice: The information contained in this message, including any attachments hereto, may be confidential and is intended to be read only by the individual or entity to whom this message is addressed. If the reader of this message is not the intended recipient or an agent or designee of the intended recipient, please note that any review, use, disclosure or distribution of this message or its attachments, in any form, is strictly prohibited. If you have received this message in error, please immediately notify the sender and/or notificati...@carrieriq.com and delete or destroy any copy of this message and its attachments.
Re: Scan all the rows of a table with Column Family only.
I might be mis-understanding your question, but if you just call addFamily on the Scan instance then all column qualifiers will be returned in the scan. Note: this does go against one of the performance recommendations (Scan Attribute Selection) in.. http://hbase.apache.org/book.html#perf.reading Š. but if it works for your app, go for it. On 9/11/13 7:37 AM, Jean-Marc Spaggiari jean-m...@spaggiari.org wrote: Hi Pavan, Have you take a look at the already existing HBase filters? http://hbase.apache.org/apidocs/org/apache/hadoop/hbase/filter/package-sum mary.html Maybe FamiliyFilter is what you are looking for? JM 2013/9/11 Pavan Sudheendra pavan0...@gmail.com Hi all, How do i scan all the rows of HBase with only the Column Family? Coumn Family -- cf Column Qualifier -- \x00\x00\x06T,\x00\x00\x05d,\x00\x00\x00\x00 etc., Column Qualifier would be random so i won't know beforehand..Any idea of how i can do this in Java API? -- Regards- Pavan
Re: Programming practices for implementing composite row keys
Greetings, Other food for thought on some case studies on composite rowkey design are in the refguide: http://hbase.apache.org/book.html#schema.casestudies On 9/5/13 12:15 PM, Anoop John anoop.hb...@gmail.com wrote: Hi Have a look at Phoenix[1]. There you can define a composite RK model and it handles the -ve number ordering. Also the scan model u mentioned will be well supported with start/stop RK on entity1 and using SkipScanFilter for others. -Anoop- [1] https://github.com/forcedotcom/phoenix On Thu, Sep 5, 2013 at 8:58 PM, Shahab Yunus shahab.yu...@gmail.com wrote: Ah! I didn't know about HBASE-8693. Good information. Thanks Ted. Regards, Shahab On Thu, Sep 5, 2013 at 10:53 AM, Ted Yu yuzhih...@gmail.com wrote: For #2 and #4, see HBASE-8693 'DataType: provide extensible type API' which has been integrated to 0.96 Cheers On Thu, Sep 5, 2013 at 7:14 AM, Shahab Yunus shahab.yu...@gmail.com wrote: My 2 cents: 1- Yes, that is one way to do it. You can also use fixed length for every attribute participating in the composite key. HBase scan would be more fitting to this pattern as well, I believe (?) It's a trade-off basically between space (all that padding increasing the key size) versus complexities involved in deciding and handling a delimiter and consequent parsing of keys etc. 2- I personally have not heard about this. As far as I understand, this goes against the whole idea of HBase scanning and prefix and fuzzy filters will not be possible this way. This should not be followed. 3- See replies to 1 2 4- The sorting of the keys, by default, is binary comparator. It is a bit tricky as far as I know and the last I checked. Some tips here: http://stackoverflow.com/questions/17248510/hbase-filters-not-working-for -negative-integers Can you normalize them (or take an absolute) before reading and writing (of course at the cost of performance) if it is possible i.e. keys with same amount but different magnitude cannot exist as well as different entities. This depends on your business logic and type/nature of data. Regards, Shahab On Thu, Sep 5, 2013 at 10:03 AM, praveenesh kumar praveen...@gmail.com wrote: Hello people, I have a scenario which requires creating composite row keys for my hbase table. Basically it would be entity1,entity2,entity3. Search would be based by entity1 and then entity2 and 3.. I know I can do row start-stopscan on entity1 first and then put row filters on entity2 and entity3. My question is what are the best programming principles to implement these keys. 1. Just use simple delimiters entity1:entity2:entity3. 2. Create complex datatypes like java structures. I don't know if anyone uses structures as keys and if they do, can someone please highlight me for which scenarios they would be good fit. Does they fit good for this scenario. 3. What are the pros and cons for both 1 and 2, when it comes for data retrieval. 4. My entity1 can be negative also. Does it make any special difference when hbase ordering is concerned. How can I tackle this scenario. Any help on how to implement composite row keys would be highly helpful. I want to understand how the community deals with implementing composite row keys. Regards Praveenesh
Re: Suggestion need on desinging Flatten table for HBase given scenario
Greetings, The refguide has some case studies on composite rowkey design that might be helpful. http://hbase.apache.org/book.html#schema.casestudies From: Ramasubramanian Narayanan ramasubramanian.naraya...@gmail.commailto:ramasubramanian.naraya...@gmail.com Reply-To: user@hbase.apache.orgmailto:user@hbase.apache.org user@hbase.apache.orgmailto:user@hbase.apache.org Date: Thursday, September 5, 2013 1:05 AM To: user@hbase.apache.orgmailto:user@hbase.apache.org user@hbase.apache.orgmailto:user@hbase.apache.org Subject: Suggestion need on desinging Flatten table for HBase given scenario Dear All, For the below 1 to many relationship column sets, require suggestion on how to design a Flatten HBase table... Kindly refer the attached image for the scenario... Pls let me know if my scenario is not clearly explained... regards, Rams
Re: HBase - stable versions
It's a very good point. Most people will go to 0.96 when CDH and Hortonworks support it. On 9/4/13 2:55 PM, Shahab Yunus shahab.yu...@gmail.com wrote: This maybe a newbie or dumb question but I believe, this does not affect or apply to HBase distributions by other vendors like HortonWorks or Cloudera. If someone is using one of the versions of distributions provided by them then it is up to them (and not people and community here) what and till when they are going to support it. Regards, Shahab On Wed, Sep 4, 2013 at 1:33 PM, James Taylor jtay...@salesforce.com wrote: +1 to what Nicolas said. That goes for Phoenix as well. It's open source too. We do plan to port to 0.96 when our user community (Salesforce.com, of course, being one of them) demands it. Thanks, James On Wed, Sep 4, 2013 at 10:11 AM, Nicolas Liochon nkey...@gmail.com wrote: It's open source. My personal point of view is that if someone is willing to spend time on the backport, there should be no issue, if the regression risk is clearly acceptable the rolling restart possible. If it's necessary (i.e. there is no agreement of the risk level), then we could as well go for a 94.12.1 solution. I don't think we need to create this branch now: this branch should be created on when and if we cannot find an agreement on a specific jira. Nicolas On Wed, Sep 4, 2013 at 6:53 PM, lars hofhansl la...@apache.org wrote: I should also explicitly state that we (Salesforce) will stay with 0.94 for the foreseeable future. We will continue backport fixes that we need. If those are not acceptable or accepted into the open source 0.94 branch, they will have to go into an Salesforce internal repository. I would really like to avoid that (essentially a fork), so I would offer to start having stable tags, i.e. we keep making changes in 0.94.x, and declare (say) 0.94.12 stable and have 0.94.12.1, etc, releases (much like what is done in Linux) We also currently have no resources to port Phoenix over to 0.96 (but if somebody wanted to step up, that would be greatly appreciated, of course). Thoughts? Comments? Concerns? -- Lars - Original Message - From: lars hofhansl la...@apache.org To: hbase-dev d...@hbase.apache.org; hbase-user user@hbase.apache.org Cc: Sent: Tuesday, September 3, 2013 5:30 PM Subject: HBase - stable versions With 0.96 being imminent we should start a discussion about continuing support for 0.94. 0.92 became stale pretty soon after 0.94 was released. The relationship between 0.94 and 0.96 is slightly different, though: 1. 0.92.x could be upgraded to 0.94.x without downtime 2. 0.92 clients and servers are mutually compatible with 0.94 clients and servers 3. the user facing API stayed backward compatible None of the above is true when moving from 0.94 to 0.96+. Upgrade from 0.94 to 0.96 will require a one-way upgrade process including downtime, and client and server need to be upgraded in lockstep. I would like to have an informal poll about who's using 0.94 and is planning to continue to use it; and who is planning to upgrade from 0.94 to 0.96. Should we officially continue support for 0.94? How long? Thanks. -- Lars
Re: Writing data to hbase from reducer
MapReduce job reading in your data in HDFS and then emitting Puts against the target table in the Mapper since it looks like there isn't any transform happening... http://hbase.apache.org/book/mapreduce.example.html Likewise, what Harsh said a few days ago. On 8/27/13 6:33 PM, Harsh J ha...@cloudera.com wrote: You can use HBase's MultiTableOutputFormat: http://hbase.apache.org/apidocs/org/apache/hadoop/hbase/mapreduce/MultiTab leOutputFormat.html An example can be found in this blog post: http://www.wildnove.com/2011/07/19/tutorial-hadoop-and-hbase-multitableout putformat/ On 8/28/13 12:49 PM, jamal sasha jamalsha...@gmail.com wrote: Hi, I have data in form: source, destination, connection This data is saved in hdfs I want to read this data and put it in hbase table something like: Column1 (source) | Column2(Destination)| Column3(Connection Type) Rowvertex A| vertex B | connection How do I do this? Thanks
Re: Newbie in hbase Trying to run an example
cf in this example is a column family, and this needs to exist in the tables (both input and output) before the job is submitted. On 8/26/13 3:01 PM, jamal sasha jamalsha...@gmail.com wrote: Hi, I am new to hbase, so few noob questions. So, I created a table in hbase: A quick scan gives me the following: hbase(main):001:0 scan 'test' ROW COLUMN+CELL row1column=cf:word, timestamp=1377298314160, value=foo row2column=cf:word, timestamp=1377298326124, value=bar row3column=cf:word, timestamp=1377298332856, value=bar foo row4column=cf:word, timestamp=1377298347602, value=bar world foo Now, I want to do the word count and write the result back to another table in hbase So I followed the code given below: http://hbase.apache.org/book.html#mapreduce Snapshot in the end: Now, I am getting an error java.lang.NullPointerException at java.lang.String.init(String.java:601) at org.rdf.HBaseExperiment$MyMapper.map(HBaseExperiment.java:42) at org.rdf.HBaseExperiment$MyMapper.map(HBaseExperiment.java:1) at org.apache.hadoop.mapreduce.Mapper.run(Mapper.java:144) at org.apache.hadoop.mapred.MapTask.runNewMapper(MapTask.java:764) at org.apache.hadoop.mapred.MapTask.run(MapTask.java:370) at org.apache.hadoop.mapred.Child$4.run(Child.java:255) at java.security.AccessController.doPrivileged(Native Method) at javax.security.auth.Subject.doAs(Subject.java:416) at org.apache.hadoop.security.UserGroupInformation.doAs(UserGroupInformation. java:1093) at org.apache.hadoop.mapred.Child.main(Child.java:249) Line 42 points to *public static final byte[] ATTR1 = attr1.getBytes();* Now I think attr1 is family qualifier. I am wondering, what exactly is a family qualifier? Do I need to set something while creating a table just like I did cf when I was creating the table. Similiarly what do I need to do on the output table as well? So, what I am saying is.. what do I need to to on hbase shell so that I can run this word count example? Thanks import java.io.IOException; import java.util.Date; import org.apache.hadoop.conf.Configuration; import org.apache.hadoop.fs.Path; import org.apache.hadoop.hbase.HBaseConfiguration; import org.apache.hadoop.hbase.KeyValue; import org.apache.hadoop.hbase.client.Put; import org.apache.hadoop.hbase.client.Result; import org.apache.hadoop.hbase.client.Scan; import org.apache.hadoop.hbase.io.ImmutableBytesWritable; import org.apache.hadoop.hbase.mapreduce.TableMapReduceUtil; import org.apache.hadoop.hbase.mapreduce.TableMapper; import org.apache.hadoop.hbase.mapreduce.TableReducer; import org.apache.hadoop.hbase.util.Bytes; import org.apache.hadoop.io.IntWritable; import org.apache.hadoop.io.Text; import org.apache.hadoop.io.Writable; import org.apache.hadoop.mapreduce.Job; import org.apache.hadoop.mapreduce.Reducer; import org.apache.hadoop.mapreduce.Reducer.Context; import org.apache.hadoop.mapreduce.lib.input.FileInputFormat; import org.apache.hadoop.mapreduce.lib.input.TextInputFormat; import org.apache.hadoop.mapreduce.lib.output.FileOutputFormat; import org.apache.hadoop.mapreduce.lib.output.TextOutputFormat; import org.co_occurance.Pair; import org.co_occurance.PairsMethod; import org.co_occurance.PairsMethod.MeanReducer; import org.co_occurance.PairsMethod.PairsMapper; public class HBaseExperiment { public static class MyMapper extends TableMapperText, IntWritable { public static final byte[] CF = cf.getBytes(); *public static final byte[] ATTR1 = attr1.getBytes();* private final IntWritable ONE = new IntWritable(1); private Text text = new Text(); public void map(ImmutableBytesWritable row, Result value, Context context) throws IOException, InterruptedException { String val = new String(value.getValue(CF, ATTR1)); //text.set(val); // we can only emit Writables... text.set(value.toString()); context.write(text, ONE); } } public static class MyTableReducer extends TableReducerText, IntWritable, ImmutableBytesWritable { public static final byte[] CF = cf.getBytes(); public static final byte[] COUNT = count.getBytes(); public void reduce(Text key, IterableIntWritable values, Context context) throws IOException, InterruptedException { int i = 0; for (IntWritable val : values) { i += val.get(); } Put put = new Put(Bytes.toBytes(key.toString())); put.add(CF, COUNT, Bytes.toBytes(i)); context.write(null, put); } } public static void main(String[] args) throws Exception { Configuration config = HBaseConfiguration.create(); Job job = new Job(config,ExampleSummary); job.setJarByClass(HBaseExperiment.class); // class that contains mapper and reducer Scan scan = new Scan(); scan.setCaching(500);// 1 is the default in Scan, which will be
Re: Kudos for Phoenix
You still have to register the view to phoenix and define which CF's and columns you are accessing, so this isn't entirely free form... create view myTable (cf VARCHAR primary key, cf.attr1 VARCHAR, cf.attr2 VARCHAR); … however, myTable in the above example is the HBase table you created outside Phoenix, so Phoenix doesn't need to copy any data, etc.. On 7/10/13 10:13 PM, Bing Jiang jiangbinglo...@gmail.com wrote: Hi, Doug. If build view upon Phoenix uncontrolled tables, whether it can be used to column family or qualifier? I want to know your design details. 2013/7/11 Doug Meil doug.m...@explorysmedical.com Hi folks, I just wanted to give a shout out to the Phoenix framework, and specifically for the ability to create view against an HBase table whose schema was not being managed by Phoenix. That's a really nice feature and I'm not sure how many folks realize this. I was initially nervous that this was only for data created with Phoenix, but that's not the case, so if you're looking for a lightweight framework for SQL-on-HBase I'd check it out. For this particular scenario it's probably better for ad-hoc data exploration, but often that's what people are looking to do. Doug Meil Chief Software Architect, Explorys doug.m...@explorysmedical.com -- Bing Jiang Tel:(86)134-2619-1361 weibo: http://weibo.com/jiangbinglover BLOG: http://blog.sina.com.cn/jiangbinglover National Research Center for Intelligent Computing Systems Institute of Computing technology Graduate University of Chinese Academy of Science default.xml Description: default.xml
Re: small hbase doubt
Compression only applies to data on disk. Over the wire (I.E., RS to client) is uncompressed. On 7/11/13 9:24 AM, Jean-Marc Spaggiari jean-m...@spaggiari.org wrote: Hi Alok, What do you mean by query? Gets are done based on the key. And snappy and LZO are used to compress the value. So only when a row feet your needs HBase will decrompress the value and send it back to you... Does it reply to your question? JM 2013/7/11 Alok Singh Mahor alokma...@gmail.com Hello everyone, could anyone tell me small query? Does Hbase decompress data before executing query or it execute queries on compressed data? and how snappy and lzo actually behave ? thanks
Re: Kudos for Phoenix
This particular use case is effectively a full scan on the table, but with server-side filters. Internally, Hbase still has to scan all the data - there's no magic. On 7/11/13 9:59 PM, Bing Jiang jiangbinglo...@gmail.com wrote: Could you give us the test performance, especially use the view of table? 2013/7/11 Doug Meil doug.m...@explorysmedical.com You still have to register the view to phoenix and define which CF's and columns you are accessing, so this isn't entirely free form... create view myTable (cf VARCHAR primary key, cf.attr1 VARCHAR, cf.attr2 VARCHAR); … however, myTable in the above example is the HBase table you created outside Phoenix, so Phoenix doesn't need to copy any data, etc.. On 7/10/13 10:13 PM, Bing Jiang jiangbinglo...@gmail.com wrote: Hi, Doug. If build view upon Phoenix uncontrolled tables, whether it can be used to column family or qualifier? I want to know your design details. 2013/7/11 Doug Meil doug.m...@explorysmedical.com Hi folks, I just wanted to give a shout out to the Phoenix framework, and specifically for the ability to create view against an HBase table whose schema was not being managed by Phoenix. That's a really nice feature and I'm not sure how many folks realize this. I was initially nervous that this was only for data created with Phoenix, but that's not the case, so if you're looking for a lightweight framework for SQL-on-HBase I'd check it out. For this particular scenario it's probably better for ad-hoc data exploration, but often that's what people are looking to do. Doug Meil Chief Software Architect, Explorys doug.m...@explorysmedical.com -- Bing Jiang Tel:(86)134-2619-1361 weibo: http://weibo.com/jiangbinglover BLOG: http://blog.sina.com.cn/jiangbinglover National Research Center for Intelligent Computing Systems Institute of Computing technology Graduate University of Chinese Academy of Science -- Bing Jiang Tel:(86)134-2619-1361 weibo: http://weibo.com/jiangbinglover BLOG: http://blog.sina.com.cn/jiangbinglover National Research Center for Intelligent Computing Systems Institute of Computing technology Graduate University of Chinese Academy of Science
Kudos for Phoenix
Hi folks, I just wanted to give a shout out to the Phoenix framework, and specifically for the ability to create view against an HBase table whose schema was not being managed by Phoenix. That's a really nice feature and I'm not sure how many folks realize this. I was initially nervous that this was only for data created with Phoenix, but that's not the case, so if you're looking for a lightweight framework for SQL-on-HBase I'd check it out. For this particular scenario it's probably better for ad-hoc data exploration, but often that's what people are looking to do. Doug Meil Chief Software Architect, Explorys doug.m...@explorysmedical.com
Re: RefGuide schema design examples
Thanks everybody, much appreciated! On 4/20/13 5:40 AM, varun kumar varun@gmail.com wrote: +1 On Sat, Apr 20, 2013 at 1:23 PM, Ravindranath Akila ravindranathak...@gmail.com wrote: +1 R. A. On 20 Apr 2013 12:07, Viral Bajaria viral.baja...@gmail.com wrote: +1! On Fri, Apr 19, 2013 at 4:09 PM, Marcos Luis Ortiz Valmaseda marcosluis2...@gmail.com wrote: Wow, great work, Doug. 2013/4/19 Doug Meil doug.m...@explorysmedical.com Hi folks, I reorganized the Schema Design case studies 2 weeks ago and consolidated them into here, plus added several cases common on the dist-list. http://hbase.apache.org/book.html#schema.casestudies Comments/suggestions welcome. Thanks! Doug Meil Chief Software Architect, Explorys doug.m...@explorysmedical.com -- Marcos Ortiz Valmaseda, *Data-Driven Product Manager* at PDVSA *Blog*: http://dataddict.wordpress.com/ *LinkedIn: *http://www.linkedin.com/in/marcosluis2186 *Twitter*: @marcosluis2186 http://twitter.com/marcosluis2186 -- Regards, Varun Kumar.P
RefGuide schema design examples
Hi folks, I reorganized the Schema Design case studies 2 weeks ago and consolidated them into here, plus added several cases common on the dist-list. http://hbase.apache.org/book.html#schema.casestudies Comments/suggestions welcome. Thanks! Doug Meil Chief Software Architect, Explorys doug.m...@explorysmedical.com
Re: 答复: HBase random read performance
Hi there, regarding this... We are passing random 1 row-keys as input, while HBase is taking around 17 secs to return 1 records. …. Given that you are generating 10,000 random keys, your multi-get is very likely hitting all 5 nodes of your cluster. Historically, multi-Get used to first sort the requests by RS and then *serially* go the RS to process the multi-Get. I'm not sure of the current (0.94.x) behavior if it multi-threads or not. One thing you might want to consider is confirming that client behavior, and if it's not multi-threading then perform a test that does the same RS sorting via... http://hbase.apache.org/apidocs/org/apache/hadoop/hbase/client/HTable.html# getRegionLocation%28byte[]%29 …. and then spin up your own threads (one per target RS) and see what happens. On 4/15/13 9:04 AM, Ankit Jain ankitjainc...@gmail.com wrote: Hi Liang, Thanks Liang for reply.. Ans1: I tried by using HFile block size of 32 KB and bloom filter is enabled. The random read performance is 1 records in 23 secs. Ans2: We are retrieving all the 1 rows in one call. Ans3: Disk detai: Model Number: ST2000DM001-1CH164 Serial Number: Z1E276YF Please suggest some more optimization Thanks, Ankit Jain On Mon, Apr 15, 2013 at 5:11 PM, 谢良 xieli...@xiaomi.com wrote: First, it's probably helpless to set block size to 4KB, please refer to the beginning of HFile.java: Smaller blocks are good * for random access, but require more memory to hold the block index, and may * be slower to create (because we must flush the compressor stream at the * conclusion of each data block, which leads to an FS I/O flush). Further, due * to the internal caching in Compression codec, the smallest possible block * size would be around 20KB-30KB. Second, is it a single-thread test client or multi-threads? we couldn't expect too much if the requests are one by one. Third, could you provide more info about your DN disk numbers and IO utils ? Thanks, Liang 发件人: Ankit Jain [ankitjainc...@gmail.com] 发送时间: 2013年4月15日 18:53 收件人: user@hbase.apache.org 主题: Re: HBase random read performance Hi Anoop, Thanks for reply.. I tried by setting Hfile block size 4KB and also enabled the bloom filter(ROW). The maximum read performance that I was able to achieve is 1 records in 14 secs (size of record is 1.6KB). Please suggest some tuning.. Thanks, Ankit Jain On Mon, Apr 15, 2013 at 4:12 PM, Rishabh Agrawal rishabh.agra...@impetus.co.in wrote: Interesting. Can you explain why this happens? -Original Message- From: Anoop Sam John [mailto:anoo...@huawei.com] Sent: Monday, April 15, 2013 3:47 PM To: user@hbase.apache.org Subject: RE: HBase random read performance Ankit I guess you might be having default HFile block size which is 64KB. For random gets a lower value will be better. Try will some thing like 8KB and check the latency? Ya ofcourse blooms can help (if major compaction was not done at the time of testing) -Anoop- From: Ankit Jain [ankitjainc...@gmail.com] Sent: Saturday, April 13, 2013 11:01 AM To: user@hbase.apache.org Subject: HBase random read performance Hi All, We are using HBase 0.94.5 and Hadoop 1.0.4. We have HBase cluster of 5 nodes(5 regionservers and 1 master node). Each regionserver has 8 GB RAM. We have loaded 25 millions records in HBase table, regions are pre-split into 16 regions and all the regions are equally loaded. We are getting very low random read performance while performing multi get from HBase. We are passing random 1 row-keys as input, while HBase is taking around 17 secs to return 1 records. Please suggest some tuning to increase HBase read performance. Thanks, Ankit Jain iLabs -- Thanks, Ankit Jain NOTE: This message may contain information that is confidential, proprietary, privileged or otherwise protected by law. The message is intended solely for the named addressee. If received in error, please destroy and notify the sender. Any use of this email is prohibited when received in error. Impetus does not represent, warrant and/or guarantee, that the integrity of this communication has been maintained nor that the communication is free of errors, virus, interception or interference. -- Thanks, Ankit Jain -- Thanks, Ankit Jain
Re: ANN: HBase Refcard available
You beat me to it! :-) I just realized that right when I hit enter on my previous email. On 4/9/13 2:05 PM, Otis Gospodnetic otis.gospodne...@gmail.com wrote: Hi Stack (cleaning your inbox? ;)) Looks like Doug did it a while back - https://issues.apache.org/jira/browse/HBASE-6574 ? Otis -- HBASE Performance Monitoring - http://sematext.com/spm/index.html On Tue, Apr 9, 2013 at 2:00 PM, Stack st...@duboce.net wrote: Make a patch for the reference guide that points to this Otis? Or just tell me where to insert? Thanks, St.Ack On Wed, Aug 8, 2012 at 4:14 PM, Otis Gospodnetic otis_gospodne...@yahoo.com wrote: Hi, We wrote an HBase Refcard and published it via DZone. Here is our very brief announcement: http://blog.sematext.com/2012/08/06/announcing-hbase-refcard/ . The PDF refcard can be had from http://refcardz.dzone.com/refcardz/hbase . Otis Performance Monitoring for Solr / ElasticSearch / HBase - http://sematext.com/spm
Re: schema design: rows vs wide columns
For the record, the refGuide mentions potential issues of CF lumpiness that you mentioned: http://hbase.apache.org/book.html#number.of.cfs 6.2.1. Cardinality of ColumnFamilies Where multiple ColumnFamilies exist in a single table, be aware of the cardinality (i.e., number of rows). If ColumnFamilyA has 1 million rows and ColumnFamilyB has 1 billion rows, ColumnFamilyA's data will likely be spread across many, many regions (and RegionServers). This makes mass scans for ColumnFamilyA less efficient. Š. anything that needs to be updated/added for this? On 4/8/13 12:39 AM, lars hofhansl la...@apache.org wrote: I think the main problem is that all CFs have to be flushed if one gets large enough to require a flush. (Does anyone remember why exactly that is? And do we still need that now that the memstoreTS is stored in the HFiles?) So things are fine as long as all CFs have roughly the same size. But if you have one that gets a lot of data and many others that are smaller, we'd end up with a lot of unnecessary and small store files from the smaller CFs. Anything else known that is bad about many column families? -- Lars From: Andrew Purtell apurt...@apache.org To: user@hbase.apache.org user@hbase.apache.org Sent: Sunday, April 7, 2013 3:52 PM Subject: Re: schema design: rows vs wide columns Is there a pointer to evidence/experiment backed analysis of this question? I'm sure there is some basis for this text in the book but I recommend we strike it. We could replace it with YCSB or LoadTestTool driven latency graphs for different workloads maybe. Although that would also be a big simplification of 'schema design' considerations, it would not be so starkly lacking background. On Sunday, April 7, 2013, Ted Yu wrote: From http://hbase.apache.org/book.html#number.of.cfs : HBase currently does not do well with anything above two or three column families so keep the number of column families in your schema low. Cheers On Sun, Apr 7, 2013 at 3:04 PM, Stack st...@duboce.net javascript:; wrote: On Sun, Apr 7, 2013 at 11:58 AM, Ted yuzhih...@gmail.com javascript:; wrote: With regard to number of column families, 3 is the recommended maximum. How did you come up w/ the number '3'? Is it a 'hard' 3? Or does it depend? If the latter, on what does it depend? Thanks, St.Ack -- Best regards, - Andy Problems worthy of attack prove their worth by hitting back. - Piet Hein (via Tom White)
Re: HBase Types: Explicit Null Support
HmmmŠ good question. I think that fixed width support is important for a great many rowkey constructs cases, so I'd rather see something like losing MIN_VALUE and keeping fixed width. On 4/1/13 2:00 PM, Nick Dimiduk ndimi...@gmail.com wrote: Heya, Thinking about data types and serialization. I think null support is an important characteristic for the serialized representations, especially when considering the compound type. However, doing so in directly incompatible with fixed-width representations for numerics. For instance, if we want to have a fixed-width signed long stored on 8-bytes, where do you put null? float and double types can cheat a little by folding negative and positive NaN's into a single representation (this isn't strictly correct!), leaving a place to represent null. In the long example case, the obvious choice is to reduce MAX_VALUE or increase MIN_VALUE by one. This will allocate an additional encoding which can be used for null. My experience working with scientific data, however, makes me wince at the idea. The variable-width encodings have it a little easier. There's already enough going on that it's simpler to make room. Remember, the final goal is to support order-preserving serialization. This imposes some limitations on our encoding strategies. For instance, it's not enough to simply encode null, it really needs to be encoded as 0x00 so as to sort lexicographically earlier than any other value. What do you think? Any ideas, experiences, etc? Thanks, Nick
Re: HBase type support
Sorry I'm late to this thread but I was the guy behind HBASE-7221 and the algorithms specifically mentioned were MD5 and Murmur (not SHA-1). And implementation of Murmur already exists in Hbase, and the MD5 implementation was the one that ships with Java. The intent was to include hashing appropriate for use with key distribution of rowkeys in tables as is often suggested on the dist-lists. SHA-1 is probably overkill for the rowkey case, but I wouldn't want to stop anybody from using SHA-1 if it was appropriate for their needs. On 3/18/13 8:02 AM, Michel Segel michael_se...@hotmail.com wrote: Andrew, I was aware of you employer, which I am pretty sure that they have already dealt with the issue of exporting encryption software and probably hardware too. Neither of us are lawyers and what I do know of dealing with the government bureaucracies, it's not always as simple of just filing the correct paperwork. (Sometimes it is, sometimes not so much, YMMV...) Putting the hooks for encryption is probably a good idea. Shipping the encryption w the release or making it part of the official release, not so much. Sorry, I'm being a bit conservative here. IMHO I think fixing other issues would be of a higher priority, but that's just me;-) Sent from a remote device. Please excuse any typos... Mike Segel On Mar 17, 2013, at 12:12 PM, Andrew Purtell apurt...@apache.org wrote: This then leads to another question... suppose Apache does add encryption to Hadoop. While the Apache organization does have the proper paperwork in place, what then happens to Cloudera, Hortonworks, EMC, IBM, Intel, etc ? Well I can't put that question aside since you've brought it up now twice and encryption feature candidates for Apache Hadoop and Apache HBase are something I have been working on. Its a valid question but since as you admit you don't know what you are talking about, perhaps stating uninformed opinions can be avoided. Only the latter is what I object to. I think the short answer is as an Apache contributor I'm concerned about the Apache product. Downstream repackagers can take whatever action needed including changes, since it is open source, or feedback about it representing a hardship. At this point I have heard nothing like that. I work for Intel and can say we are good with it. On Sunday, March 17, 2013, Michael Segel wrote: Its not a question of FUD, but that certain types of encryption/decryption code falls under the munitions act. See: http://www.fas.org/irp/offdocs/eo_crypt_9611_memo.htm Having said that, there is this: http://www.bis.doc.gov/encryption/encfaqs6_17_02.html In short, I don't as a habit export/import encryption technology so I am not up to speed on the current state of the laws. Which is why I have to question the current state of the US encryption laws. This then leads to another question... suppose Apache does add encryption to Hadoop. While the Apache organization does have the proper paperwork in place, what then happens to Cloudera, Hortonworks, EMC, IBM, Intel, etc ? But lets put that question aside. The point I was trying to make was that the core Sun JVM does support MD5 and SHA-1 out of the box, so that anyone running Hadoop and using the 1.6_xx or the 1.7_xx versions of the JVM will have these packages. Adding hooks that use these classes are a no brainer. However, beyond this... you tell me. -Mike On Mar 16, 2013, at 7:59 AM, Andrew Purtell apurt...@apache.org wrote: The ASF avails itself of an exception to crypto export which only requires a bit of PMC housekeeping at release time. So is not [ok] is FUD. I humbly request we refrain from FUD here. See http://www.apache.org/dev/crypto.html. To the best of our knowledge we expect this to continue, though the ASF has not updated this policy yet for recent regulation updates. On Saturday, March 16, 2013, Michel Segel wrote: I also want to add that you could add MD5 and SHA-1, but I'd check on us laws... I think these are ok, however other encryption/decryption code is not. They are part of the std sun java libraries ... Sent from a remote device. Please excuse any typos... Mike Segel On Mar 16, 2013, at 7:18 AM, Michel Segel michael_se...@hotmail.com wrote: Isn't that what you get through add on frameworks like TSDB and Kiji ? Maybe not on the client side, but frameworks that extend HBase... Sent from a remote device. Please excuse any typos... Mike Segel On Mar 16, 2013, at 12:45 AM, lars hofhansl la...@apache.org wrote: I think generally we should keep HBase a byte[] based key value store. What we should add to HBase are tools that would allow client side apps (or libraries) to built functionality on top of plain HBase. Serialization that maintains a correct semantic sort order is important as a building block, so is code that can build up correctly serialized and sortable compound keys, as well as hashing
Re: question about pre-splitting regions
Good to hear! Given your experience, I'd appreciate your feedback on the section 6.3.6. Relationship Between RowKeys and Region Splits in... http://hbase.apache.org/book.html#schema.creation Š because it's on that same topic. Any other points to add to this? Thanks! On 2/14/13 11:08 PM, Viral Bajaria viral.baja...@gmail.com wrote: I was able to figure it out. I had to use the createTable api which took splitKeys instead of the startKey, endKey and numPartitions. If anyone comes across this issue and needs more feedback feel free to ping me. Thanks, Viral On Thu, Feb 14, 2013 at 7:30 PM, Viral Bajaria viral.baja...@gmail.comwrote: Hi, I am creating a new table and want to pre-split the regions and am seeing some weird behavior. My table is designed as a composite of multiple fixed length byte arrays separated by a control character (for simplicity sake we can say the separator is _underscore_). The prefix of this rowkey is deterministic (i.e. length of 8 bytes) and I know it beforehand how many different prefix I will see in the near future. The values after the prefix is not deterministic. I wanted to create a pre-split tables based on the number of number of prefix combinations that I know. I ended up doing something like this: hbaseAdmin.createTable(tableName, Bytes.toBytes(1L), Bytes.toBytes(maxCombinationPrefixValue), maxCombinationPrefixValue) The create table worked fine and as expected it created the number of partitions. But when I write data to the table, I still see all the writes hitting a single region instead of hitting different regions based on the prefix. Is my thinking of splitting by prefix values flawed ? Do I have to split by some real rowkeys (though it's impossible for me to know what rowkeys will show up except the row prefix which is much more deterministic). For some reason I think I have a flawed understanding of the createTable API and that is causing the issue for me ? Should I use the byte[][] prefixes method and not the one that I am using right now ? Any suggestions/pointers ? Thanks, Viral
Re: Join Using MapReduce and Hbase
Hi there- Here is a comment in the RefGuide on joins in the HBase data model. http://hbase.apache.org/book.html#joins Short answer, you need to do it yourself (e.g., either with an in-memory hashmap or instantiating an HTable of the other table, depending on your situation). For other MR examples, see this... http://hbase.apache.org/book.html#mapreduce.example On 1/24/13 8:19 AM, Vikas Jadhav vikascjadha...@gmail.com wrote: Hi I am working join operation using MapReduce So if anyone has useful information plz share it. Example Code or New Technique along with existing one. Thank You. -- * * * Thanx and Regards* * Vikas Jadhav*
Re: Loading data, hbase slower than Hive?
Hi there- On top of what everybody else said, for more info on rowkey design and pre-splitting see http://hbase.apache.org/book.html#schema (as well as other threads in this dist-list on that topic). On 1/19/13 4:12 PM, Mohammad Tariq donta...@gmail.com wrote: Hello Austin, I am sorry for the late response. Asaf has made a very valid point. Rowkwey design is very crucial. Specially if the data is gonna be sequential(timeseries kinda thing). You may end up with hotspotting problem. Use pre-splitted tables or hash the keys to avoid that. It'll also allow you to fetch the results faster. Warm Regards, Tariq https://mtariq.jux.com/ cloudfront.blogspot.com On Sun, Jan 20, 2013 at 1:20 AM, Asaf Mesika asaf.mes...@gmail.com wrote: Start by telling us your row key design. Check for pre splitting your table regions. I managed to get to 25mb/sec write throughput in Hbase using 1 region server. If your data is evenly spread you can get around 7 times that in a 10 regions server environment. Should mean that 1 gig should take 4 sec. On Friday, January 18, 2013, praveenesh kumar wrote: Hey, Can someone throw some pointers on what would be the best practice for bulk imports in hbase ? That would be really helpful. Regards, Praveenesh On Thu, Jan 17, 2013 at 11:16 PM, Mohammad Tariq donta...@gmail.com javascript:; wrote: Just to add to whatever all the heavyweights have said above, your MR job may not be as efficient as the MR job corresponding to your Hive query. You can enhance the performance by setting the mapred config parameters wisely and by tuning your MR job. Warm Regards, Tariq https://mtariq.jux.com/ cloudfront.blogspot.com On Thu, Jan 17, 2013 at 10:39 PM, ramkrishna vasudevan ramkrishna.s.vasude...@gmail.com javascript:; wrote: Hive is more for batch and HBase is for more of real time data. Regards Ram On Thu, Jan 17, 2013 at 10:30 PM, Anoop John anoop.hb...@gmail.com javascript:; wrote: In case of Hive data insertion means placing the file under table path in HDFS. HBase need to read the data and convert it into its format. (HFiles) MR is doing this work.. So this makes it clear that HBase will be slower. :) As Michael said the read operation... -Anoop- On Thu, Jan 17, 2013 at 10:14 PM, Austin Chungath austi...@gmail.com javascript:; wrote: Hi, Problem: hive took 6 mins to load a data set, hbase took 1 hr 14 mins. It's a 20 gb data set approx 230 million records. The data is in hdfs, single text file. The cluster is 11 nodes, 8 cores. I loaded this in hive, partitioned by date and bucketed into 32 and sorted. Time taken is 6 mins. I loaded the same data into hbase, in the same cluster by writing a map reduce code. It took 1hr 14 mins. The cluster wasn't running anything else and assuming that the code that i wrote is good enough, what is it that makes hbase slower than hive in loading the data? Thanks, Austin
Re: How to de-nomarlize for this situation in HBASE Table
Hi there, I'd recommend reading the Schema Design chapter in the RefGuide because there are some good tips and hard-learned lessons. http://hbase.apache.org/book.html#schema Also, all your examples use composite row keys (not a surprise, a very common pattern) and one thing I would like to draw your attention to is this patch for composite row building. Feedback appreciated, because there isn't currently any utility support in Hbase for this. https://issues.apache.org/jira/browse/HBASE-7221 (Also, WibiData and Sematext have done good work in key-utility generation utilities tooŠ ) On 1/18/13 12:18 AM, Ramasubramanian Narayanan ramasubramanian.naraya...@gmail.com wrote: Hi, Is there any other way instead of using HOME/Work/etc? we expect some 10 such types may come in future.. hence asking regards, Rams On Fri, Jan 18, 2013 at 10:24 AM, Sonal Goyal sonalgoy...@gmail.com wrote: A rowkey is associated with the complete row. So you could have client id as the rowkey. Hbase allows different qualifiers within a column family, so you could potentially do the following: 1. You could have qualifiers like home address street 1, home address street 2, home address city, office address street 1 etc kind of qualifiers under physical address column family. 2. If you access entire address and not city, state individually, you could have the complete address concatenated and saved in one quailifer under physical address family using qualifiers like home, office, extra. A good link to get started is http://hbase.apache.org/book/datamodel.html#conceptual.view Best Regards, Sonal Real Time Analytics for BigData https://github.com/sonalgoyal/crux Nube Technologies http://www.nubetech.co http://in.linkedin.com/in/sonalgoyal On Fri, Jan 18, 2013 at 10:09 AM, Ramasubramanian Narayanan ramasubramanian.naraya...@gmail.com wrote: Hi Sonal, In that case, the problem is how to store multiple physical address sets in the same column family.. what rowkey to be used for this scenario.. A Physical address will contain the following fields (need to store multiple physical address like this): Physical address type : Home/office/other/etc Address line1: .. .. Address line 4: State : City: Country: regards, Rams On Fri, Jan 18, 2013 at 10:00 AM, Sonal Goyal sonalgoy...@gmail.com wrote: How about client id as the rowkey, with column families as physical address, email address, telephone address? within each cf, you could have various qualifiers. For eg in physical address, you could have home Street, office street etc. Best Regards, Sonal Real Time Analytics for BigData https://github.com/sonalgoyal/crux Nube Technologies http://www.nubetech.co http://in.linkedin.com/in/sonalgoyal On Fri, Jan 18, 2013 at 9:46 AM, Ramasubramanian Narayanan ramasubramanian.naraya...@gmail.com wrote: Hi Sonal, 1. will fetch all demographic details of customer based on client ID 2. Fetch the particular type of address along with other demographic for a client.. for example, HOME Physical address or HOME Telephone address or office Email address etc., regards, Rams On Fri, Jan 18, 2013 at 9:29 AM, Sonal Goyal sonalgoy...@gmail.com wrote: What are your data access patterns? Best Regards, Sonal Real Time Analytics for BigData https://github.com/sonalgoyal/crux Nube Technologies http://www.nubetech.co http://in.linkedin.com/in/sonalgoyal On Fri, Jan 18, 2013 at 9:04 AM, Ramasubramanian Narayanan ramasubramanian.naraya...@gmail.com wrote: Hi, I have the following relational tables.. I want to denormalize and bring it all into single HBASE table... Pls help how it could be done.. 1. Client Master Table 2. Physical Address Table (there might be 'n' number of address that can be captured against each client ID) 3. Email Address Table (there might be 'n' number of address that can be captured against each client ID) 4. Telephone Address Table (there might be 'n' number of address that can be captured against each client ID) For the tables 2 to 4, there are multiple fields like which is the Address type (home/office,etc), bad address, good address, communication address, time to call etc., Please help me to clarify the following : 1. Whether we can bring this to a single HBASE table? 2. Having fields like phone number1, phone number 2 etc. is not an good approach for this scenario... 3. Whether we can have in the same table by populating these multiple rows for the same customer with different rowkey? For e.g. For Client Records - Rowkey can be Client
Re: Loading data, hbase slower than Hive?
Hi there, See this section of the HBase RefGuide for information about bulk loading. http://hbase.apache.org/book.html#arch.bulk.load On 1/18/13 12:57 PM, praveenesh kumar praveen...@gmail.com wrote: Hey, Can someone throw some pointers on what would be the best practice for bulk imports in hbase ? That would be really helpful. Regards, Praveenesh On Thu, Jan 17, 2013 at 11:16 PM, Mohammad Tariq donta...@gmail.com wrote: Just to add to whatever all the heavyweights have said above, your MR job may not be as efficient as the MR job corresponding to your Hive query. You can enhance the performance by setting the mapred config parameters wisely and by tuning your MR job. Warm Regards, Tariq https://mtariq.jux.com/ cloudfront.blogspot.com On Thu, Jan 17, 2013 at 10:39 PM, ramkrishna vasudevan ramkrishna.s.vasude...@gmail.com wrote: Hive is more for batch and HBase is for more of real time data. Regards Ram On Thu, Jan 17, 2013 at 10:30 PM, Anoop John anoop.hb...@gmail.com wrote: In case of Hive data insertion means placing the file under table path in HDFS. HBase need to read the data and convert it into its format. (HFiles) MR is doing this work.. So this makes it clear that HBase will be slower. :) As Michael said the read operation... -Anoop- On Thu, Jan 17, 2013 at 10:14 PM, Austin Chungath austi...@gmail.com wrote: Hi, Problem: hive took 6 mins to load a data set, hbase took 1 hr 14 mins. It's a 20 gb data set approx 230 million records. The data is in hdfs, single text file. The cluster is 11 nodes, 8 cores. I loaded this in hive, partitioned by date and bucketed into 32 and sorted. Time taken is 6 mins. I loaded the same data into hbase, in the same cluster by writing a map reduce code. It took 1hr 14 mins. The cluster wasn't running anything else and assuming that the code that i wrote is good enough, what is it that makes hbase slower than hive in loading the data? Thanks, Austin
Re: Reagrding HBase Hadoop multiple scan objects issue
Hi there- You probably want to review this section of the RegGuide: http://hbase.apache.org/book.html#mapreduce re: it's inefficient to have one scan object to scan everything. It is. But in the MapReduce case, there is a Map-task for each input split (see the RefGuide for details), and therefore a Scanner instance per Map-task. On 1/18/13 5:43 PM, Xu, Leon guodo...@amazon.com wrote: Hi HBase users, I am currently trying to set up a denormalization map-reduce job for my HBase Table. Since our table contains large volume of data, it's inefficient to have one scan object to scan everything. We are only need to process those records that have changes. I am planning to have multiple scan objects, each of which scan object specifies range given that we are in track of what rows has been changed. Therefore I am trying to set up the map-reduce job with multiple scan objects, is this possible? I am seeing some post online suggesting extending the InputFormat object and change the getSplits, is this the most efficient way? Using filter seems to be not very efficient in my case because it's basically still scan the whole table,right? Just filter out some certain records. Can you point me to the right direction? Thanks Leon
Re: Constructing rowkeys and HBASE-7221
Thanks Aaron! I will take a look at Kiji. And I think it underscores the need for some type of utility row rowkey building/parsing being available in HBase, because one of the first things folks tend to do is start building their own keybuilder utility when they start using Hbase (same sentiment also expressed by others in the HBASE-7221 ticket comments). It's good that you have full control over the rowkey (i.e., byte[]) as a backstop, but HBase should also try to make things a bit easier for some common cases. I think it will help adoption. The general idea is a FixedLengthRowKey and a VariableLengthRowKey along with a RowKeySchema class, and I think that the variant you bring up is a great idea (e.g., prefix vs. hash). Let's keep this ball rolling! On 1/16/13 2:06 PM, Aaron Kimball akimbal...@gmail.com wrote: Hi Doug, This HBase feature is really interesting. It is quite related to some work we're doing on Kiji, our schema management project. In particular, we've also been focusing on building composite row keys correctly. One thing that jumped out at me in that ticket is that with a composition of md5hash and other (string, int, etc) components, you probably don't want the whole hash. If you're using that to shard your rows more efficiently across regions, you might want to just use a subset of the md5 bytes as a prefix. It might be a good idea to offer users control of this. Our own thoughts on this on the Kiji side are being tracked at https://jira.kiji.org/browse/schema-3 where we have a design doc that goes into a bit more detail. Cheers, - Aaron On Tue, Jan 15, 2013 at 2:01 PM, Doug Meil doug.m...@explorysmedical.comwrote: Hi there, well, this request for input fell like a thud. :-) But I think perhaps it has to do with the fact that I sent it to the dev-list instead of the user-list, as people that are actively writing HBase itself (devs) need less help with such keybuilding utilities. So one last request for feedback, but this time aimed at users of HBase: how has your key-building experience been? Thanks! On 1/7/13 11:04 AM, Doug Meil doug.m...@explorysmedical.com wrote: Greetings folks- I would like to restart the conversation on https://issues.apache.org/jira/browse/HBASE-7221 because there continue to be conversations on the dist-list about creating composite rowkeys, and while HBase makes just about anything possible, it doesn¹t make much easy in this respect. What I¹m lobbying for is a utility class (see the v3 patch in HBASE-7221) that can both create and read rowkeys (so this isn¹t just a one-way builder pattern). This is currently stuck because it was noted that Bytes has an issue with sort-order of numbers specifically if you have both negative and positive values, which is really a different issue, but because this patch uses Bytes it¹s related. What are people¹s thoughts on this topic in general, and the v3 version of the patch specifically? (and the last set of comments). Thanks! One of the unit tests shows the example of usage. The last set of comments suggested that RowKey be renamed FixedLengthRowKey, which I think is a good idea. A follow-on patch could include VariableLengthRowKey for folks that use strings in the rowkeys. public void testCreate() throws Exception { int elements[] = {RowKeySchema.SIZEOF_MD5_HASH, RowKeySchema.SIZEOF_INT, RowKeySchema.SIZEOF_LONG}; RowKeySchema schema = new RowKeySchema(elements); RowKey rowkey = schema.createRowKey(); rowkey.setHash(0, hashVal); rowkey.setInt(1, intVal); rowkey.setLong(2, longVal); byte bytes[] = rowkey.getBytes(); Assert.assertEquals(key length, schema.getRowKeyLength(), bytes.length); Assert.assertEquals(e1, rowkey.getInt(1), intVal); Assert.assertEquals(e2, rowkey.getLong(2), longVal); } Doug Meil Chief Software Architect, Explorys doug.m...@explorys.com
Re: Just joined the user group and have a question
Hi there- If you're absolutely new to Hbase, you might want to check out the Hbase refGuide in the architecture, performance, and troubleshooting chapters first. http://hbase.apache.org/book.html In terms of determining why your region servers just die, I think you need to read the background information then provide more information on your cluster and what you're trying to do because although there are a lot of people on this dist-list that want to help, you're not giving folks a whole lot to go on. On 1/17/13 12:24 PM, Chalcy Raja chalcy.r...@careerbuilder.com wrote: Hi HBASE Gurus, I am Chalcy Raja and I joined the hbase group yesterday. I am already a member of hive and sqoop user groups. Looking forward to learn and share information about hbase here! Have a question: We have a cluster where we run hive jobs and also hbase. There are stability issues like region servers just die. We are looking into fine tuning. When I read about performance and also heard from another user is separate mapreduce from hbase. How do I do that? If I understand that as running tasktrackers on some and hbase region servers on some, then we will run into data locality issues and I believe it will perform poorly. Definitely I am not the only one running into this issue. Any thoughts on how to resolve this issue? Thanks, Chalcy
Re: Constructing rowkeys and HBASE-7221
Hi there, well, this request for input fell like a thud. :-) But I think perhaps it has to do with the fact that I sent it to the dev-list instead of the user-list, as people that are actively writing HBase itself (devs) need less help with such keybuilding utilities. So one last request for feedback, but this time aimed at users of HBase: how has your key-building experience been? Thanks! On 1/7/13 11:04 AM, Doug Meil doug.m...@explorysmedical.com wrote: Greetings folks- I would like to restart the conversation on https://issues.apache.org/jira/browse/HBASE-7221 because there continue to be conversations on the dist-list about creating composite rowkeys, and while HBase makes just about anything possible, it doesn¹t make much easy in this respect. What I¹m lobbying for is a utility class (see the v3 patch in HBASE-7221) that can both create and read rowkeys (so this isn¹t just a one-way builder pattern). This is currently stuck because it was noted that Bytes has an issue with sort-order of numbers specifically if you have both negative and positive values, which is really a different issue, but because this patch uses Bytes it¹s related. What are people¹s thoughts on this topic in general, and the v3 version of the patch specifically? (and the last set of comments). Thanks! One of the unit tests shows the example of usage. The last set of comments suggested that RowKey be renamed FixedLengthRowKey, which I think is a good idea. A follow-on patch could include VariableLengthRowKey for folks that use strings in the rowkeys. public void testCreate() throws Exception { int elements[] = {RowKeySchema.SIZEOF_MD5_HASH, RowKeySchema.SIZEOF_INT, RowKeySchema.SIZEOF_LONG}; RowKeySchema schema = new RowKeySchema(elements); RowKey rowkey = schema.createRowKey(); rowkey.setHash(0, hashVal); rowkey.setInt(1, intVal); rowkey.setLong(2, longVal); byte bytes[] = rowkey.getBytes(); Assert.assertEquals(key length, schema.getRowKeyLength(), bytes.length); Assert.assertEquals(e1, rowkey.getInt(1), intVal); Assert.assertEquals(e2, rowkey.getLong(2), longVal); } Doug Meil Chief Software Architect, Explorys doug.m...@explorys.com
Re: One weird problem of my MR job upon hbase table.
Hi there, The HBase RefGuide has a comprehensive case study on such a case. This might not be the exact problem, but the diagnostic approach should help. http://hbase.apache.org/book.html#casestudies.slownode On 1/4/13 10:37 PM, Liu, Raymond raymond@intel.com wrote: Hi I encounter a weird lag behind map task issue here : I have a small hadoop/hbase cluster with 1 master node and 4 regionserver node all have 16 CPU with map and reduce slot set to 24. A few table is created with regions distributed on each region node evenly ( say 16 region for each region server). Also each region has almost the same number of kvs with very similar size. All table had major_compact done to ensure data locality I have a MR job which simply do local region scan in every map task ( so 16 map task for each regionserver node). By theory, every map task should finish within similar time. But the real case is that some regions on the same region server always lags behind a lot, say cost 150 ~250% of the other map tasks average times. If this is happen to a single region server for every table, I might doubt it is a disk issue or other reason that bring down the performance of this region server. But the weird thing is that, though with each single table, almost all the map task on the the same single regionserver is lag behind. But for different table, this lag behind regionserver is different! And the region and region size is distributed evenly which I double checked for a lot of times. ( I even try to set replica to 4 to ensure every node have a copy of local data) Say table 1, all map task on regionserver node 2 is slow. While for table 2, maybe all map task on regionserver node 3 is slow, and with table 1, it will always be regionserver node 2 which is slow regardless of cluster restart, and the slowest map task will always be the very same one. And it won't go away even I do major compact again. So, anyone could give me some clue on what reason might possible lead to this weird behavior? Any wild guess is welcome! (BTW. I don't encounter this issue a few days ago with the same table. While I do restart cluster and do a few changes upon config file during that period, But restore the config file don't help) Best Regards, Raymond Liu
Re: Is it necessary to set MD5 on rowkey?
Hi there- You don't want a filter for this, use a Scan with the lead portion of the key. http://hbase.apache.org/book.html#datamodel See 5.7.3. Scans On a related topic, this is a utility in process to make composite key construction easier. https://issues.apache.org/jira/browse/HBASE-7221 On 12/18/12 4:20 AM, bigdata bigdatab...@outlook.com wrote: Many articles tell me that MD5 rowkey or part of it is good method to balance the records stored in different parts. But If I want to search some sequential rowkey records, such as date as rowkey or partially. I can not use rowkey filter to scan a range of date value one time on the date by MD5. How to balance this issue? Thanks.
Re: 答复: 答复: what is the max size for one region and what is the max size of region for one server
Hi there, When sizing your data, don't forget to read thisŠ http://hbase.apache.org/book.html#schema.creation and http://hbase.apache.org/book.html#regions.arch 9.7.5.4. KeyValue You need to understand how Hbase stores data internally on initial design to avoid problems down the line. Keep the keys as small as reasonable, likewise CF name, and column names. On 12/17/12 6:07 AM, Nicolas Liochon nkey...@gmail.com wrote: I think it's safer to use a newer version (0.94): there are a lot of things around performances volumes in the 0.92 0.94. As well, there are much more bug fixes releases on the 0.94. For the number of region, there is no maximum written in stone. Having too many regions will essentially impact the performances. As I said, having 60TB of data per machine is not standard today (points are: that's a lot of disk a single machine; what's the impact if you lose a node; what will be the network load, ...). I suppose all this is documented in the usual books on HBase. On Mon, Dec 17, 2012 at 11:26 AM, tgh guanhua.t...@ia.ac.cn wrote: number of region for ONE server?
Re: Reg:delete performance on HBase table
Hi there, You probably want to read this section on the RefGuide about deleting from HBase. http://hbase.apache.org/book.html#perf.deleting On 12/5/12 8:31 AM, Jean-Marc Spaggiari jean-m...@spaggiari.org wrote: Hi Manoj, Delete in HBase is like a put. If you want to delete the entire table (drop) then it will be very fast. My test table has 100M rows and it's taking few seconds to delete (one CF and one C only). But if you want to delete the rows one by one (like 190M rows out of more) then it's like doing 190M puts. HTH. JM 2012/12/5, Manoj Babu manoj...@gmail.com: Hi All, I am having doubt on delete performance inHBase table. I have 190 million rows in oracle table it hardly took 4hours to delete it, If i am having the same 190 million rows in HBase table how much time by approx Hbase will take to delete the rows(based on row key range) and how internally HBase handles delete? Thanks in advance! Cheers! Manoj.
Re: CopyTable utility fails on larger tables
I agree it shouldn't fail (slow is one thing, fail is something else), but regarding HBase Master Web UI showed only one region for the destination table., you probably want to pre-split your destination table. It's writing to one region, splitting, writing to those regions, splitting, etc. On 12/5/12 10:42 AM, David Koch ogd...@googlemail.com wrote: Hello, I can copy relatively small tables (10gb, 7million rows) using the built-in HBase (0.92.1-cdh4.0.1) CopyTable utility but copying larger tables, say 150gb, 100million rows does not work. The failed CopyTable job required 128 mappers according to the Job Tracker UI, all of these failed in the first attempt after 15 minutes, the job then ran another 1 hour while remaining at 0%. However, according to the counters many rows apparently had been mapped and emitted. Checking with HBase shell, I could not perform any action on the destination table (scan, get, count) and the HBase Master Web UI showed only one region for the destination table. I checked the log file on this region server and saw attached log record (extract). What precautions should I take when copying tables? Do certain settings need to be de-activated for the duration of the job? Thank you, /David 2012-12-05 15:50:40,406 INFO org.apache.hadoop.hbase.regionserver.MemStoreFlusher: Flush of region _xxx_EH_xxx,{\xF0\xE4\xA2?!EQ\xB8\xC9tE\x19\x92 \x08,1354713876229.a75fba31d9883ed7be4ed4a7be0e592f. due to global heap pressure 2012-12-05 15:50:49,086 INFO org.apache.hadoop.hbase.regionserver.Store: Added hdfs:// x-1.xx.net:8020/hbase/_xxx_EH_xxx/a75fba31d9883ed7be4ed4a7be0e 592f/t/1788b9f6f9594e2e9efe4ea5230d134c, entries=418152, sequenceid=1440152048, memsize=217.0m, filesize=145.9m 2012-12-05 15:50:49,088 INFO org.apache.hadoop.hbase.regionserver.HRegion: Finished memstore flush of ~217.8m/228416264, currentsize=33.0m/34555320 for region _xxx_EH_xxx,{\xF0\xE4\xA2?!EQ\xB8\xC9tE\x19\x92 \x08,1354713876229.a75fba31d9883ed7be4ed4a7be0e592f. in 8682ms, sequenceid=1440152048, compaction requested=true 2012-12-05 15:50:49,088 WARN org.apache.hadoop.hbase.regionserver.MemStoreFlusher: Region _xxx_EH_xxx,,1354713876229.37825c623850b16013ab0bf902d02746. has too many store files; delaying flush up to 9ms 2012-12-05 15:50:49,760 WARN org.apache.hadoop.ipc.HBaseServer: IPC Server Responder, call multi(org.apache.hadoop.hbase.client.MultiAction@44848967), rpc version=1, client version=29, methodsFingerPrint=54742778 from 5.39.67.13:56290: output error 2012-12-05 15:50:49,760 WARN org.apache.hadoop.ipc.HBaseServer: IPC Server handler 9 on 60020 caught: java.nio.channels.ClosedChannelException at sun.nio.ch.SocketChannelImpl.ensureWriteOpen(SocketChannelImpl.java:133) at sun.nio.ch.SocketChannelImpl.write(SocketChannelImpl.java:324) at org.apache.hadoop.hbase.ipc.HBaseServer.channelWrite(HBaseServer.java:1663 ) at org.apache.hadoop.hbase.ipc.HBaseServer$Responder.processResponse(HBaseSer ver.java:934) at org.apache.hadoop.hbase.ipc.HBaseServer$Responder.doRespond(HBaseServer.ja va:1013) at org.apache.hadoop.hbase.ipc.HBaseServer$Call.sendResponseIfReady(HBaseServ er.java:419) at org.apache.hadoop.hbase.ipc.HBaseServer$Handler.run(HBaseServer.java:1356) 2012-12-05 15:50:49,763 WARN org.apache.hadoop.ipc.HBaseServer: IPC Server listener on 60020: readAndProcess threw exception java.io.IOException: Connection reset by peer. Count of bytes read: 0 java.io.IOException: Connection reset by peer at sun.nio.ch.FileDispatcher.read0(Native Method) at sun.nio.ch.SocketDispatcher.read(SocketDispatcher.java:21) at sun.nio.ch.IOUtil.readIntoNativeBuffer(IOUtil.java:198) at sun.nio.ch.IOUtil.read(IOUtil.java:171) at sun.nio.ch.SocketChannelImpl.read(SocketChannelImpl.java:243) at org.apache.hadoop.hbase.ipc.HBaseServer.channelRead(HBaseServer.java:1686) at org.apache.hadoop.hbase.ipc.HBaseServer$Connection.readAndProcess(HBaseSer ver.java:1130) at org.apache.hadoop.hbase.ipc.HBaseServer$Listener.doRead(HBaseServer.java:7 13) at org.apache.hadoop.hbase.ipc.HBaseServer$Listener$Reader.doRunLoop(HBaseSer ver.java:505) at org.apache.hadoop.hbase.ipc.HBaseServer$Listener$Reader.run(HBaseServer.ja va:480) at java.util.concurrent.ThreadPoolExecutor$Worker.runTask(ThreadPoolExecutor. java:886) at java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java :908) at java.lang.Thread.run(Thread.java:662) 2012-12-05 15:50:49,792 WARN org.apache.hadoop.ipc.HBaseServer: IPC Server listener on 60020: readAndProcess threw exception java.io.IOException: Connection reset by peer. Count of bytes read: 0 java.io.IOException: Connection reset by peer at sun.nio.ch.FileDispatcher.read0(Native Method) at sun.nio.ch.SocketDispatcher.read(SocketDispatcher.java:21) at sun.nio.ch.IOUtil.readIntoNativeBuffer(IOUtil.java:198) at sun.nio.ch.IOUtil.read(IOUtil.java:171) at sun.nio.ch.SocketChannelImpl.read(SocketChannelImpl.java:243) at
Re: Data Locality, HBase? Or Hadoop?
Hi there- This is also discussed in the Regions section in the RefGuide: http://hbase.apache.org/book.html#regions.arch 9.7.3. Region-RegionServer Locality On 12/3/12 10:08 AM, Kevin O'dell kevin.od...@cloudera.com wrote: JM, If you have disabled the balancer and are manually moving regions, you will need to run a compaction on those regions. That is the only(logical) way of bringing the data local. HDFS does not have a concept of HBase locality. HBase locality is all managed through major and minor compactions. On Mon, Dec 3, 2012 at 10:04 AM, Jean-Marc Spaggiari jean-m...@spaggiari.org wrote: Hi, I'm wondering who is taking care of the data locality. Is it hadoop? Or hbase? Let's say I have disabled the load balancer and I'm manually moving a region to a specific server. Who is going to take care that the data is going to be on the same datanode as the regionserver I moved the region to? Is hadoop going to see that my region is now on this region server and make sure my data is moved there too? Or is hbase going to ask hadoop to do it? Or, since I moved it manually, there is not any data locality guaranteed? Thanks, JM -- Kevin O'Dell Customer Operations Engineer, Cloudera
Re: Multiple regionservers on a single node
Hi there, Not tried multi-RS on a single node, but have you looked at the off-heap cache? It's a part of 0.92.x. From what I understand that feature was designed with this case in mind (I.e., trying to do a lot of caching, but don't want to introduce GC issues in RS). https://issues.apache.org/jira/browse/HBASE-4027 On 12/3/12 4:39 PM, Ishan Chhabra ichha...@rocketfuel.com wrote: Hi, Has anybody tried to run multiple RegionServers on a single physical node? Are there deep technical issues or minor impediments that would hinder this? We are trying to do this because we are facing a lot of GC pauses on the large heap sizes (~70G) that we are using, which leads to a lot of timeouts in our latency critical application. More processes with smaller heaps would help in mitigating this issue. Any experience or thoughts on this would help. Thanks! -- *Ishan Chhabra *| Rocket Scientist | Rocketfuel Inc. | *m *650 556 6803
Re: Connecting to standalone HBase from a remote client
Hi there- re: From what I have understood, these properties are not for Hbase but for the Hbase client which we write. They tell the client where to look for ZK. Yep. That's how it works. Then the client looks up ROOT/META and then the client talks directly to the RegionServers. http://hbase.apache.org/book.html#client On 11/27/12 8:52 AM, Mohammad Tariq donta...@gmail.com wrote: Hello Matan, From what I have understood, these properties are not for Hbase but for the Hbase client which we write. They tell the client where to look for ZK. Hmaster registers its address with ZK. And from there client will come to know where to look for Hmaster. And if the Hmaster registers its address as 'localhost', the client will take it as the 'localhost', which is client's 'localhost' and not the 'localhost' where Hmaster is running. So, if you have the IP and hostname of the Hmaster in your /etc/hosts file the client can reach that machine without any problem as there is proper DNS resolution available. But this just is what I think. I need approval from the heavyweights. Stack sir?? Regards, Mohammad Tariq On Tue, Nov 27, 2012 at 5:57 PM, matan ma...@cloudaloe.org wrote: Thanks guys, Excuse my ignorance, but having sort of agreed that the configuration that determines which-server-should-be-contacted-for-what is on the HBase server, I am not sure how any of the practical suggestions made should solve the issue, and enable connecting from a remote client. Let me delineate - setting /etc/hosts on my client side seems in this regard not relevant in that view. And the other suggestion for hbase-site.xml configuration I have already got covered, as my client code successfully connects to zookeeper (the configuration properties mentioned on this thread are zookeeper specific according to my interpretation of documentation, I don't directly see how they should solve the problem). Perhaps Mohammad you can explain why those zookeeper properties relate to how the master references itself towards zookeeper? Should I take it from St.Ack that there is currently no way to specify the master's remotely accessible server/ip in the HBase configuration? Anyway, my HBase server's /etc/hosts has just one line now, in case it got lost on the thread - 127.0.0.1 localhost 'server-name'. Everything works fine on the HBase server itself, the same client code runs perfectly there. Thanks again, Matan On Mon, Nov 26, 2012 at 10:15 PM, Tariq [via Apache HBase] ml-node+s679495n4034419...@n3.nabble.com wrote: Hello Nicolas, You are right. It has been deprecated. Thank you for updating my knowledge base..:) Regards, Mohammad Tariq On Tue, Nov 27, 2012 at 12:17 AM, Nicolas Liochon [hidden email] http://user/SendEmail.jtp?type=nodenode=4034419i=0 wrote: Hi Mohammad, Your answer was right, just that specifying the master address is not necessary (anymore I think). But it does no harm. Changing the /etc/hosts (as you did) is right too. Lastly, if the cluster is standalone and accessed locally, having localhost in ZK will not be an issue. However, it's perfectly possible to have a standalone cluster accessed remotely, so you don't want to have the master to write I'm on the server named localhost in this case. I expect it won't be an issue for communications between the region servers or hdfs as they would be all on the same localhost... Cheers, Nicolas On Mon, Nov 26, 2012 at 7:16 PM, Mohammad Tariq [hidden email] http://user/SendEmail.jtp?type=nodenode=4034419i=1 wrote: what -- If you reply to this email, your message will be added to the discussion below: http://apache-hbase.679495.n3.nabble.com/Connecting-to-standalone-HBase-f rom-a-remote-client-tp4034362p4034419.html To unsubscribe from Connecting to standalone HBase from a remote client, click here http://apache-hbase.679495.n3.nabble.com/template/NamlServlet.jtp?macro=u nsubscribe_by_codenode=4034362code=bWF0YW5AY2xvdWRhbG9lLm9yZ3w0MDM0MzYy fC0xMDg3NTk1Njc3 . NAML http://apache-hbase.679495.n3.nabble.com/template/NamlServlet.jtp?macro=m acro_viewerid=instant_html%21nabble%3Aemail.namlbase=nabble.naml.namesp aces.BasicNamespace-nabble.view.web.template.NabbleNamespace-nabble.view. web.template.NodeNamespacebreadcrumbs=notify_subscribers%21nabble%3Aemai l.naml-instant_emails%21nabble%3Aemail.naml-send_instant_email%21nabble%3 Aemail.naml -- View this message in context: http://apache-hbase.679495.n3.nabble.com/Connecting-to-standalone-HBase-f rom-a-remote-client-tp4034362p4034439.html Sent from the HBase User mailing list archive at Nabble.com.
Re: Expert suggestion needed to create table in Hbase - Banking
Hi there, somebody already wisely mentioned the link to the # of CF's entry, but here are a few other entries that can save you some heartburn if you read them ahead of time. http://hbase.apache.org/book.html#datamodel http://hbase.apache.org/book.html#schema http://hbase.apache.org/book.html#architecture On 11/26/12 5:28 AM, Mohammad Tariq donta...@gmail.com wrote: Hello sir, You might become a victim of RS hotspotting, since the cutomerIDs will be sequential(I assume). To keep things simple Hbase puts all the rows with similar keys to the same RS. But, it becomes a bottleneck in the long run as all the data keeps on going to the same region. HTH Regards, Mohammad Tariq On Mon, Nov 26, 2012 at 3:53 PM, Ramasubramanian Narayanan ramasubramanian.naraya...@gmail.com wrote: Hi, Thanks! Can we have the customer number as the RowKey for the customer (client) master table? Please help in educating me on the advantage and disadvantage of having customer number as the Row key... Also SCD2 we may need to implement in that table.. will it work if I have like that? Or SCD2 is not needed instead we can achieve the same by increasing the version number that it will hold? pls suggest... regards, Rams On Mon, Nov 26, 2012 at 1:10 PM, Li, Min m...@microstrategy.com wrote: When 1 cf need to do split, other 599 cfs will split at the same time. So many fragments will be produced when you use so many column families. Actually, many cfs can be merge to only one cf with specific tags in rowkey. For example, rowkey of customer address can be uid+'AD', and customer profile can be uid+'PR'. Min -Original Message- From: Ramasubramanian Narayanan [mailto: ramasubramanian.naraya...@gmail.com] Sent: Monday, November 26, 2012 3:05 PM To: user@hbase.apache.org Subject: Expert suggestion needed to create table in Hbase - Banking Hi, I have a requirement of physicalising the logical model... I have a client model which has 600+ entities... Need suggestion how to go about physicalising it... I have few other doubts : 1) Whether is it good to create a single table for all the 600+ columns? 2) To have different column families for different groups or can it be under a single column family? For example, customer address can we have as a different column family? Please help on this.. regards, Rams
Re: Paging On HBASE like solr
Hi there- Then don't use an end-row and break out of the loop when you hit 100 rows. On 11/22/12 5:16 AM, Vajrakumar vajra.ku...@pointcross.com wrote: Hello Doug, First of all thanks for taking time to reply. As per my knowledge goes below two lines take the rowkey as a parameter for representing start and end. scan.setStartRow( Bytes.toBytes(row)); // start key is inclusive scan.setStopRow( Bytes.toBytes(row + (char)0)); // stop key is exclusive But, In my case irrespective of rowkey I need 100 rows always. If I go with this concept if 5 rows are deleted in between 1 to 100 then it will give me 95 but not 100. But for me always I need 100 (I mean rowCount whatever I pass) rows. And as after usage there may be deletions of rows or adding and all on DB, I can't keep track of rows for this paging.. Paging needs a fixed number of rows in each page always. -Original Message- From: Doug Meil [mailto:doug.m...@explorysmedical.com] Sent: 22 November 2012 00:21 To: user@hbase.apache.org Subject: Re: Paging On HBASE like solr Hi there, Pretty similar approach with Hbase. See the Scan class. http://hbase.apache.org/book.html#data_model_operations On 11/21/12 1:04 PM, Vajrakumar vajra.ku...@pointcross.com wrote: Hello all, As we do paging in solr using start and rowCount I need to implement same through hbase. In Detail: I have 1000 rows data which I need to display in 10 pages each page containing 100 rows. So on click of next page we will send current rowStart (1,101,201,301,401,501...) and rowCount (100 for all the pages) to a method which will query hbase and return me the result. One solution is to always query more than rowCount starting from th rowkey of last passed row, and in a for loop count depending on row key and return when it becomes 100 (i.e., rowCount) . But its poor solution i know. Thanks in advance. Sent from Samsung Mobile
Re: Paging On HBASE like solr
Hi there, Pretty similar approach with Hbase. See the Scan class. http://hbase.apache.org/book.html#data_model_operations On 11/21/12 1:04 PM, Vajrakumar vajra.ku...@pointcross.com wrote: Hello all, As we do paging in solr using start and rowCount I need to implement same through hbase. In Detail: I have 1000 rows data which I need to display in 10 pages each page containing 100 rows. So on click of next page we will send current rowStart (1,101,201,301,401,501...) and rowCount (100 for all the pages) to a method which will query hbase and return me the result. One solution is to always query more than rowCount starting from th rowkey of last passed row, and in a for loop count depending on row key and return when it becomes 100 (i.e., rowCount) . But its poor solution i know. Thanks in advance. Sent from Samsung Mobile
Re: Region hot spotting
Hi there- If he's using monotonically increasing keys the pre splits won't help because the same region is going to get all the writes. http://hbase.apache.org/book.html#rowkey.design On 11/21/12 12:33 PM, Suraj Varma svarma...@gmail.com wrote: Ajay: Why would you not want to specify splits while creating table? If your 0-10 prefix is at random ... why not pre-split with that? Without presplitting, as Ram says, you cannot avoid region hotspotting until table starts automatic splits. --S On Wed, Nov 21, 2012 at 3:46 AM, Ajay Bhosle ajay.bho...@relianceada.com wrote: Thanks for your comments, I am already prefixing the timestamp with integer in range of 1..10, also the hbase.hregion.max.filesize is defined as 256 MB. Still it is hot spotting. Thanks Ajay -Original Message- From: ramkrishna vasudevan [mailto:ramkrishna.s.vasude...@gmail.com] Sent: Wednesday, November 21, 2012 2:14 PM To: user@hbase.apache.org Subject: Re: Region hot spotting Hi This link is pretty much useful. But still there too it says if you dont pre split you need to wait for the salting to help you from hotspotting till the region gets splitted. Mohammad just pointing this to say the usefulness of presplitting definitely your's is a good pointer to Ajay. :) Regards Ram On Wed, Nov 21, 2012 at 1:59 PM, Mohammad Tariq donta...@gmail.com wrote: Hello Ajay, You can use 'salting' if you don't want to presplit your table. You might this link useful : http://blog.sematext.com/2012/04/09/hbasewd-avoid-regionserver-hotspottin g-d espite-writing-records-with-sequential-keys/ HTH Regards, Mohammad Tariq On Wed, Nov 21, 2012 at 1:49 PM, ramkrishna vasudevan ramkrishna.s.vasude...@gmail.com wrote: Hotspotting is bound to happen until the region starts splitting and gets assigned to diff region servers. Regards Ram On Wed, Nov 21, 2012 at 12:49 PM, Ajay Bhosle ajay.bho...@relianceada.comwrote: Hi, I am inserting some data in hbase which is getting hot spotted in a particular server. The format of the row key is (0 or 1)|[timestamp]_[sequence]. Basically I want to add log information to hbase and search the records based on range of dates. Can someone suggest any configuration changes or any ideas on how the row key should be design. I do not want to specify the splits while creating table. Thanks Ajay The information contained in this electronic message (email) and any attachments to this email are intended for the exclusive use of the addressee(s) and access to this email by any one else is unauthorised. The email may contain proprietary, confidential or privileged information or information relating to Reliance Group. If you are not the intended recipient, please notify the sender by telephone, fax, or return email and delete this communication and any attachments thereto, immediately from your computer. Any dissemination, distribution, or copying of this communication and the attachments thereto (in whole or part), in any manner, is strictly prohibited and actionable at law. The recipient acknowledges that emails are susceptible to alteration and their integrity can not be guaranteed and that Company does not guarantee that any e-mail is virus-free and accept no liability for any damage caused by any virus transmitted by this email. The information contained in this electronic message (email) and any attachments to this email are intended for the exclusive use of the addressee(s) and access to this email by any one else is unauthorised. The email may contain proprietary, confidential or privileged information or information relating to Reliance Group. If you are not the intended recipient, please notify the sender by telephone, fax, or return email and delete this communication and any attachments thereto, immediately from your computer. Any dissemination, distribution, or copying of this communication and the attachments thereto (in whole or part), in any manner, is strictly prohibited and actionable at law. The recipient acknowledges that emails are susceptible to alteration and their integrity can not be guaranteed and that Company does not guarantee that any e-mail is virus-free and accept no liability for any damage caused by any virus transmitted by this email.
Re: Development work focused on HFile v2
The Hbase RefGuide has a big entry in the appendix on Hfile v2. On 11/3/12 5:34 PM, Marcos Ortiz mlor...@uci.cu wrote: Regards to all HBase users. I'm looking for all available information about the current development of HFile version 2 to write a blog post talking about the main differences between HFile and HFile version 2. What I'm looking for? - JIRA issues - Research papers - General discussions about this topic Any help is welcome. Thanks -- Marcos Luis Ortíz Valmaseda about.me/marcosortiz http://about.me/marcosortiz @marcosluis2186 http://twitter.com/marcosluis2186 10mo. ANIVERSARIO DE LA CREACION DE LA UNIVERSIDAD DE LAS CIENCIAS INFORMATICAS... CONECTADOS AL FUTURO, CONECTADOS A LA REVOLUCION http://www.uci.cu http://www.facebook.com/universidad.uci http://www.flickr.com/photos/universidad_uci
Re: how to copy oracle to HBASE, just like goldengate
Additionally, don't take it for granted that an RDBMS and HBase aren't the same thing. Check out these sections of the RefGuide if you haven't already. http://hbase.apache.org/book.html#datamodel http://hbase.apache.org/book.html#schema On 11/1/12 11:01 PM, Shumin Wu shumin...@gmail.com wrote: Have you taken a look at the Sqoop (http://sqoop.apache.org/) tool? Shumin On Thu, Nov 1, 2012 at 6:44 PM, Xiang Hua bea...@gmail.com wrote: Hi, IS there any tool to 'copy' whole oracle data of an instance into 'hbase'. Best R. huaxiang
Re: Does hbase.hregion.max.filesize have a limit?
Hi there- re: The max file size the whole cluster can store for one CF is 60G, right? No, the max file-size for a region, in your example, is 60GB. When the data exceeds that the region will split - and then you'll have 2 regions with 60GB limit. Check out this section of the RefGuide: http://hbase.apache.org/book.html#regions.arch Which explains how regions are how data is distributed across your cluster. The trick is that you don't want regions to small, but you also don't want them too big - because you'll wind up with what the ref guide describes in this chapter... 9.7.1. Region Size HBase scales by having regions across many servers. Thus if you have 2 regions for 16GB data, on a 20 node machine your data will be concentrated on just a few machines - nearly the entire cluster will be idle. This really cant be stressed enough, since a common problem is loading 200MB data into HBase then wondering why your awesome 10 node cluster isn't doing anything. On 11/1/12 4:09 AM, Cheng Su scarcer...@gmail.com wrote: Thank you for your answer. The max file size the whole cluster can store for one CF is 60G, right? Maybe the only way is to split the large table into small tables... On Thu, Nov 1, 2012 at 3:05 PM, ramkrishna vasudevan ramkrishna.s.vasude...@gmail.com wrote: Can multiple region servers runs on one real machine? (I guess not though) No.. Every RS runs in different physical machines. max.file.size actually applies for region. Suppose you create a table then insert data for 20G that will get explicitly splitted into further regions. Yes all 60G of data can be stored in one physical machine but that means that you have the data is logically served by 3 regions. Does this help you? Regards Ram On Thu, Nov 1, 2012 at 12:15 PM, Cheng Su scarcer...@gmail.com wrote: Does that means the max file size of 1 cf is 20G? If I have 3 region servers, then 60G total? I have a very large table, size of one cf (contains only one column) may exceed 60G. Is there any chance to store the data without increase machines? Can multiple region servers runs on one real machine? (I guess not though) On Thu, Nov 1, 2012 at 1:35 PM, lars hofhansl lhofha...@yahoo.com wrote: The tribal knowledge would say about 20G is the max. The fellas from Facebook will have a more definite answer. -- Lars From: Cheng Su scarcer...@gmail.com To: user@hbase.apache.org Sent: Wednesday, October 31, 2012 10:22 PM Subject: Does hbase.hregion.max.filesize have a limit? Hi, all. I have a simple question: does hbase.hregion.max.filesize have a limit? May I specify a very large value to this? like 40G or more? (don't consider the performance) I didn't find any description about this from official site or google. Thanks. -- Regards, Cheng Su -- Regards, Cheng Su -- Regards, Cheng Su
Re: Best technique for doing lookup with Secondary Index
Hey folks, for the record there are samples of using importsv for preparing Hfiles in here... http://hbase.apache.org/book.html#importtsv On 10/26/12 12:44 AM, anil gupta anilgupt...@gmail.com wrote: Hi Anoop, Yes i use bulk loading for loading table A. I wrote my own mapper as Importtsv wont suffice my requirements. :) No, i dont call HTable#put() from my mapper. I was thinking about trying out calling HTable#put() from my mapper and see the outcome. I meant to say that when we use MR job (ex. importtsv) then WAL is not used. Sorry, if i misunderstood someone. Thanks, Anil On Thu, Oct 25, 2012 at 9:06 PM, Anoop Sam John anoo...@huawei.com wrote: Hi Anil, Some confusion after seeing your reply. You use bulk loading? You created your own mapper? You call HTable#put() from mappers? I think confusion in another thread also.. I was refering to the HFileOutputReducer.. There is a TableOutputFormat also... In TableOutputFormat it will try put to the HTable... Here write to WAL is applicable... [HFileOutputReducer] : As we discussed in another thread, in case of bulk loading the aproach is like MR job create KVs and write to files and this file is written as an HFile. Yes this will contain all meta information, trailer etc... Finally only HBase cluster need to be contacted just to load this HFile(s) into HBase cluster.. Under corresponding regions. This will be the fastest way for bulk loading of huge data... -Anoop- From: anil gupta [anilgupt...@gmail.com] Sent: Friday, October 26, 2012 3:40 AM To: user@hbase.apache.org Subject: Re: Best technique for doing lookup with Secondary Index Anoop: In prePut hook u call HTable#put()? Anil: Yes i call HTable#put() in prePut. Is there better way of doing it? Anoop: Why use the network calls from server side here then? Anil: I thought this is a cleaner approach since i am using BulkLoader. I decided not to run two jobs since i am generating a UniqueIdentifier at runtime in bulkloader. Anoop: can not handle it from client alone? Anil: I cannot handle it from client since i am using BulkLoader. Is it a good idea to create Htable instance on B and do put in my mapper? I might try this idea. Anoop: You can have a look at Lily project. Anil: It's little late for us to evaluate Lily now and at present we dont need complex secondary index since our data is immutable. Ram: what is rowkey B here? Anil: Suppose i am storing customer events in table A. I have two requirement for data query: 1. Query customer events on basis of customer_Id and event_ID. 2. Query customer events on basis of event_timestamp and customer_ID. 70% of querying is done by query#1, so i will create customer_Idevent_ID as row key of Table A. Now, in order to support fast results for query#2, i need to create a secondary index on A. I store that secondary index in B, rowkey of B is event_timestampcustomer_ID .Every row stores the corresponding rowkey of A. Ram:How is the startRow determined for every query? Anil: Its determined by a very simple application logic. Thanks, Anil Gupta On Wed, Oct 24, 2012 at 10:16 PM, Ramkrishna.S.Vasudevan ramkrishna.vasude...@huawei.com wrote: Just out of curiosity, The secondary index is stored in table B as rowkey B -- family:rowkey A what is rowkey B here? 1. Scan the secondary table by using prefix filter and startRow. How is the startRow determined for every query ? Regards Ram -Original Message- From: Anoop Sam John [mailto:anoo...@huawei.com] Sent: Thursday, October 25, 2012 10:15 AM To: user@hbase.apache.org Subject: RE: Best technique for doing lookup with Secondary Index I build the secondary table B using a prePut RegionObserver. Anil, In prePut hook u call HTable#put()? Why use the network calls from server side here then? can not handle it from client alone? You can have a look at Lily project. Thoughts after seeing ur idea on put and scan.. -Anoop- From: anil gupta [anilgupt...@gmail.com] Sent: Thursday, October 25, 2012 3:10 AM To: user@hbase.apache.org Subject: Best technique for doing lookup with Secondary Index Hi All, I am using HBase 0.92.1. I have created a secondary index on table A. Table A stores immutable data. I build the secondary table B using a prePut RegionObserver. The secondary index is stored in table B as rowkey B -- family:rowkey A . rowkey A is the column qualifier. Every row in B will only on have one column and the name of that column is the rowkey of A. So the value is blank. As per my understanding, accessing column qualifier is faster than accessing value. Please correct me if i am wrong. HBase Querying approach: 1. Scan the secondary table by using prefix filter and startRow. 2. Do a batch get on
Re: Hbase sequential row merging in MapReduce job
As long as you know your keyspace, you should be able to create your own splits. See TableInputFormatBase for the default implementation (which is 1 input split per region) On 10/19/12 9:32 AM, Eric Czech eczec...@gmail.com wrote: Hi everyone, Is there any way to create an InputSplit for a MapReduce job (reading from an HBase table) that guarantees sequential rows with some shared key prefix will end up in the same mapper? For example, if I have sequential keys like this: metric1_2010, metric1_2011, metric1_2012, metric2_2011, metric2_2012, ... I want a mapper that will definitely see all the rows with keys that start with metric1. Is there a way to do this? Thank you!
Re: Coprocessor end point vs MapReduce?
To echo what Mike said about KISS, would you use triggers for a large time-sensitive batch job in an RDBMS? It's possible, but probably not. Then you might want to think twice about using co-processors for such a purpose with HBase. On 10/17/12 9:50 PM, Michael Segel michael_se...@hotmail.com wrote: Run your weekly job in a low priority fair scheduler/capacity scheduler queue. Maybe its just me, but I look at Coprocessors as a similar structure to RDBMS triggers and stored procedures. You need to restrain and use them sparingly otherwise you end up creating performance issues. Just IMHO. -Mike On Oct 17, 2012, at 8:44 PM, Jean-Marc Spaggiari jean-m...@spaggiari.org wrote: I don't have any concern about the time it's taking. It's more about the load it's putting on the cluster. I have other jobs that I need to run (secondary index, data processing, etc.). So the more time this new job is taking, the less CPU the others will have. I tried the M/R and I really liked the way it's done. So my only concern will really be the performance of the delete part. That's why I'm wondering what's the best practice to move a row to another table. 2012/10/17, Michael Segel michael_se...@hotmail.com: If you're going to be running this weekly, I would suggest that you stick with the M/R job. Is there any reason why you need to be worried about the time it takes to do the deletes? On Oct 17, 2012, at 8:19 PM, Jean-Marc Spaggiari jean-m...@spaggiari.org wrote: Hi Mike, I'm expecting to run the job weekly. I initially thought about using end points because I found HBASE-6942 which was a good example for my needs. I'm fine with the Put part for the Map/Reduce, but I'm not sure about the delete. That's why I look at coprocessors. Then I figure that I also can do the Put on the coprocessor side. On a M/R, can I delete the row I'm dealing with based on some criteria like timestamp? If I do that, I will not do bulk deletes, but I will delete the rows one by one, right? Which might be very slow. If in the future I want to run the job daily, might that be an issue? Or should I go with the initial idea of doing the Put with the M/R job and the delete with HBASE-6942? Thanks, JM 2012/10/17, Michael Segel michael_se...@hotmail.com: Hi, I'm a firm believer in KISS (Keep It Simple, Stupid) The Map/Reduce (map job only) is the simplest and least prone to failure. Not sure why you would want to do this using coprocessors. How often are you running this job? It sounds like its going to be sporadic. -Mike On Oct 17, 2012, at 7:11 PM, Jean-Marc Spaggiari jean-m...@spaggiari.org wrote: Hi, Can someone please help me to understand the pros and cons between those 2 options for the following usecase? I need to transfer all the rows between 2 timestamps to another table. My first idea was to run a MapReduce to map the rows and store them on another table, and then delete them using an end point coprocessor. But the more I look into it, the more I think the MapReduce is not a good idea and I should use a coprocessor instead. BUT... The MapReduce framework guarantee me that it will run against all the regions. I tried to stop a regionserver while the job was running. The region moved, and the MapReduce restarted the job from the new location. Will the coprocessor do the same thing? Also, I found the webconsole for the MapReduce with the number of jobs, the status, etc. Is there the same thing with the coprocessors? Are all coprocessors running at the same time on all regions, which mean we can have 100 of them running on a regionserver at a time? Or are they running like the MapReduce jobs based on some configured values? Thanks, JM
Re: Coprocessor end point vs MapReduce?
I agree with the concern and there isn't a ton of guidance on this area yet. On 10/18/12 2:01 PM, Michael Segel michael_se...@hotmail.com wrote: Doug, One thing that concerns me is that a lot of folks are gravitating to Coprocessors and may be using them for the wrong thing. Has anyone done any sort of research as to some of the limitations and negative impacts on using coprocessors? While I haven't really toyed with the idea of bulk deletes, periodic deletes is probably not a good use of coprocessors however using them to synchronize tables would be a valid use case. Thx -Mike On Oct 18, 2012, at 7:36 AM, Doug Meil doug.m...@explorysmedical.com wrote: To echo what Mike said about KISS, would you use triggers for a large time-sensitive batch job in an RDBMS? It's possible, but probably not. Then you might want to think twice about using co-processors for such a purpose with HBase. On 10/17/12 9:50 PM, Michael Segel michael_se...@hotmail.com wrote: Run your weekly job in a low priority fair scheduler/capacity scheduler queue. Maybe its just me, but I look at Coprocessors as a similar structure to RDBMS triggers and stored procedures. You need to restrain and use them sparingly otherwise you end up creating performance issues. Just IMHO. -Mike On Oct 17, 2012, at 8:44 PM, Jean-Marc Spaggiari jean-m...@spaggiari.org wrote: I don't have any concern about the time it's taking. It's more about the load it's putting on the cluster. I have other jobs that I need to run (secondary index, data processing, etc.). So the more time this new job is taking, the less CPU the others will have. I tried the M/R and I really liked the way it's done. So my only concern will really be the performance of the delete part. That's why I'm wondering what's the best practice to move a row to another table. 2012/10/17, Michael Segel michael_se...@hotmail.com: If you're going to be running this weekly, I would suggest that you stick with the M/R job. Is there any reason why you need to be worried about the time it takes to do the deletes? On Oct 17, 2012, at 8:19 PM, Jean-Marc Spaggiari jean-m...@spaggiari.org wrote: Hi Mike, I'm expecting to run the job weekly. I initially thought about using end points because I found HBASE-6942 which was a good example for my needs. I'm fine with the Put part for the Map/Reduce, but I'm not sure about the delete. That's why I look at coprocessors. Then I figure that I also can do the Put on the coprocessor side. On a M/R, can I delete the row I'm dealing with based on some criteria like timestamp? If I do that, I will not do bulk deletes, but I will delete the rows one by one, right? Which might be very slow. If in the future I want to run the job daily, might that be an issue? Or should I go with the initial idea of doing the Put with the M/R job and the delete with HBASE-6942? Thanks, JM 2012/10/17, Michael Segel michael_se...@hotmail.com: Hi, I'm a firm believer in KISS (Keep It Simple, Stupid) The Map/Reduce (map job only) is the simplest and least prone to failure. Not sure why you would want to do this using coprocessors. How often are you running this job? It sounds like its going to be sporadic. -Mike On Oct 17, 2012, at 7:11 PM, Jean-Marc Spaggiari jean-m...@spaggiari.org wrote: Hi, Can someone please help me to understand the pros and cons between those 2 options for the following usecase? I need to transfer all the rows between 2 timestamps to another table. My first idea was to run a MapReduce to map the rows and store them on another table, and then delete them using an end point coprocessor. But the more I look into it, the more I think the MapReduce is not a good idea and I should use a coprocessor instead. BUT... The MapReduce framework guarantee me that it will run against all the regions. I tried to stop a regionserver while the job was running. The region moved, and the MapReduce restarted the job from the new location. Will the coprocessor do the same thing? Also, I found the webconsole for the MapReduce with the number of jobs, the status, etc. Is there the same thing with the coprocessors? Are all coprocessors running at the same time on all regions, which mean we can have 100 of them running on a regionserver at a time? Or are they running like the MapReduce jobs based on some configured values? Thanks, JM
Re: Problems using unqualified hostname on hbase
Hi there. You generally don't want to run with 2 clusters like (HBase on one, HDFS on the other) that because your regions have 0% locality. For more information on this topic, seeŠ. http://hbase.apache.org/book.html#regions.arch.locality On 10/17/12 12:19 PM, Richard Tang tristartom.t...@gmail.com wrote: Hello, everyone, I have problems using hbase based on unqualified hostname. My ``hbase`` runs in a cluster and ``hdfs`` on another cluster. While using fully qualified name on ``hbase``, for properties like ``hbase.rootdir`` and ``hbase.zookeeper.quorum``, there is no problem. But when I change them to be shorter unqualified names, like ``node4`` and ``node2``, (which are resolved to local ip address by ``/etc/hosts``, like ``10.0.0.8``), the hbase cluster begin to throw ``Connect refused`` messages. Anyone encounter same problem here?What is the possible reason behind all theses? Thanks. Regards, Richard
Re: bulk load
Yep. Bulk-loads are an extremely useful way of loading data. That would be 2 jobs since those are 2 tables. For more info on bulk loading, seeŠ http://hbase.apache.org/book.html#arch.bulk.load On 10/14/12 10:58 AM, yutoo yanio yutoo.ya...@gmail.com wrote: hi i want to bulk load my data, but use map/reduce and HFileOutputFormat have a problem, just can bulk load to a table. but i want to load my data to more table : one table for statistics, one table for all of data , ... i am thinking about this approach : manually load data(locally or map/reduce) and create HFiles for all tables and append incrementally to tables. is this approach good? please help me. thanks.
Re: Remove the row in MR job?
I'm not entirely sure of the use-case, but here are some thoughts on thisŠ re: should I take the table from the pool, and simply call the delete method? Yep, you can construct an HTable instance within a MR job. But use the delete that takes a list because the single-delete will invoke an RPC for each one (not great over an MR job). Construct the HTable instance at the Mapper level (not map-method level) and keep a buffer of deletes in a List. At the end of the job, send any un-processed deletes in the cleanup method. I'm not entirely sure why you'd want to delete every row in a table (as opposed to processing all the rows in Table1 and generating an entirely new Table2). And then drop Table1 when you're done with it. That seems like it would be less hassle than deleting every row (since the table is empty anyway). On 10/12/12 1:20 PM, Jean-Marc Spaggiari jean-m...@spaggiari.org wrote: Hi, I have a table which I want to parse over a MR job. Today, I'm using a scan to parse all the rows. Each row is retrieve, removed, and the parsed (feeding 2 other tables) The goal is to parse all the content while some process might still be adding some more. On the map method from the MR job, can I delete the row I'm working with? If so, how should I do? should I take the table from the pool, and simply call the delete method? The issue is, doing a delete for each line will take a while. I would prefer to batch them, but I don't know when will be the last line, so it's difficult to know when to send the batch. Is there a way to say to the MR job to delete this line? Also, what's the impact on the MR job if I delete the row it's working one? Or is the MR job not the best way to do that? Thanks, JM
Re: Remove the row in MR job?
Just throwing an idea out there, but if you rotate tables you could probably do what you want.. 1) Table1 is being written throughout the day 2) It's time to kick off the MR job, but before the job is submitted Table2 is now configured to be the 'write' table 3) MR job processes all the data in Table1. Table1 is dropped/truncated when finished. 4) Table2 continues to get writes 5) Now it's time to run the MR job again, Table1 is now configured to be the 'write' table and Table2 is processed by the MR job. 6) Continue rotating between the tables Something like this is probably going to be a lot easier to manage than doing deletes of what you've read. On 10/12/12 3:47 PM, Jean-Marc Spaggiari jean-m...@spaggiari.org wrote: Hi Doug, Thanks for the suggestion. I like the idea of simply deleting the table, however, I'm not sure if I can implement it. Basically, I have one process which is constantly feeding the table, and, once a day, I want to run a MR job to proccess this table (Which will emtpy it). While I'm processing it, I still want to other process to have the ability to store data. Since I can't rename the table because this functionnaly doesn't exist, I need to have the 2 working on the same table. Maybe what I can do is working on the colum name Like I store on a different column every day based on the day number and I just run MR on all the columns except today. After that, I can delete all the columns except the one for the current day. Issue is if the MR is taking more than 24h... Also, is that fast to delete a column? JM 2012/10/12 Doug Meil doug.m...@explorysmedical.com: I'm not entirely sure of the use-case, but here are some thoughts on thisŠ re: should I take the table from the pool, and simply call the delete method? Yep, you can construct an HTable instance within a MR job. But use the delete that takes a list because the single-delete will invoke an RPC for each one (not great over an MR job). Construct the HTable instance at the Mapper level (not map-method level) and keep a buffer of deletes in a List. At the end of the job, send any un-processed deletes in the cleanup method. I'm not entirely sure why you'd want to delete every row in a table (as opposed to processing all the rows in Table1 and generating an entirely new Table2). And then drop Table1 when you're done with it. That seems like it would be less hassle than deleting every row (since the table is empty anyway). On 10/12/12 1:20 PM, Jean-Marc Spaggiari jean-m...@spaggiari.org wrote: Hi, I have a table which I want to parse over a MR job. Today, I'm using a scan to parse all the rows. Each row is retrieve, removed, and the parsed (feeding 2 other tables) The goal is to parse all the content while some process might still be adding some more. On the map method from the MR job, can I delete the row I'm working with? If so, how should I do? should I take the table from the pool, and simply call the delete method? The issue is, doing a delete for each line will take a while. I would prefer to batch them, but I don't know when will be the last line, so it's difficult to know when to send the batch. Is there a way to say to the MR job to delete this line? Also, what's the impact on the MR job if I delete the row it's working one? Or is the MR job not the best way to do that? Thanks, JM
Re: HBase table - distinct values
Typically this is something done as a MapReduce job. http://hbase.apache.org/book.html#mapreduce.example 7.2.4. HBase MapReduce Summary to HBase Example However, if this is an operation to be performed frequently by an application then doing frequent MapReduce jobs for summaries probably isn't the best idea. Either produce periodic summaries into another Hbase table, or denormalize and keep track of the required summaries upon data load. On 10/10/12 6:59 AM, raviprasa...@polarisft.com raviprasa...@polarisft.com wrote: Hi all, Is it possible to select distinct value from Hbase table. Example :- what is the equivalant code for the below Oracle code in Hbase ? Select count (distinct deptno) from emp ; Regards Raviprasad. T This e-Mail may contain proprietary and confidential information and is sent for the intended recipient(s) only. If by an addressing or transmission error this mail has been misdirected to you, you are requested to delete this mail immediately. You are also hereby notified that any use, any form of reproduction, dissemination, copying, disclosure, modification, distribution and/or publication of this e-mail message, contents or its attachment other than by its intended recipient/s is strictly prohibited. Visit us at http://www.polarisFT.com
Re: How to select distinct column from a Hbase table
Here's a Scan example from the Hbase ref guideŠ http://hbase.apache.org/book.html#scan Š but this assumes you are asking about simply getting a distinct column from a table, as opposed to doing a distinct count query which was in another email thread today. On 10/10/12 11:20 AM, Ramasubramanian Narayanan ramasubramanian.naraya...@gmail.com wrote: Hi, Can someone help in providing the query to select the distinct column from a Hbase table. If not should we do it only in Pig script? regards, Rams
Re: key design
Hi there- Given the fact that the userid is in the lead position of the key in both approaches, I'm not sure that he'd have a region hotspotting problem because the userid should be able to offer some spread. On 10/10/12 12:55 PM, Jerry Lam chiling...@gmail.com wrote: Hi: So you are saying you have ~3TB of data stored per day? Using the second approach, all data for one day will go to only 1 regionserver no matter what you do because HBase doesn't split a single row. Using the first approach, data will spread across regionservers but there will be hotspotted to each regionserver during write since this is a time-series problem. Best Regards, Jerry On Wed, Oct 10, 2012 at 11:24 AM, yutoo yanio yutoo.ya...@gmail.com wrote: hi i have a question about key column design. in my application we have 3,000,000,000 record in every day each record contain : user-id, time stamp, content(max 1KB). we need to store records for one year, this means we will have about 1,000,000,000,000 after 1 year. we just search a user-id over rang of time stamp table can design in two way 1.key=userid-timestamp and column:=content 2.key=userid-MMdd and column:HHmmss=content in first design we have tall-narrow table but we have very very records, in second design we have flat-wide table. which of them have better performance? thanks.
Re: Regarding Hbase tuning - Configuration at table level
Re: JD's suggestion, this and more exciting and useful things can be found in these sections of the Hbase ref guide. http://hbase.apache.org/book.html#perf.reading http://hbase.apache.org/book.html#perf.writing Well, maybe not exciting, but certainly useful. :-) On 10/10/12 2:04 PM, Jean-Daniel Cryans jdcry...@apache.org wrote: At the table level there's only deferred log flush that would help, and only with writes at the cost of some durability. J-D On Wed, Oct 10, 2012 at 8:26 AM, Ramasubramanian Narayanan ramasubramanian.naraya...@gmail.com wrote: Hi, What are all the configurations that can be done at Hbase table level to improve the performance of Hbase table for both read and right... regards, Rams
Re: HBase Key Design : Doubt
Correct. If you do 2 Puts for row key A-B-C-D on different days, the second Put logically replaces the first and the earlier Put becomes a previous version. Unless you specifically want older versions, you won't get them in either Gets or Scans. Definitely want to read thisŠ http://hbase.apache.org/book.html#datamodel See this for more information about they internal KeyValue structure. http://hbase.apache.org/book.html#regions.arch 9.7.5.4. KeyValue Older versions are kept around as long as the table descriptor says so (e.g., max versions). See the StoreFile and Compactions entries in the RefGuide for more information on the internals. On 10/10/12 3:24 PM, Jerry Lam chiling...@gmail.com wrote: correct me if I'm wrong. The version applies to the individual cell (ie. row key, column family and column qualifier) not (row key, column family). On Wed, Oct 10, 2012 at 3:13 PM, Narayanan K knarayana...@gmail.com wrote: Hi all, I have a usecase wherein I need to find the unique of some things in HBase across dates. Say, on 1st Oct, A-B-C-D appeared, hence I insert a row with rowkey : A-B-C-D. On 2nd Oct, I get the same value A-B-C-D and I don't want to redundantly store the row again with a new rowkey - A-B-C-D for 2nd Oct i.e I will not want to have 20121001-A-B-C-D and 20121002-A-B-C-D as 2 rowkeys in the table. Eg: If I have 1st Oct , 2nd Oct as 2 column families and if number of versions are set to 1, only 1 row will be present in for both the dates having rowkey A-B-C-D. Hence if I need to find unique number of times A-B-C-D appeared during Oct 1 and Oct 2, I just need to take rowcount of the row A-B-C-D by filtering over the 2 column families. Similarly, if we have 10 date column families, and I need to scan only for 2 dates, then it scans only those store files having the specified column families. This will make scanning faster. But here the design problem is that I cant add more column families to the table each day. I would need to store data every day and I read that HBase doesnt work well with more than 3 column families. The other option is to have one single column family and store dates as qualifiers : date:d1, date:d2 But here if there are 30 date qualifiers under date column family, to scan a single date qualifier or may be range of 2-3 dates will have to scan through the entire data of all d1 to d30 qualifiers in the date column family which would be slower compared to having separate column families for the each date.. Please share your thoughts on this. Also any alternate design suggestions you might have. Regards, Narayanan
Re: HBase client slows down
It's one of those it depends answers. See this firstŠ http://hbase.apache.org/book.html#perf.writing Š Additionally, one thing to understand is where you are writing data. Either keep track of the requests per RS over the period (e.g., the web interface), or you can also track it on the client side with... http://hbase.apache.org/apidocs/org/apache/hadoop/hbase/client/HTable.html# getRegionLocation%28byte[],%20boolean%29 Š to know if you are continually hitting the same RS or spreading the load. On 10/9/12 1:27 PM, Mohit Anchlia mohitanch...@gmail.com wrote: I just have 5 stress client threads writing timeseries data. What I see is after few mts HBaseClient slows down and starts to take 4 secs. Once I kill the client and restart it stays at sustainable rate for about 2 mts and then again it slows down. I am wondering if there is something I should be doing on the HBaseclient side? All the request are similar in terms of data.
Re: HBase client slows down
So you're running on a single regionserver? On 10/9/12 1:44 PM, Mohit Anchlia mohitanch...@gmail.com wrote: I am using HTableInterface as a pool but I don't see any setautoflush method. I am using 0.92.1 jar. Also, how can I see if RS is getting overloaded? I looked at the UI and I don't see anything obvious: equestsPerSecond=0, numberOfOnlineRegions=1, numberOfStores=1, numberOfStorefiles=1, storefileIndexSizeMB=0, rootIndexSizeKB=1, totalStaticIndexSizeKB=0, totalStaticBloomSizeKB=0, memstoreSizeMB=27, readRequestsCount=126, writeRequestsCount=96157, compactionQueueSize=0, flushQueueSize=0, usedHeapMB=44, maxHeapMB=3976, blockCacheSizeMB=8.79, blockCacheFreeMB=985.34, blockCacheCount=11, blockCacheHitCount=23, blockCacheMissCount=28, blockCacheEvictedCount=0, blockCacheHitRatio=45%, blockCacheHitCachingRatio=67%, hdfsBlocksLocalityIndex=100 On Tue, Oct 9, 2012 at 10:32 AM, Doug Meil doug.m...@explorysmedical.comwrote: It's one of those it depends answers. See this firstŠ http://hbase.apache.org/book.html#perf.writing Š Additionally, one thing to understand is where you are writing data. Either keep track of the requests per RS over the period (e.g., the web interface), or you can also track it on the client side with... http://hbase.apache.org/apidocs/org/apache/hadoop/hbase/client/HTable.htm l# getRegionLocation%28byte[],%20boolean%29 Š to know if you are continually hitting the same RS or spreading the load. On 10/9/12 1:27 PM, Mohit Anchlia mohitanch...@gmail.com wrote: I just have 5 stress client threads writing timeseries data. What I see is after few mts HBaseClient slows down and starts to take 4 secs. Once I kill the client and restart it stays at sustainable rate for about 2 mts and then again it slows down. I am wondering if there is something I should be doing on the HBaseclient side? All the request are similar in terms of data.
Re: Column Qualifier space requirements
Hi there, there is a separate Store per ColumnFamily, which results in a separate StoreFile. http://hbase.apache.org/book.html#regions.arch This section has a description of what the files look like on HDFSŠ http://hbase.apache.org/book.html#trouble.namenode On 10/3/12 10:35 AM, Fuad Efendi f...@efendi.ca wrote: Hi Anoop, Thanks for the response! - I thought that each Column Family is associated with separate MapFile in Hadoop... but it was before 0.20... I found details at http://www.cloudera.com/blog/2012/06/hbase-io-hfile-input-output/ -Fuad
Re: Bulk Loads and Updates
Hi there- re: All 20 versions will get loaded but the 10 oldest will be deleted during the next major compaction. Yep, that's what is expected to happen. For information on KeyValue structure and compaction algorithm, seeŠ. http://hbase.apache.org/book.html#regions.arch For info on bulk loading, see.. http://hbase.apache.org/book.html#arch.bulk.load On 10/3/12 4:12 PM, Paul Mackles pmack...@adobe.com wrote: Keys in hbase are a combination of rowkey/column/timestamp. Two records with the same rowkey but different column will result in two different cells with the same rowkey which is probably what you expect. For two records with the same rowkey and same column, the timestamp will normally differentiate them but in the case of a bulk load, the timestamp could be the same so it may actually be a tie and both will be stored. There are no updates in bulk loads. All 20 versions will get loaded but the 10 oldest will be deleted during the next major compaction. I would definitely recommend setting up small scale tests for all of the above scenarios to confirm. On 10/3/12 3:35 PM, Juan P. gordoslo...@gmail.com wrote: Hi guys, I've been reading up on bulk load using MapReduce jobs and I wanted to validate something. If I the input I wanted to load into HBase produced the same key for several lines. How will HBase handle that? I understand the MapReduce job will create StoreFiles which the region servers just pick up and make available to the users. But is there a validation to treat the first as insert and the rest as updates? What about the limit on the number of versions of a key HBase can have? If I want to have 10 versions, but the bulk load has 20 values for the same key, will it only keep the last 10? Thanks, Juan
Re: HBase vs. HDFS
Hi there, Another thing to consider on top of the scan-caching is that that HBase is doing more in the process of scanning the table. See... http://hbase.apache.org/book.html#conceptual.view http://hbase.apache.org/book.html#regions.arch ... Specifically, processing the KeyValues, potentially merging rows between StoreFiles, checking for un-flushed updates in the MemStore per CF. On 10/1/12 8:54 PM, Doug Meil doug.m...@explorysmedical.com wrote: Hi there- Might want to start with thisŠ http://hbase.apache.org/book.html#perf.reading Š if you're using default scan caching (which is 1) that would explain a lot. On 10/1/12 7:01 PM, Juan P. gordoslo...@gmail.com wrote: Hi guys, I'm trying to get familiarized with HBase and one thing I noticed is that reads seem to very slow. I just tried doing a scan 'my_table' to get 120K records and it took about 50 seconds to print it all out. In contrast hadoop fs -cat my_file.csv where my_file.csv has 120K lines completed in under a second. Is that possible? Am I missing something about HBase reads? Thanks, Joni
Re: HBase vs. HDFS
If you take Hbase out of it and think of it from the standpoint of 2 programs, one of which opens a file and write the output to another file, and the other one which actually processes each row and then writes out results, the 2nd one is going to be slower because it's doing more, ceteris paribus. HBase is like the 2nd program in your test. On 10/2/12 8:46 AM, gordoslocos gordoslo...@gmail.com wrote: Thank you all! Setting a cache size helped a great deal. It's still slower though. I think it might be possible that the overhead of processing the data from the table might be the cause. I guess if HBase adds an indirection to the HDFS then it makes sense that it'd be slower, right? On 02/10/2012, at 09:28, Doug Meil doug.m...@explorysmedical.com wrote: Hi there, Another thing to consider on top of the scan-caching is that that HBase is doing more in the process of scanning the table. See... http://hbase.apache.org/book.html#conceptual.view http://hbase.apache.org/book.html#regions.arch ... Specifically, processing the KeyValues, potentially merging rows between StoreFiles, checking for un-flushed updates in the MemStore per CF. On 10/1/12 8:54 PM, Doug Meil doug.m...@explorysmedical.com wrote: Hi there- Might want to start with thisŠ http://hbase.apache.org/book.html#perf.reading Š if you're using default scan caching (which is 1) that would explain a lot. On 10/1/12 7:01 PM, Juan P. gordoslo...@gmail.com wrote: Hi guys, I'm trying to get familiarized with HBase and one thing I noticed is that reads seem to very slow. I just tried doing a scan 'my_table' to get 120K records and it took about 50 seconds to print it all out. In contrast hadoop fs -cat my_file.csv where my_file.csv has 120K lines completed in under a second. Is that possible? Am I missing something about HBase reads? Thanks, Joni
Re: HBase table row key design question.
Hi there, while this isn't an answer to some of the specific design questions, this chapter in the RefGuide can be helpful for general design.. http://hbase.apache.org/book.html#schema On 10/2/12 10:28 AM, Jason Huang jason.hu...@icare.com wrote: Hello, I am designing a HBase table for users and hope to get some suggestions for my row key design. Thanks... This user table will have columns which include user information such as names, birthday, gender, address, phone number, etc... The first time user comes to us we will ask all these information and we should generate a new row in the table with a unique row key. The next time the same user comes in again we will ask for his/her names and birthday and our application should quickly get the row(s) in the table which meets the name and birthday provided. Here is what I am thinking as row key: {first 6 digit of user's first name}_{first 6 digit of user's last name}_{birthday in MMDD}_{timestamp when user comes in for the first time} However, I see a few questions from this row key: (1) Although it is not very likely but there could be some small chances that two users with same name and birthday came in at the same day. And the two requests to generate new user came at the same time (the timestamps were defined in the HTable API and happened to be of the same value before calling the put method). This means the row key design above won't guarantee a unique row key. Any suggestions on how to modify it and ensure a unique ID? (2) Sometimes we will only have part of user's first name and/or last name. In that case, we will need to perform a scan and return multiple matches to the client. To avoid scanning the whole table, if we have user's first name, we can set start/stop row accordingly. But then if we only have user's last name, we can't set up a good start/stop row. What's even worse, if the user provides a sounds-like first or last name, then our scan won't be able to return good possible matches. Does anyone ever use names as part of the row key and encounter this type of issue? (3) The row key seems to be long (30+ chars), will this affect our read/write performance? Maybe it will increase the storage a bit (say we have 3 million rows per month)? In other words, does the length of the row key matter a lot? thanks! Jason
Re: Column Qualifier space requirements
Hi there, take a look at the Hbase Refguide here... http://hbase.apache.org/book.html#regions.arch For this section... 9.7.5.4. KeyValue On 10/1/12 9:50 AM, Fuad Efendi f...@efendi.ca wrote: Hi, Is column qualifier physically stored in a Cell? Or pointer to it? Do we need to care about long size such as my_very_long_qualifier:1 (size of a value is small in comparison to size of qualifierŠ) thanks
Re: HBase vs. HDFS
Hi there- Might want to start with thisŠ http://hbase.apache.org/book.html#perf.reading Š if you're using default scan caching (which is 1) that would explain a lot. On 10/1/12 7:01 PM, Juan P. gordoslo...@gmail.com wrote: Hi guys, I'm trying to get familiarized with HBase and one thing I noticed is that reads seem to very slow. I just tried doing a scan 'my_table' to get 120K records and it took about 50 seconds to print it all out. In contrast hadoop fs -cat my_file.csv where my_file.csv has 120K lines completed in under a second. Is that possible? Am I missing something about HBase reads? Thanks, Joni
Re: Clarification regarding major compaction logic
Hi there, for background on the file selection algorithm for compactions, see... http://hbase.apache.org/book.html#regions.arch 9.7.5.5. Compaction On 9/23/12 9:59 AM, Monish r monishs...@gmail.com wrote: Hi guys, i would like to clarify the following regarding Major Compaction 1) When TTL is set for a column family and major compaction is triggered by user - Does it act on the region only when *time since last major compaction is TTL.* * * 2) Does major compaction go through the index of a region to find out that there is data to be acted upon and then start the rewriting ( or ) does it rewrite without any pre checks about the data inside the region ? 3) If major compaction for a region results in a empty region , does the empty region get deleted or left as such ? Regards, R.Monish
Re: What is the best value to be used in rowkey
Hi there, you probably want to read thisŠ http://hbase.apache.org/book.html#schema On 9/22/12 10:29 AM, Ramasubramanian Narayanan ramasubramanian.naraya...@gmail.com wrote: Hi, Can anyone suggest what is the best value that can be used for a rowkey in a hbase table which will not produce duplicate any point of time. For example timestamp with nano seconds may get duplicated if we are loading in a batch file. regards, Rams
Re: HBase Multi-Threaded Writes
Hi there, You haven't described much about your environment, but there are two things you might want to consider for starters: 1) Is the table pre-split? (I.e., if it isn't, there is one region) 2) If it is, are all the writes hitting the same region? For other write tips, see thisŠ http://hbase.apache.org/book.html#perf.writing On 9/19/12 2:53 PM, Pankaj Misra pankaj.mi...@impetus.co.in wrote: Dear All, I have created 2 clients with multi-threading support to perform concurrent writes to HBase with initial expectation that with multiple threads I should be able to write faster. The clients that I created are using the Native HBase API and Thrift API. To my surprise, the performance with multi-threaded clients dropped for the both the clients consistently when compared to single threaded ingestion. As I increase the number of threads the writes performance degrades consistently. With a single thread ingestion both the clients perform far better, but I intend to use HBase in a multi-threaded environment, wherein I am facing challenges with the performance. Since I am relatively new to HBase, please do excuse me if I am asking something very basic, but any suggestions around this would be extremely helpful. Thanks and Regards Pankaj Misra Impetus Ranked in the Top 50 India¹s Best Companies to Work For 2012. Impetus webcast ŒDesigning a Test Automation Framework for Multi-vendor Interoperable Systems¹ available at http://lf1.me/0E/. NOTE: This message may contain information that is confidential, proprietary, privileged or otherwise protected by law. The message is intended solely for the named addressee. If received in error, please destroy and notify the sender. Any use of this email is prohibited when received in error. Impetus does not represent, warrant and/or guarantee, that the integrity of this communication has been maintained nor that the communication is free of errors, virus, interception or interference.
Re: HBase Multi-Threaded Writes
re: pseudo-distributed mode Ok, so you're doing a local test. The benefits you get with multiple regions per table that are spread across multiple RegionServers are that you can engage more of the cluster in your workload. You can't really do that on a local test. On 9/19/12 4:48 PM, Pankaj Misra pankaj.mi...@impetus.co.in wrote: Thank you so much Doug. You are right there is only one region to start with as I am not pre-splitting them. So for a given set of writes, all are hitting the same region. I will have the table pre-split as described, and test again. Will the number of region servers also impact the writes performance? My environment is HBase 0.94.1 with Hadoop 0.23.1, running on Oracle JVM 1.6. I am running hbase in a pseudo-distributed mode. Please find below my hbase-site.xml, which has very basic configurations. configuration property namehbase.rootdir/name valuehdfs://localhost:9000/hbase/value /property property namedfs.replication/name value1/value /property property namehbase.zookeeper.quorum/name valuelocalhost/value /property property namehbase.cluster.distributed/name valuetrue/value /property /configuration Thanks and Regards Pankaj Misra From: Doug Meil [doug.m...@explorysmedical.com] Sent: Thursday, September 20, 2012 1:48 AM To: user@hbase.apache.org Subject: Re: HBase Multi-Threaded Writes Hi there, You haven't described much about your environment, but there are two things you might want to consider for starters: 1) Is the table pre-split? (I.e., if it isn't, there is one region) 2) If it is, are all the writes hitting the same region? For other write tips, see thisŠ http://hbase.apache.org/book.html#perf.writing On 9/19/12 2:53 PM, Pankaj Misra pankaj.mi...@impetus.co.in wrote: Dear All, I have created 2 clients with multi-threading support to perform concurrent writes to HBase with initial expectation that with multiple threads I should be able to write faster. The clients that I created are using the Native HBase API and Thrift API. To my surprise, the performance with multi-threaded clients dropped for the both the clients consistently when compared to single threaded ingestion. As I increase the number of threads the writes performance degrades consistently. With a single thread ingestion both the clients perform far better, but I intend to use HBase in a multi-threaded environment, wherein I am facing challenges with the performance. Since I am relatively new to HBase, please do excuse me if I am asking something very basic, but any suggestions around this would be extremely helpful. Thanks and Regards Pankaj Misra Impetus Ranked in the Top 50 India¹s Best Companies to Work For 2012. Impetus webcast ŒDesigning a Test Automation Framework for Multi-vendor Interoperable Systems¹ available at http://lf1.me/0E/. NOTE: This message may contain information that is confidential, proprietary, privileged or otherwise protected by law. The message is intended solely for the named addressee. If received in error, please destroy and notify the sender. Any use of this email is prohibited when received in error. Impetus does not represent, warrant and/or guarantee, that the integrity of this communication has been maintained nor that the communication is free of errors, virus, interception or interference. Impetus Ranked in the Top 50 India’s Best Companies to Work For 2012. Impetus webcast ‘Designing a Test Automation Framework for Multi-vendor Interoperable Systems’ available at http://lf1.me/0E/. NOTE: This message may contain information that is confidential, proprietary, privileged or otherwise protected by law. The message is intended solely for the named addressee. If received in error, please destroy and notify the sender. Any use of this email is prohibited when received in error. Impetus does not represent, warrant and/or guarantee, that the integrity of this communication has been maintained nor that the communication is free of errors, virus, interception or interference.
Re: HBase Multi-Threaded Writes
You probably want to do a review of these chapters too... http://hbase.apache.org/book.html#architecture http://hbase.apache.org/book.html#datamodel http://hbase.apache.org/book.html#schema On 9/19/12 4:48 PM, Pankaj Misra pankaj.mi...@impetus.co.in wrote: Thank you so much Doug. You are right there is only one region to start with as I am not pre-splitting them. So for a given set of writes, all are hitting the same region. I will have the table pre-split as described, and test again. Will the number of region servers also impact the writes performance? My environment is HBase 0.94.1 with Hadoop 0.23.1, running on Oracle JVM 1.6. I am running hbase in a pseudo-distributed mode. Please find below my hbase-site.xml, which has very basic configurations. configuration property namehbase.rootdir/name valuehdfs://localhost:9000/hbase/value /property property namedfs.replication/name value1/value /property property namehbase.zookeeper.quorum/name valuelocalhost/value /property property namehbase.cluster.distributed/name valuetrue/value /property /configuration Thanks and Regards Pankaj Misra From: Doug Meil [doug.m...@explorysmedical.com] Sent: Thursday, September 20, 2012 1:48 AM To: user@hbase.apache.org Subject: Re: HBase Multi-Threaded Writes Hi there, You haven't described much about your environment, but there are two things you might want to consider for starters: 1) Is the table pre-split? (I.e., if it isn't, there is one region) 2) If it is, are all the writes hitting the same region? For other write tips, see thisŠ http://hbase.apache.org/book.html#perf.writing On 9/19/12 2:53 PM, Pankaj Misra pankaj.mi...@impetus.co.in wrote: Dear All, I have created 2 clients with multi-threading support to perform concurrent writes to HBase with initial expectation that with multiple threads I should be able to write faster. The clients that I created are using the Native HBase API and Thrift API. To my surprise, the performance with multi-threaded clients dropped for the both the clients consistently when compared to single threaded ingestion. As I increase the number of threads the writes performance degrades consistently. With a single thread ingestion both the clients perform far better, but I intend to use HBase in a multi-threaded environment, wherein I am facing challenges with the performance. Since I am relatively new to HBase, please do excuse me if I am asking something very basic, but any suggestions around this would be extremely helpful. Thanks and Regards Pankaj Misra Impetus Ranked in the Top 50 India¹s Best Companies to Work For 2012. Impetus webcast ŒDesigning a Test Automation Framework for Multi-vendor Interoperable Systems¹ available at http://lf1.me/0E/. NOTE: This message may contain information that is confidential, proprietary, privileged or otherwise protected by law. The message is intended solely for the named addressee. If received in error, please destroy and notify the sender. Any use of this email is prohibited when received in error. Impetus does not represent, warrant and/or guarantee, that the integrity of this communication has been maintained nor that the communication is free of errors, virus, interception or interference. Impetus Ranked in the Top 50 India’s Best Companies to Work For 2012. Impetus webcast ‘Designing a Test Automation Framework for Multi-vendor Interoperable Systems’ available at http://lf1.me/0E/. NOTE: This message may contain information that is confidential, proprietary, privileged or otherwise protected by law. The message is intended solely for the named addressee. If received in error, please destroy and notify the sender. Any use of this email is prohibited when received in error. Impetus does not represent, warrant and/or guarantee, that the integrity of this communication has been maintained nor that the communication is free of errors, virus, interception or interference.
Re: About hbase metadata
Hi there, Additionally, see this section in the RefGuideŠ http://hbase.apache.org/book.html#arch.catalog On 9/18/12 5:06 AM, Mohammad Tariq donta...@gmail.com wrote: Hello Ram, You can scan the '.META.' and '-ROOT-' tables. Alternatively you can also visit the 'hmaster web console' by pointing your web browser at hmaster_hostname:60010. You can use MetaUtils class to manipulate hbase meta tables. What kind of mapping you want to do and where do you want to specify it? Regards, Mohammad Tariq On Tue, Sep 18, 2012 at 1:46 PM, Ramasubramanian ramasubramanian.naraya...@gmail.com wrote: Hi, 1. Where can I see the metadata of hbase? 2. Can we able to edit it? 3. Can we specify column mapping for a table? Regards, Rams
Re: Hbase Scan - number of columns make the query performance way different
Hi there, I don't know the specifics of your environment, but ... http://hbase.apache.org/book.html#perf.reading 11.8.2. Scan Attribute Selection Š describes paying attention to the number of columns you are returning, particularly when using HBase as a MR source. In short, returning only the columns you need means you are reducing the data transferred between the RS and the client and the number of KV's evaluated in the RS, etc. On 9/13/12 10:12 AM, Shengjie Min kelvin@gmail.com wrote: Hi, I found an interesting difference between hbase scan query. I have a hbase table which has a lot of columns in a single column family. eg. let's say I have a users table, then userid, username, email etc etc 15 fields all together are in the single columnFamily. if you are familiar with RDBMS, query 1: select * from users vs query 2: select userid, username from users in mysql, these two has a difference, the query 2 will be obviously faster, but two queries won't give you a huge difference from performance perspective. In Hbase, I noticed that: query 3: scan 'users', // this is basically return me all 15 fields vs query 4: scan 'users', {COLUMNS=['cf:userid','cf:username']}// this is return me only two fields: userid , username query 3 here takes way longer than query 4, Given a big data set. In my test, I have around 1,000,000 user records. You are talking about query 3 - 100 secs VS query 4 - a few secs. Can anybody explain to me, why the width of the resultset in HBASE can impact the performance that much? Shengjie Min
Re: Optimizing table scans
Hi there, See this for info on the block cache in the RegionServer.. http://hbase.apache.org/book.html 9.6.4. Block Cache Š and see this for batching on the scan parameter... http://hbase.apache.org/book.html#perf.reading 11.8.1. Scan Caching On 9/12/12 9:55 AM, Amit Sela am...@infolinks.com wrote: I allocate 10GB per RegionServer. An average row size is ~200 Bytes. The network is 1GB. It would be great if anyone could elaborate on the difference between Cache and Batch parameters. Thanks. On Wed, Sep 12, 2012 at 4:04 PM, Michael Segel michael_se...@hotmail.comwrote: How much memory do you have? What's the size of the underlying row? What does your network look like? 1GBe or 10GBe? There's more to it, and I think that you'll find that YMMV on what is an optimum scan size... HTH -Mike On Sep 12, 2012, at 7:57 AM, Amit Sela am...@infolinks.com wrote: Hi all, I'm trying to find the sweet spot for the cache size and batch size Scan() parameters. I'm scanning one table using HTable.getScanner() and iterating over the ResultScanner retrieved. I did some testing and got the following results: For scanning *100* rows. * Cache Batch Total execution time (sec) 1 -1 (default) 112 1 5000 110 1 1 110 1 2 110 Cache Batch Total execution time (sec) 1000 -1 (default) 116 1 -1 (default) 110 2 -1 (default) 115 Cache Batch Total execution time (sec) 5000 10 26 2 10 25 5 10 26 5000 5 15 2 5 14 5 5 14 1000 1 6 5000 1 5 1 1 4 2 1 4 5 1 4 * *I don't understand why a lower batch size gives such an improvement ?* Thanks, Amit. * *
Re: HDFS footprint of a table
Hi there, see... http://hbase.apache.org/book.html#regions.arch Š And in particular focus onŠ 9.7.5.4. KeyValue On 9/11/12 3:35 AM, Lin Ma lin...@gmail.com wrote: Hi guys, Supposing I have a table in HBase, how to estimate its storage footprint? Thanks. regards, Lin
Re: Regarding column family
Hi there, additionally, see.. http://hbase.apache.org/book.html#regions.arch Š and focus on 9.7.5.4. KeyValue because the CF name is actually a part of each KV. On 9/11/12 4:03 AM, n keywal nkey...@gmail.com wrote: Yes, because there is one store (hence set of files) per column family. See this: http://hbase.apache.org/book.html#number.of.cfs On Tue, Sep 11, 2012 at 9:52 AM, Ramasubramanian ramasubramanian.naraya...@gmail.com wrote: Hi, Does column family play any role during loading a file into hbase from hdfs in terms of performance? Regards, Rams
Re: More rows or less rows and more columns
re: You may want to update this section Good point. I will add. On 9/11/12 6:59 AM, Michel Segel michael_se...@hotmail.com wrote: Option c, depending on the use case, add a structure to you columns to store the data. You may want to update this section Sent from a remote device. Please excuse any typos... Mike Segel On Sep 10, 2012, at 12:30 PM, Harsh J ha...@cloudera.com wrote: Hey Mohit, See http://hbase.apache.org/book.html#schema.smackdown.rowscols On Mon, Sep 10, 2012 at 10:56 PM, Mohit Anchlia mohitanch...@gmail.com wrote: Is there any recommendation on how many columns one should have per row. My columns are 200 bytes. This will help me to decide if I should shard my rows with id + some date/time value. -- Harsh J
Re: Regarding rowkey
Hi there, have you read this? http://hbase.apache.org/book.html#performance And especially this? http://hbase.apache.org/book.html#perf.writing How many nodes is the cluster? Is the target table pre-split? And if it is, are you sure that the rows aren't winding up on a single region? On 9/11/12 1:39 PM, Ramasubramanian ramasubramanian.naraya...@gmail.com wrote: Hi, What can be used as rowkey to improve performance while loading into hbase? Currently I am having sequence. It takes some 11 odd minutes to load 1 million record with 147 columns. Regards, Rams
Re: Tomb Stone Marker
Hi there... In this chapter... http://hbase.apache.org/book.html#datamodel .. it explains that the updates are just a view. There is a merge happening across CFs and versions (and delete-markers).. In this... http://hbase.apache.org/book.html#regions.arch 9.7.5.5. Compaction ... it explains how and when the delete markers are removed in the compaction process. On 9/10/12 2:50 AM, Monish r monishs...@gmail.com wrote: Hi, Thanks for the link . If the meta data information for a delete is part of key value , then when does this update happen When the region is re written by minor compaction. ? or Is the region re written for a set of batched deletes ? On Sun, Sep 9, 2012 at 6:42 PM, Doug Meil doug.m...@explorysmedical.comwrote: Hi there, See 9.7.5.4. KeyValue... http://hbase.apache.org/book.html#regions.arch Š the tombstone is one of the keytypes. On 9/9/12 5:21 AM, Monish r monishs...@gmail.com wrote: Hi, I need some clarifications regarding the Tomb Stone Marker . I was wondering where exactly are the tomb stone markers stored when a row is deleted . Are they kept in some memory area and updated in the HFile during minor compaction ? If they are updated in the HFile , then what part of a HFile contains this information. Regards, R.Monish
Re: HBase aggregate query
Hi there, if there are common questions I'd suggest creating summary tables of the pre-aggregated results. http://hbase.apache.org/book.html#mapreduce.example 7.2.4. HBase MapReduce Summary to HBase Example On 9/10/12 10:03 AM, iwannaplay games funnlearnfork...@gmail.com wrote: Hi , I want to run query like select month(eventdate),scene,count(1),sum(timespent) from eventlog group by month(eventdate),scene in hbase.Through hive its taking a lot of time for 40 million records.Do we have any syntax in hbase to find its result?In sql server it takes around 9 minutes,How long it might take in hbase?? Regards Prabhjot
Re: scan a table with 2 column families.
Hi there, the scan will merge the results between the CFsŠ for more information see these two chapters in the HBase RefGuide. http://hbase.apache.org/book.html#datamodel http://hbase.apache.org/book.html#mapreduce On 9/9/12 6:41 AM, huaxiang huaxi...@asiainfo-linkage.com wrote: Hi, If a table has two column families, Each stored in a store file in a region in a regionserver. what is process will be a scan for values in these two column families? For example, I want to scan jack's health1:height and health2:weight, each in a differenet column family. First scan height1 in one store file, then scan weight in another store file? Then merge these two scan results? Thanks! huaxiang
Re: Tomb Stone Marker
Hi there, See 9.7.5.4. KeyValue... http://hbase.apache.org/book.html#regions.arch Š the tombstone is one of the keytypes. On 9/9/12 5:21 AM, Monish r monishs...@gmail.com wrote: Hi, I need some clarifications regarding the Tomb Stone Marker . I was wondering where exactly are the tomb stone markers stored when a row is deleted . Are they kept in some memory area and updated in the HFile during minor compaction ? If they are updated in the HFile , then what part of a HFile contains this information. Regards, R.Monish
Re: batch update question
For the 2nd part of the question, if you have 10 Puts it's more efficient to send a single RS message with 10 Puts than send 10 RS messages with 1 Put apiece. There are 2 words to be careful with, and those are always and never, because there is an exception: if you are using the client writeBuffer and each of those 10 Puts are going to a different RegionServer, then you haven't really gained much. To answer the next question of how you know where the Puts are going, see this method… http://hbase.apache.org/apidocs/org/apache/hadoop/hbase/client/HTable.html#getRegionLocation%28byte[],%20boolean%29 Because the Hbase client talks directly to each RS, it has to know the region boundaries. From: Lin Ma lin...@gmail.commailto:lin...@gmail.com Date: Thursday, September 6, 2012 11:54 AM To: user@hbase.apache.orgmailto:user@hbase.apache.org user@hbase.apache.orgmailto:user@hbase.apache.org, Doug Meil doug.m...@explorysmedical.commailto:doug.m...@explorysmedical.com Cc: st...@duboce.netmailto:st...@duboce.net st...@duboce.netmailto:st...@duboce.net Subject: Re: batch update question Thank you Doug, Very effective reply. :-) - why batch update could resolve contention issue on the same row? Could you elaborate a bit more or show me an example? - Batch update always have good performance compared to single update (when we measure total throughput)? regards, Lin On Thu, Sep 6, 2012 at 12:59 AM, Doug Meil doug.m...@explorysmedical.commailto:doug.m...@explorysmedical.com wrote: Hi there, if you look in the source code for HTable there is a list of Put objects. That's the buffer, and it's a client-side buffer. On 9/5/12 12:04 PM, Lin Ma lin...@gmail.commailto:lin...@gmail.com wrote: Thank you Stack for the details directions! 1. You are right, I have not met with any real row contention issues. My purpose is understanding the issue in advance, and also from this issue to understand HBase generals better; 2. For the comments from API Url page you referred -- If isAutoFlushhttp://hbase.apache.org/apidocs/org/apache/hadoop/hbase/client /HTableInterface.html#isAutoFlush%28%29is false, the update is buffered until the internal buffer is full., I am confused what is the buffer? Buffer at client side or buffer in region server? Is there a way to configure its size to hold until flushing? 3. Why batch could resolve contention on the same raw issue in theory, compared to non-batch operation? Besides preparation the solution in my mind in advance, I want to learn a bit about why. :-) regards, Lin On Wed, Sep 5, 2012 at 4:00 AM, Stack st...@duboce.netmailto:st...@duboce.net wrote: On Sun, Sep 2, 2012 at 2:13 AM, Lin Ma lin...@gmail.commailto:lin...@gmail.com wrote: Hello guys, I am reading the book HBase, the definitive guide, at the beginning of chapter 3, it is mentioned in order to reduce performance impact for clients to update the same row (lock contention issues for automatic write), batch update is preferred. My questions is, for MR job, what are the batch update methods we could leverage to resolve the issue? And for API client, what are the batch update methods we could leverage to resolve the issue? Do you actually have a problem where there is contention on a single row? Use methods like http://hbase.apache.org/apidocs/org/apache/hadoop/hbase/client/HTable.htm l#put(java.util.List) or the batch methods listed earlier in the API. You should set autoflush to false too: http://hbase.apache.org/apidocs/org/apache/hadoop/hbase/client/HTableInte rface.html#isAutoFlush() Even batching, a highly contended row might hold up inserts... but for sure you actually have this problem in the first place? St.Ack
Re: Extremely slow when loading small amount of data from HBase
You have are 4000 regions on an 8 node cluster? I think you need to bring that *way* down… re: something like 40 regions Yep… around there. See… http://hbase.apache.org/book.html#bigger.regions On 9/5/12 8:06 AM, Jean-Marc Spaggiari jean-m...@spaggiari.org wrote: But I think you should also look at why we have so many regions... Because even if you merge them manually now, you might face the same issu soon. 2012/9/5, n keywal nkey...@gmail.com: Hi, With 8 regionservers, yes, you can. Target a few hundreds by default imho. N. On Wed, Sep 5, 2012 at 4:55 AM, 某因幡 tewil...@gmail.com wrote: +HBase users. -- Forwarded message -- From: Dmitriy Ryaboy dvrya...@gmail.com Date: 2012/9/4 Subject: Re: Extremely slow when loading small amount of data from HBase To: u...@pig.apache.org u...@pig.apache.org I think the hbase folks recommend something like 40 regions per node per table, but I might be misremembering something. Have you tried emailing the hbase users list? On Sep 4, 2012, at 3:39 AM, 某因幡 tewil...@gmail.com wrote: After merging ~8000 regions to ~4000 on an 8-node cluster the things is getting better. Should I continue merging? 2012/8/29 Dmitriy Ryaboy dvrya...@gmail.com: Can you try the same scans with a regular hbase mapreduce job? If you see the same problem, it's an hbase issue. Otherwise, we need to see the script and some facts about your table (how many regions, how many rows, how big a cluster, is the small range all on one region server, etc) On Aug 27, 2012, at 11:49 PM, 某因幡 tewil...@gmail.com wrote: When I load a range of data from HBase simply using row key range in HBaseStorageHandler, I find that the speed is acceptable when I'm trying to load some tens of millions rows or more, while the only map ends up in a timeout when it's some thousands of rows. What is going wrong here? Tried both Pig-0.9.2 and Pig-0.10.0. -- language: Chinese, Japanese, English -- language: Chinese, Japanese, English -- language: Chinese, Japanese, English
Re: batch update question
Hi there, if you look in the source code for HTable there is a list of Put objects. That's the buffer, and it's a client-side buffer. On 9/5/12 12:04 PM, Lin Ma lin...@gmail.com wrote: Thank you Stack for the details directions! 1. You are right, I have not met with any real row contention issues. My purpose is understanding the issue in advance, and also from this issue to understand HBase generals better; 2. For the comments from API Url page you referred -- If isAutoFlushhttp://hbase.apache.org/apidocs/org/apache/hadoop/hbase/client /HTableInterface.html#isAutoFlush%28%29is false, the update is buffered until the internal buffer is full., I am confused what is the buffer? Buffer at client side or buffer in region server? Is there a way to configure its size to hold until flushing? 3. Why batch could resolve contention on the same raw issue in theory, compared to non-batch operation? Besides preparation the solution in my mind in advance, I want to learn a bit about why. :-) regards, Lin On Wed, Sep 5, 2012 at 4:00 AM, Stack st...@duboce.net wrote: On Sun, Sep 2, 2012 at 2:13 AM, Lin Ma lin...@gmail.com wrote: Hello guys, I am reading the book HBase, the definitive guide, at the beginning of chapter 3, it is mentioned in order to reduce performance impact for clients to update the same row (lock contention issues for automatic write), batch update is preferred. My questions is, for MR job, what are the batch update methods we could leverage to resolve the issue? And for API client, what are the batch update methods we could leverage to resolve the issue? Do you actually have a problem where there is contention on a single row? Use methods like http://hbase.apache.org/apidocs/org/apache/hadoop/hbase/client/HTable.htm l#put(java.util.List) or the batch methods listed earlier in the API. You should set autoflush to false too: http://hbase.apache.org/apidocs/org/apache/hadoop/hbase/client/HTableInte rface.html#isAutoFlush() Even batching, a highly contended row might hold up inserts... but for sure you actually have this problem in the first place? St.Ack
Re: batch update question
Hi there, for more information about the hbase client, seeŠ http://hbase.apache.org/book.html#client On 9/5/12 12:59 PM, Doug Meil doug.m...@explorysmedical.com wrote: Hi there, if you look in the source code for HTable there is a list of Put objects. That's the buffer, and it's a client-side buffer. On 9/5/12 12:04 PM, Lin Ma lin...@gmail.com wrote: Thank you Stack for the details directions! 1. You are right, I have not met with any real row contention issues. My purpose is understanding the issue in advance, and also from this issue to understand HBase generals better; 2. For the comments from API Url page you referred -- If isAutoFlushhttp://hbase.apache.org/apidocs/org/apache/hadoop/hbase/clien t /HTableInterface.html#isAutoFlush%28%29is false, the update is buffered until the internal buffer is full., I am confused what is the buffer? Buffer at client side or buffer in region server? Is there a way to configure its size to hold until flushing? 3. Why batch could resolve contention on the same raw issue in theory, compared to non-batch operation? Besides preparation the solution in my mind in advance, I want to learn a bit about why. :-) regards, Lin On Wed, Sep 5, 2012 at 4:00 AM, Stack st...@duboce.net wrote: On Sun, Sep 2, 2012 at 2:13 AM, Lin Ma lin...@gmail.com wrote: Hello guys, I am reading the book HBase, the definitive guide, at the beginning of chapter 3, it is mentioned in order to reduce performance impact for clients to update the same row (lock contention issues for automatic write), batch update is preferred. My questions is, for MR job, what are the batch update methods we could leverage to resolve the issue? And for API client, what are the batch update methods we could leverage to resolve the issue? Do you actually have a problem where there is contention on a single row? Use methods like http://hbase.apache.org/apidocs/org/apache/hadoop/hbase/client/HTable.ht m l#put(java.util.List) or the batch methods listed earlier in the API. You should set autoflush to false too: http://hbase.apache.org/apidocs/org/apache/hadoop/hbase/client/HTableInt e rface.html#isAutoFlush() Even batching, a highly contended row might hold up inserts... but for sure you actually have this problem in the first place? St.Ack
Re: Reading in parallel from table's regions in MapReduce
Hi there- Yes, there is an input split for each region of the source table of a MR job. There is a blurb on that in the RefGuide... http://hbase.apache.org/book.html#splitter On 9/4/12 11:17 AM, Ioakim Perros imper...@gmail.com wrote: Hello, I would be grateful if someone could shed a light to the following: Each M/R map task is reading data from a separate region of a table. From the jobtracker 's GUI, at the map completion graph, I notice that although data read from mappers are different, they read data sequentially - like the table has a lock that permits only one mapper to read data from every region at a time. Does this lock hypothesis make sense? Is there any way I could avoid this useless delay? Thanks in advance and regards, Ioakim
Re: md5 hash key and splits
Stack, re: Where did you read that?, I think he might also be referring to this... http://hbase.apache.org/book.html#important_configurations On 8/30/12 8:04 PM, Mohit Anchlia mohitanch...@gmail.com wrote: In general isn't it better to split the regions so that the load can be spread accross the cluster to avoid HotSpots? I read about pre-splitting here: http://blog.sematext.com/2012/04/09/hbasewd-avoid-regionserver-hotspotting -despite-writing-records-with-sequential-keys/ On Thu, Aug 30, 2012 at 4:30 PM, Amandeep Khurana ama...@gmail.com wrote: Also, you might have read that an initial loading of data can be better distributed across the cluster if the table is pre-split rather than starting with a single region and splitting (possibly aggressively, depending on the throughput) as the data loads in. Once you are in a stable state with regions distributed across the cluster, there is really no benefit in terms of spreading load by managing splitting manually v/s letting HBase do it for you. At that point it's about what Ian mentioned - predictability of latencies by avoiding splits happening at a busy time. On Thu, Aug 30, 2012 at 4:26 PM, Ian Varley ivar...@salesforce.com wrote: The Facebook devs have mentioned in public talks that they pre-split their tables and don't use automated region splitting. But as far as I remember, the reason for that isn't predictability of spreading load, so much as predictability of uptime latency (they don't want an automated split to happen at a random busy time). Maybe that's what you mean, Mohit? Ian On Aug 30, 2012, at 5:45 PM, Stack wrote: On Thu, Aug 30, 2012 at 7:35 AM, Mohit Anchlia mohitanch...@gmail.com mailto:mohitanch...@gmail.com wrote: From what I;ve read it's advisable to do manual splits since you are able to spread the load in more predictable way. If I am missing something please let me know. Where did you read that? St.Ack
Re: Hbase Bulk Load Java Sample Code
Hi there, in addition there is a fair amount of documentation about bulk loads and importtsv in the Hbase RefGuide. http://hbase.apache.org/book.html#importtsv On 8/27/12 9:34 AM, Ioakim Perros imper...@gmail.com wrote: On 08/27/2012 04:18 PM, o brbrs wrote: Hi, I'm new at hase and i want to make bulk load from hdfs to hbase with java. Is there any sample code which includes importtsv and completebulkload libraries on java? Thanks. Hi, Here is a sample configuration of a bulk loading job consisting only of map tasks: Configuration config = HBaseConfiguration.create(); config.set(TableOutputFormat.OUTPUT_TABLE, tableNAME); Path inputPath = new Path(inputStringPath); Job job = new Job(config, Sample job ); job.setMapOutputKeyClass(mapperKey); job.setMapOutputValueClass(mapperValue); FileInputFormat.setInputPaths(job, inputPath); job.setInputFormatClass(inputFormat); FileOutputFormat.setOutputPath(job,new Path (HFileoutputPath)); //directory at HDFS where HFiles will be placed //before bulk loading job.setOutputFormatClass(HFileOutputFormat.class); job.setJarByClass(caller); job.setMapperClass(mapper); HTable hTable = new HTable(config, tableNAME); //tableNAME is a String representing a table which has to //already exist in HBase HFileOutputFormat.configureIncrementalLoad(job, hTable); //check respective API for the complete functionality of this //function job.waitForCompletion(true); /* after the job's completion, we have to write the HFiles * into HBase's specified table */ LoadIncrementalHFiles lihf = new LoadIncrementalHFiles(config); lihf.doBulkLoad(new Path(HFileoutputPath), hTable); Create a map task which produces key,value pairs just as you expect them to exist in your HBase's table (e.g.: key: ImmutableBytesWritable, Put) and you re done. Regards, Ioakim
Re: Pig, HBaseStorage, HBase, JRuby and Sinatra
I think somewhere in here in the RefGuide would workŠ http://hbase.apache.org/book.html#other.info.sites On 8/27/12 1:20 PM, Stack st...@duboce.net wrote: On Mon, Aug 27, 2012 at 6:32 AM, Russell Jurney russell.jur...@gmail.com wrote: I wrote a tutorial around HBase, JRuby and Pig that I thought would be of interest to the HBase users list: http://hortonworks.com/blog/pig-as-hadoop-connector-part-two-hbase-jruby- and-sinatra/ Thanks Russell. Should we add a link in the refguide? Where would you put it (and I'll do the edit). St.Ack
Re: how client location a region/tablet?
For further information about the catalog tables and region-regionserver assignment, see thisŠ http://hbase.apache.org/book.html#arch.catalog On 8/19/12 7:36 AM, Lin Ma lin...@gmail.com wrote: Thank you Stack, especially for the smart 6 round trip guess for the puzzle. :-) 1. Yeah, we client cache's locations, not the data. -- does it mean for each client, it will cache all location information of a HBase cluster, i.e. which physical server owns which region? Supposing each region has 128M bytes, for a big cluster (P-bytes level), total data size / 128M is not a trivial number, not sure if any overhead to client? 2. A bit confused by what do you mean not the data? For the client cached location information, it should be the data in table METADATA, which is region / physical server mapping data. Why you say not data (do you mean real content in each region)? regards, Lin On Sun, Aug 19, 2012 at 12:40 PM, Stack st...@duboce.net wrote: On Sat, Aug 18, 2012 at 2:13 AM, Lin Ma lin...@gmail.com wrote: Hello guys, I am referencing the Big Table paper about how a client locates a tablet. In section 5.1 Tablet location, it is mentioned that client will cache all tablet locations, I think it means client will cache root tablet in METADATA table, and all other tablets in METADATA table (which means client cache the whole METADATA table?). My question is, whether HBase implements in the same or similar way? My concern or confusion is, supposing each tablet or region file is 128M bytes, it will be very huge space (i.e. memory footprint) for each client to cache all tablets or region files of METADATA table. Is it doable or feasible in real HBase clusters? Thanks. Yeah, we client cache's locations, not the data. BTW: another confusion from me is in the paper of Big Table section 5.1 Tablet location, it is mentioned that If the client¹s cache is stale, the location algorithm could take up to six round-trips, because stale cache entries are only discovered upon misses (assuming that METADATA tablets do not move very frequently)., I do not know how the 6 times round trip time is calculated, if anyone could answer this puzzle, it will be great. :-) I'm not sure what the 6 is about either. Here is a guesstimate: 1. Go to cached location for a server for a particular user region, but server says that it does not have a region, the client location is stale 2. Go back to client cached meta region that holds user region w/ row we want, but its location is stale. 3. Go to root location, to find new location of meta, but the root location has moved what the client has is stale 4. Find new root location and do lookup of meta region location 5. Go to meta region location to find new user region 6. Go to server w/ user region St.Ack
Re: Hbase Schema
re: Q2 Yes you can have the same CF name in different tables. Column Family names are embedded in each KeyValue. See: http://hbase.apache.org/book.html#regions.arch for more detail re: Q3 It depends on what you you need. A common pattern is using composite keys where the lead portion represents some natural grouping of data (e.g., a userid) but that is also hashed to provide distribution across the cluster. re: Q4 Read the RefGuide! http://hbase.apache.org/book.html On 7/11/12 3:16 PM, grashmi13 rashmi.maheshw...@rsystems.com wrote: Hi, In RDBMS we have multiple DB schemas\oracle user instances. Similarly, can we have multiple db schemas in hbase? If yes, can we have multiple schemas one one hadoop-hbase cluster? If multiple schemas possible, how can we define them? Using configuration or programatically? Q2: can we have same column family name in multiple tables? if yes, does it impacts performance if we have same name column family in multiple tables? Q3: Sequential keys improves read performance and random keys improves write performance. which way one must go? Q4: What are best practices to improve hadoop+hbase performance? Q5: when one program is deleting tables, another program is accessing a row of that table. what would be impact of it? can we have some sort of lock while reading or while deleting a table? Q6: as everything in application is byte form, what would happen if hbase db and application are using different character set? can we synch both for some particular character set by configuration or programatically? -- View this message in context: http://old.nabble.com/Hbase-Schema-tp34147582p34147582.html Sent from the HBase User mailing list archive at Nabble.com.
Re: Stargate: ScannerModel
One other thingŠ re: I tried using rowFilter but it is quite slow. If you didn't use startRow/stopRow for the Scan you will be filtering all the rows in the table (albiet on the RS, but stillŠ all the rows) On 6/28/12 4:56 AM, N Keywal nkey...@gmail.com wrote: (moving this to the user mailing list, with the dev one in bcc) From what you said it should be customerid_MIN_TX_ID to customerid_MAX_TX_ID But only if customerid size is constant. Note that with this rowkey design there will be very few regions involved, so it's unlikely to be parallelized. N. On Thu, Jun 28, 2012 at 7:43 AM, sameer sameer_therat...@infosys.com wrote: Hello, I want to what are the parameters for scan.setStartRow ans scan.setStopRow. My requirement is that I have a table, with key as customerid_transactionId. I want to scan all the rows, they key of which contains the customer Id that I have. I tried using rowFilter but it is quite slow. If I am using the scan - setStartRow and setStopRow then what would I give as parameters? Thanks, Sameer -- View this message in context: http://apache-hbase.679495.n3.nabble.com/Stargate-ScannerModel-tp2975161p 4019139.html Sent from the HBase - Developer mailing list archive at Nabble.com.