Re: Scan vs map-reduce

2014-04-14 Thread Doug Meil

re:  my first version is using 20,000 Get²

Just throwing this out there, but have you looked at multi-get?  Multi-get
will group the gets by RegionServer internally.

You are doing a lot of IO for a web-app so this is going to be tough to
make ³fast², but there are ways to make it ³faster.²

But since you only have 1,000,000 rows you might not have many regions, so
this might wind up all going on the same RegionServer.




On 4/14/14, 7:52 AM, Li Li fancye...@gmail.com wrote:

I need to get about 20,000 rows from the table. the table is about
1,000,000 rows.
my first version is using 20,000 Get and I found it's very slow. So I
modified it to a scan and filter unrelated rows in the client.
maybe I should write a coprocessor. btw, is there any filter available
for me? something like sql statement where rowkey in('abc', 'abd'
). a very long in statement

On Mon, Apr 14, 2014 at 7:46 PM, Jean-Marc Spaggiari
jean-m...@spaggiari.org wrote:
 Hi Li Li,

 If you have more than one region, might be useful. MR will scan all the
 regions in parallel. If you do a full scan from a client API with no
 parallelism, then the MR job might be faster. But it will take more
 resources on the cluster and might impact the SLA of the other clients,
if
 any,

 JM


 2014-04-14 2:42 GMT-04:00 Mohammad Tariq donta...@gmail.com:

 Well, it depends. Could you please provide some more details?It will
help
 us in giving a proper answer.

 Warm Regards,
 Tariq
 cloudfront.blogspot.com


 On Mon, Apr 14, 2014 at 11:38 AM, Li Li fancye...@gmail.com wrote:

  I have a full table scan which cost about 10 minutes. it seems a
  bottleneck for our application. if use map-reduce to rewrite it. will
  it be faster?
 




Re: How to generate a large dataset quickly.

2014-04-14 Thread Doug Meil

 
re:  So, I execute 3.2Mill of Put¹s in HBase.

There will be 3.2 million Puts, but they won¹t be sent over 1 at a time if
autoFlush on Htable is false.  By default, htable should be using a 2mb
write buffer, and then it groups the Puts by RegionServer.






On 4/14/14, 2:21 PM, Guillermo Ortiz konstt2...@gmail.com wrote:

Are there some benchmark about how long could it takes to insert data in
HBase to have a reference?
The output of my Mapper has 3.2mill. output. So, I execute 3.2Mill of
Put's
in HBase.

Well, data has to be copied and sent to the reducers, but with a network
of
1Gb it shouldn't take too much time. I'll check Ganglia.


2014-04-14 18:16 GMT+02:00 Ted Yu yuzhih...@gmail.com:

 I looked at revision history for HFileOutputFormat.java
 There was one patch, HBASE-8949, which went into 0.94.11 but it
shouldn't
 affect throughput much.

 If you can use ganglia (or some similar tool) to pinpoint what caused
the
 low ingest rate, that would give us more clue.

 BTW Is upgrading to newer release, such as 0.98.1 (which contains
 HBASE-8755), an option for you ?

 Cheers


 On Mon, Apr 14, 2014 at 5:41 AM, Guillermo Ortiz konstt2...@gmail.com
 wrote:

  I'm using. 0.94.6-cdh4.4.0,
 
  I use the bulkload:
  FileInputFormat.addInputPath(job, new Path(INPUT_FOLDER));
  FileOutputFormat.setOutputPath(job, hbasePath);
  HTable table = new HTable(jConf, HBASE_TABLE);
  HFileOutputFormat.configureIncrementalLoad(job, table);
 
  It seems that it takes really long time when it starts to execute the
 Puts
  to HBase in the reduce phase.
 
 
 
  2014-04-14 14:35 GMT+02:00 Ted Yu yuzhih...@gmail.com:
 
   Which hbase release did you run mapreduce job ?
  
   Cheers
  
   On Apr 14, 2014, at 4:50 AM, Guillermo Ortiz konstt2...@gmail.com
  wrote:
  
I want to create a large dateset for HBase with different versions
 and
number of rows. It's about 10M rows and 100 versions to do some
   benchmarks.
   
What's the fastest way to create it?? I'm generating the dataset
 with a
Mapreduce of 100.000rows and 10verions. It takes 17minutes and
size
   around
7Gb. I don't know if I could do it quickly. The bottleneck is when
MapReduces write the output and when transfer the output to the
  Reduces.
  
 




Re: HFile size writeup in HBase Blog

2014-04-12 Thread Doug Meil

Thanks Ted!

I can add that to the to-do list.  Also have plans for read/write
performance numbers too in a follow-up blog.






On 4/11/14, 6:00 PM, Ted Yu yuzhih...@gmail.com wrote:

Nice writeup, Doug.

Do you have plan to profile Prefix Tree data block encoding ?

Cheers


On Fri, Apr 11, 2014 at 3:14 PM, Doug Meil
doug.m...@explorysmedical.comwrote:

 Hey folks,

 Stack published a writeup I did on the HBase blog on the effects of
rowkey
 size, column-name size, CF compression, data block encoding and KV
storage
 approach on HFile size.  For example, had large row keys vs. small row
 keys, used Snappy vs. LZO vs. etc., used prefix vs. fast-diff, used a KV
 per column vs. a single KV per row.  We tried 'em all... and wrote it
up.

 http://blogs.apache.org/hbase/


 Doug Meil
 Chief Software Architect, Explorys
 doug.m...@explorysmedical.com






HFile size writeup in HBase Blog

2014-04-11 Thread Doug Meil
Hey folks,

Stack published a writeup I did on the HBase blog on the effects of rowkey 
size, column-name size, CF compression, data block encoding and KV storage 
approach on HFile size.  For example, had large row keys vs. small row keys, 
used Snappy vs. LZO vs. etc., used prefix vs. fast-diff, used a KV per column 
vs. a single KV per row.  We tried 'em all... and wrote it up.

http://blogs.apache.org/hbase/


Doug Meil
Chief Software Architect, Explorys
doug.m...@explorysmedical.com




Re: How to get Last access time of a record

2014-02-26 Thread Doug Meil

Hi there,

On top of what Vladimir already saidŠ

re:  Table1: 80 m records say Author, Table2 : 5k records say Category

Just 80 million records?  Hbase tends to be overkill for relatively low
data volumes.

But if you wish to proceed this path, to extend what was already said,
rather than thinking of it in terms of an RDBMS 2 table design, create a
pre-joined table that has data from both tables as the query target.


As for the LRU cache, ³premature optimization is the root of all evil².
:-)

Best of luck!


On 2/24/14, 4:38 PM, Vikram Singh Chandel vikramsinghchan...@gmail.com
wrote:

Hi Vladimir
We are planing to have around 40Gb for L1 and 150Gb for L2 and when this
size is breached then we have start cleaning L1 and L2.
now this cleaning (deletion of records) i needed that LRU info at record
level, i.e. delete all records which are not been used past 15 days or
later.
We will save save this LRU info in a Metric column family.

What we thought of using a Post Get Observer to write the value to Last
Read column of Metric column family.
this info we will later use for deletion of records.

Is there any other simpler way. As you said block cache is at table level(
if i am correct) but we info at record level

Thanks


On Tue, Feb 25, 2014 at 1:42 AM, Vladimir Rodionov
vrodio...@carrieriq.comwrote:

 I recommend you work a little bit more on design.
 NoSQL in general and HBase in particular are not very good at joining
 tables, but very good at point and range queries.

 Sure, you can do some optimizations in your current approach: create
CACHE
 table as IN_MEMORY, set TTL for say 1day (or less, depends
 on the data volume your are able to store ) and utilize HBase internal
 block cache (which is LRU) for that table.

 Best regards,
 Vladimir Rodionov
 Principal Platform Engineer
 Carrier IQ, www.carrieriq.com
 e-mail: vrodio...@carrieriq.com

 
 From: Vikram Singh Chandel [vikramsinghchan...@gmail.com]
 Sent: Monday, February 24, 2014 11:38 AM
 To: user@hbase.apache.org
 Subject: Re: How to get Last access time of a record

 Hi Vladimir,

 We are going to implement cache in HBase, let me give you a example

 We have two tables
 Table1: 80 m records say Author
 Table2 : 5k records say Category
 query : Get details of all publications by Author XYZ broken down by
 Category

 We fire a get on Table 1 to get a list of publications ids(hashed)
 Then we do a scan on  Table 2 to get list of publications for each
category
 and then we do Intersection
 of both list and in the end get the details from publication table.

 Now suppose same query comes again instead of doing all this computation
 again we are going to save the intersected results
 in a table we are calling L2 Cache (there's a L1 also)

 Hope you would have got idea of what we are trying to achieve.
 Now if you can help please





 On Tue, Feb 25, 2014 at 12:20 AM, Vladimir Rodionov 
 vrodio...@carrieriq.com
  wrote:

  Interesting. You want to use HBase as a cache. What data are going to
  cache? Is it some kind of a cold storage
  on tapes or Blu-Ray disks? Just curious.
 
  Best regards,
  Vladimir Rodionov
  Principal Platform Engineer
  Carrier IQ, www.carrieriq.com
  e-mail: vrodio...@carrieriq.com
 
  
  From: Vikram Singh Chandel [vikramsinghchan...@gmail.com]
  Sent: Monday, February 24, 2014 4:25 AM
  To: user@hbase.apache.org
  Subject: Re: How to get Last access time of a record
 
  Hi
  Hbase provides cache on non processed data, we are implementing a
second
  level of caching on processed data,
  for eg on intersected data between two tables, or on post processed
data.
 
 
  On Mon, Feb 24, 2014 at 5:02 PM, haosdent haosd...@gmail.com wrote:
 
   HBase have already maintained a cache.
  
   we can get last accessed time for a record
  
   I think you could get this from your application level.
  
  
   On Mon, Feb 24, 2014 at 7:21 PM, Vikram Singh Chandel 
   vikramsinghchan...@gmail.com wrote:
  
Hi
   
We are planning to implement caching mechanism for our Hbase data
 model
   for
that we have to remove the *LRU (least recently used)  records*
from
  the
cached table.
   
Is there any way by which we can get last accessed time for a
record,
primarily the access will be
using *Range Scan and Get *
   
--
*Regards*
   
*VIKRAM SINGH CHANDEL*
   
Please do not print this email unless it is absolutely
  necessary,Reduce.
Reuse. Recycle. Save our planet.
   
  
  
  
   --
   Best Regards,
   Haosdent Huang
  
 
 
 
  --
  *Regards*
 
  *VIKRAM SINGH CHANDEL*
 
  Please do not print this email unless it is absolutely
necessary,Reduce.
  Reuse. Recycle. Save our planet.
 
  Confidentiality Notice:  The information contained in this message,
  including any attachments hereto, may be confidential and is intended
to
 be
  read only by the individual or entity to whom this message is
addressed.
 If
  the 

Re: Question on efficient, ordered composite keys

2014-01-14 Thread Doug Meil

Hey there,

re:  efficient, correctly ordered, byte[] serialized composite row keys?

I was the guy behind 7221 and that patch had the first part and the last
part, but not the middle part (correctly ordered) because this patch
relied on the HBase build-in implementations which have the aforementioned
order issue. 

James already threw out a good option, but you could also take the 7221
patch and use it yourself and change the conversions to use Orderly or
something that has the type conversions that are suitable for your
purposes.

Once HBase fixes the type conversion issue, some form of built-in utility
for creating composite keys is critical because building composite keys is
one of the most asked questions on the dist-list (what 7221 was attempting
to address).





on 1/14/14 4:01 PM, James Taylor jtay...@salesforce.com wrote:

Hi Henning,
My favorite implementation of efficient composite row keys is Phoenix. We
support composite row keys whose byte representation sorts according to
the
natural sort order of the values (inspired by Lily). You can use our type
system independent of querying/inserting data with Phoenix, the advantage
being that when you want to support adhoc querying through SQL using
Phoenix, it'll just work.

Thanks,
James


On Tue, Jan 14, 2014 at 7:02 AM, Ted Yu yuzhih...@gmail.com wrote:

 Please take a look at HBASE-8089 which is an umbrella JIRA.
 Some of its subtasks are in 0.96

 bq. claiming that short keys (as well as short column names) are
relevant
 bq. Is that also true in 0.94.x?

 That is true in 0.94.x

 Cheers


 On Tue, Jan 14, 2014 at 6:56 AM, Henning Blohm henning.bl...@zfabrik.de
 wrote:

  Hi,
 
  for an application still running on Hbase 0.90.4 (but moving to
0.94.6)
 we
  are thinking about using more efficient composite row keys compared
what
 we
  use today (fixed length strings with / separator).
 
  I ran into http://hbase.apache.org/book/rowkey.design.html claiming
that
  short keys (as well as short column names) are relevant also when
using
  compression (as there is no compression in caches/indices). Is that
also
  true in 0.94.x?
 
  If so, is there some support for efficient, correctly ordered, byte[]
  serialized composite row keys? I ran into HBASE-7221 
  https://issues.apache.org/jira/browse/HBASE-7221 and HBASE-7692.
 
  For some time it seemed Orderly (https://github.com/ndimiduk/orderly)
 was
  suggested but then abandoned again in favor of ... nothing really.
 
  So, in short, do you have any favorite / suggested implementation?
 
  Thanks,
  Henning
 




Re: hbase read performance tuning failed

2014-01-07 Thread Doug Meil

In addition to what Lars just said about the blocksize, this is a similar
question to another one that somebody asked, and it's always good to make
sure that you understand where your data is. As a sanity check, make sure
it's not all on one or two RSs (look at the hbase web pages or with tools
like Hannibal).


Also, you definitely want to to turn HBase checksumming on - and when you
do so you'll need to re-create the HFiles (e.g., you can't just change the
config and bounce the HBase cluster).  That's a significant reduction in
I/O.

Likewise, if you are doing a full-scan, make sure that you select only the
attributes you need...

See this for more:  http://hbase.apache.org/book.html#perf.reading





On 1/7/14 1:24 PM, lars hofhansl la...@apache.org wrote:

If increasing hbase.client.scanner.caching makes no difference you have
another issue.
How many rows do you expect your to return?

On contemporary hardware I manage to scan a few million KeyValues (i.e.
columns) per second and per CPU core.
Note that for scan performance you want to increase the BLOCKSIZE.


-- Lars




 From: LEI Xiaofeng le...@ihep.ac.cn
To: user@hbase.apache.org
Sent: Monday, January 6, 2014 11:06 PM
Subject: hbase read performance tuning failed
 

Hi,
I am running hbase-0.94.6-cdh4.5.0 and set up a cluster of 5 nodes. The
random read performance is ok, but the scan performance is poor.
I tried to increase hbase.client.scanner.caching to 100 to promote the
scan performance but it made  no difference. And when I tried to make
smaller blocks by setting BLOCKSIZE when created tables to get better
random read performance it made no difference too.
So, I am wondering if anyone could give some advice to solve this problem.



Thanks



Re: Hbase Performance Issue

2014-01-06 Thread Doug Meil

In addition to what everybody else said, look what *where* the regions are
for the target table.  There may be 5 regions (for example), but look to
see if they are all on the same RS.





On 1/6/14 5:45 AM, Nicolas Liochon nkey...@gmail.com wrote:

It's very strange that you don't see a perf improvement when you increase
the number of nodes.
Nothing in what you've done change the performances at the end?

You may want to check:
 - the number of regions for this table. Are all the region server busy?
Do
you have some split on the table?
 - How much data you actually write. Is the compression enabled on this
table?
 - Do you have compactions? You may want to change the max store file
settings for unfrequent write load (see
http://gbif.blogspot.fr/2012/07/optimizing-writes-in-hbase.html).

It would be interesting to test as well the 0.96 release.



On Sun, Jan 5, 2014 at 2:12 AM, Vladimir Rodionov
vrodio...@carrieriq.comwrote:


 I think in this case, writing data to HDFS or HFile directly (for
 subsequent bulk loading)
 is the best option. HBase will never compete in write speed with HDFS.

 Best regards,
 Vladimir Rodionov
 Principal Platform Engineer
 Carrier IQ, www.carrieriq.com
 e-mail: vrodio...@carrieriq.com

 
 From: Ted Yu [yuzhih...@gmail.com]
 Sent: Saturday, January 04, 2014 2:33 PM
 To: user@hbase.apache.org
 Subject: Re: Hbase Performance Issue

 There're 8 items under:
 http://hbase.apache.org/book.html#perf.writing

 I guess you have through all of them :-)


 On Sat, Jan 4, 2014 at 1:34 PM, Akhtar Muhammad Din
 akhtar.m...@gmail.comwrote:

  Thanks guys for your precious time.
  Vladimir, as Ted rightly said i want to improve write performance
 currently
  (of course i want to read data as fast as possible later on)
  Kevin, my current understanding of bulk load is that you generate
  StoreFiles and later load through a command line program. I dont want
to
 do
  any manual step. Our system is getting data after every 15 minutes, so
  requirement is to automate it through client API completely.
 
 

 Confidentiality Notice:  The information contained in this message,
 including any attachments hereto, may be confidential and is intended
to be
 read only by the individual or entity to whom this message is
addressed. If
 the reader of this message is not the intended recipient or an agent or
 designee of the intended recipient, please note that any review, use,
 disclosure or distribution of this message or its attachments, in any
form,
 is strictly prohibited.  If you have received this message in error,
please
 immediately notify the sender and/or notificati...@carrieriq.com and
 delete or destroy any copy of this message and its attachments.




Re: Online/Realtime query with filter and join?

2013-12-02 Thread Doug Meil

You are going to want to figure out a rowkey (or a set of tables with
rowkeys) to restrict the number of I/O's. If you just slap Impala in front
of HBase (or even Phoenix, for that matter) you could write SQL against it
but if it's winds up doing a full-scan of an Hbase table underneath you
won't get your  100ms response time.

Note:  I'm not saying you can't do this with Impala or Phoenix, I'm just
saying start with the rowkeys first so that you limit the I/O.  Then start
adding frameworks as needed (and/or build a schema with Phoenix in the
same rowkey exercise).

Such response-time requirements make me think that this is for application
support, so why the requirement for SQL? Might want to start writing it as
a Java program first.









On 11/29/13 4:32 PM, Mourad K mourad...@gmail.com wrote:

You might want to consider something like Impala or Phoenix, I presume
you are trying to do some report query for dashboard or UI?
MapReduce is certainly not adequate as there is too much latency on
startup. If you want to give this a try, cdh4 and Impala are a good start.

Mouradk

On 29 Nov 2013, at 10:33, Ramon Wang ra...@appannie.com wrote:

 The general performance requirement for each query is less than 100 ms,
 that's the average level. Sounds crazy, but yes we need to find a way
for
 it.
 
 Thanks
 Ramon
 
 
 On Fri, Nov 29, 2013 at 5:01 PM, yonghu yongyong...@gmail.com wrote:
 
 The question is what you mean of real-time. What is your performance
 request? In my opinion, I don't think the MapReduce is suitable for the
 real time data processing.
 
 
 On Fri, Nov 29, 2013 at 9:55 AM, Azuryy Yu azury...@gmail.com wrote:
 
 you can try phoniex.
 On 2013-11-29 3:44 PM, Ramon Wang ra...@appannie.com wrote:
 
 Hi Folks
 
 It seems to be impossible, but I still want to check if there is a
way
 we
 can do complex query on HBase with Order By, JOIN.. etc like we
 have
 with normal RDBMS, we are asked to provided such a solution for it,
any
 ideas? Thanks for your help.
 
 BTW, i think maybe impala from CDH would be a way to go, but haven't
 got
 time to check it yet.
 
 Thanks
 Ramon
 



Re: hbase schema design

2013-09-18 Thread Doug Meil

Don't forget to look at this section for hbase schema design examples.

http://hbase.apache.org/book.html#schema.casestudies
 







On 9/17/13 1:52 PM, Adrian CAPDEFIER chivas314...@gmail.com wrote:

Thanks for the tip. In the data warehousing world I used to call them
surrogate keys - I wonder if there's any difference between the two.


On Tue, Sep 17, 2013 at 6:41 PM, Vladimir Rodionov
vrodio...@carrieriq.comwrote:

  Is there a built-in functionality to generate (integer) surrogate
values
 in
  hbase that can be used on the rowkey or does it need to be hand code
it
  from scratch?

 There is no such functionality in HBase. What are asking for is known
as a
 dictionary compression :
 unique 1-1 association between arbitrary strings and numeric values.

 Best regards,
 Vladimir Rodionov
 Principal Platform Engineer
 Carrier IQ, www.carrieriq.com
 e-mail: vrodio...@carrieriq.com

 
 From: Ted Yu [yuzhih...@gmail.com]
 Sent: Tuesday, September 17, 2013 9:53 AM
 To: user@hbase.apache.org
 Subject: Re: hbase schema design

 I guess you were referring to section 6.3.2

 bq. rowkey is stored and/ or read for every cell value

 The above is true.

 bq. the event description is a string of 0.1 to 2Kb

 You can enable Data Block encoding to reduce storage.

 Cheers



 On Tue, Sep 17, 2013 at 9:44 AM, Adrian CAPDEFIER
chivas314...@gmail.com
 wrote:

  Howdy all,
 
  I'm trying to use hbase for the first time (plenty of other experience
 with
  RDBMS database though), and I have a couple of questions after reading
 The
  Book.
 
  I am a bit confused by the advice to reduce the row size in the
hbase
  book. It states that every cell value is accomplished by the
coordinates
  (row, column and timestamp). I'm just trying to be thorough, so am I
to
  understand that the rowkey is stored and/ or read for every cell value
 in a
  record or just once per column family in a record?
 
  I am intrigued by the rows as columns design as described in the book
at
  http://hbase.apache.org/book.html#rowkey.design. To make a long story
  short, I will end up with a table to store event types and number of
  occurrences in each day. I would prefer to have the event description
as
  the row key and the dates when it happened as columns - up to 7300 for
  roughly 20 years.
  However, the event description is a string of 0.1 to 2Kb and if it is
  stored for each cell value, I will need to use a surrogate (shorter)
 value.
 
  Is there a built-in functionality to generate (integer) surrogate
values
 in
  hbase that can be used on the rowkey or does it need to be hand code
it
  from scratch?
 

 Confidentiality Notice:  The information contained in this message,
 including any attachments hereto, may be confidential and is intended
to be
 read only by the individual or entity to whom this message is
addressed. If
 the reader of this message is not the intended recipient or an agent or
 designee of the intended recipient, please note that any review, use,
 disclosure or distribution of this message or its attachments, in any
form,
 is strictly prohibited.  If you have received this message in error,
please
 immediately notify the sender and/or notificati...@carrieriq.com and
 delete or destroy any copy of this message and its attachments.




Re: Scan all the rows of a table with Column Family only.

2013-09-13 Thread Doug Meil

I might be mis-understanding your question, but if you just call addFamily
on the Scan instance then all column qualifiers will be returned in the
scan.

Note:  this does go against one of the performance recommendations (Scan
Attribute Selection) in..

http://hbase.apache.org/book.html#perf.reading

Š. but if it works for your app, go for it.




On 9/11/13 7:37 AM, Jean-Marc Spaggiari jean-m...@spaggiari.org wrote:

Hi Pavan,

Have you take a look at the already existing HBase filters?
http://hbase.apache.org/apidocs/org/apache/hadoop/hbase/filter/package-sum
mary.html

Maybe FamiliyFilter is what you are looking for?

JM


2013/9/11 Pavan Sudheendra pavan0...@gmail.com

 Hi all,
 How do i scan all the rows of HBase with only the Column Family?

 Coumn Family -- cf
 Column Qualifier -- \x00\x00\x06T,\x00\x00\x05d,\x00\x00\x00\x00 etc.,

 Column Qualifier would be random so i won't know beforehand..Any idea of
 how i can do this in Java API?

 --
 Regards-
 Pavan




Re: Programming practices for implementing composite row keys

2013-09-05 Thread Doug Meil

Greetings, 

Other food for thought on some case studies on composite rowkey design are
in the refguide:

http://hbase.apache.org/book.html#schema.casestudies






On 9/5/13 12:15 PM, Anoop John anoop.hb...@gmail.com wrote:

Hi
  Have a look at Phoenix[1].  There you can define a composite RK
model and it handles the -ve number ordering.  Also the scan model u
mentioned will be well supported with start/stop RK on entity1 and
using SkipScanFilter
for others.

-Anoop-

[1] https://github.com/forcedotcom/phoenix


On Thu, Sep 5, 2013 at 8:58 PM, Shahab Yunus shahab.yu...@gmail.com
wrote:

 Ah! I didn't know about HBASE-8693. Good information. Thanks Ted.

 Regards,
 Shahab


 On Thu, Sep 5, 2013 at 10:53 AM, Ted Yu yuzhih...@gmail.com wrote:

  For #2 and #4, see HBASE-8693 'DataType: provide extensible type API'
 which
  has been integrated to 0.96
 
  Cheers
 
 
  On Thu, Sep 5, 2013 at 7:14 AM, Shahab Yunus shahab.yu...@gmail.com
  wrote:
 
   My 2 cents:
  
   1- Yes, that is one way to do it. You can also use fixed length for
 every
   attribute participating in the composite key. HBase scan would be
more
   fitting to this pattern as well, I believe (?) It's a trade-off
 basically
   between space (all that padding increasing the key size) versus
   complexities involved in deciding and handling a delimiter and
 consequent
   parsing of keys etc.
  
   2- I personally have not heard about this. As far as I understand,
this
   goes against the whole idea of HBase scanning and prefix and fuzzy
  filters
   will not be possible this way. This should not be followed.
  
   3- See replies to 1  2
  
   4- The sorting of the keys, by default, is binary comparator. It is
a
 bit
   tricky as far as I know and the last I checked. Some tips here:
  
  
 
 
http://stackoverflow.com/questions/17248510/hbase-filters-not-working-for
-negative-integers
  
   Can you normalize them (or take an absolute) before reading and
writing
  (of
   course at the cost of performance) if it is possible i.e. keys with
 same
   amount but different magnitude cannot exist as well as different
  entities.
   This depends on your business logic and type/nature of data.
  
   Regards,
   Shahab
  
  
   On Thu, Sep 5, 2013 at 10:03 AM, praveenesh kumar 
 praveen...@gmail.com
   wrote:
  
Hello people,
   
I have a scenario which requires creating composite row keys for
my
  hbase
table.
   
Basically it would be entity1,entity2,entity3.
   
Search would be based by entity1 and then entity2 and 3.. I know I
 can
  do
row start-stopscan on entity1 first and then put row filters on
  entity2
and entity3.
   
My question is what are the best programming principles to
implement
   these
keys.
   
1. Just use simple delimiters entity1:entity2:entity3.
   
2. Create complex datatypes like java structures. I don't know if
  anyone
uses structures as keys and if they do, can someone please
highlight
 me
   for
which scenarios they would be good fit. Does they fit good for
this
scenario.
   
3. What are the pros and cons for both 1 and 2, when it comes for
 data
retrieval.
   
4. My entity1 can be negative also. Does it make any special
  difference
when hbase ordering is concerned. How can I tackle this scenario.
   
Any help on how to implement composite row keys would be highly
  helpful.
   I
want to understand how the community deals with implementing
 composite
   row
keys.
   
Regards
Praveenesh
   
  
 




Re: Suggestion need on desinging Flatten table for HBase given scenario

2013-09-05 Thread Doug Meil

Greetings,

The refguide has some case studies on composite rowkey design that might be 
helpful.

http://hbase.apache.org/book.html#schema.casestudies



From: Ramasubramanian Narayanan 
ramasubramanian.naraya...@gmail.commailto:ramasubramanian.naraya...@gmail.com
Reply-To: user@hbase.apache.orgmailto:user@hbase.apache.org 
user@hbase.apache.orgmailto:user@hbase.apache.org
Date: Thursday, September 5, 2013 1:05 AM
To: user@hbase.apache.orgmailto:user@hbase.apache.org 
user@hbase.apache.orgmailto:user@hbase.apache.org
Subject: Suggestion need on desinging Flatten table for HBase given scenario


Dear All,


For the below 1 to many relationship column sets, require suggestion on how to 
design a Flatten HBase table... Kindly refer the attached image for the 
scenario...

Pls let me know if my scenario is not clearly explained...

regards,
Rams



Re: HBase - stable versions

2013-09-04 Thread Doug Meil

It's a very good point.  Most people will go to 0.96 when CDH and
Hortonworks support it.






On 9/4/13 2:55 PM, Shahab Yunus shahab.yu...@gmail.com wrote:

This maybe a newbie or dumb question but I believe, this does not affect
or
apply to HBase distributions by other vendors like HortonWorks or
Cloudera.
If someone is using one of the versions of distributions provided by them
then it is up to them (and not people and community here) what and till
when they are going to support it.

Regards,
Shahab


On Wed, Sep 4, 2013 at 1:33 PM, James Taylor jtay...@salesforce.com
wrote:

 +1 to what Nicolas said.

 That goes for Phoenix as well. It's open source too. We do plan to port
to
 0.96 when our user community (Salesforce.com, of course, being one of
them)
 demands it.

 Thanks,
 James


 On Wed, Sep 4, 2013 at 10:11 AM, Nicolas Liochon nkey...@gmail.com
 wrote:

  It's open source. My personal point of view is that if someone is
willing
  to spend time on the backport, there should be no issue, if the
 regression
  risk is clearly acceptable  the rolling restart possible. If it's
  necessary (i.e. there is no agreement of the risk level), then we
could
 as
  well go for a 94.12.1 solution. I don't think we need to create this
 branch
  now: this branch should be created on when and if we cannot find an
  agreement on a specific jira.
 
  Nicolas
 
 
 
  On Wed, Sep 4, 2013 at 6:53 PM, lars hofhansl la...@apache.org
wrote:
 
   I should also explicitly state that we (Salesforce) will stay with
0.94
   for the foreseeable future.
  
   We will continue backport fixes that we need. If those are not
 acceptable
   or accepted into the open source 0.94 branch, they will have to go
into
  an
   Salesforce internal repository.
   I would really like to avoid that (essentially a fork), so I would
 offer
   to start having stable tags, i.e. we keep making changes in 0.94.x,
and
   declare (say) 0.94.12 stable and have 0.94.12.1, etc, releases (much
 like
   what is done in Linux)
  
   We also currently have no resources to port Phoenix over to 0.96
(but
 if
   somebody wanted to step up, that would be greatly appreciated, of
  course).
  
   Thoughts? Comments? Concerns?
  
   -- Lars
  
  
   - Original Message -
   From: lars hofhansl la...@apache.org
   To: hbase-dev d...@hbase.apache.org; hbase-user 
 user@hbase.apache.org
   Cc:
   Sent: Tuesday, September 3, 2013 5:30 PM
   Subject: HBase - stable versions
  
   With 0.96 being imminent we should start a discussion about
continuing
   support for 0.94.
  
   0.92 became stale pretty soon after 0.94 was released.
   The relationship between 0.94 and 0.96 is slightly different,
though:
  
   1. 0.92.x could be upgraded to 0.94.x without downtime
   2. 0.92 clients and servers are mutually compatible with 0.94
clients
 and
   servers
   3. the user facing API stayed backward compatible
  
   None of the above is true when moving from 0.94 to 0.96+.
   Upgrade from 0.94 to 0.96 will require a one-way upgrade process
  including
   downtime, and client and server need to be upgraded in lockstep.
  
   I would like to have an informal poll about who's using 0.94 and is
   planning to continue to use it; and who is planning to upgrade from
 0.94
  to
   0.96.
   Should we officially continue support for 0.94? How long?
  
   Thanks.
  
   -- Lars
  
  
 




Re: Writing data to hbase from reducer

2013-08-28 Thread Doug Meil

MapReduce job reading in your data in HDFS and then emitting Puts against
the target table in the Mapper since it looks like there isn't any
transform happening...

http://hbase.apache.org/book/mapreduce.example.html
 
Likewise, what Harsh said a few days ago.

On 8/27/13 6:33 PM, Harsh J ha...@cloudera.com wrote:

You can use HBase's MultiTableOutputFormat:
http://hbase.apache.org/apidocs/org/apache/hadoop/hbase/mapreduce/MultiTab
leOutputFormat.html

An example can be found in this blog post:
http://www.wildnove.com/2011/07/19/tutorial-hadoop-and-hbase-multitableout
putformat/






On 8/28/13 12:49 PM, jamal sasha jamalsha...@gmail.com wrote:

Hi,
I have data in form:

source, destination, connection
This data is saved in hdfs

I want to read this data and put it in hbase table something like:
Column1 (source) | Column2(Destination)| Column3(Connection Type)
Rowvertex A|   vertex B | connection

How do I do this?
Thanks



Re: Newbie in hbase Trying to run an example

2013-08-28 Thread Doug Meil

cf in this example is a column family, and this needs to exist in the
tables (both input and output) before the job is submitted.





On 8/26/13 3:01 PM, jamal sasha jamalsha...@gmail.com wrote:

Hi,
  I am new to hbase, so few noob questions.

So, I created a table in hbase:
A quick scan gives me the following:
hbase(main):001:0 scan 'test'
ROW  COLUMN+CELL


 row1column=cf:word,
timestamp=1377298314160, value=foo

 row2column=cf:word,
timestamp=1377298326124, value=bar

 row3column=cf:word,
timestamp=1377298332856, value=bar foo

 row4column=cf:word,
timestamp=1377298347602, value=bar world foo

Now, I want to do the word count and write the result back to another
table
in hbase
So I followed the code given below:
http://hbase.apache.org/book.html#mapreduce
Snapshot in the end:
Now, I am getting an error

java.lang.NullPointerException
at java.lang.String.init(String.java:601)
at org.rdf.HBaseExperiment$MyMapper.map(HBaseExperiment.java:42)
at org.rdf.HBaseExperiment$MyMapper.map(HBaseExperiment.java:1)
at org.apache.hadoop.mapreduce.Mapper.run(Mapper.java:144)
at org.apache.hadoop.mapred.MapTask.runNewMapper(MapTask.java:764)
at org.apache.hadoop.mapred.MapTask.run(MapTask.java:370)
at org.apache.hadoop.mapred.Child$4.run(Child.java:255)
at java.security.AccessController.doPrivileged(Native Method)
at javax.security.auth.Subject.doAs(Subject.java:416)
at
org.apache.hadoop.security.UserGroupInformation.doAs(UserGroupInformation.
java:1093)
at org.apache.hadoop.mapred.Child.main(Child.java:249)

Line 42 points to
*public static final byte[] ATTR1 = attr1.getBytes();*

Now I think attr1 is family qualifier.
I am wondering, what exactly is a family qualifier?
Do I need to set something while creating a table just like I did cf
when
I was creating the table.
Similiarly what do I need to do on the output table as well?
So, what I am saying is.. what do I need to to on hbase shell so that I
can
run this word count example?
Thanks





import java.io.IOException;
import java.util.Date;

import org.apache.hadoop.conf.Configuration;
import org.apache.hadoop.fs.Path;
import org.apache.hadoop.hbase.HBaseConfiguration;
import org.apache.hadoop.hbase.KeyValue;
import org.apache.hadoop.hbase.client.Put;
import org.apache.hadoop.hbase.client.Result;
import org.apache.hadoop.hbase.client.Scan;
import org.apache.hadoop.hbase.io.ImmutableBytesWritable;
import org.apache.hadoop.hbase.mapreduce.TableMapReduceUtil;
import org.apache.hadoop.hbase.mapreduce.TableMapper;
import org.apache.hadoop.hbase.mapreduce.TableReducer;
import org.apache.hadoop.hbase.util.Bytes;

import org.apache.hadoop.io.IntWritable;
import org.apache.hadoop.io.Text;
import org.apache.hadoop.io.Writable;
import org.apache.hadoop.mapreduce.Job;
import org.apache.hadoop.mapreduce.Reducer;
import org.apache.hadoop.mapreduce.Reducer.Context;
import org.apache.hadoop.mapreduce.lib.input.FileInputFormat;
import org.apache.hadoop.mapreduce.lib.input.TextInputFormat;
import org.apache.hadoop.mapreduce.lib.output.FileOutputFormat;
import org.apache.hadoop.mapreduce.lib.output.TextOutputFormat;
import org.co_occurance.Pair;
import org.co_occurance.PairsMethod;
import org.co_occurance.PairsMethod.MeanReducer;
import org.co_occurance.PairsMethod.PairsMapper;

public class HBaseExperiment {
public static class MyMapper extends TableMapperText, IntWritable  {
public static final byte[] CF = cf.getBytes();
*public static final byte[] ATTR1 = attr1.getBytes();*

private final IntWritable ONE = new IntWritable(1);
   private Text text = new Text();

   public void map(ImmutableBytesWritable row, Result value, Context
context) throws IOException, InterruptedException {
 String val = new String(value.getValue(CF, ATTR1));
   //text.set(val); // we can only emit Writables...
   text.set(value.toString());
 context.write(text, ONE);
   }
}
 public static class MyTableReducer extends TableReducerText,
IntWritable,
ImmutableBytesWritable  {
public static final byte[] CF = cf.getBytes();
public static final byte[] COUNT = count.getBytes();

 public void reduce(Text key, IterableIntWritable values, Context
context) throws IOException, InterruptedException {
 int i = 0;
 for (IntWritable val : values) {
 i += val.get();
 }
 Put put = new Put(Bytes.toBytes(key.toString()));
 put.add(CF, COUNT, Bytes.toBytes(i));

 context.write(null, put);
   }
}

 public static void main(String[] args) throws Exception {
Configuration config = HBaseConfiguration.create();
Job job = new Job(config,ExampleSummary);
job.setJarByClass(HBaseExperiment.class); // class that contains
mapper
and reducer

Scan scan = new Scan();
scan.setCaching(500);// 1 is the default in Scan, which will be

Re: Kudos for Phoenix

2013-07-11 Thread Doug Meil

You still have to register the view to phoenix and define which CF's and
columns you are accessing, so this isn't entirely free form...

create view
myTable (cf VARCHAR primary key,
cf.attr1 VARCHAR, cf.attr2
VARCHAR);

… however, myTable in the above example is the HBase table you created
outside Phoenix, so Phoenix doesn't need to copy any data, etc..

 






On 7/10/13 10:13 PM, Bing Jiang jiangbinglo...@gmail.com wrote:

Hi, Doug.
If build view upon Phoenix uncontrolled tables, whether it can be used to
column family or qualifier?

I want to know your design details.


2013/7/11 Doug Meil doug.m...@explorysmedical.com

 Hi folks,

 I just wanted to give a shout out to the Phoenix framework, and
 specifically for the ability to create view against an HBase table
whose
 schema was not being managed by Phoenix.  That's a really nice feature
and
 I'm not sure how many folks realize this.  I was initially nervous that
 this was only for data created with Phoenix, but that's not the case,
so if
 you're looking for a lightweight framework for SQL-on-HBase I'd check it
 out.  For this particular scenario it's probably better for ad-hoc data
 exploration, but often that's what people are looking to do.

 Doug Meil
 Chief Software Architect, Explorys
 doug.m...@explorysmedical.com





-- 
Bing Jiang
Tel:(86)134-2619-1361
weibo: http://weibo.com/jiangbinglover
BLOG: http://blog.sina.com.cn/jiangbinglover
National Research Center for Intelligent Computing Systems
Institute of Computing technology
Graduate University of Chinese Academy of Science



default.xml
Description: default.xml


Re: small hbase doubt

2013-07-11 Thread Doug Meil

Compression only applies to data on disk.  Over the wire (I.E., RS to
client) is uncompressed.






On 7/11/13 9:24 AM, Jean-Marc Spaggiari jean-m...@spaggiari.org wrote:

Hi Alok,

What do you mean by query?

Gets are done based on the key. And snappy and LZO are used to compress
the
value. So only when a row feet your needs HBase will decrompress the value
and send it back to you...

Does it reply to your question?

JM

2013/7/11 Alok Singh Mahor alokma...@gmail.com

 Hello everyone,
 could anyone tell me small query?

 Does Hbase decompress data before executing query or it execute queries
on
 compressed data? and how snappy and lzo actually behave ?

 thanks





Re: Kudos for Phoenix

2013-07-11 Thread Doug Meil

This particular use case is effectively a full scan on the table, but with
server-side filters.  Internally, Hbase still has to scan all the data -
there's no magic.




On 7/11/13 9:59 PM, Bing Jiang jiangbinglo...@gmail.com wrote:

Could you give us the test performance, especially use the view of table?


2013/7/11 Doug Meil doug.m...@explorysmedical.com


 You still have to register the view to phoenix and define which CF's and
 columns you are accessing, so this isn't entirely free form...

 create view
 myTable (cf VARCHAR primary key,
 cf.attr1 VARCHAR, cf.attr2
 VARCHAR);

 … however, myTable in the above example is the HBase table you created
 outside Phoenix, so Phoenix doesn't need to copy any data, etc..








 On 7/10/13 10:13 PM, Bing Jiang jiangbinglo...@gmail.com wrote:

 Hi, Doug.
 If build view upon Phoenix uncontrolled tables, whether it can be used
to
 column family or qualifier?
 
 I want to know your design details.
 
 
 2013/7/11 Doug Meil doug.m...@explorysmedical.com
 
  Hi folks,
 
  I just wanted to give a shout out to the Phoenix framework, and
  specifically for the ability to create view against an HBase table
 whose
  schema was not being managed by Phoenix.  That's a really nice
feature
 and
  I'm not sure how many folks realize this.  I was initially nervous
that
  this was only for data created with Phoenix, but that's not the case,
 so if
  you're looking for a lightweight framework for SQL-on-HBase I'd
check it
  out.  For this particular scenario it's probably better for ad-hoc
data
  exploration, but often that's what people are looking to do.
 
  Doug Meil
  Chief Software Architect, Explorys
  doug.m...@explorysmedical.com
 
 
 
 
 
 --
 Bing Jiang
 Tel:(86)134-2619-1361
 weibo: http://weibo.com/jiangbinglover
 BLOG: http://blog.sina.com.cn/jiangbinglover
 National Research Center for Intelligent Computing Systems
 Institute of Computing technology
 Graduate University of Chinese Academy of Science




-- 
Bing Jiang
Tel:(86)134-2619-1361
weibo: http://weibo.com/jiangbinglover
BLOG: http://blog.sina.com.cn/jiangbinglover
National Research Center for Intelligent Computing Systems
Institute of Computing technology
Graduate University of Chinese Academy of Science




Kudos for Phoenix

2013-07-10 Thread Doug Meil
Hi folks,

I just wanted to give a shout out to the Phoenix framework, and specifically 
for the ability to create view against an HBase table whose schema was not 
being managed by Phoenix.  That's a really nice feature and I'm not sure how 
many folks realize this.  I was initially nervous that this was only for data 
created with Phoenix, but that's not the case, so if you're looking for a 
lightweight framework for SQL-on-HBase I'd check it out.  For this particular 
scenario it's probably better for ad-hoc data exploration, but often that's 
what people are looking to do.

Doug Meil
Chief Software Architect, Explorys
doug.m...@explorysmedical.com




Re: RefGuide schema design examples

2013-04-21 Thread Doug Meil

Thanks everybody, much appreciated!





On 4/20/13 5:40 AM, varun kumar varun@gmail.com wrote:

+1


On Sat, Apr 20, 2013 at 1:23 PM, Ravindranath Akila 
ravindranathak...@gmail.com wrote:

 +1

 R. A.
 On 20 Apr 2013 12:07, Viral Bajaria viral.baja...@gmail.com wrote:

  +1!
 
 
  On Fri, Apr 19, 2013 at 4:09 PM, Marcos Luis Ortiz Valmaseda 
  marcosluis2...@gmail.com wrote:
 
   Wow, great work, Doug.
  
  
   2013/4/19 Doug Meil doug.m...@explorysmedical.com
  
Hi folks,
   
I reorganized the Schema Design case studies 2 weeks ago and
  consolidated
them into here, plus added several cases common on the dist-list.
   
http://hbase.apache.org/book.html#schema.casestudies
   
Comments/suggestions welcome.  Thanks!
   
   
Doug Meil
Chief Software Architect, Explorys
doug.m...@explorysmedical.com
   
   
   
  
  
   --
   Marcos Ortiz Valmaseda,
   *Data-Driven Product Manager* at PDVSA
   *Blog*: http://dataddict.wordpress.com/
   *LinkedIn: *http://www.linkedin.com/in/marcosluis2186
   *Twitter*: @marcosluis2186 http://twitter.com/marcosluis2186
  
 




-- 
Regards,
Varun Kumar.P





RefGuide schema design examples

2013-04-19 Thread Doug Meil
Hi folks,

I reorganized the Schema Design case studies 2 weeks ago and consolidated them 
into here, plus added several cases common on the dist-list.

http://hbase.apache.org/book.html#schema.casestudies

Comments/suggestions welcome.  Thanks!


Doug Meil
Chief Software Architect, Explorys
doug.m...@explorysmedical.com




Re: 答复: HBase random read performance

2013-04-15 Thread Doug Meil

Hi there, regarding this...

 We are passing random 1 row-keys as input, while HBase is taking
around
 17 secs to return 1 records.


….  Given that you are generating 10,000 random keys, your multi-get is
very likely hitting all 5 nodes of your cluster.


Historically, multi-Get used to first sort the requests by RS and then
*serially* go the RS to process the multi-Get.  I'm not sure of the
current (0.94.x) behavior if it multi-threads or not.

One thing you might want to consider is confirming that client behavior,
and if it's not multi-threading then perform a test that does the same RS
sorting via...

http://hbase.apache.org/apidocs/org/apache/hadoop/hbase/client/HTable.html#
getRegionLocation%28byte[]%29

…. and then spin up your own threads (one per target RS) and see what
happens.



On 4/15/13 9:04 AM, Ankit Jain ankitjainc...@gmail.com wrote:

Hi Liang,

Thanks Liang for reply..

Ans1:
I tried by using HFile block size of 32 KB and bloom filter is enabled.
The
random read performance is 1 records in 23 secs.

Ans2:
We are retrieving all the 1 rows in one call.

Ans3:
Disk detai:
Model Number:   ST2000DM001-1CH164
Serial Number:  Z1E276YF

Please suggest some more optimization

Thanks,
Ankit Jain

On Mon, Apr 15, 2013 at 5:11 PM, 谢良 xieli...@xiaomi.com wrote:

 First, it's probably helpless to set block size to 4KB, please refer to
 the beginning of HFile.java:

  Smaller blocks are good
  * for random access, but require more memory to hold the block index,
and
 may
  * be slower to create (because we must flush the compressor stream at
the
  * conclusion of each data block, which leads to an FS I/O flush).
 Further, due
  * to the internal caching in Compression codec, the smallest possible
 block
  * size would be around 20KB-30KB.

 Second, is it a single-thread test client or multi-threads? we couldn't
 expect too much if the requests are one by one.

 Third, could you provide more info about  your DN disk numbers and IO
 utils ?

 Thanks,
 Liang
 
 发件人: Ankit Jain [ankitjainc...@gmail.com]
 发送时间: 2013年4月15日 18:53
 收件人: user@hbase.apache.org
 主题: Re: HBase random read performance

 Hi Anoop,

 Thanks for reply..

 I tried by setting Hfile block size 4KB and also enabled the bloom
 filter(ROW). The maximum read performance that I was able to achieve is
 1 records in 14 secs (size of record is 1.6KB).

 Please suggest some tuning..

 Thanks,
 Ankit Jain



 On Mon, Apr 15, 2013 at 4:12 PM, Rishabh Agrawal 
 rishabh.agra...@impetus.co.in wrote:

  Interesting. Can you explain why this happens?
 
  -Original Message-
  From: Anoop Sam John [mailto:anoo...@huawei.com]
  Sent: Monday, April 15, 2013 3:47 PM
  To: user@hbase.apache.org
  Subject: RE: HBase random read performance
 
  Ankit
   I guess you might be having default HFile block size
  which is 64KB.
  For random gets a lower value will be better. Try will some thing like
 8KB
  and check the latency?
 
  Ya ofcourse blooms can help (if major compaction was not done at the
time
  of testing)
 
  -Anoop-
  
  From: Ankit Jain [ankitjainc...@gmail.com]
  Sent: Saturday, April 13, 2013 11:01 AM
  To: user@hbase.apache.org
  Subject: HBase random read performance
 
  Hi All,
 
  We are using HBase 0.94.5 and Hadoop 1.0.4.
 
  We have HBase cluster of 5 nodes(5 regionservers and 1 master node).
Each
  regionserver has 8 GB RAM.
 
  We have loaded 25 millions records in HBase table, regions are
pre-split
  into 16 regions and all the regions are equally loaded.
 
  We are getting very low random read performance while performing multi
 get
  from HBase.
 
  We are passing random 1 row-keys as input, while HBase is taking
 around
  17 secs to return 1 records.
 
  Please suggest some tuning to increase HBase read performance.
 
  Thanks,
  Ankit Jain
  iLabs
 
 
 
  --
  Thanks,
  Ankit Jain
 
  
 
 
 
 
 
 
  NOTE: This message may contain information that is confidential,
  proprietary, privileged or otherwise protected by law. The message is
  intended solely for the named addressee. If received in error, please
  destroy and notify the sender. Any use of this email is prohibited
when
  received in error. Impetus does not represent, warrant and/or
guarantee,
  that the integrity of this communication has been maintained nor that
the
  communication is free of errors, virus, interception or interference.
 



 --
 Thanks,
 Ankit Jain




-- 
Thanks,
Ankit Jain



Re: ANN: HBase Refcard available

2013-04-09 Thread Doug Meil

You beat me to it!   :-)

I just realized that right when I hit enter on my previous email.






On 4/9/13 2:05 PM, Otis Gospodnetic otis.gospodne...@gmail.com wrote:

Hi Stack (cleaning your inbox? ;))

Looks like Doug did it a while back -
https://issues.apache.org/jira/browse/HBASE-6574 ?

Otis
--
HBASE Performance Monitoring - http://sematext.com/spm/index.html





On Tue, Apr 9, 2013 at 2:00 PM, Stack st...@duboce.net wrote:
 Make a patch for the reference guide that points to this Otis?  Or just
 tell me where to insert?
 Thanks,
 St.Ack


 On Wed, Aug 8, 2012 at 4:14 PM, Otis Gospodnetic
otis_gospodne...@yahoo.com
 wrote:

 Hi,

 We wrote an HBase Refcard and published it via DZone.  Here is our very
 brief announcement:
 http://blog.sematext.com/2012/08/06/announcing-hbase-refcard/ .  The
PDF
 refcard can be had from http://refcardz.dzone.com/refcardz/hbase .

 Otis
 
 Performance Monitoring for Solr / ElasticSearch / HBase -
 http://sematext.com/spm







Re: schema design: rows vs wide columns

2013-04-08 Thread Doug Meil


For the record, the refGuide mentions potential issues of CF lumpiness
that you mentioned:

http://hbase.apache.org/book.html#number.of.cfs
 

6.2.1. Cardinality of ColumnFamilies

Where multiple ColumnFamilies exist in a single table, be aware of the
cardinality (i.e., number of rows).
  If ColumnFamilyA has 1 million rows and ColumnFamilyB has 1 billion
rows, ColumnFamilyA's data will likely be spread
  across many, many regions (and RegionServers).  This makes mass
scans for ColumnFamilyA less efficient.
  




Š. anything that needs to be updated/added for this?





On 4/8/13 12:39 AM, lars hofhansl la...@apache.org wrote:

I think the main problem is that all CFs have to be flushed if one gets
large enough to require a flush.
(Does anyone remember why exactly that is? And do we still need that now
that the memstoreTS is stored in the HFiles?)


So things are fine as long as all CFs have roughly the same size. But if
you have one that gets a lot of data and many others that are smaller,
we'd end up with a lot of unnecessary and small store files from the
smaller CFs.

Anything else known that is bad about many column families?


-- Lars




 From: Andrew Purtell apurt...@apache.org
To: user@hbase.apache.org user@hbase.apache.org
Sent: Sunday, April 7, 2013 3:52 PM
Subject: Re: schema design: rows vs wide columns
 
Is there a pointer to evidence/experiment backed analysis of this
question?
I'm sure there is some basis for this text in the book but I recommend we
strike it. We could replace it with YCSB or LoadTestTool driven latency
graphs for different workloads maybe. Although that would also be a big
simplification of 'schema design' considerations, it would not be so
starkly lacking background.

On Sunday, April 7, 2013, Ted Yu wrote:

 From http://hbase.apache.org/book.html#number.of.cfs :

 HBase currently does not do well with anything above two or three column
 families so keep the number of column families in your schema low.

 Cheers

 On Sun, Apr 7, 2013 at 3:04 PM, Stack st...@duboce.net javascript:;
 wrote:

  On Sun, Apr 7, 2013 at 11:58 AM, Ted yuzhih...@gmail.com
javascript:;
 wrote:
 
   With regard to number of column families, 3 is the recommended
maximum.
  
 
  How did you come up w/ the number '3'?  Is it a 'hard' 3? Or does it
  depend?  If the latter, on what does it depend?
  Thanks,
  St.Ack
 



-- 
Best regards,

   - Andy

Problems worthy of attack prove their worth by hitting back. - Piet Hein
(via Tom White)





Re: HBase Types: Explicit Null Support

2013-04-01 Thread Doug Meil

HmmmŠ good question.

I think that fixed width support is important for a great many rowkey
constructs cases, so I'd rather see something like losing MIN_VALUE and
keeping fixed width.




On 4/1/13 2:00 PM, Nick Dimiduk ndimi...@gmail.com wrote:

Heya,

Thinking about data types and serialization. I think null support is an
important characteristic for the serialized representations, especially
when considering the compound type. However, doing so in directly
incompatible with fixed-width representations for numerics. For instance,
if we want to have a fixed-width signed long stored on 8-bytes, where do
you put null? float and double types can cheat a little by folding
negative
and positive NaN's into a single representation (this isn't strictly
correct!), leaving a place to represent null. In the long example case,
the
obvious choice is to reduce MAX_VALUE or increase MIN_VALUE by one. This
will allocate an additional encoding which can be used for null. My
experience working with scientific data, however, makes me wince at the
idea.

The variable-width encodings have it a little easier. There's already
enough going on that it's simpler to make room.

Remember, the final goal is to support order-preserving serialization.
This
imposes some limitations on our encoding strategies. For instance, it's
not
enough to simply encode null, it really needs to be encoded as 0x00 so as
to sort lexicographically earlier than any other value.

What do you think? Any ideas, experiences, etc?

Thanks,
Nick





Re: HBase type support

2013-03-18 Thread Doug Meil

Sorry I'm late to this thread but I was the guy behind HBASE-7221 and the
algorithms specifically mentioned were MD5 and Murmur (not SHA-1).  And
implementation of Murmur already exists in Hbase, and the MD5
implementation was the one that ships with Java.

The intent was to include hashing appropriate for use with key
distribution of rowkeys in tables as is often suggested on the dist-lists.
 SHA-1 is probably overkill for the rowkey case, but I wouldn't want to
stop anybody from using SHA-1 if it was appropriate for their needs.





On 3/18/13 8:02 AM, Michel Segel michael_se...@hotmail.com wrote:

Andrew, 

I was aware of you employer, which I am pretty sure that they have
already dealt with the issue of  exporting encryption software and
probably hardware too.

Neither of us are lawyers and what I do know of dealing with the
government bureaucracies, it's not always as simple of just filing the
correct paperwork. (Sometimes it is, sometimes not so much, YMMV...)

Putting the hooks for encryption is probably a good idea. Shipping the
encryption w the release or making it part of the official release, not
so much. Sorry, I'm being a bit conservative here.

IMHO I think fixing other issues would be of a higher priority, but
that's just me;-)

Sent from a remote device. Please excuse any typos...

Mike Segel

On Mar 17, 2013, at 12:12 PM, Andrew Purtell apurt...@apache.org wrote:

 This then leads to another question... suppose Apache does add
encryption
 to Hadoop. While the Apache organization does have the proper paperwork
in
 place, what then happens to Cloudera, Hortonworks, EMC, IBM, Intel, etc
?
 
 Well I can't put that question aside since you've brought it up now
 twice and encryption feature candidates for Apache Hadoop and Apache
HBase
 are something I have been working on. Its a valid question but since as
you
 admit you don't know what you are talking about, perhaps stating
uninformed
 opinions can be avoided. Only the latter is what I object to. I think
the
 short answer is as an Apache contributor I'm concerned about the Apache
 product. Downstream repackagers can take whatever action needed
including
 changes, since it is open source, or feedback about it representing a
 hardship. At this point I have heard nothing like that. I work for Intel
 and can say we are good with it.
 
 On Sunday, March 17, 2013, Michael Segel wrote:
 
 Its not a question of FUD, but that certain types of
encryption/decryption
 code falls under the munitions act.
 See: http://www.fas.org/irp/offdocs/eo_crypt_9611_memo.htm
 
 Having said that, there is this:
 http://www.bis.doc.gov/encryption/encfaqs6_17_02.html
 
 In short, I don't as a habit export/import encryption technology so I
am
 not up to speed on the current state of the laws.
 Which is why I have to question the current state of the US encryption
 laws.
 
 This then leads to another question... suppose Apache does add
encryption
 to Hadoop. While the Apache organization does have the proper
paperwork in
 place, what then happens to Cloudera, Hortonworks, EMC, IBM, Intel,
etc ?
 
 But lets put that question aside.
 
 The point I was trying to make was that the core Sun JVM does support
MD5
 and SHA-1 out of the box, so that anyone running Hadoop and using the
 1.6_xx or the 1.7_xx versions of the JVM will have these packages.
 
 Adding hooks that use these classes are a no brainer.  However, beyond
 this... you tell me.
 
 -Mike
 
 On Mar 16, 2013, at 7:59 AM, Andrew Purtell apurt...@apache.org
wrote:
 
 The ASF avails itself of an exception to crypto export which only
 requires
 a bit of PMC housekeeping at release time. So is not [ok] is FUD. I
 humbly request we refrain from FUD here. See
 http://www.apache.org/dev/crypto.html. To the best of our knowledge we
 expect this to continue, though the ASF has not updated this policy
yet
 for
 recent regulation updates.
 
 On Saturday, March 16, 2013, Michel Segel wrote:
 
 I also want to add that you could add MD5 and SHA-1, but I'd check
on us
 laws... I think these are ok, however other encryption/decryption
code
 is
 not.
 
 They are part of the std sun java libraries ...
 
 Sent from a remote device. Please excuse any typos...
 
 Mike Segel
 
 On Mar 16, 2013, at 7:18 AM, Michel Segel michael_se...@hotmail.com
 wrote:
 
 Isn't that what you get through add on frameworks like TSDB and
Kiji ?
 Maybe not on the client side, but frameworks that extend HBase...
 
 Sent from a remote device. Please excuse any typos...
 
 Mike Segel
 
 On Mar 16, 2013, at 12:45 AM, lars hofhansl la...@apache.org
wrote:
 
 I think generally we should keep HBase a byte[] based key value
store.
 What we should add to HBase are tools that would allow client side
 apps
 (or libraries) to built functionality on top of plain HBase.
 
 Serialization that maintains a correct semantic sort order is
 important
 as a building block, so is code that can build up correctly
serialized
 and
 sortable compound keys, as well as hashing 

Re: question about pre-splitting regions

2013-02-15 Thread Doug Meil

Good to hear!  Given your experience, I'd appreciate your feedback on the
section 6.3.6. Relationship Between RowKeys and Region Splits in...

http://hbase.apache.org/book.html#schema.creation

Š because it's on that same topic.  Any other points to add to this?
Thanks!





On 2/14/13 11:08 PM, Viral Bajaria viral.baja...@gmail.com wrote:

I was able to figure it out. I had to use the createTable api which took
splitKeys instead of the startKey, endKey and numPartitions.

If anyone comes across this issue and needs more feedback feel free to
ping
me.

Thanks,
Viral

On Thu, Feb 14, 2013 at 7:30 PM, Viral Bajaria
viral.baja...@gmail.comwrote:

 Hi,

 I am creating a new table and want to pre-split the regions and am
seeing
 some weird behavior.

 My table is designed as a composite of multiple fixed length byte arrays
 separated by a control character (for simplicity sake we can say the
 separator is _underscore_). The prefix of this rowkey is deterministic
 (i.e. length of 8 bytes) and I know it beforehand how many different
prefix
 I will see in the near future. The values after the prefix is not
 deterministic. I wanted to create a pre-split tables based on the
number of
 number of prefix combinations that I know.

 I ended up doing something like this:
 hbaseAdmin.createTable(tableName, Bytes.toBytes(1L),
 Bytes.toBytes(maxCombinationPrefixValue), maxCombinationPrefixValue)

 The create table worked fine and as expected it created the number of
 partitions. But when I write data to the table, I still see all the
writes
 hitting a single region instead of hitting different regions based on
the
 prefix. Is my thinking of splitting by prefix values flawed ? Do I have
to
 split by some real rowkeys (though it's impossible for me to know what
 rowkeys will show up except the row prefix which is much more
 deterministic).

 For some reason I think I have a flawed understanding of the createTable
 API and that is causing the issue for me ? Should I use the byte[][]
 prefixes method and not the one that I am using right now ?

 Any suggestions/pointers ?

 Thanks,
 Viral






Re: Join Using MapReduce and Hbase

2013-01-24 Thread Doug Meil

Hi there-

Here is a comment in the RefGuide on joins in the HBase data model.

http://hbase.apache.org/book.html#joins

Short answer, you need to do it yourself (e.g., either with an in-memory
hashmap or instantiating an HTable of the other table, depending on your
situation).

For other MR examples, see this...

http://hbase.apache.org/book.html#mapreduce.example




On 1/24/13 8:19 AM, Vikas Jadhav vikascjadha...@gmail.com wrote:

Hi I am working join operation using MapReduce
So if anyone has useful information plz share it.
Example Code or New Technique along with existing one.
Thank You.
-- 
*
*
*

Thanx and Regards*
* Vikas Jadhav*




Re: Loading data, hbase slower than Hive?

2013-01-20 Thread Doug Meil

Hi there-

On top of what everybody else said, for more info on rowkey design and
pre-splitting see http://hbase.apache.org/book.html#schema (as well as
other threads in this dist-list on that topic).





On 1/19/13 4:12 PM, Mohammad Tariq donta...@gmail.com wrote:

Hello Austin,

  I am sorry for the late response.

Asaf has made a very valid point. Rowkwey design is very crucial.
Specially if the data is gonna be sequential(timeseries kinda thing).
You may end up with hotspotting problem. Use pre-splitted tables
or hash the keys to avoid that. It'll also allow you to fetch the results
faster.

Warm Regards,
Tariq
https://mtariq.jux.com/
cloudfront.blogspot.com


On Sun, Jan 20, 2013 at 1:20 AM, Asaf Mesika asaf.mes...@gmail.com
wrote:

 Start by telling us your row key design.
 Check for pre splitting your table regions.
 I managed to get to 25mb/sec write throughput in Hbase using 1 region
 server. If your data is evenly spread you can get around 7 times that
in a
 10 regions server environment. Should mean that 1 gig should take 4 sec.


 On Friday, January 18, 2013, praveenesh kumar wrote:

  Hey,
  Can someone throw some pointers on what would be the best practice for
 bulk
  imports in hbase ?
  That would be really helpful.
 
  Regards,
  Praveenesh
 
  On Thu, Jan 17, 2013 at 11:16 PM, Mohammad Tariq donta...@gmail.com
 javascript:;
  wrote:
 
   Just to add to whatever all the heavyweights have said above, your
MR
 job
   may not be as efficient as the MR job corresponding to your Hive
query.
  You
   can enhance the performance by setting the mapred config parameters
  wisely
   and by tuning your MR job.
  
   Warm Regards,
   Tariq
   https://mtariq.jux.com/
   cloudfront.blogspot.com
  
  
   On Thu, Jan 17, 2013 at 10:39 PM, ramkrishna vasudevan 
   ramkrishna.s.vasude...@gmail.com javascript:; wrote:
  
Hive is more for batch and HBase is for more of real time data.
   
Regards
Ram
   
On Thu, Jan 17, 2013 at 10:30 PM, Anoop John
anoop.hb...@gmail.com
 javascript:;
  
wrote:
   
 In case of Hive data insertion means placing the file under
table
  path
   in
 HDFS.  HBase need to read the data and convert it into its
format.
(HFiles)
 MR is doing this work..  So this makes it clear that HBase will
be
slower.
 :)  As Michael said the read operation...



 -Anoop-

 On Thu, Jan 17, 2013 at 10:14 PM, Austin Chungath 
  austi...@gmail.com javascript:;
 wrote:

Hi,
  Problem: hive took 6 mins to load a data set, hbase took 1 hr
14
   mins.
  It's a 20 gb data set approx 230 million records. The data is
in
   hdfs,
  single text file. The cluster is 11 nodes, 8 cores.
 
  I loaded this in hive, partitioned by date and bucketed into
32
 and
 sorted.
  Time taken is 6 mins.
 
  I loaded the same data into hbase, in the same cluster by
 writing a
   map
  reduce code. It took 1hr 14 mins. The cluster wasn't running
  anything
 else
  and assuming that the code that i wrote is good enough, what
is
 it
   that
  makes hbase slower than hive in loading the data?
 
  Thanks,
  Austin
 

   
  
 





Re: How to de-nomarlize for this situation in HBASE Table

2013-01-18 Thread Doug Meil

Hi there, 

I'd recommend reading the Schema Design chapter in the RefGuide because
there are some good tips and hard-learned lessons.

http://hbase.apache.org/book.html#schema

Also, all your examples use composite row keys (not a surprise, a very
common pattern) and one thing I would like to draw your attention to is
this patch for composite row building.  Feedback appreciated, because
there isn't currently any utility support in Hbase for this.

https://issues.apache.org/jira/browse/HBASE-7221

(Also, WibiData and Sematext have done good work in key-utility generation
utilities tooŠ  )




On 1/18/13 12:18 AM, Ramasubramanian Narayanan
ramasubramanian.naraya...@gmail.com wrote:

Hi,

Is there any other way instead of using HOME/Work/etc? we expect some 10
such types may come in future.. hence asking

regards,
Rams

On Fri, Jan 18, 2013 at 10:24 AM, Sonal Goyal sonalgoy...@gmail.com
wrote:

 A rowkey is associated with the complete row. So you could have client
id
 as the rowkey. Hbase allows different qualifiers within a column
family, so
 you could potentially do the following:

 1. You could have qualifiers like home address street 1, home address
 street 2, home address city, office address street 1 etc kind of
qualifiers
 under physical address column family.
 2. If you access entire address and not city, state individually, you
could
 have the complete address concatenated and saved in one quailifer under
 physical address family using qualifiers like home, office, extra.

 A good link to get started is
 http://hbase.apache.org/book/datamodel.html#conceptual.view

 Best Regards,
 Sonal
 Real Time Analytics for BigData https://github.com/sonalgoyal/crux
 Nube Technologies http://www.nubetech.co

 http://in.linkedin.com/in/sonalgoyal




 On Fri, Jan 18, 2013 at 10:09 AM, Ramasubramanian Narayanan 
 ramasubramanian.naraya...@gmail.com wrote:

  Hi Sonal,
 
  In that case, the problem is how to store multiple physical address
sets
 in
  the same column family.. what rowkey to be used for this scenario..
 
  A Physical address will contain the following fields (need to store
  multiple physical address like this):
  Physical address type : Home/office/other/etc
  Address line1:
  ..
  ..
  Address line 4:
  State :
  City:
  Country:
 
  regards,
  Rams
 
 
  On Fri, Jan 18, 2013 at 10:00 AM, Sonal Goyal sonalgoy...@gmail.com
  wrote:
 
   How about client id as the rowkey, with column families as physical
   address, email address, telephone address? within each cf, you could
 have
   various qualifiers. For eg in physical address, you could have home
  Street,
   office street etc.
  
   Best Regards,
   Sonal
   Real Time Analytics for BigData https://github.com/sonalgoyal/crux
   Nube Technologies http://www.nubetech.co
  
   http://in.linkedin.com/in/sonalgoyal
  
  
  
  
   On Fri, Jan 18, 2013 at 9:46 AM, Ramasubramanian Narayanan 
   ramasubramanian.naraya...@gmail.com wrote:
  
Hi Sonal,
   
1. will fetch all demographic details of customer based on client
ID
2. Fetch the particular type of address along with other
demographic
  for
   a
client.. for example, HOME Physical address or HOME Telephone
address
  or
office Email address etc.,
   
regards,
Rams
   
On Fri, Jan 18, 2013 at 9:29 AM, Sonal Goyal
sonalgoy...@gmail.com
wrote:
   
 What are your data access patterns?

 Best Regards,
 Sonal
 Real Time Analytics for BigData 
 https://github.com/sonalgoyal/crux
 Nube Technologies http://www.nubetech.co

 http://in.linkedin.com/in/sonalgoyal




 On Fri, Jan 18, 2013 at 9:04 AM, Ramasubramanian Narayanan 
 ramasubramanian.naraya...@gmail.com wrote:

  Hi,
 
  I have the following relational tables.. I want to denormalize
 and
bring
 it
  all into single HBASE table... Pls help how it could be done..
 
 
  1. Client Master Table
  2. Physical Address Table (there might be 'n' number of
address
  that
can
 be
  captured against each client ID)
  3. Email Address Table (there might be 'n' number of address
that
  can
be
  captured against each client ID)
  4. Telephone Address Table (there might be 'n' number of
address
  that
can
  be captured against each client ID)
 
 
  For the tables 2 to 4, there are multiple fields like which is
 the
 Address
  type (home/office,etc), bad address, good address,
communication
address,
  time to call etc.,
 
  Please help me to clarify the following :
 
  1. Whether we can bring this to a single HBASE table?
  2. Having fields like phone number1, phone number 2 etc. is
not
 an
   good
  approach for this scenario...
  3. Whether we can have in the same table by populating these
  multiple
 rows
  for the same customer with different rowkey?
 For e.g.
 For Client Records  - Rowkey can be Client 

Re: Loading data, hbase slower than Hive?

2013-01-18 Thread Doug Meil

Hi there,

See this section of the HBase RefGuide for information about bulk loading.

http://hbase.apache.org/book.html#arch.bulk.load






On 1/18/13 12:57 PM, praveenesh kumar praveen...@gmail.com wrote:

Hey,
Can someone throw some pointers on what would be the best practice for
bulk
imports in hbase ?
That would be really helpful.

Regards,
Praveenesh

On Thu, Jan 17, 2013 at 11:16 PM, Mohammad Tariq donta...@gmail.com
wrote:

 Just to add to whatever all the heavyweights have said above, your MR
job
 may not be as efficient as the MR job corresponding to your Hive query.
You
 can enhance the performance by setting the mapred config parameters
wisely
 and by tuning your MR job.

 Warm Regards,
 Tariq
 https://mtariq.jux.com/
 cloudfront.blogspot.com


 On Thu, Jan 17, 2013 at 10:39 PM, ramkrishna vasudevan 
 ramkrishna.s.vasude...@gmail.com wrote:

  Hive is more for batch and HBase is for more of real time data.
 
  Regards
  Ram
 
  On Thu, Jan 17, 2013 at 10:30 PM, Anoop John anoop.hb...@gmail.com
  wrote:
 
   In case of Hive data insertion means placing the file under table
path
 in
   HDFS.  HBase need to read the data and convert it into its format.
  (HFiles)
   MR is doing this work..  So this makes it clear that HBase will be
  slower.
   :)  As Michael said the read operation...
  
  
  
   -Anoop-
  
   On Thu, Jan 17, 2013 at 10:14 PM, Austin Chungath
austi...@gmail.com
   wrote:
  
  Hi,
Problem: hive took 6 mins to load a data set, hbase took 1 hr 14
 mins.
It's a 20 gb data set approx 230 million records. The data is in
 hdfs,
single text file. The cluster is 11 nodes, 8 cores.
   
I loaded this in hive, partitioned by date and bucketed into 32
and
   sorted.
Time taken is 6 mins.
   
I loaded the same data into hbase, in the same cluster by writing
a
 map
reduce code. It took 1hr 14 mins. The cluster wasn't running
anything
   else
and assuming that the code that i wrote is good enough, what is it
 that
makes hbase slower than hive in loading the data?
   
Thanks,
Austin
   
  
 





Re: Reagrding HBase Hadoop multiple scan objects issue

2013-01-18 Thread Doug Meil

Hi there-

You probably want to review this section of the RegGuide:
http://hbase.apache.org/book.html#mapreduce

re:  it's inefficient to have one scan object to scan everything.


It is.  But in the MapReduce case, there is a Map-task for each input
split (see the RefGuide for details), and therefore a Scanner instance per
Map-task.



On 1/18/13 5:43 PM, Xu, Leon guodo...@amazon.com wrote:

Hi HBase users,

I am currently trying to set up a denormalization map-reduce job for my
HBase Table.
Since our table contains large volume of data, it's inefficient to have
one scan object to scan everything. We are only need to process those
records that have changes. I am planning to have multiple scan objects,
each of which scan object specifies range given that we are in track of
what rows has been changed.
Therefore I am trying to set up the map-reduce job with multiple scan
objects, is this possible?
I am seeing some post online suggesting extending the InputFormat object
and change the getSplits, is this the most efficient way?

Using filter seems to be not very efficient in my case because it's
basically still scan the whole table,right? Just filter out some certain
records.

Can you point me to the right direction?


Thanks
Leon




Re: Constructing rowkeys and HBASE-7221

2013-01-17 Thread Doug Meil

Thanks Aaron!

I will take a look at Kiji.  And I think it underscores the need for some
type of utility row rowkey building/parsing being available in HBase,
because one of the first things folks tend to do is start building their
own keybuilder utility when they start using Hbase (same sentiment also
expressed by others in the HBASE-7221 ticket comments).

It's good that you have full control over the rowkey (i.e., byte[]) as a
backstop, but HBase should also try to make things a bit easier for some
common cases.  I think it will help adoption.

The general idea is a FixedLengthRowKey and a VariableLengthRowKey along
with a RowKeySchema class, and I think that the variant you bring up is a
great idea (e.g., prefix vs. hash).  Let's keep this ball rolling!



On 1/16/13 2:06 PM, Aaron Kimball akimbal...@gmail.com wrote:

Hi Doug,

This HBase feature is really interesting. It is quite related to some work
we're doing on Kiji, our schema management project. In particular, we've
also been focusing on building composite row keys correctly. One thing
that
jumped out at me in that ticket is that with a composition of md5hash and
other (string, int, etc) components, you probably don't want the whole
hash. If you're using that to shard your rows more efficiently across
regions, you might want to just use a subset of the md5 bytes as a prefix.
It might be a good idea to offer users control of this.

Our own thoughts on this on the Kiji side are being tracked at
https://jira.kiji.org/browse/schema-3 where we have a design doc that goes
into a bit more detail.

Cheers,
- Aaron


On Tue, Jan 15, 2013 at 2:01 PM, Doug Meil
doug.m...@explorysmedical.comwrote:


 Hi there, well, this request for input fell like a thud.  :-)

 But I think perhaps it has to do with the fact that I sent it to the
 dev-list instead of the user-list, as people that are actively writing
 HBase itself (devs) need less help with such keybuilding utilities.

 So one last request for feedback, but this time aimed at users of HBase:
 how has your key-building experience been?

 Thanks!



 On 1/7/13 11:04 AM, Doug Meil doug.m...@explorysmedical.com wrote:

 
 Greetings folks-
 
 I would like to restart the conversation on
 https://issues.apache.org/jira/browse/HBASE-7221 because there continue
 to be conversations on the dist-list about creating composite rowkeys,
 and while HBase makes just about anything possible, it doesn¹t make
much
 easy in this respect.
 
 What I¹m lobbying for is a utility class (see the v3 patch in
HBASE-7221)
 that can both create and read rowkeys (so this isn¹t just a one-way
 builder pattern).
 
 This is currently stuck because it was noted that Bytes has an issue
with
 sort-order of numbers specifically if you have both negative and
positive
 values, which is really a different issue, but because this patch uses
 Bytes it¹s related.
 
 What are people¹s thoughts on this topic in general, and the v3 version
 of the patch specifically?  (and the last set of comments).  Thanks!
 
 One of the unit tests shows the example of usage.  The last set of
 comments suggested that RowKey be renamed FixedLengthRowKey, which I
 think is a good idea.  A follow-on patch could include
 VariableLengthRowKey for folks that use strings in the rowkeys.
 
 
   public void testCreate() throws Exception {
 
 int elements[] = {RowKeySchema.SIZEOF_MD5_HASH,
 RowKeySchema.SIZEOF_INT, RowKeySchema.SIZEOF_LONG};
 RowKeySchema schema = new RowKeySchema(elements);
 
 RowKey rowkey = schema.createRowKey();
 rowkey.setHash(0, hashVal);
 rowkey.setInt(1, intVal);
 rowkey.setLong(2, longVal);
 
 byte bytes[] = rowkey.getBytes();
 Assert.assertEquals(key length, schema.getRowKeyLength(),
 bytes.length);
 
 Assert.assertEquals(e1, rowkey.getInt(1), intVal);
 Assert.assertEquals(e2, rowkey.getLong(2), longVal);
   }
 
 Doug Meil
 Chief Software Architect, Explorys
 doug.m...@explorys.com
 






Re: Just joined the user group and have a question

2013-01-17 Thread Doug Meil
Hi there-

If you're absolutely new to Hbase, you might want to check out the Hbase
refGuide in the architecture, performance, and troubleshooting chapters
first.

http://hbase.apache.org/book.html

In terms of determining why your region servers just die, I think you
need to read the background information then provide more information on
your cluster and what you're trying to do because although there are a lot
of people on this dist-list that want to help, you're not giving folks a
whole lot to go on.




On 1/17/13 12:24 PM, Chalcy Raja chalcy.r...@careerbuilder.com wrote:

Hi HBASE Gurus,



I am Chalcy Raja and I joined the hbase group yesterday.  I am already a
member of hive and sqoop user groups.  Looking forward to learn and share
information about hbase here!



Have a question:  We have a cluster where we run hive jobs and also
hbase.  There are stability issues like region servers just die.  We are
looking into fine tuning.  When I read about performance and also heard
from another user is separate mapreduce from hbase.  How do I do that?
If I understand that as running tasktrackers on some and hbase region
servers on some, then we will run into data locality issues and I believe
it will perform poorly.



Definitely I am not the only one running into this issue.  Any thoughts
on how to resolve this issue?



Thanks,

Chalcy




Re: Constructing rowkeys and HBASE-7221

2013-01-15 Thread Doug Meil

Hi there, well, this request for input fell like a thud.  :-)

But I think perhaps it has to do with the fact that I sent it to the
dev-list instead of the user-list, as people that are actively writing
HBase itself (devs) need less help with such keybuilding utilities.

So one last request for feedback, but this time aimed at users of HBase:
how has your key-building experience been?

Thanks!



On 1/7/13 11:04 AM, Doug Meil doug.m...@explorysmedical.com wrote:


Greetings folks-

I would like to restart the conversation on
https://issues.apache.org/jira/browse/HBASE-7221 because there continue
to be conversations on the dist-list about creating composite rowkeys,
and while HBase makes just about anything possible, it doesn¹t make much
easy in this respect.

What I¹m lobbying for is a utility class (see the v3 patch in HBASE-7221)
that can both create and read rowkeys (so this isn¹t just a one-way
builder pattern).

This is currently stuck because it was noted that Bytes has an issue with
sort-order of numbers specifically if you have both negative and positive
values, which is really a different issue, but because this patch uses
Bytes it¹s related.

What are people¹s thoughts on this topic in general, and the v3 version
of the patch specifically?  (and the last set of comments).  Thanks!

One of the unit tests shows the example of usage.  The last set of
comments suggested that RowKey be renamed FixedLengthRowKey, which I
think is a good idea.  A follow-on patch could include
VariableLengthRowKey for folks that use strings in the rowkeys.


  public void testCreate() throws Exception {

int elements[] = {RowKeySchema.SIZEOF_MD5_HASH,
RowKeySchema.SIZEOF_INT, RowKeySchema.SIZEOF_LONG};
RowKeySchema schema = new RowKeySchema(elements);

RowKey rowkey = schema.createRowKey();
rowkey.setHash(0, hashVal);
rowkey.setInt(1, intVal);
rowkey.setLong(2, longVal);

byte bytes[] = rowkey.getBytes();
Assert.assertEquals(key length, schema.getRowKeyLength(),
bytes.length);

Assert.assertEquals(e1, rowkey.getInt(1), intVal);
Assert.assertEquals(e2, rowkey.getLong(2), longVal);
  }

Doug Meil
Chief Software Architect, Explorys
doug.m...@explorys.com





Re: One weird problem of my MR job upon hbase table.

2013-01-07 Thread Doug Meil

Hi there, 

The HBase RefGuide has a comprehensive case study on such a case.  This
might not be the exact problem, but the diagnostic approach should help.

http://hbase.apache.org/book.html#casestudies.slownode





On 1/4/13 10:37 PM, Liu, Raymond raymond@intel.com wrote:

Hi

I encounter a weird lag behind map task issue here :

I have a small hadoop/hbase cluster with 1 master node and 4 regionserver
node all have 16 CPU with map and reduce slot set to 24.

A few table is created with regions distributed on each region node
evenly ( say 16 region for each region server). Also each region has
almost the same number of kvs with very similar size. All table had
major_compact done to ensure data locality

I have a MR job which simply do local region scan in every map task ( so
16 map task for each regionserver node).

By theory, every map task should finish within similar time.

But the real case is that some regions on the same region server always
lags behind a lot, say cost 150 ~250% of the other map tasks average
times.

If this is happen to a single region server for every table, I might
doubt it is a disk issue or other reason that bring down the performance
of this region server.

But the weird thing is that, though with each single table, almost all
the map task on the the same single regionserver is lag behind. But for
different table, this lag behind regionserver is different! And the
region and region size is distributed evenly which I double checked for a
lot of times. ( I even try to set replica to 4 to ensure every node have
a copy of local data)

Say table 1, all map task on regionserver node 2 is slow. While for table
2, maybe all map task on regionserver node 3 is slow, and with table 1,
it will always be regionserver node 2 which is slow regardless of cluster
restart, and the slowest map task will always be the very same one. And
it won't go away even I do major compact again.

So, anyone could give me some clue on what reason might possible lead to
this weird behavior? Any wild guess is welcome!

(BTW. I don't encounter this issue a few days ago with the same table.
While I do restart cluster and do a few changes upon config file during
that period, But restore the config file don't help)


Best Regards,
Raymond Liu






Re: Is it necessary to set MD5 on rowkey?

2012-12-18 Thread Doug Meil

Hi there-

You don't want a filter for this, use a Scan with the lead portion of the
key.

http://hbase.apache.org/book.html#datamodel

See 5.7.3. Scans

On a related topic, this is a utility in process to make composite key
construction easier.

https://issues.apache.org/jira/browse/HBASE-7221





On 12/18/12 4:20 AM, bigdata bigdatab...@outlook.com wrote:

Many articles tell me that MD5 rowkey or part of it is good method to
balance the records stored in different parts. But If I want to search
some sequential rowkey records, such as date as rowkey or partially. I
can not use rowkey filter to scan a range of date value one time on the
date by MD5. How to balance this issue?
Thanks.

 




Re: 答复: 答复: what is the max size for one region and what is the max size of region for one server

2012-12-17 Thread Doug Meil

Hi there,

When sizing your data, don't forget to read thisŠ

http://hbase.apache.org/book.html#schema.creation

and

http://hbase.apache.org/book.html#regions.arch

9.7.5.4. KeyValue

You need to understand how Hbase stores data internally on initial design
to avoid problems down the line.  Keep the keys as small as reasonable,
likewise CF name, and column names.




On 12/17/12 6:07 AM, Nicolas Liochon nkey...@gmail.com wrote:

I think it's safer to use a newer version (0.94): there are a lot of
things
around performances  volumes in the 0.92  0.94. As well, there are much
more bug fixes releases on the 0.94.

For the number of region, there is no maximum written in stone. Having too
many regions will essentially impact the performances. As I said, having
60TB of data per machine is not standard today (points are: that's a lot
of
disk a single machine; what's the impact if you lose a node; what will be
the network load, ...). I suppose all this is documented in the usual
books
on HBase.


On Mon, Dec 17, 2012 at 11:26 AM, tgh guanhua.t...@ia.ac.cn wrote:

 number of region for ONE server?



Re: Reg:delete performance on HBase table

2012-12-05 Thread Doug Meil

Hi there,

You probably want to read this section on the RefGuide about deleting from
HBase.

http://hbase.apache.org/book.html#perf.deleting





On 12/5/12 8:31 AM, Jean-Marc Spaggiari jean-m...@spaggiari.org wrote:

Hi Manoj,

Delete in HBase is like a put.

If you want to delete the entire table (drop) then it will be very
fast. My test table has 100M rows and it's taking few seconds to
delete (one CF and one C only). But if you want to delete the rows one
by one (like 190M rows out of more) then it's like doing 190M puts.

HTH.

JM

2012/12/5, Manoj Babu manoj...@gmail.com:
 Hi All,

 I am having doubt on delete performance inHBase table.

 I have 190 million rows in oracle table it hardly took 4hours to delete
it,
 If i am having the same 190 million rows in HBase table how much time by
 approx Hbase will take to delete the rows(based on row key range) and
 how internally HBase handles delete?


 Thanks in advance!
 Cheers!
 Manoj.






Re: CopyTable utility fails on larger tables

2012-12-05 Thread Doug Meil

I agree it shouldn't fail (slow is one thing, fail is something else), but
regarding HBase Master Web UI showed only one region for the destination
table., you probably want to pre-split your destination table.

It's writing to one region, splitting, writing to those regions,
splitting, etc.




On 12/5/12 10:42 AM, David Koch ogd...@googlemail.com wrote:

Hello,

I can copy relatively small tables (10gb, 7million rows) using the
built-in
HBase (0.92.1-cdh4.0.1) CopyTable utility but copying larger tables, say
150gb, 100million rows does not work.

The failed CopyTable job required 128 mappers according to the Job Tracker
UI, all of these failed in the first attempt after 15 minutes, the job
then
ran another 1 hour while remaining at 0%. However, according  to the
counters many rows apparently had been mapped and emitted. Checking with
HBase shell, I could not perform any action on the destination table
(scan,
get, count) and the HBase Master Web UI showed only one region for the
destination table. I checked the log file on this region server and saw
attached log record (extract).

What precautions should I take when copying tables? Do certain settings
need to be de-activated for the duration of the job?

Thank you,

/David


2012-12-05 15:50:40,406 INFO
org.apache.hadoop.hbase.regionserver.MemStoreFlusher: Flush of region
_xxx_EH_xxx,{\xF0\xE4\xA2?!EQ\xB8\xC9tE\x19\x92
\x08,1354713876229.a75fba31d9883ed7be4ed4a7be0e592f. due to global heap
pressure
2012-12-05 15:50:49,086 INFO org.apache.hadoop.hbase.regionserver.Store:
Added hdfs://
x-1.xx.net:8020/hbase/_xxx_EH_xxx/a75fba31d9883ed7be4ed4a7be0e
592f/t/1788b9f6f9594e2e9efe4ea5230d134c,
entries=418152, sequenceid=1440152048, memsize=217.0m, filesize=145.9m
2012-12-05 15:50:49,088 INFO org.apache.hadoop.hbase.regionserver.HRegion:
Finished memstore flush of ~217.8m/228416264, currentsize=33.0m/34555320
for region _xxx_EH_xxx,{\xF0\xE4\xA2?!EQ\xB8\xC9tE\x19\x92
\x08,1354713876229.a75fba31d9883ed7be4ed4a7be0e592f. in 8682ms,
sequenceid=1440152048, compaction requested=true
2012-12-05 15:50:49,088 WARN
org.apache.hadoop.hbase.regionserver.MemStoreFlusher: Region
_xxx_EH_xxx,,1354713876229.37825c623850b16013ab0bf902d02746. has too
many store files; delaying flush up to 9ms
2012-12-05 15:50:49,760 WARN org.apache.hadoop.ipc.HBaseServer: IPC Server
Responder, call 
multi(org.apache.hadoop.hbase.client.MultiAction@44848967),
rpc version=1, client version=29, methodsFingerPrint=54742778 from
5.39.67.13:56290: output error
2012-12-05 15:50:49,760 WARN org.apache.hadoop.ipc.HBaseServer: IPC Server
handler 9 on 60020 caught: java.nio.channels.ClosedChannelException
at 
sun.nio.ch.SocketChannelImpl.ensureWriteOpen(SocketChannelImpl.java:133)
at sun.nio.ch.SocketChannelImpl.write(SocketChannelImpl.java:324)
at
org.apache.hadoop.hbase.ipc.HBaseServer.channelWrite(HBaseServer.java:1663
)
at
org.apache.hadoop.hbase.ipc.HBaseServer$Responder.processResponse(HBaseSer
ver.java:934)
at
org.apache.hadoop.hbase.ipc.HBaseServer$Responder.doRespond(HBaseServer.ja
va:1013)
at
org.apache.hadoop.hbase.ipc.HBaseServer$Call.sendResponseIfReady(HBaseServ
er.java:419)
at
org.apache.hadoop.hbase.ipc.HBaseServer$Handler.run(HBaseServer.java:1356)

2012-12-05 15:50:49,763 WARN org.apache.hadoop.ipc.HBaseServer: IPC Server
listener on 60020: readAndProcess threw exception java.io.IOException:
Connection reset by peer. Count of bytes read: 0
java.io.IOException: Connection reset by peer
at sun.nio.ch.FileDispatcher.read0(Native Method)
at sun.nio.ch.SocketDispatcher.read(SocketDispatcher.java:21)
at sun.nio.ch.IOUtil.readIntoNativeBuffer(IOUtil.java:198)
at sun.nio.ch.IOUtil.read(IOUtil.java:171)
at sun.nio.ch.SocketChannelImpl.read(SocketChannelImpl.java:243)
at
org.apache.hadoop.hbase.ipc.HBaseServer.channelRead(HBaseServer.java:1686)
at
org.apache.hadoop.hbase.ipc.HBaseServer$Connection.readAndProcess(HBaseSer
ver.java:1130)
at
org.apache.hadoop.hbase.ipc.HBaseServer$Listener.doRead(HBaseServer.java:7
13)
at
org.apache.hadoop.hbase.ipc.HBaseServer$Listener$Reader.doRunLoop(HBaseSer
ver.java:505)
at
org.apache.hadoop.hbase.ipc.HBaseServer$Listener$Reader.run(HBaseServer.ja
va:480)
at
java.util.concurrent.ThreadPoolExecutor$Worker.runTask(ThreadPoolExecutor.
java:886)
at
java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java
:908)
at java.lang.Thread.run(Thread.java:662)
2012-12-05 15:50:49,792 WARN org.apache.hadoop.ipc.HBaseServer: IPC Server
listener on 60020: readAndProcess threw exception java.io.IOException:
Connection reset by peer. Count of bytes read: 0
java.io.IOException: Connection reset by peer
at sun.nio.ch.FileDispatcher.read0(Native Method)
at sun.nio.ch.SocketDispatcher.read(SocketDispatcher.java:21)
at sun.nio.ch.IOUtil.readIntoNativeBuffer(IOUtil.java:198)
at sun.nio.ch.IOUtil.read(IOUtil.java:171)
at sun.nio.ch.SocketChannelImpl.read(SocketChannelImpl.java:243)
at

Re: Data Locality, HBase? Or Hadoop?

2012-12-03 Thread Doug Meil

Hi there-

This is also discussed in the Regions section in the RefGuide:

http://hbase.apache.org/book.html#regions.arch

9.7.3. Region-RegionServer Locality




On 12/3/12 10:08 AM, Kevin O'dell kevin.od...@cloudera.com wrote:

JM,

  If you have disabled the balancer and are manually moving regions, you
will need to run a compaction on those regions.  That is the only(logical)
way of bringing the data local.  HDFS does not have a concept of HBase
locality.  HBase locality is all managed through major and minor
compactions.

On Mon, Dec 3, 2012 at 10:04 AM, Jean-Marc Spaggiari 
jean-m...@spaggiari.org wrote:

 Hi,

 I'm wondering who is taking care of the data locality. Is it hadoop? Or
 hbase?

 Let's say I have disabled the load balancer and I'm manually moving a
 region to a specific server. Who is going to take care that the data
 is going to be on the same datanode as the regionserver I moved the
 region to? Is hadoop going to see that my region is now on this region
 server and make sure my data is moved there too? Or is hbase going to
 ask hadoop to do it?

 Or, since I moved it manually, there is not any data locality
guaranteed?

 Thanks,

 JM




-- 
Kevin O'Dell
Customer Operations Engineer, Cloudera




Re: Multiple regionservers on a single node

2012-12-03 Thread Doug Meil

Hi there, 

Not tried multi-RS on a single node, but have you looked at the off-heap
cache?  It's a part of 0.92.x.  From what I understand that feature was
designed with this case in mind (I.e., trying to do a lot of caching, but
don't want to introduce GC issues in RS).

https://issues.apache.org/jira/browse/HBASE-4027



On 12/3/12 4:39 PM, Ishan Chhabra ichha...@rocketfuel.com wrote:

Hi,
Has anybody tried to run multiple RegionServers on a single physical
node? Are there deep technical issues or minor impediments that would
hinder this?

We are trying to do this because we are facing a lot of GC pauses on the
large heap sizes (~70G) that we are using, which leads to a lot of
timeouts
in our latency critical application. More processes with smaller heaps
would help in mitigating this issue.

Any experience or thoughts on this would help.
Thanks!

-- 
*Ishan Chhabra *| Rocket Scientist | Rocketfuel Inc. | *m *650 556 6803




Re: Connecting to standalone HBase from a remote client

2012-11-27 Thread Doug Meil

Hi there-

re:  From what I have understood, these properties are not for Hbase but
for the Hbase client which we write. They tell the client where to look for
ZK.

Yep.  That's how it works.  Then the client looks up ROOT/META and then
the client talks directly to the RegionServers.

http://hbase.apache.org/book.html#client





On 11/27/12 8:52 AM, Mohammad Tariq donta...@gmail.com wrote:

Hello Matan,

  From what I have understood, these properties are not for Hbase but
for the Hbase client which we write. They tell the client where to look
for
ZK.

Hmaster registers its address with ZK. And from there client will come to
know where to look for Hmaster. And if the Hmaster registers its address
as
'localhost', the client will take it as the 'localhost', which is client's
'localhost' and not the 'localhost' where Hmaster is running. So, if you
have the IP and hostname of the Hmaster in your /etc/hosts file the client
can reach that machine without any problem as there is proper DNS
resolution available.

But this just is what I think. I need approval from the heavyweights.

Stack sir??



Regards,
Mohammad Tariq



On Tue, Nov 27, 2012 at 5:57 PM, matan ma...@cloudaloe.org wrote:

 Thanks guys,

 Excuse my ignorance, but having sort of agreed that the configuration
that
 determines which-server-should-be-contacted-for-what is on the HBase
 server, I am not sure how any of the practical suggestions made should
 solve the issue, and enable connecting from a remote client.

 Let me delineate - setting /etc/hosts on my client side seems in this
 regard not relevant in that view. And the other suggestion for
 hbase-site.xml configuration I have already got covered, as my client
code
 successfully connects to zookeeper (the configuration properties
mentioned
 on this thread are zookeeper specific according to my interpretation of
 documentation, I don't directly see how they should solve the problem).
 Perhaps Mohammad you can explain why those zookeeper properties relate
to
 how the master references itself towards zookeeper?

 Should I take it from St.Ack that there is currently no way to specify
the
 master's remotely accessible server/ip in the HBase configuration?

 Anyway, my HBase server's /etc/hosts has just one line now, in case it
got
 lost on the thread -
 127.0.0.1 localhost 'server-name'. Everything works fine on the HBase
 server itself, the same client code runs perfectly there.

 Thanks again,
 Matan

 On Mon, Nov 26, 2012 at 10:15 PM, Tariq [via Apache HBase] 
 ml-node+s679495n4034419...@n3.nabble.com wrote:

  Hello Nicolas,
 
You are right. It has been deprecated. Thank you for updating my
  knowledge base..:)
 
  Regards,
  Mohammad Tariq
 
 
 
  On Tue, Nov 27, 2012 at 12:17 AM, Nicolas Liochon [hidden email]
 http://user/SendEmail.jtp?type=nodenode=4034419i=0
  wrote:
 
   Hi Mohammad,
  
   Your answer was right, just that specifying the master address is
not
   necessary (anymore I think). But it does no harm.
   Changing the /etc/hosts (as you did) is right too.
   Lastly, if the cluster is standalone and accessed locally, having
  localhost
   in ZK will not be an issue. However, it's perfectly possible to
have a
   standalone cluster accessed remotely, so you don't want to have the
  master
   to write I'm on the server named localhost in this case. I expect
it
   won't be an issue for communications between the region servers or
hdfs
  as
   they would be all on the same localhost...
  
   Cheers,
  
   Nicolas
  
   On Mon, Nov 26, 2012 at 7:16 PM, Mohammad Tariq [hidden email]
 http://user/SendEmail.jtp?type=nodenode=4034419i=1
 
   wrote:
  
what
  
 
 
  --
   If you reply to this email, your message will be added to the
discussion
  below:
 
 
 
http://apache-hbase.679495.n3.nabble.com/Connecting-to-standalone-HBase-f
rom-a-remote-client-tp4034362p4034419.html
   To unsubscribe from Connecting to standalone HBase from a remote
 client, click
  here
 
http://apache-hbase.679495.n3.nabble.com/template/NamlServlet.jtp?macro=u
nsubscribe_by_codenode=4034362code=bWF0YW5AY2xvdWRhbG9lLm9yZ3w0MDM0MzYy
fC0xMDg3NTk1Njc3
 
  .
  NAML
 
http://apache-hbase.679495.n3.nabble.com/template/NamlServlet.jtp?macro=m
acro_viewerid=instant_html%21nabble%3Aemail.namlbase=nabble.naml.namesp
aces.BasicNamespace-nabble.view.web.template.NabbleNamespace-nabble.view.
web.template.NodeNamespacebreadcrumbs=notify_subscribers%21nabble%3Aemai
l.naml-instant_emails%21nabble%3Aemail.naml-send_instant_email%21nabble%3
Aemail.naml
 
 




 --
 View this message in context:
 
http://apache-hbase.679495.n3.nabble.com/Connecting-to-standalone-HBase-f
rom-a-remote-client-tp4034362p4034439.html
 Sent from the HBase User mailing list archive at Nabble.com.





Re: Expert suggestion needed to create table in Hbase - Banking

2012-11-26 Thread Doug Meil

Hi there, somebody already wisely mentioned the link to the # of CF's
entry, but here are a few other entries that can save you some heartburn
if you read them ahead of time.

http://hbase.apache.org/book.html#datamodel

http://hbase.apache.org/book.html#schema

http://hbase.apache.org/book.html#architecture





On 11/26/12 5:28 AM, Mohammad Tariq donta...@gmail.com wrote:

Hello sir,

You might become a victim of RS hotspotting, since the cutomerIDs will
be sequential(I assume). To keep things simple Hbase puts all the rows
with
similar keys to the same RS. But, it becomes a bottleneck in the long run
as all the data keeps on going to the same region.

HTH

Regards,
Mohammad Tariq



On Mon, Nov 26, 2012 at 3:53 PM, Ramasubramanian Narayanan 
ramasubramanian.naraya...@gmail.com wrote:

 Hi,
 Thanks! Can we have the customer number as the RowKey for the customer
 (client) master table? Please help in educating me on the advantage and
 disadvantage of having customer number as the Row key...

 Also SCD2 we may need to implement in that table.. will it work if I
have
 like that?

 Or

 SCD2 is not needed instead we can achieve the same by increasing the
 version number that it will hold?

 pls suggest...

 regards,
 Rams

 On Mon, Nov 26, 2012 at 1:10 PM, Li, Min m...@microstrategy.com wrote:

  When 1 cf need to do split, other 599 cfs will split at the same
time. So
  many fragments will be produced when you use so many column families.
  Actually, many cfs can be merge to only one cf with specific tags in
  rowkey. For example, rowkey of customer address can be uid+'AD', and
  customer profile can be uid+'PR'.
 
  Min
  -Original Message-
  From: Ramasubramanian Narayanan [mailto:
  ramasubramanian.naraya...@gmail.com]
  Sent: Monday, November 26, 2012 3:05 PM
  To: user@hbase.apache.org
  Subject: Expert suggestion needed to create table in Hbase - Banking
 
  Hi,
 
I have a requirement of physicalising the logical model... I have a
  client model which has 600+ entities...
 
Need suggestion how to go about physicalising it...
 
I have few other doubts :
1) Whether is it good to create a single table for all the 600+
 columns?
2) To have different column families for different groups or can it
be
  under a single column family? For example, customer address can we
have
 as
  a different column family?
 
Please help on this..
 
 
  regards,
  Rams
 





Re: Paging On HBASE like solr

2012-11-22 Thread Doug Meil

Hi there-

Then don't use an end-row and break out of the loop when you hit 100 rows.





On 11/22/12 5:16 AM, Vajrakumar vajra.ku...@pointcross.com wrote:

Hello Doug,
First of all thanks for taking time to reply.

As per my knowledge goes  below two lines take the rowkey as a parameter
for
representing start and end.

scan.setStartRow( Bytes.toBytes(row));   // start key is
inclusive
scan.setStopRow( Bytes.toBytes(row +  (char)0));  // stop key is
exclusive


But,
In my case irrespective of rowkey I need 100 rows always. If I go with
this
concept if 5 rows are deleted in between 1 to 100 then it will give me 95
but not 100.
But for me always I need 100 (I mean rowCount whatever I pass) rows.


And as after usage there may be deletions of rows or adding and all on
DB, I
can't keep track of rows for this paging..
Paging needs a fixed number of rows in each page always.




-Original Message-
From: Doug Meil [mailto:doug.m...@explorysmedical.com]
Sent: 22 November 2012 00:21
To: user@hbase.apache.org
Subject: Re: Paging On HBASE like solr


Hi there,

Pretty similar approach with Hbase.  See the Scan class.

http://hbase.apache.org/book.html#data_model_operations






On 11/21/12 1:04 PM, Vajrakumar vajra.ku...@pointcross.com wrote:

Hello all,
As we do paging in solr using start and rowCount I need to implement
same through hbase.

In Detail:
I have 1000 rows data which I need to display in 10 pages each page
containing 100 rows.
So on click of next page we will send current rowStart
(1,101,201,301,401,501...) and rowCount (100 for all the pages) to a
method which will query hbase and return me the result.

One solution is to always query more than rowCount starting from th
rowkey of last passed row, and in a for loop count depending on row key
and return when it becomes 100 (i.e., rowCount) . But its poor solution
i know.

Thanks in advance.

Sent from Samsung Mobile








Re: Paging On HBASE like solr

2012-11-21 Thread Doug Meil

Hi there,

Pretty similar approach with Hbase.  See the Scan class.

http://hbase.apache.org/book.html#data_model_operations






On 11/21/12 1:04 PM, Vajrakumar vajra.ku...@pointcross.com wrote:

Hello all,
As we do paging in solr using start and rowCount I need to implement same
through hbase.

In Detail:
I have 1000 rows data which I need to display in 10 pages each page
containing 100 rows.
So on click of next page we will send current rowStart
(1,101,201,301,401,501...) and rowCount (100 for all the pages) to a
method which will query hbase and return me the result.

One solution is to always query more than rowCount starting from th
rowkey of last passed row, and in a for loop count depending on row key
and return when it becomes 100 (i.e., rowCount) . But its poor solution i
know.

Thanks in advance.

Sent from Samsung Mobile




Re: Region hot spotting

2012-11-21 Thread Doug Meil

Hi there-

If he's using monotonically increasing keys the pre splits won't help
because the same region is going to get all the writes.

http://hbase.apache.org/book.html#rowkey.design





On 11/21/12 12:33 PM, Suraj Varma svarma...@gmail.com wrote:

Ajay:
Why would you not want to specify splits while creating table? If your
0-10 prefix is at random ... why not pre-split with that?

Without presplitting, as Ram says, you cannot avoid region hotspotting
until table starts automatic splits.
--S

On Wed, Nov 21, 2012 at 3:46 AM, Ajay Bhosle
ajay.bho...@relianceada.com wrote:
 Thanks for your comments,

 I am already prefixing the timestamp with integer in range of 1..10,
also
 the hbase.hregion.max.filesize is defined as 256 MB. Still it is hot
 spotting.

 Thanks
 Ajay

 -Original Message-
 From: ramkrishna vasudevan [mailto:ramkrishna.s.vasude...@gmail.com]
 Sent: Wednesday, November 21, 2012 2:14 PM
 To: user@hbase.apache.org
 Subject: Re: Region hot spotting

 Hi
 This link is pretty much useful.  But still there too it says if you
dont
 pre split you need to wait for the salting to help you from hotspotting
 till the region gets splitted.

 Mohammad just pointing this to say the usefulness of presplitting
 definitely your's is a good pointer to Ajay. :)

 Regards
 Ram

 On Wed, Nov 21, 2012 at 1:59 PM, Mohammad Tariq donta...@gmail.com
wrote:

 Hello Ajay,

  You can use 'salting' if you don't want to presplit your table. You
might
 this link useful :


 
http://blog.sematext.com/2012/04/09/hbasewd-avoid-regionserver-hotspottin
g-d
 espite-writing-records-with-sequential-keys/

 HTH

 Regards,
 Mohammad Tariq



 On Wed, Nov 21, 2012 at 1:49 PM, ramkrishna vasudevan 
 ramkrishna.s.vasude...@gmail.com wrote:

  Hotspotting is bound to happen until the region starts splitting and
 gets
  assigned to diff region servers.
 
  Regards
  Ram
 
  On Wed, Nov 21, 2012 at 12:49 PM, Ajay Bhosle
  ajay.bho...@relianceada.comwrote:
 
   Hi,
  
  
  
   I am inserting some data in hbase which is getting hot spotted in a
   particular server. The format of the row key is (0 or
   1)|[timestamp]_[sequence].  Basically I want to add log
information to
   hbase
   and search the records based on range of dates.
  
  
  
   Can someone suggest any configuration changes or any ideas on how
the
 row
   key should be design. I do not want to specify the splits while
 creating
   table.
  
  
  
   Thanks
  
   Ajay
  
  
  
  
   The information contained in this electronic message (email) and
any
   attachments to this email are intended for the exclusive use of the
   addressee(s) and access to this email by any one else is
unauthorised.
  The
   email may contain proprietary, confidential or privileged
information
 or
   information relating to Reliance Group. If you are not the intended
   recipient, please notify the sender by telephone, fax, or return
email
  and
   delete this communication and any attachments thereto, immediately
 from
   your computer. Any dissemination, distribution, or copying of this
   communication and the attachments thereto (in whole or part), in
any
   manner, is strictly prohibited and actionable at law. The recipient
   acknowledges that emails are susceptible to alteration and their
  integrity
   can not be guaranteed and that Company does not guarantee that any
 e-mail
   is virus-free and accept no liability for any damage caused by any
 virus
   transmitted by this email.
  
 





 The information contained in this electronic message (email) and any
attachments to this email are intended for the exclusive use of the
addressee(s) and access to this email by any one else is unauthorised.
The email may contain proprietary, confidential or privileged
information or information relating to Reliance Group. If you are not
the intended recipient, please notify the sender by telephone, fax, or
return email and delete this communication and any attachments thereto,
immediately from your computer. Any dissemination, distribution, or
copying of this communication and the attachments thereto (in whole or
part), in any manner, is strictly prohibited and actionable at law. The
recipient acknowledges that emails are susceptible to alteration and
their integrity can not be guaranteed and that Company does not
guarantee that any e-mail is virus-free and accept no liability for any
damage caused by any virus transmitted by this email.





Re: Development work focused on HFile v2

2012-11-03 Thread Doug Meil


The Hbase RefGuide has a big entry in the appendix on Hfile v2.






On 11/3/12 5:34 PM, Marcos Ortiz mlor...@uci.cu wrote:

Regards to all HBase users.
I'm looking for all available information about the current development
of HFile
version 2 to write a blog post talking about the main differences
between HFile and
HFile version 2.
What I'm looking for?
- JIRA issues
- Research papers
- General discussions about this topic

Any help is welcome.
Thanks
-- 

Marcos Luis Ortíz Valmaseda
about.me/marcosortiz http://about.me/marcosortiz
@marcosluis2186 http://twitter.com/marcosluis2186



10mo. ANIVERSARIO DE LA CREACION DE LA UNIVERSIDAD DE LAS CIENCIAS
INFORMATICAS...
CONECTADOS AL FUTURO, CONECTADOS A LA REVOLUCION

http://www.uci.cu
http://www.facebook.com/universidad.uci
http://www.flickr.com/photos/universidad_uci




Re: how to copy oracle to HBASE, just like goldengate

2012-11-02 Thread Doug Meil

Additionally, don't take it for granted that an RDBMS and HBase aren't the
same thing.  Check out these sections of the RefGuide if you haven't
already.

http://hbase.apache.org/book.html#datamodel

http://hbase.apache.org/book.html#schema




On 11/1/12 11:01 PM, Shumin Wu shumin...@gmail.com wrote:

Have you taken a look at the Sqoop (http://sqoop.apache.org/) tool?

Shumin

On Thu, Nov 1, 2012 at 6:44 PM, Xiang Hua bea...@gmail.com wrote:

 Hi,
IS there any tool to 'copy' whole oracle data of an instance into
 'hbase'.


   Best R.
huaxiang





Re: Does hbase.hregion.max.filesize have a limit?

2012-11-01 Thread Doug Meil

Hi there-

re:  The max file size the whole cluster can store for one CF is 60G,
right?

No, the max file-size for a region, in your example, is 60GB.  When the
data exceeds that the region will split - and then you'll have 2 regions
with 60GB limit.  

Check out this section of the RefGuide:

http://hbase.apache.org/book.html#regions.arch

Which explains how regions are how data is distributed across your cluster.

The trick is that you don't want regions to small, but you also don't want
them too big - because you'll wind up with what the ref guide describes in
this chapter...


9.7.1. Region Size

HBase scales by having regions across many servers. Thus if
  you have 2 regions for 16GB data, on a 20 node machine your data
  will be concentrated on just a few machines - nearly the entire
  cluster will be idle.  This really cant be stressed enough,
since a
  common problem is loading 200MB data into HBase then wondering
why
  your awesome 10 node cluster isn't doing anything.





On 11/1/12 4:09 AM, Cheng Su scarcer...@gmail.com wrote:

Thank you for your answer.
The max file size the whole cluster can store for one CF is 60G, right?
Maybe the only way is to split the large table into small tables...

On Thu, Nov 1, 2012 at 3:05 PM, ramkrishna vasudevan
ramkrishna.s.vasude...@gmail.com wrote:
 Can multiple region servers runs on one real machine?
 (I guess not though)
 No.. Every RS runs in different physical machines.

 max.file.size actually applies for region.  Suppose you create a table
then
 insert data for 20G that will get explicitly splitted into further
regions.
 Yes all 60G of data can be stored in one physical machine but that means
 that you have the data is logically served by 3 regions.
 Does this help you?

 Regards
 Ram

 On Thu, Nov 1, 2012 at 12:15 PM, Cheng Su scarcer...@gmail.com wrote:

 Does that means the max file size of 1 cf is 20G? If I have 3 region
 servers, then 60G total?
 I have a very large table, size of one cf (contains only one column)
 may exceed 60G.
 Is there any chance to store the data without increase machines?

 Can multiple region servers runs on one real machine?
 (I guess not though)

 On Thu, Nov 1, 2012 at 1:35 PM, lars hofhansl lhofha...@yahoo.com
wrote:
  The tribal knowledge would say about 20G is the max.
  The fellas from Facebook will have a more definite answer.
 
  -- Lars
 
 
 
  
   From: Cheng Su scarcer...@gmail.com
  To: user@hbase.apache.org
  Sent: Wednesday, October 31, 2012 10:22 PM
  Subject: Does hbase.hregion.max.filesize have a limit?
 
  Hi, all.
 
  I have a simple question: does hbase.hregion.max.filesize have a
limit?
  May I specify a very large value to this? like 40G or more? (don't
  consider the performance)
  I didn't find any description about this from official site or
google.
 
  Thanks.
 
  --
 
  Regards,
  Cheng Su



 --

 Regards,
 Cheng Su




-- 

Regards,
Cheng Su





Re: Best technique for doing lookup with Secondary Index

2012-10-26 Thread Doug Meil

Hey folks, for the record there are samples of using importsv for
preparing Hfiles in here...

http://hbase.apache.org/book.html#importtsv






On 10/26/12 12:44 AM, anil gupta anilgupt...@gmail.com wrote:

Hi Anoop,

Yes i use bulk loading for loading table A. I wrote my own mapper as
Importtsv wont suffice my requirements. :) No, i dont call HTable#put()
from my mapper. I was thinking about trying out calling HTable#put() from
my mapper and see the outcome.

 I meant to say that when we use MR job (ex. importtsv) then WAL is not
used. Sorry, if i misunderstood someone.

Thanks,
Anil

On Thu, Oct 25, 2012 at 9:06 PM, Anoop Sam John anoo...@huawei.com
wrote:

 Hi Anil,
   Some confusion after seeing your reply.
 You use bulk loading?  You created your own mapper?  You call
HTable#put()
 from mappers?

 I think confusion in another thread also..  I was refering to the
 HFileOutputReducer.. There is a TableOutputFormat also... In
 TableOutputFormat it will try put to the HTable...  Here write to WAL is
 applicable...


 [HFileOutputReducer] : As we discussed in another thread, in case of
bulk
 loading the aproach is like MR job create KVs and write to files and
this
 file is written as an HFile. Yes this will contain all meta information,
 trailer etc... Finally only HBase cluster need to be contacted just to
load
 this HFile(s) into HBase cluster.. Under corresponding regions.  This
will
 be the fastest way for bulk loading of huge data...


 -Anoop-
 
 From: anil gupta [anilgupt...@gmail.com]
 Sent: Friday, October 26, 2012 3:40 AM
 To: user@hbase.apache.org
 Subject: Re: Best technique for doing lookup with Secondary Index

 Anoop:  In prePut hook u call HTable#put()?
 Anil: Yes i call HTable#put() in prePut. Is there better way of doing
it?

 Anoop: Why use the network calls from server side here then?
 Anil: I thought this is a cleaner approach since i am using BulkLoader.
I
 decided not to run two jobs since i am generating a UniqueIdentifier at
 runtime in bulkloader.

 Anoop: can not handle it from client alone?
 Anil: I cannot handle it from client since i am using BulkLoader. Is it
a
 good idea to create Htable instance on B and do put in my mapper? I
might
 try this idea.

 Anoop: You can have a look at Lily project.
 Anil: It's little late for us to evaluate Lily now and at present we
dont
 need complex secondary index since our data is immutable.

 Ram: what is rowkey B here?
 Anil: Suppose i am storing customer events in table A. I have two
 requirement for data query:
 1. Query customer events on basis of customer_Id and event_ID.
 2. Query customer events on basis of event_timestamp and customer_ID.

 70% of querying is done by query#1, so i will create
 customer_Idevent_ID as row key of Table A.
 Now, in order to support fast results for query#2, i need to create a
 secondary index on A. I store that secondary index in B, rowkey of B is
 event_timestampcustomer_ID  .Every row stores the corresponding
rowkey
 of A.

 Ram:How is the startRow determined for every query?
 Anil: Its determined by a very simple application logic.

 Thanks,
 Anil Gupta

 On Wed, Oct 24, 2012 at 10:16 PM, Ramkrishna.S.Vasudevan 
 ramkrishna.vasude...@huawei.com wrote:

  Just out of curiosity,
   The secondary index is stored in table B as rowkey B --
   family:rowkey
   A
  what is rowkey B here?
   1. Scan the secondary table by using prefix filter and startRow.
  How is the startRow determined for every query ?
 
  Regards
  Ram
 
   -Original Message-
   From: Anoop Sam John [mailto:anoo...@huawei.com]
   Sent: Thursday, October 25, 2012 10:15 AM
   To: user@hbase.apache.org
   Subject: RE: Best technique for doing lookup with Secondary Index
  
   I build the secondary table B using a prePut RegionObserver.
  
   Anil,
  In prePut hook u call HTable#put()?  Why use the network
calls
   from server side here then? can not handle it from client alone? You
   can have a look at Lily project.   Thoughts after seeing ur idea on
put
   and scan..
  
   -Anoop-
   
   From: anil gupta [anilgupt...@gmail.com]
   Sent: Thursday, October 25, 2012 3:10 AM
   To: user@hbase.apache.org
   Subject: Best technique for doing lookup with Secondary Index
  
   Hi All,
  
   I am using HBase 0.92.1. I have created a secondary index on table
A.
   Table A stores immutable data. I build the secondary table B
using a
   prePut RegionObserver.
  
   The secondary index is stored in table B as rowkey B --
   family:rowkey
   A  . rowkey A is the column qualifier. Every row in B will
only on
   have one column and the name of that column is the rowkey of A. So
the
   value is blank. As per my understanding, accessing column qualifier
is
   faster than accessing value. Please correct me if i am wrong.
  
  
   HBase Querying approach:
   1. Scan the secondary table by using prefix filter and startRow.
   2. Do a batch get on 

Re: Hbase sequential row merging in MapReduce job

2012-10-19 Thread Doug Meil

As long as you know your keyspace, you should be able to create your own
splits.  See TableInputFormatBase for the default implementation (which is
1 input split per region)





On 10/19/12 9:32 AM, Eric Czech eczec...@gmail.com wrote:

Hi everyone,

Is there any way to create an InputSplit for a MapReduce job (reading from
an HBase table) that guarantees sequential rows with some shared key
prefix
will end up in the same mapper?

For example, if I have sequential keys like this:

metric1_2010,
metric1_2011,
metric1_2012,
metric2_2011,
metric2_2012,
...

I want a mapper that will definitely see all the rows with keys that start
with metric1.

Is there a way to do this?

Thank you!




Re: Coprocessor end point vs MapReduce?

2012-10-18 Thread Doug Meil

To echo what Mike said about KISS, would you use triggers for a large
time-sensitive batch job in an RDBMS?  It's possible, but probably not.
Then you might want to think twice about using co-processors for such a
purpose with HBase.





On 10/17/12 9:50 PM, Michael Segel michael_se...@hotmail.com wrote:

Run your weekly job in a low priority fair scheduler/capacity scheduler
queue. 

Maybe its just me, but I look at Coprocessors as a similar structure to
RDBMS triggers and stored procedures.
You need to restrain and use them sparingly otherwise you end up creating
performance issues.

Just IMHO.

-Mike

On Oct 17, 2012, at 8:44 PM, Jean-Marc Spaggiari
jean-m...@spaggiari.org wrote:

 I don't have any concern about the time it's taking. It's more about
 the load it's putting on the cluster. I have other jobs that I need to
 run (secondary index, data processing, etc.). So the more time this
 new job is taking, the less CPU the others will have.
 
 I tried the M/R and I really liked the way it's done. So my only
 concern will really be the performance of the delete part.
 
 That's why I'm wondering what's the best practice to move a row to
 another table.
 
 2012/10/17, Michael Segel michael_se...@hotmail.com:
 If you're going to be running this weekly, I would suggest that you
stick
 with the M/R job.
 
 Is there any reason why you need to be worried about the time it takes
to do
 the deletes?
 
 
 On Oct 17, 2012, at 8:19 PM, Jean-Marc Spaggiari
jean-m...@spaggiari.org
 wrote:
 
 Hi Mike,
 
 I'm expecting to run the job weekly. I initially thought about using
 end points because I found HBASE-6942 which was a good example for my
 needs.
 
 I'm fine with the Put part for the Map/Reduce, but I'm not sure about
 the delete. That's why I look at coprocessors. Then I figure that I
 also can do the Put on the coprocessor side.
 
 On a M/R, can I delete the row I'm dealing with based on some criteria
 like timestamp? If I do that, I will not do bulk deletes, but I will
 delete the rows one by one, right? Which might be very slow.
 
 If in the future I want to run the job daily, might that be an issue?
 
 Or should I go with the initial idea of doing the Put with the M/R job
 and the delete with HBASE-6942?
 
 Thanks,
 
 JM
 
 
 2012/10/17, Michael Segel michael_se...@hotmail.com:
 Hi,
 
 I'm a firm believer in KISS (Keep It Simple, Stupid)
 
 The Map/Reduce (map job only) is the simplest and least prone to
 failure.
 
 Not sure why you would want to do this using coprocessors.
 
 How often are you running this job? It sounds like its going to be
 sporadic.
 
 -Mike
 
 On Oct 17, 2012, at 7:11 PM, Jean-Marc Spaggiari
 jean-m...@spaggiari.org
 wrote:
 
 Hi,
 
 Can someone please help me to understand the pros and cons between
 those 2 options for the following usecase?
 
 I need to transfer all the rows between 2 timestamps to another
table.
 
 My first idea was to run a MapReduce to map the rows and store them
on
 another table, and then delete them using an end point coprocessor.
 But the more I look into it, the more I think the MapReduce is not a
 good idea and I should use a coprocessor instead.
 
 BUT... The MapReduce framework guarantee me that it will run against
 all the regions. I tried to stop a regionserver while the job was
 running. The region moved, and the MapReduce restarted the job from
 the new location. Will the coprocessor do the same thing?
 
 Also, I found the webconsole for the MapReduce with the number of
 jobs, the status, etc. Is there the same thing with the
coprocessors?
 
 Are all coprocessors running at the same time on all regions, which
 mean we can have 100 of them running on a regionserver at a time? Or
 are they running like the MapReduce jobs based on some configured
 values?
 
 Thanks,
 
 JM
 
 
 
 
 
 
 






Re: Coprocessor end point vs MapReduce?

2012-10-18 Thread Doug Meil

I agree with the concern and there isn't a ton of guidance on this area
yet. 



On 10/18/12 2:01 PM, Michael Segel michael_se...@hotmail.com wrote:

Doug, 

One thing that concerns me is that a lot of folks are gravitating to
Coprocessors and may be using them for the wrong thing.
Has anyone done any sort of research as to some of the limitations and
negative impacts on using coprocessors?

While I haven't really toyed with the idea of bulk deletes, periodic
deletes is probably not a good use of coprocessors however using them
to synchronize tables would be a valid use case.

Thx

-Mike

On Oct 18, 2012, at 7:36 AM, Doug Meil doug.m...@explorysmedical.com
wrote:

 
 To echo what Mike said about KISS, would you use triggers for a large
 time-sensitive batch job in an RDBMS?  It's possible, but probably not.
 Then you might want to think twice about using co-processors for such a
 purpose with HBase.
 
 
 
 
 
 On 10/17/12 9:50 PM, Michael Segel michael_se...@hotmail.com wrote:
 
 Run your weekly job in a low priority fair scheduler/capacity scheduler
 queue. 
 
 Maybe its just me, but I look at Coprocessors as a similar structure to
 RDBMS triggers and stored procedures.
 You need to restrain and use them sparingly otherwise you end up
creating
 performance issues.
 
 Just IMHO.
 
 -Mike
 
 On Oct 17, 2012, at 8:44 PM, Jean-Marc Spaggiari
 jean-m...@spaggiari.org wrote:
 
 I don't have any concern about the time it's taking. It's more about
 the load it's putting on the cluster. I have other jobs that I need to
 run (secondary index, data processing, etc.). So the more time this
 new job is taking, the less CPU the others will have.
 
 I tried the M/R and I really liked the way it's done. So my only
 concern will really be the performance of the delete part.
 
 That's why I'm wondering what's the best practice to move a row to
 another table.
 
 2012/10/17, Michael Segel michael_se...@hotmail.com:
 If you're going to be running this weekly, I would suggest that you
 stick
 with the M/R job.
 
 Is there any reason why you need to be worried about the time it
takes
 to do
 the deletes?
 
 
 On Oct 17, 2012, at 8:19 PM, Jean-Marc Spaggiari
 jean-m...@spaggiari.org
 wrote:
 
 Hi Mike,
 
 I'm expecting to run the job weekly. I initially thought about using
 end points because I found HBASE-6942 which was a good example for
my
 needs.
 
 I'm fine with the Put part for the Map/Reduce, but I'm not sure
about
 the delete. That's why I look at coprocessors. Then I figure that I
 also can do the Put on the coprocessor side.
 
 On a M/R, can I delete the row I'm dealing with based on some
criteria
 like timestamp? If I do that, I will not do bulk deletes, but I will
 delete the rows one by one, right? Which might be very slow.
 
 If in the future I want to run the job daily, might that be an
issue?
 
 Or should I go with the initial idea of doing the Put with the M/R
job
 and the delete with HBASE-6942?
 
 Thanks,
 
 JM
 
 
 2012/10/17, Michael Segel michael_se...@hotmail.com:
 Hi,
 
 I'm a firm believer in KISS (Keep It Simple, Stupid)
 
 The Map/Reduce (map job only) is the simplest and least prone to
 failure.
 
 Not sure why you would want to do this using coprocessors.
 
 How often are you running this job? It sounds like its going to be
 sporadic.
 
 -Mike
 
 On Oct 17, 2012, at 7:11 PM, Jean-Marc Spaggiari
 jean-m...@spaggiari.org
 wrote:
 
 Hi,
 
 Can someone please help me to understand the pros and cons between
 those 2 options for the following usecase?
 
 I need to transfer all the rows between 2 timestamps to another
 table.
 
 My first idea was to run a MapReduce to map the rows and store
them
 on
 another table, and then delete them using an end point
coprocessor.
 But the more I look into it, the more I think the MapReduce is
not a
 good idea and I should use a coprocessor instead.
 
 BUT... The MapReduce framework guarantee me that it will run
against
 all the regions. I tried to stop a regionserver while the job was
 running. The region moved, and the MapReduce restarted the job
from
 the new location. Will the coprocessor do the same thing?
 
 Also, I found the webconsole for the MapReduce with the number of
 jobs, the status, etc. Is there the same thing with the
 coprocessors?
 
 Are all coprocessors running at the same time on all regions,
which
 mean we can have 100 of them running on a regionserver at a time?
Or
 are they running like the MapReduce jobs based on some configured
 values?
 
 Thanks,
 
 JM
 
 
 
 
 
 
 
 
 
 
 
 






Re: Problems using unqualified hostname on hbase

2012-10-17 Thread Doug Meil

Hi there.  You generally don't want to run with 2 clusters like (HBase on
one, HDFS on the other) that because your regions have 0% locality.

For more information on this topic, seeŠ.

http://hbase.apache.org/book.html#regions.arch.locality




On 10/17/12 12:19 PM, Richard Tang tristartom.t...@gmail.com wrote:

Hello, everyone,
I have problems using hbase based on unqualified hostname. My ``hbase``
runs  in a cluster and ``hdfs`` on another cluster. While using fully
qualified name on ``hbase``, for properties like ``hbase.rootdir`` and
``hbase.zookeeper.quorum``, there is no problem. But when I change them to
be shorter unqualified names, like ``node4`` and ``node2``, (which are
resolved to local ip address by ``/etc/hosts``, like ``10.0.0.8``), the
hbase cluster begin to throw ``Connect refused`` messages. Anyone
encounter
same problem here?What is the possible reason behind all theses? Thanks.
Regards,
Richard




Re: bulk load

2012-10-14 Thread Doug Meil

Yep.  Bulk-loads are an extremely useful way of loading data.  That would
be 2 jobs since those are 2 tables.

For more info on bulk loading, seeŠ
http://hbase.apache.org/book.html#arch.bulk.load





On 10/14/12 10:58 AM, yutoo yanio yutoo.ya...@gmail.com wrote:

hi
i want to bulk load my data, but use map/reduce and HFileOutputFormat have
a problem, just can bulk load to a table.
but i want to load my data to more table : one table for statistics, one
table for all of data , ...

i am thinking about this approach : manually load data(locally or
map/reduce) and create HFiles for all tables and append incrementally to
tables.

is this approach good?

please help me.
thanks.




Re: Remove the row in MR job?

2012-10-12 Thread Doug Meil

I'm not entirely sure of the use-case, but here are some thoughts on thisŠ

re:  should I take the table from the pool, and simply call the delete
method?

Yep, you can construct an HTable instance within a MR job.  But use the
delete that takes a list because the single-delete will invoke an RPC for
each one (not great over an MR job).

Construct the HTable instance at the Mapper level (not map-method level)
and keep a buffer of deletes in a List.  At the end of the job, send any
un-processed deletes in the cleanup method.


I'm not entirely sure why you'd want to delete every row in a table (as
opposed to processing all the rows in Table1 and generating an entirely
new Table2).  And then drop Table1 when you're done with it.  That seems
like it would be less hassle than deleting every row (since the table is
empty anyway).






On 10/12/12 1:20 PM, Jean-Marc Spaggiari jean-m...@spaggiari.org wrote:

Hi,

I have a table which I want to parse over a MR job.

Today, I'm using a scan to parse all the rows. Each row is retrieve,
removed, and the parsed (feeding 2 other tables)

The goal is to parse all the content while some process might still be
adding some more.

On the map method from the MR job, can I delete the row I'm working
with? If so, how should I do? should I take the table from the pool,
and simply call the delete method? The issue is, doing a delete for
each line will take a while. I would prefer to batch them, but I don't
know when will be the last line, so it's difficult to know when to
send the batch.  Is there a way to say to the MR job to delete this
line? Also, what's the impact on the MR job if I delete the row it's
working one?

Or is the MR job not the best way to do that?

Thanks,

JM





Re: Remove the row in MR job?

2012-10-12 Thread Doug Meil

Just throwing an idea out there, but if you rotate tables you could
probably do what you want..

1)  Table1 is being written throughout the day
2)  It's time to kick off the MR job, but before the job is submitted
Table2 is now configured to be the 'write' table
3)  MR job processes all the data in Table1.  Table1 is dropped/truncated
when finished.
4)  Table2 continues to get writes
5)  Now it's time to run the MR job again, Table1 is now configured to be
the 'write' table and Table2 is processed by the MR job.
6)  Continue rotating between the tables

Something like this is probably going to be a lot easier to manage than
doing deletes of what you've read.




On 10/12/12 3:47 PM, Jean-Marc Spaggiari jean-m...@spaggiari.org wrote:

Hi Doug,

Thanks for the suggestion. I like the idea of simply deleting the
table, however, I'm not sure if I can implement it.

Basically, I have one process which is constantly feeding the table,
and, once a day, I want to run a MR job to proccess this table (Which
will emtpy it).

While I'm processing it, I still want to other process to have the
ability to store data.

Since I can't rename the table because this functionnaly doesn't
exist, I need to have the 2 working on the same table.

Maybe what I can do is working on the colum name Like I store on a
different column every day based on the day number and I just run MR
on all the columns except today. After that, I can delete all the
columns except the one for the current day. Issue is if the MR is
taking more than 24h...

Also, is that fast to delete a column?

JM

2012/10/12 Doug Meil doug.m...@explorysmedical.com:

 I'm not entirely sure of the use-case, but here are some thoughts on
thisŠ

 re:  should I take the table from the pool, and simply call the delete
 method?

 Yep, you can construct an HTable instance within a MR job.  But use the
 delete that takes a list because the single-delete will invoke an RPC
for
 each one (not great over an MR job).

 Construct the HTable instance at the Mapper level (not map-method level)
 and keep a buffer of deletes in a List.  At the end of the job, send any
 un-processed deletes in the cleanup method.


 I'm not entirely sure why you'd want to delete every row in a table (as
 opposed to processing all the rows in Table1 and generating an entirely
 new Table2).  And then drop Table1 when you're done with it.  That seems
 like it would be less hassle than deleting every row (since the table is
 empty anyway).






 On 10/12/12 1:20 PM, Jean-Marc Spaggiari jean-m...@spaggiari.org
wrote:

Hi,

I have a table which I want to parse over a MR job.

Today, I'm using a scan to parse all the rows. Each row is retrieve,
removed, and the parsed (feeding 2 other tables)

The goal is to parse all the content while some process might still be
adding some more.

On the map method from the MR job, can I delete the row I'm working
with? If so, how should I do? should I take the table from the pool,
and simply call the delete method? The issue is, doing a delete for
each line will take a while. I would prefer to batch them, but I don't
know when will be the last line, so it's difficult to know when to
send the batch.  Is there a way to say to the MR job to delete this
line? Also, what's the impact on the MR job if I delete the row it's
working one?

Or is the MR job not the best way to do that?

Thanks,

JM








Re: HBase table - distinct values

2012-10-10 Thread Doug Meil

Typically this is something done as a MapReduce job.

http://hbase.apache.org/book.html#mapreduce.example

7.2.4. HBase MapReduce Summary to HBase Example



However, if this is an operation to be performed frequently by an
application then doing frequent MapReduce jobs for summaries probably
isn't the best idea.  Either produce periodic summaries into another Hbase
table, or denormalize and keep track of the required summaries upon data
load.



On 10/10/12 6:59 AM, raviprasa...@polarisft.com
raviprasa...@polarisft.com wrote:

Hi all,
  Is it possible to select distinct value from Hbase table.

Example :- 
   what is the equivalant code for the below Oracle code  in Hbase  ?

  Select count (distinct deptno) from emp ;

Regards
Raviprasad. T


This e-Mail may contain proprietary and confidential information and is
sent for the intended recipient(s) only.  If by an addressing or
transmission error this mail has been misdirected to you, you are
requested to delete this mail immediately. You are also hereby notified
that any use, any form of reproduction, dissemination, copying,
disclosure, modification, distribution and/or publication of this e-mail
message, contents or its attachment other than by its intended
recipient/s is strictly prohibited.

Visit us at http://www.polarisFT.com




Re: How to select distinct column from a Hbase table

2012-10-10 Thread Doug Meil

Here's a Scan example from the Hbase ref guideŠ

http://hbase.apache.org/book.html#scan

Š but this assumes you are asking about simply getting a distinct column
from a table, as opposed to doing a distinct count query which was in
another email thread today.



On 10/10/12 11:20 AM, Ramasubramanian Narayanan
ramasubramanian.naraya...@gmail.com wrote:

Hi,

Can someone help in providing the query to select the distinct column from
a Hbase table. If not should we do it only in Pig script?

regards,
Rams




Re: key design

2012-10-10 Thread Doug Meil
Hi there-

Given the fact that the userid is in the lead position of the key in both
approaches, I'm not sure that he'd have a region hotspotting problem
because the userid should be able to offer some spread.




On 10/10/12 12:55 PM, Jerry Lam chiling...@gmail.com wrote:

Hi:

So you are saying you have ~3TB of data stored per day?

Using the second approach, all data for one day will go to only 1
regionserver no matter what you do because HBase doesn't split a single
row.

Using the first approach, data will spread across regionservers but there
will be hotspotted to each regionserver during write since this is a
time-series problem.

Best Regards,

Jerry

On Wed, Oct 10, 2012 at 11:24 AM, yutoo yanio yutoo.ya...@gmail.com
wrote:

 hi
 i have a question about key  column design.
 in my application we have 3,000,000,000 record in every day
 each record contain : user-id, time stamp, content(max 1KB).
 we need to store records for one year, this means we will have about
 1,000,000,000,000 after 1 year.
 we just search a user-id over rang of time stamp
 table can design in two way
 1.key=userid-timestamp and column:=content
 2.key=userid-MMdd and column:HHmmss=content


 in first design we have tall-narrow table but we have very very
records, in
 second design we have flat-wide table.
 which of them have better performance?

 thanks.





Re: Regarding Hbase tuning - Configuration at table level

2012-10-10 Thread Doug Meil

Re:  JD's suggestion, this and more exciting and useful things can be
found in these sections of the Hbase ref guide.

http://hbase.apache.org/book.html#perf.reading
http://hbase.apache.org/book.html#perf.writing

Well, maybe not exciting, but certainly useful.  :-)




On 10/10/12 2:04 PM, Jean-Daniel Cryans jdcry...@apache.org wrote:

At the table level there's only deferred log flush that would help,
and only with writes at the cost of some durability.

J-D

On Wed, Oct 10, 2012 at 8:26 AM, Ramasubramanian Narayanan
ramasubramanian.naraya...@gmail.com wrote:
 Hi,

 What are all the configurations that can be done at Hbase table level to
 improve the performance of Hbase table for both read and right...

 regards,
 Rams





Re: HBase Key Design : Doubt

2012-10-10 Thread Doug Meil

Correct.

If you do 2 Puts for row key A-B-C-D on different days, the second Put
logically replaces the first and the earlier Put becomes a previous
version.  Unless you specifically want older versions, you won't get them
in either Gets or Scans.

Definitely want to read thisŠ

http://hbase.apache.org/book.html#datamodel

See this for more information about they internal KeyValue structure.

http://hbase.apache.org/book.html#regions.arch
9.7.5.4. KeyValue


Older versions are kept around as long as the table descriptor says so
(e.g., max versions).  See the StoreFile and Compactions entries in the
RefGuide for more information on the internals.




On 10/10/12 3:24 PM, Jerry Lam chiling...@gmail.com wrote:

correct me if I'm wrong. The version applies to the individual cell (ie.
row key, column family and column qualifier) not (row key, column family).


On Wed, Oct 10, 2012 at 3:13 PM, Narayanan K knarayana...@gmail.com
wrote:

 Hi all,

 I have a usecase wherein I need to find the unique of some things in
HBase
 across dates.

 Say, on 1st Oct, A-B-C-D appeared, hence I insert a row with rowkey :
 A-B-C-D.
 On 2nd Oct, I get the same value A-B-C-D and I don't want to redundantly
 store the row again with a new rowkey - A-B-C-D for 2nd Oct
 i.e I will not want to have 20121001-A-B-C-D and 20121002-A-B-C-D as 2
 rowkeys in the table.

 Eg: If I have 1st Oct , 2nd Oct as 2 column families and if number of
 versions are set to 1, only 1 row will be present in for both the dates
 having rowkey A-B-C-D.
 Hence if I need to find unique number of times A-B-C-D appeared during
Oct
 1 and Oct 2, I just need to take rowcount of the row A-B-C-D by
filtering
 over the 2 column families.
 Similarly, if we have 10  date column families, and I need to scan only
for
 2 dates, then it scans only those store files having the specified
column
 families. This will make scanning faster.

 But here the design problem is that I cant add more column families to
the
 table each day.

 I would need to store data every day and I read that HBase doesnt work
well
 with more than 3 column families.

 The other option is to have one single column family and store dates as
 qualifiers : date:d1, date:d2 But here if there are 30 date
qualifiers
 under date column family, to scan a single date qualifier or may be
range
 of 2-3 dates will have to scan through the entire data of all d1 to d30
 qualifiers in the date column family which would be slower compared to
 having separate column families for the each date..

 Please share your thoughts on this. Also any alternate design
suggestions
 you might have.

 Regards,
 Narayanan





Re: HBase client slows down

2012-10-09 Thread Doug Meil

It's one of those it depends answers.

See this firstŠ

http://hbase.apache.org/book.html#perf.writing

Š Additionally, one thing to understand is where you are writing data.
Either keep track of the requests per RS over the period (e.g., the web
interface), or you can also track it on the client side with...

http://hbase.apache.org/apidocs/org/apache/hadoop/hbase/client/HTable.html#
getRegionLocation%28byte[],%20boolean%29


Š to know if you are continually hitting the same RS or spreading the load.



On 10/9/12 1:27 PM, Mohit Anchlia mohitanch...@gmail.com wrote:

I just have 5 stress client threads writing timeseries data. What I see is
after few mts HBaseClient slows down and starts to take 4 secs. Once I
kill
the client and restart it stays at sustainable rate for about 2 mts and
then again it slows down. I am wondering if there is something I should be
doing on the HBaseclient side? All the request are similar in terms of
data.




Re: HBase client slows down

2012-10-09 Thread Doug Meil

So you're running on a single regionserver?




On 10/9/12 1:44 PM, Mohit Anchlia mohitanch...@gmail.com wrote:

I am using HTableInterface as a pool but I don't see any setautoflush
method. I am using 0.92.1 jar.

Also, how can I see if RS is getting overloaded? I looked at the UI and I
don't see anything obvious:

equestsPerSecond=0, numberOfOnlineRegions=1, numberOfStores=1,
numberOfStorefiles=1, storefileIndexSizeMB=0, rootIndexSizeKB=1,
totalStaticIndexSizeKB=0, totalStaticBloomSizeKB=0, memstoreSizeMB=27,
readRequestsCount=126, writeRequestsCount=96157, compactionQueueSize=0,
flushQueueSize=0, usedHeapMB=44, maxHeapMB=3976, blockCacheSizeMB=8.79,
blockCacheFreeMB=985.34, blockCacheCount=11, blockCacheHitCount=23,
blockCacheMissCount=28, blockCacheEvictedCount=0, blockCacheHitRatio=45%,
blockCacheHitCachingRatio=67%, hdfsBlocksLocalityIndex=100

On Tue, Oct 9, 2012 at 10:32 AM, Doug Meil
doug.m...@explorysmedical.comwrote:


 It's one of those it depends answers.

 See this firstŠ

 http://hbase.apache.org/book.html#perf.writing

 Š Additionally, one thing to understand is where you are writing data.
 Either keep track of the requests per RS over the period (e.g., the web
 interface), or you can also track it on the client side with...

 
http://hbase.apache.org/apidocs/org/apache/hadoop/hbase/client/HTable.htm
l#
 getRegionLocation%28byte[],%20boolean%29


 Š to know if you are continually hitting the same RS or spreading the
load.



 On 10/9/12 1:27 PM, Mohit Anchlia mohitanch...@gmail.com wrote:

 I just have 5 stress client threads writing timeseries data. What I
see is
 after few mts HBaseClient slows down and starts to take 4 secs. Once I
 kill
 the client and restart it stays at sustainable rate for about 2 mts and
 then again it slows down. I am wondering if there is something I
should be
 doing on the HBaseclient side? All the request are similar in terms of
 data.







Re: Column Qualifier space requirements

2012-10-03 Thread Doug Meil

Hi there, there is a separate Store per ColumnFamily, which results in a
separate StoreFile.

http://hbase.apache.org/book.html#regions.arch


This section has a description of what the files look like on HDFSŠ

http://hbase.apache.org/book.html#trouble.namenode



On 10/3/12 10:35 AM, Fuad Efendi f...@efendi.ca wrote:

Hi Anoop,


Thanks for the response!

- I thought that each Column Family is associated with separate MapFile in
Hadoop... but it was before 0.20... I found details at
http://www.cloudera.com/blog/2012/06/hbase-io-hfile-input-output/

-Fuad







Re: Bulk Loads and Updates

2012-10-03 Thread Doug Meil

Hi there-

re:  All 20 versions will get loaded but the 10 oldest will be deleted
during
the next major compaction.

Yep, that's what is expected to happen.

For information on KeyValue structure and compaction algorithm, seeŠ.


http://hbase.apache.org/book.html#regions.arch

For info on bulk loading, see..

http://hbase.apache.org/book.html#arch.bulk.load





On 10/3/12 4:12 PM, Paul Mackles pmack...@adobe.com wrote:

Keys in hbase are a combination of rowkey/column/timestamp.

Two records with the same rowkey but different column will result in two
different cells with the same rowkey which is probably what you expect.

For two records with the same rowkey and same column, the timestamp will
normally differentiate them but in the case of a bulk load, the timestamp
could be the same so it may actually be a tie and both will be stored.
There are no updates in bulk loads.

All 20 versions will get loaded but the 10 oldest will be deleted during
the next major compaction.

I would definitely recommend setting up small scale tests for all of the
above scenarios to confirm.

On 10/3/12 3:35 PM, Juan P. gordoslo...@gmail.com wrote:

Hi guys,
I've been reading up on bulk load using MapReduce jobs and I wanted to
validate something.

If I the input I wanted to load into HBase produced the same key for
several lines. How will HBase handle that?

I understand the MapReduce job will create StoreFiles which the region
servers just pick up and make available to the users. But is there a
validation to treat the first as insert and the rest as updates?

What about the limit on the number of versions of a key HBase can have?
If
I want to have 10 versions, but the bulk load has 20 values for the same
key, will it only keep the last 10?

Thanks,
Juan






Re: HBase vs. HDFS

2012-10-02 Thread Doug Meil

Hi there, 

Another thing to consider on top of the scan-caching is that that HBase is
doing more in the process of scanning the table.  See...

http://hbase.apache.org/book.html#conceptual.view

http://hbase.apache.org/book.html#regions.arch


... Specifically, processing the KeyValues, potentially merging rows between
StoreFiles, checking for un-flushed updates in the MemStore per CF.



On 10/1/12 8:54 PM, Doug Meil doug.m...@explorysmedical.com wrote:


Hi there-

Might want to start with thisŠ

http://hbase.apache.org/book.html#perf.reading

Š if you're using default scan caching (which is 1) that would explain a
lot.




On 10/1/12 7:01 PM, Juan P. gordoslo...@gmail.com wrote:

Hi guys,
I'm trying to get familiarized with HBase and one thing I noticed is that
reads seem to very slow. I just tried doing a scan 'my_table' to get
120K
records and it took about 50 seconds to print it all out.

In contrast hadoop fs -cat my_file.csv where my_file.csv has 120K lines
completed in under a second.

Is that possible? Am I missing something about HBase reads?

Thanks,
Joni







Re: HBase vs. HDFS

2012-10-02 Thread Doug Meil

If you take Hbase out of it and think of it from the standpoint of 2
programs, one of which opens a file and write the output to another file,
and the other one which actually processes each row and then writes out
results, the 2nd one is going to be slower because it's doing more,
ceteris paribus.  HBase is like the 2nd program in your test.




On 10/2/12 8:46 AM, gordoslocos gordoslo...@gmail.com wrote:

Thank you all! Setting a cache size helped a great deal. It's still
slower though.

I think it might be possible that the overhead of processing the data
from the table might be the cause.

I guess if HBase adds an indirection to the HDFS then it makes sense that
it'd be slower, right?

On 02/10/2012, at 09:28, Doug Meil doug.m...@explorysmedical.com wrote:

 
 Hi there, 
 
 Another thing to consider on top of the scan-caching is that that HBase
is
 doing more in the process of scanning the table.  See...
 
 http://hbase.apache.org/book.html#conceptual.view
 
 http://hbase.apache.org/book.html#regions.arch
 
 
 ... Specifically, processing the KeyValues, potentially merging rows
between
 StoreFiles, checking for un-flushed updates in the MemStore per CF.
 
 
 
 On 10/1/12 8:54 PM, Doug Meil doug.m...@explorysmedical.com wrote:
 
 
 Hi there-
 
 Might want to start with thisŠ
 
 http://hbase.apache.org/book.html#perf.reading
 
 Š if you're using default scan caching (which is 1) that would explain
a
 lot.
 
 
 
 
 On 10/1/12 7:01 PM, Juan P. gordoslo...@gmail.com wrote:
 
 Hi guys,
 I'm trying to get familiarized with HBase and one thing I noticed is
that
 reads seem to very slow. I just tried doing a scan 'my_table' to get
 120K
 records and it took about 50 seconds to print it all out.
 
 In contrast hadoop fs -cat my_file.csv where my_file.csv has 120K
lines
 completed in under a second.
 
 Is that possible? Am I missing something about HBase reads?
 
 Thanks,
 Joni
 
 





Re: HBase table row key design question.

2012-10-02 Thread Doug Meil

Hi there, while this isn't an answer to some of the specific design
questions, this chapter in the RefGuide can be helpful for general design..

http://hbase.apache.org/book.html#schema





On 10/2/12 10:28 AM, Jason Huang jason.hu...@icare.com wrote:

Hello,

I am designing a HBase table for users and hope to get some
suggestions for my row key design. Thanks...

This user table will have columns which include user information such
as names, birthday, gender, address, phone number, etc... The first
time user comes to us we will ask all these information and we should
generate a new row in the table with a unique row key. The next time
the same user comes in again we will ask for his/her names and
birthday and our application should quickly get the row(s) in the
table which meets the name and birthday provided.

Here is what I am thinking as row key:

{first 6 digit of user's first name}_{first 6 digit of user's last
name}_{birthday in MMDD}_{timestamp when user comes in for the
first time}

However, I see a few questions from this row key:

(1) Although it is not very likely but there could be some small
chances that two users with same name and birthday came in at the same
day. And the two requests to generate new user came at the same time
(the timestamps were defined in the HTable API and happened to be of
the same value before calling the put method). This means the row key
design above won't guarantee a unique row key. Any suggestions on how
to modify it and ensure a unique ID?

(2) Sometimes we will only have part of user's first name and/or last
name. In that case, we will need to perform a scan and return multiple
matches to the client. To avoid scanning the whole table, if we have
user's first name, we can set start/stop row accordingly. But then if
we only have user's last name, we can't set up a good start/stop row.
What's even worse, if the user provides a sounds-like first or last
name, then our scan won't be able to return good possible matches.
Does anyone ever use names as part of the row key and encounter this
type of issue?

(3) The row key seems to be long (30+ chars), will this affect our
read/write performance? Maybe it will increase the storage a bit (say
we have 3 million rows per month)? In other words, does the length of
the row key matter a lot?

thanks!

Jason





Re: Column Qualifier space requirements

2012-10-01 Thread Doug Meil

Hi there, take a look at the Hbase Refguide here...

http://hbase.apache.org/book.html#regions.arch

For this section...

9.7.5.4. KeyValue






On 10/1/12 9:50 AM, Fuad Efendi f...@efendi.ca wrote:

Hi,

Is column qualifier physically stored in a Cell? Or pointer to it? Do we
need to care about long size such as
my_very_long_qualifier:1

(size of a value is small in comparison to size of qualifierŠ)

thanks






Re: HBase vs. HDFS

2012-10-01 Thread Doug Meil

Hi there-

Might want to start with thisŠ

http://hbase.apache.org/book.html#perf.reading

Š if you're using default scan caching (which is 1) that would explain a
lot.




On 10/1/12 7:01 PM, Juan P. gordoslo...@gmail.com wrote:

Hi guys,
I'm trying to get familiarized with HBase and one thing I noticed is that
reads seem to very slow. I just tried doing a scan 'my_table' to get
120K
records and it took about 50 seconds to print it all out.

In contrast hadoop fs -cat my_file.csv where my_file.csv has 120K lines
completed in under a second.

Is that possible? Am I missing something about HBase reads?

Thanks,
Joni




Re: Clarification regarding major compaction logic

2012-09-23 Thread Doug Meil

Hi there, for background on the file selection algorithm for compactions,
see...

http://hbase.apache.org/book.html#regions.arch

9.7.5.5. Compaction






On 9/23/12 9:59 AM, Monish r monishs...@gmail.com wrote:

Hi guys,

i would like to clarify the following regarding Major Compaction

1) When TTL is set for a column family and major compaction is triggered
by
user

- Does it act on the region only when *time since last major compaction is
 TTL.*
*
*

2) Does major compaction go through the index of a region to find out that
there is data to be acted upon and then start the rewriting  ( or ) does
it
rewrite without any pre checks about the data  inside the region ?

3) If major compaction for a region results in a empty region , does the
empty region get deleted or left as such ?

Regards,
R.Monish




Re: What is the best value to be used in rowkey

2012-09-22 Thread Doug Meil

Hi there, you probably want to read thisŠ

http://hbase.apache.org/book.html#schema





On 9/22/12 10:29 AM, Ramasubramanian Narayanan
ramasubramanian.naraya...@gmail.com wrote:

Hi,

Can anyone suggest what is the best value that can be used for a rowkey in
a hbase table which will not produce duplicate any point of time. For
example timestamp with nano seconds may get duplicated if we are loading
in
a batch file.

regards,
Rams




Re: HBase Multi-Threaded Writes

2012-09-19 Thread Doug Meil

Hi there,

You haven't described much about your environment, but there are two
things you might want to consider for starters:

1)  Is the table pre-split?  (I.e., if it isn't, there is one region)
2)  If it is, are all the writes hitting the same region?

For other write tips, see thisŠ

http://hbase.apache.org/book.html#perf.writing





On 9/19/12 2:53 PM, Pankaj Misra pankaj.mi...@impetus.co.in wrote:

Dear All,

I have created 2 clients with multi-threading support to perform
concurrent writes to HBase with initial expectation that with multiple
threads I should be able to write faster. The clients that I created are
using the Native HBase API and Thrift API.

To my surprise, the performance with multi-threaded clients dropped  for
the both the clients consistently when compared to single threaded
ingestion. As I increase the number of threads the writes performance
degrades consistently. With a single thread ingestion both the clients
perform far better, but I intend to use HBase in a multi-threaded
environment, wherein I am facing challenges with the performance.

Since I am relatively new to HBase, please do excuse me if I am asking
something very basic, but any suggestions around this would be extremely
helpful.

Thanks and Regards
Pankaj Misra




Impetus Ranked in the Top 50 India¹s Best Companies to Work For 2012.

Impetus webcast ŒDesigning a Test Automation Framework for Multi-vendor
Interoperable Systems¹ available at http://lf1.me/0E/.


NOTE: This message may contain information that is confidential,
proprietary, privileged or otherwise protected by law. The message is
intended solely for the named addressee. If received in error, please
destroy and notify the sender. Any use of this email is prohibited when
received in error. Impetus does not represent, warrant and/or guarantee,
that the integrity of this communication has been maintained nor that the
communication is free of errors, virus, interception or interference.




Re: HBase Multi-Threaded Writes

2012-09-19 Thread Doug Meil

re:  pseudo-distributed mode

Ok, so you're doing a local test.  The benefits you get with multiple
regions per table that are spread across multiple RegionServers are that
you can engage more of the cluster in your workload.  You can't really do
that on a local test.




On 9/19/12 4:48 PM, Pankaj Misra pankaj.mi...@impetus.co.in wrote:

Thank you so much Doug.

You are  right there is only one region to start with as I am not
pre-splitting them. So for a given set of writes, all are hitting the
same region.

I will have the table pre-split as described, and test again. Will the
number of region servers also impact the writes performance?

My environment is HBase 0.94.1 with Hadoop 0.23.1, running on Oracle JVM
1.6. I am running hbase in a pseudo-distributed mode. Please find below
my hbase-site.xml, which has very basic configurations.
configuration
property
namehbase.rootdir/name
 valuehdfs://localhost:9000/hbase/value
/property
property
namedfs.replication/name
value1/value
/property
  property
namehbase.zookeeper.quorum/name
valuelocalhost/value
/property
property
namehbase.cluster.distributed/name
valuetrue/value
/property
/configuration


Thanks and Regards
Pankaj Misra



From: Doug Meil [doug.m...@explorysmedical.com]
Sent: Thursday, September 20, 2012 1:48 AM
To: user@hbase.apache.org
Subject: Re: HBase Multi-Threaded Writes

Hi there,

You haven't described much about your environment, but there are two
things you might want to consider for starters:

1)  Is the table pre-split?  (I.e., if it isn't, there is one region)
2)  If it is, are all the writes hitting the same region?

For other write tips, see thisŠ

http://hbase.apache.org/book.html#perf.writing





On 9/19/12 2:53 PM, Pankaj Misra pankaj.mi...@impetus.co.in wrote:

Dear All,

I have created 2 clients with multi-threading support to perform
concurrent writes to HBase with initial expectation that with multiple
threads I should be able to write faster. The clients that I created are
using the Native HBase API and Thrift API.

To my surprise, the performance with multi-threaded clients dropped  for
the both the clients consistently when compared to single threaded
ingestion. As I increase the number of threads the writes performance
degrades consistently. With a single thread ingestion both the clients
perform far better, but I intend to use HBase in a multi-threaded
environment, wherein I am facing challenges with the performance.

Since I am relatively new to HBase, please do excuse me if I am asking
something very basic, but any suggestions around this would be extremely
helpful.

Thanks and Regards
Pankaj Misra




Impetus Ranked in the Top 50 India¹s Best Companies to Work For 2012.

Impetus webcast ŒDesigning a Test Automation Framework for Multi-vendor
Interoperable Systems¹ available at http://lf1.me/0E/.


NOTE: This message may contain information that is confidential,
proprietary, privileged or otherwise protected by law. The message is
intended solely for the named addressee. If received in error, please
destroy and notify the sender. Any use of this email is prohibited when
received in error. Impetus does not represent, warrant and/or guarantee,
that the integrity of this communication has been maintained nor that the
communication is free of errors, virus, interception or interference.





Impetus Ranked in the Top 50 India’s Best Companies to Work For 2012.

Impetus webcast ‘Designing a Test Automation Framework for Multi-vendor
Interoperable Systems’ available at http://lf1.me/0E/.


NOTE: This message may contain information that is confidential,
proprietary, privileged or otherwise protected by law. The message is
intended solely for the named addressee. If received in error, please
destroy and notify the sender. Any use of this email is prohibited when
received in error. Impetus does not represent, warrant and/or guarantee,
that the integrity of this communication has been maintained nor that the
communication is free of errors, virus, interception or interference.




Re: HBase Multi-Threaded Writes

2012-09-19 Thread Doug Meil

You probably want to do a review of these chapters too...

http://hbase.apache.org/book.html#architecture
http://hbase.apache.org/book.html#datamodel
http://hbase.apache.org/book.html#schema





On 9/19/12 4:48 PM, Pankaj Misra pankaj.mi...@impetus.co.in wrote:

Thank you so much Doug.

You are  right there is only one region to start with as I am not
pre-splitting them. So for a given set of writes, all are hitting the
same region.

I will have the table pre-split as described, and test again. Will the
number of region servers also impact the writes performance?

My environment is HBase 0.94.1 with Hadoop 0.23.1, running on Oracle JVM
1.6. I am running hbase in a pseudo-distributed mode. Please find below
my hbase-site.xml, which has very basic configurations.
configuration
property
namehbase.rootdir/name
 valuehdfs://localhost:9000/hbase/value
/property
property
namedfs.replication/name
value1/value
/property
  property
namehbase.zookeeper.quorum/name
valuelocalhost/value
/property
property
namehbase.cluster.distributed/name
valuetrue/value
/property
/configuration


Thanks and Regards
Pankaj Misra



From: Doug Meil [doug.m...@explorysmedical.com]
Sent: Thursday, September 20, 2012 1:48 AM
To: user@hbase.apache.org
Subject: Re: HBase Multi-Threaded Writes

Hi there,

You haven't described much about your environment, but there are two
things you might want to consider for starters:

1)  Is the table pre-split?  (I.e., if it isn't, there is one region)
2)  If it is, are all the writes hitting the same region?

For other write tips, see thisŠ

http://hbase.apache.org/book.html#perf.writing





On 9/19/12 2:53 PM, Pankaj Misra pankaj.mi...@impetus.co.in wrote:

Dear All,

I have created 2 clients with multi-threading support to perform
concurrent writes to HBase with initial expectation that with multiple
threads I should be able to write faster. The clients that I created are
using the Native HBase API and Thrift API.

To my surprise, the performance with multi-threaded clients dropped  for
the both the clients consistently when compared to single threaded
ingestion. As I increase the number of threads the writes performance
degrades consistently. With a single thread ingestion both the clients
perform far better, but I intend to use HBase in a multi-threaded
environment, wherein I am facing challenges with the performance.

Since I am relatively new to HBase, please do excuse me if I am asking
something very basic, but any suggestions around this would be extremely
helpful.

Thanks and Regards
Pankaj Misra




Impetus Ranked in the Top 50 India¹s Best Companies to Work For 2012.

Impetus webcast ŒDesigning a Test Automation Framework for Multi-vendor
Interoperable Systems¹ available at http://lf1.me/0E/.


NOTE: This message may contain information that is confidential,
proprietary, privileged or otherwise protected by law. The message is
intended solely for the named addressee. If received in error, please
destroy and notify the sender. Any use of this email is prohibited when
received in error. Impetus does not represent, warrant and/or guarantee,
that the integrity of this communication has been maintained nor that the
communication is free of errors, virus, interception or interference.





Impetus Ranked in the Top 50 India’s Best Companies to Work For 2012.

Impetus webcast ‘Designing a Test Automation Framework for Multi-vendor
Interoperable Systems’ available at http://lf1.me/0E/.


NOTE: This message may contain information that is confidential,
proprietary, privileged or otherwise protected by law. The message is
intended solely for the named addressee. If received in error, please
destroy and notify the sender. Any use of this email is prohibited when
received in error. Impetus does not represent, warrant and/or guarantee,
that the integrity of this communication has been maintained nor that the
communication is free of errors, virus, interception or interference.




Re: About hbase metadata

2012-09-18 Thread Doug Meil

Hi there, 

Additionally, see this section in the RefGuideŠ

http://hbase.apache.org/book.html#arch.catalog






On 9/18/12 5:06 AM, Mohammad Tariq donta...@gmail.com wrote:

Hello Ram,

 You can scan the '.META.' and '-ROOT-' tables. Alternatively you can
also visit the 'hmaster web console' by pointing your web browser at
hmaster_hostname:60010.

You can use MetaUtils class to manipulate hbase meta tables.

What kind of mapping you want to do and where do you want to specify it?

Regards,
Mohammad Tariq



On Tue, Sep 18, 2012 at 1:46 PM, Ramasubramanian 
ramasubramanian.naraya...@gmail.com wrote:

 Hi,

 1. Where can I see the metadata of hbase?
 2. Can we able to edit it?
 3. Can we specify column mapping for a table?

 Regards,
 Rams





Re: Hbase Scan - number of columns make the query performance way different

2012-09-13 Thread Doug Meil

Hi there, I don't know the specifics of your environment, but ...

http://hbase.apache.org/book.html#perf.reading
11.8.2. Scan Attribute Selection


Š describes paying attention to the number of columns you are returning,
particularly when using HBase as a MR source.  In short, returning only
the columns you need means you are reducing the data transferred between
the RS and the client and the number of KV's evaluated in the RS, etc.




On 9/13/12 10:12 AM, Shengjie Min kelvin@gmail.com wrote:

Hi,

I found an interesting difference between hbase scan query.

I have a hbase table which has a lot of columns in a single column family.
eg. let's say I have a users table, then userid, username, email  etc
etc 15 fields all together are in the single columnFamily.

if you are familiar with RDBMS,

query 1: select * from users
vs
query 2: select userid, username from users

in mysql, these two has a difference, the query 2 will be obviously
faster,
but two queries won't give you a huge difference from performance
perspective.

In Hbase, I noticed that:

query 3: scan 'users',   // this is basically return me all 15 fields
vs
query 4: scan 'users', {COLUMNS=['cf:userid','cf:username']}// this
is
return me only two fields: userid , username

query 3 here takes way longer than query 4, Given a big data set. In my
test, I have around 1,000,000 user records. You are talking about query 3
-
100 secs VS query 4 - a few secs.


Can anybody explain to me, why the width of the resultset in HBASE can
impact the performance that much?


Shengjie Min




Re: Optimizing table scans

2012-09-12 Thread Doug Meil

Hi there, 

See this for info on the block cache in the RegionServer..

http://hbase.apache.org/book.html
9.6.4. Block Cache

Š and see this for batching on the scan parameter...

http://hbase.apache.org/book.html#perf.reading
11.8.1. Scan Caching






On 9/12/12 9:55 AM, Amit Sela am...@infolinks.com wrote:

I allocate 10GB per RegionServer.
An average row size is ~200 Bytes.
The network is 1GB.

It would be great if anyone could elaborate on the difference between
Cache
and Batch parameters.

Thanks.

On Wed, Sep 12, 2012 at 4:04 PM, Michael Segel
michael_se...@hotmail.comwrote:

 How much memory do you have?
 What's the size of the underlying row?
 What does your network look like? 1GBe or 10GBe?

 There's more to it, and I think that you'll find that YMMV on what is an
 optimum scan size...

 HTH

 -Mike

 On Sep 12, 2012, at 7:57 AM, Amit Sela am...@infolinks.com wrote:

  Hi all,
 
  I'm trying to find the sweet spot for the cache size and batch size
 Scan()
  parameters.
 
  I'm scanning one table using HTable.getScanner() and iterating over
the
  ResultScanner retrieved.
 
  I did some testing and got the following results:
 
  For scanning *100* rows.
 
  *
 
  Cache
 
  Batch
 
  Total execution time (sec)
 
  1
 
  -1 (default)
 
  112
 
  1
 
  5000
 
  110
 
  1
 
  1
 
  110
 
  1
 
  2
 
  110
 
  Cache
 
  Batch
 
  Total execution time (sec)
 
  1000
 
  -1 (default)
 
  116
 
  1
 
  -1 (default)
 
  110
 
  2
 
  -1 (default)
 
  115
 
  Cache
 
  Batch
 
  Total execution time (sec)
 
  5000
 
  10
 
  26
 
  2
 
  10
 
  25
 
  5
 
  10
 
  26
 
  5000
 
  5
 
  15
 
  2
 
  5
 
  14
 
  5
 
  5
 
  14
 
  1000
 
  1
 
  6
 
  5000
 
  1
 
  5
 
  1
 
  1
 
  4
 
  2
 
  1
 
  4
 
  5
 
  1
 
  4
 
  *
  *I don't understand why a lower batch size gives such an improvement
?*
 
  Thanks,
 
  Amit.
  *
  *






Re: HDFS footprint of a table

2012-09-11 Thread Doug Meil

Hi there, see...

http://hbase.apache.org/book.html#regions.arch

Š And in particular focus onŠ

9.7.5.4. KeyValue






On 9/11/12 3:35 AM, Lin Ma lin...@gmail.com wrote:

Hi guys,

Supposing I have a table in HBase, how to estimate its storage footprint?
Thanks.

regards,
Lin




Re: Regarding column family

2012-09-11 Thread Doug Meil

Hi there, additionally, see..

http://hbase.apache.org/book.html#regions.arch

Š and focus on 9.7.5.4. KeyValue because the CF name is actually a part
of each KV.  




On 9/11/12 4:03 AM, n keywal nkey...@gmail.com wrote:

Yes, because there is one store (hence set of files) per column family.
See this: http://hbase.apache.org/book.html#number.of.cfs

On Tue, Sep 11, 2012 at 9:52 AM, Ramasubramanian 
ramasubramanian.naraya...@gmail.com wrote:

 Hi,

 Does column family play any role during loading a file into hbase from
 hdfs in terms of performance?

 Regards,
 Rams




Re: More rows or less rows and more columns

2012-09-11 Thread Doug Meil

re:  You may want to update this section

Good point.  I will add.





On 9/11/12 6:59 AM, Michel Segel michael_se...@hotmail.com wrote:

Option c, depending on the use case, add a structure to you columns to
store the data.
You may want to update this section


Sent from a remote device. Please excuse any typos...

Mike Segel

On Sep 10, 2012, at 12:30 PM, Harsh J ha...@cloudera.com wrote:

 Hey Mohit,
 
 See http://hbase.apache.org/book.html#schema.smackdown.rowscols
 
 On Mon, Sep 10, 2012 at 10:56 PM, Mohit Anchlia
mohitanch...@gmail.com wrote:
 Is there any recommendation on how many columns one should have per
row. My
 columns are  200 bytes. This will help me to decide if I should shard
my
 rows with id + some date/time value.
 
 
 
 -- 
 Harsh J
 





Re: Regarding rowkey

2012-09-11 Thread Doug Meil

Hi there, have you read this?

http://hbase.apache.org/book.html#performance

And especially this?

http://hbase.apache.org/book.html#perf.writing


How many nodes is the cluster?  Is the target table pre-split?  And if it
is, are you sure that the rows aren't winding up on a single region?





On 9/11/12 1:39 PM, Ramasubramanian
ramasubramanian.naraya...@gmail.com wrote:

Hi,

What can be used as rowkey to improve performance while loading into
hbase? Currently I am having sequence. It takes some 11 odd minutes to
load 1 million record with 147 columns.

Regards,
Rams




Re: Tomb Stone Marker

2012-09-10 Thread Doug Meil

Hi there...

In this chapter...

http://hbase.apache.org/book.html#datamodel

.. it explains that the updates are just a view.  There is a merge
happening across CFs and versions (and delete-markers)..

In this...

http://hbase.apache.org/book.html#regions.arch
9.7.5.5. Compaction

... it explains how and when the delete markers are removed in the
compaction process.





On 9/10/12 2:50 AM, Monish r monishs...@gmail.com wrote:

Hi,
Thanks for the link . If the meta data information for a delete is part of
key value , then when does this update happen

When the region is re written by minor compaction. ?
or Is the region  re written for a set of batched deletes ?



On Sun, Sep 9, 2012 at 6:42 PM, Doug Meil
doug.m...@explorysmedical.comwrote:


 Hi there,

 See 9.7.5.4. KeyValue...

 http://hbase.apache.org/book.html#regions.arch

 Š the tombstone is one of the keytypes.



 On 9/9/12 5:21 AM, Monish r monishs...@gmail.com wrote:

 Hi,
 I need some clarifications regarding the Tomb Stone Marker .
 I was wondering where exactly are the tomb stone markers stored when a
row
 is deleted .
 
 Are they kept in some memory area and  updated in the HFile  during
minor
 compaction ?
 If they are updated in the HFile , then what part of a  HFile contains
 this
 information.
 
 Regards,
 R.Monish







Re: HBase aggregate query

2012-09-10 Thread Doug Meil

Hi there, if there are common questions I'd suggest creating summary
tables of the pre-aggregated results.

http://hbase.apache.org/book.html#mapreduce.example

7.2.4. HBase MapReduce Summary to HBase Example




On 9/10/12 10:03 AM, iwannaplay games funnlearnfork...@gmail.com wrote:

Hi ,

I want to run query like

select month(eventdate),scene,count(1),sum(timespent) from eventlog
group by month(eventdate),scene


in hbase.Through hive its taking a lot of time for 40 million
records.Do we have any syntax in hbase to find its result?In sql
server it takes around 9 minutes,How long it might take in hbase??

Regards
Prabhjot





Re: scan a table with 2 column families.

2012-09-09 Thread Doug Meil

Hi there, the scan will merge the results between the CFsŠ   for more
information see these two chapters in the HBase RefGuide.

http://hbase.apache.org/book.html#datamodel

http://hbase.apache.org/book.html#mapreduce




On 9/9/12 6:41 AM, huaxiang huaxi...@asiainfo-linkage.com wrote:

Hi,

  If a table has two column families, Each stored in a store file in a
region in a regionserver.

  what is process will be a scan for values  in these two column
families?

  For example, I want to scan jack's health1:height and health2:weight,
each
in a differenet column family.

   First scan height1 in one store file, then scan weight in another
store file? Then merge these two scan results?

 

 

Thanks!

 

huaxiang





Re: Tomb Stone Marker

2012-09-09 Thread Doug Meil

Hi there,

See 9.7.5.4. KeyValue...

http://hbase.apache.org/book.html#regions.arch

Š the tombstone is one of the keytypes.



On 9/9/12 5:21 AM, Monish r monishs...@gmail.com wrote:

Hi,
I need some clarifications regarding the Tomb Stone Marker .
I was wondering where exactly are the tomb stone markers stored when a row
is deleted .

Are they kept in some memory area and  updated in the HFile  during minor
compaction ?
If they are updated in the HFile , then what part of a  HFile contains
this
information.

Regards,
R.Monish




Re: batch update question

2012-09-06 Thread Doug Meil

For the 2nd part of the question, if you have 10 Puts it's more efficient to 
send a single RS message with 10 Puts than send 10 RS messages with 1 Put 
apiece.  There are 2 words to be careful with, and those are always and 
never, because there is an exception: if you are using the client writeBuffer 
and each of those 10 Puts are going to a different RegionServer, then you 
haven't really gained much.

To answer the next question of how you know where the Puts are going, see this 
method…

http://hbase.apache.org/apidocs/org/apache/hadoop/hbase/client/HTable.html#getRegionLocation%28byte[],%20boolean%29

Because the Hbase client talks directly to each RS, it has to know the region 
boundaries.



From: Lin Ma lin...@gmail.commailto:lin...@gmail.com
Date: Thursday, September 6, 2012 11:54 AM
To: user@hbase.apache.orgmailto:user@hbase.apache.org 
user@hbase.apache.orgmailto:user@hbase.apache.org, Doug Meil 
doug.m...@explorysmedical.commailto:doug.m...@explorysmedical.com
Cc: st...@duboce.netmailto:st...@duboce.net 
st...@duboce.netmailto:st...@duboce.net
Subject: Re: batch update question

Thank you Doug,

Very effective reply. :-)

- why batch update could resolve contention issue on the same row? Could you 
elaborate a bit more or show me an example?
- Batch update always have good performance compared to single update (when we 
measure total throughput)?

regards,
Lin

On Thu, Sep 6, 2012 at 12:59 AM, Doug Meil 
doug.m...@explorysmedical.commailto:doug.m...@explorysmedical.com wrote:

Hi there, if you look in the source code for HTable there is a list of Put
objects.  That's the buffer, and it's a client-side buffer.





On 9/5/12 12:04 PM, Lin Ma lin...@gmail.commailto:lin...@gmail.com wrote:

Thank you Stack for the details directions!

1. You are right, I have not met with any real row contention issues. My
purpose is understanding the issue in advance, and also from this issue to
understand HBase generals better;
2. For the comments from API Url page you referred -- If
isAutoFlushhttp://hbase.apache.org/apidocs/org/apache/hadoop/hbase/client
/HTableInterface.html#isAutoFlush%28%29is
false, the update is buffered until the internal buffer is full., I
am
confused what is the buffer? Buffer at client side or buffer in region
server? Is there a way to configure its size to hold until flushing?
3. Why batch could resolve contention on the same raw issue in theory,
compared to non-batch operation? Besides preparation the solution in my
mind in advance, I want to learn a bit about why. :-)

regards,
Lin

On Wed, Sep 5, 2012 at 4:00 AM, Stack 
st...@duboce.netmailto:st...@duboce.net wrote:

 On Sun, Sep 2, 2012 at 2:13 AM, Lin Ma 
 lin...@gmail.commailto:lin...@gmail.com wrote:
  Hello guys,
 
  I am reading the book HBase, the definitive guide, at the beginning
of
  chapter 3, it is mentioned in order to reduce performance impact for
  clients to update the same row (lock contention issues for automatic
  write), batch update is preferred. My questions is, for MR job, what
are
  the batch update methods we could leverage to resolve the issue? And
for
  API client, what are the batch update methods we could leverage to
 resolve
  the issue?
 

 Do you actually have a problem where there is contention on a single
row?

 Use methods like


http://hbase.apache.org/apidocs/org/apache/hadoop/hbase/client/HTable.htm
l#put(java.util.List)
 or the batch methods listed earlier in the API.  You should set
 autoflush to false too:


http://hbase.apache.org/apidocs/org/apache/hadoop/hbase/client/HTableInte
rface.html#isAutoFlush()

 Even batching, a highly contended row might hold up inserts... but for
 sure you actually have this problem in the first place?

 St.Ack






Re: Extremely slow when loading small amount of data from HBase

2012-09-05 Thread Doug Meil

You have are 4000 regions on an 8 node cluster?  I think you need to bring
that *way* down…  

re:  something like 40 regions


Yep… around there.  See…


http://hbase.apache.org/book.html#bigger.regions



On 9/5/12 8:06 AM, Jean-Marc Spaggiari jean-m...@spaggiari.org wrote:

But I think you should also look at why we have so many regions...
Because even if you merge them manually now, you might face the same
issu soon.

2012/9/5, n keywal nkey...@gmail.com:
 Hi,

 With 8 regionservers, yes, you can. Target a few hundreds by default
imho.

 N.

 On Wed, Sep 5, 2012 at 4:55 AM, 某因幡 tewil...@gmail.com wrote:

 +HBase users.


 -- Forwarded message --
 From: Dmitriy Ryaboy dvrya...@gmail.com
 Date: 2012/9/4
 Subject: Re: Extremely slow when loading small amount of data from
HBase
 To: u...@pig.apache.org u...@pig.apache.org


 I think the hbase folks recommend something like 40 regions per node
 per table, but I might be misremembering something. Have you tried
 emailing the hbase users list?

 On Sep 4, 2012, at 3:39 AM, 某因幡 tewil...@gmail.com wrote:

  After merging ~8000 regions to ~4000 on an 8-node cluster the things
  is getting better.
  Should I continue merging?
 
 
  2012/8/29 Dmitriy Ryaboy dvrya...@gmail.com:
  Can you try the same scans with a regular hbase mapreduce job? If
you
 see the same problem, it's an hbase issue. Otherwise, we need to see
the
 script and some facts about your table (how many regions, how many
rows,
 how big a cluster, is the small range all on one region server, etc)
 
  On Aug 27, 2012, at 11:49 PM, 某因幡 tewil...@gmail.com wrote:
 
  When I load a range of data from HBase simply using row key range
in
  HBaseStorageHandler, I find that the speed is acceptable when I'm
  trying to load some tens of millions rows or more, while the only
map
  ends up in a timeout when it's some thousands of rows.
  What is going wrong here? Tried both Pig-0.9.2 and Pig-0.10.0.
 
 
  --
  language: Chinese, Japanese, English
 
 
 
  --
  language: Chinese, Japanese, English


 --
 language: Chinese, Japanese, English







Re: batch update question

2012-09-05 Thread Doug Meil

Hi there, if you look in the source code for HTable there is a list of Put
objects.  That's the buffer, and it's a client-side buffer.





On 9/5/12 12:04 PM, Lin Ma lin...@gmail.com wrote:

Thank you Stack for the details directions!

1. You are right, I have not met with any real row contention issues. My
purpose is understanding the issue in advance, and also from this issue to
understand HBase generals better;
2. For the comments from API Url page you referred -- If
isAutoFlushhttp://hbase.apache.org/apidocs/org/apache/hadoop/hbase/client
/HTableInterface.html#isAutoFlush%28%29is
false, the update is buffered until the internal buffer is full., I
am
confused what is the buffer? Buffer at client side or buffer in region
server? Is there a way to configure its size to hold until flushing?
3. Why batch could resolve contention on the same raw issue in theory,
compared to non-batch operation? Besides preparation the solution in my
mind in advance, I want to learn a bit about why. :-)

regards,
Lin

On Wed, Sep 5, 2012 at 4:00 AM, Stack st...@duboce.net wrote:

 On Sun, Sep 2, 2012 at 2:13 AM, Lin Ma lin...@gmail.com wrote:
  Hello guys,
 
  I am reading the book HBase, the definitive guide, at the beginning
of
  chapter 3, it is mentioned in order to reduce performance impact for
  clients to update the same row (lock contention issues for automatic
  write), batch update is preferred. My questions is, for MR job, what
are
  the batch update methods we could leverage to resolve the issue? And
for
  API client, what are the batch update methods we could leverage to
 resolve
  the issue?
 

 Do you actually have a problem where there is contention on a single
row?

 Use methods like

 
http://hbase.apache.org/apidocs/org/apache/hadoop/hbase/client/HTable.htm
l#put(java.util.List)
 or the batch methods listed earlier in the API.  You should set
 autoflush to false too:

 
http://hbase.apache.org/apidocs/org/apache/hadoop/hbase/client/HTableInte
rface.html#isAutoFlush()

 Even batching, a highly contended row might hold up inserts... but for
 sure you actually have this problem in the first place?

 St.Ack





Re: batch update question

2012-09-05 Thread Doug Meil

Hi there, for more information about the hbase client, seeŠ

http://hbase.apache.org/book.html#client





On 9/5/12 12:59 PM, Doug Meil doug.m...@explorysmedical.com wrote:


Hi there, if you look in the source code for HTable there is a list of Put
objects.  That's the buffer, and it's a client-side buffer.





On 9/5/12 12:04 PM, Lin Ma lin...@gmail.com wrote:

Thank you Stack for the details directions!

1. You are right, I have not met with any real row contention issues. My
purpose is understanding the issue in advance, and also from this issue
to
understand HBase generals better;
2. For the comments from API Url page you referred -- If
isAutoFlushhttp://hbase.apache.org/apidocs/org/apache/hadoop/hbase/clien
t
/HTableInterface.html#isAutoFlush%28%29is
false, the update is buffered until the internal buffer is full., I
am
confused what is the buffer? Buffer at client side or buffer in region
server? Is there a way to configure its size to hold until flushing?
3. Why batch could resolve contention on the same raw issue in theory,
compared to non-batch operation? Besides preparation the solution in my
mind in advance, I want to learn a bit about why. :-)

regards,
Lin

On Wed, Sep 5, 2012 at 4:00 AM, Stack st...@duboce.net wrote:

 On Sun, Sep 2, 2012 at 2:13 AM, Lin Ma lin...@gmail.com wrote:
  Hello guys,
 
  I am reading the book HBase, the definitive guide, at the beginning
of
  chapter 3, it is mentioned in order to reduce performance impact for
  clients to update the same row (lock contention issues for automatic
  write), batch update is preferred. My questions is, for MR job, what
are
  the batch update methods we could leverage to resolve the issue? And
for
  API client, what are the batch update methods we could leverage to
 resolve
  the issue?
 

 Do you actually have a problem where there is contention on a single
row?

 Use methods like

 
http://hbase.apache.org/apidocs/org/apache/hadoop/hbase/client/HTable.ht
m
l#put(java.util.List)
 or the batch methods listed earlier in the API.  You should set
 autoflush to false too:

 
http://hbase.apache.org/apidocs/org/apache/hadoop/hbase/client/HTableInt
e
rface.html#isAutoFlush()

 Even batching, a highly contended row might hold up inserts... but for
 sure you actually have this problem in the first place?

 St.Ack






Re: Reading in parallel from table's regions in MapReduce

2012-09-04 Thread Doug Meil

Hi there-

Yes, there is an input split for each region of the source table of a MR
job.

There is a blurb on that in the RefGuide...

http://hbase.apache.org/book.html#splitter





On 9/4/12 11:17 AM, Ioakim Perros imper...@gmail.com wrote:

Hello,

I would be grateful if someone could shed a light to the following:

Each M/R map task is reading data from a separate region of a table.
 From the jobtracker 's GUI, at the map completion graph, I notice that
although data read from mappers are different, they read data
sequentially - like the table has a lock that permits only one mapper to
read data from every region at a time.

Does this lock hypothesis make sense? Is there any way I could avoid
this useless delay?

Thanks in advance and regards,
Ioakim





Re: md5 hash key and splits

2012-08-31 Thread Doug Meil

Stack, re:  Where did you read that?, I think he might also be referring
to this...

http://hbase.apache.org/book.html#important_configurations






On 8/30/12 8:04 PM, Mohit Anchlia mohitanch...@gmail.com wrote:

In general isn't it better to split the regions so that the load can be
spread accross the cluster to avoid HotSpots?

I read about pre-splitting here:

http://blog.sematext.com/2012/04/09/hbasewd-avoid-regionserver-hotspotting
-despite-writing-records-with-sequential-keys/

On Thu, Aug 30, 2012 at 4:30 PM, Amandeep Khurana ama...@gmail.com
wrote:

 Also, you might have read that an initial loading of data can be better
 distributed across the cluster if the table is pre-split rather than
 starting with a single region and splitting (possibly aggressively,
 depending on the throughput) as the data loads in. Once you are in a
stable
 state with regions distributed across the cluster, there is really no
 benefit in terms of spreading load by managing splitting manually v/s
 letting HBase do it for you. At that point it's about what Ian
mentioned -
 predictability of latencies by avoiding splits happening at a busy time.

 On Thu, Aug 30, 2012 at 4:26 PM, Ian Varley ivar...@salesforce.com
 wrote:

  The Facebook devs have mentioned in public talks that they pre-split
 their
  tables and don't use automated region splitting. But as far as I
 remember,
  the reason for that isn't predictability of spreading load, so much as
  predictability of uptime  latency (they don't want an automated
split to
  happen at a random busy time). Maybe that's what you mean, Mohit?
 
  Ian
 
  On Aug 30, 2012, at 5:45 PM, Stack wrote:
 
  On Thu, Aug 30, 2012 at 7:35 AM, Mohit Anchlia mohitanch...@gmail.com
  mailto:mohitanch...@gmail.com wrote:
  From what I;ve read it's advisable to do manual splits since you are
able
  to spread the load in more predictable way. If I am missing something
  please let me know.
 
 
  Where did you read that?
  St.Ack
 
 





Re: Hbase Bulk Load Java Sample Code

2012-08-27 Thread Doug Meil

Hi there, in addition there is a fair amount of documentation about bulk
loads and importtsv in the Hbase RefGuide.

http://hbase.apache.org/book.html#importtsv





On 8/27/12 9:34 AM, Ioakim Perros imper...@gmail.com wrote:

On 08/27/2012 04:18 PM, o brbrs wrote:
 Hi,

 I'm new at hase and i want to make bulk load from hdfs to hbase with
java.
 Is there any sample code which includes importtsv and completebulkload
 libraries on java?

 Thanks.

Hi,

Here is a sample configuration of a bulk loading job consisting only of
map tasks:

 Configuration config = HBaseConfiguration.create();

 config.set(TableOutputFormat.OUTPUT_TABLE, tableNAME);
 Path inputPath = new Path(inputStringPath);

 Job job = new Job(config, Sample job );

 job.setMapOutputKeyClass(mapperKey);
 job.setMapOutputValueClass(mapperValue);

 FileInputFormat.setInputPaths(job, inputPath);
 job.setInputFormatClass(inputFormat);
 FileOutputFormat.setOutputPath(job,new Path (HFileoutputPath));
//directory at HDFS where HFiles will be placed
 //before bulk loading

 job.setOutputFormatClass(HFileOutputFormat.class);

 job.setJarByClass(caller);
 job.setMapperClass(mapper);

 HTable hTable = new HTable(config, tableNAME); //tableNAME is a
String representing a table which has to
 //already exist in HBase
 HFileOutputFormat.configureIncrementalLoad(job, hTable);
//check respective API for the complete functionality of this
//function

 job.waitForCompletion(true);

 /* after the job's completion, we have to write the HFiles
  * into HBase's specified table */
 LoadIncrementalHFiles lihf = new LoadIncrementalHFiles(config);
 lihf.doBulkLoad(new Path(HFileoutputPath), hTable);

Create a map task which produces key,value pairs just as you expect them
to exist in your HBase's table (e.g.: key: ImmutableBytesWritable, Put)
and you re done.

Regards,
Ioakim




Re: Pig, HBaseStorage, HBase, JRuby and Sinatra

2012-08-27 Thread Doug Meil

I think somewhere in here in the RefGuide would workŠ

http://hbase.apache.org/book.html#other.info.sites






On 8/27/12 1:20 PM, Stack st...@duboce.net wrote:

On Mon, Aug 27, 2012 at 6:32 AM, Russell Jurney
russell.jur...@gmail.com wrote:
 I wrote a tutorial around HBase, JRuby and Pig that I thought would be
of
 interest to the HBase users list:
 
http://hortonworks.com/blog/pig-as-hadoop-connector-part-two-hbase-jruby-
and-sinatra/


Thanks Russell. Should we add a link in the refguide?  Where would you
put it (and I'll do the edit).
St.Ack





Re: how client location a region/tablet?

2012-08-23 Thread Doug Meil

For further information about the catalog tables and region-regionserver
assignment, see thisŠ

http://hbase.apache.org/book.html#arch.catalog






On 8/19/12 7:36 AM, Lin Ma lin...@gmail.com wrote:

Thank you Stack, especially for the smart 6 round trip guess for the
puzzle. :-)

1. Yeah, we client cache's locations, not the data. -- does it mean for
each client, it will cache all location information of a HBase cluster,
i.e. which physical server owns which region? Supposing each region has
128M bytes, for a big cluster (P-bytes level), total data size / 128M is
not a trivial number, not sure if any overhead to client?
2. A bit confused by what do you mean not the data? For the client
cached
location information, it should be the data in table METADATA, which is
region / physical server mapping data. Why you say not data (do you mean
real content in each region)?

regards,
Lin

On Sun, Aug 19, 2012 at 12:40 PM, Stack st...@duboce.net wrote:

 On Sat, Aug 18, 2012 at 2:13 AM, Lin Ma lin...@gmail.com wrote:
  Hello guys,
 
  I am referencing the Big Table paper about how a client locates a
tablet.
  In section 5.1 Tablet location, it is mentioned that client will cache
 all
  tablet locations, I think it means client will cache root tablet in
  METADATA table, and all other tablets in METADATA table (which means
 client
  cache the whole METADATA table?). My question is, whether HBase
 implements
  in the same or similar way? My concern or confusion is, supposing each
  tablet or region file is 128M bytes, it will be very huge space (i.e.
  memory footprint) for each client to cache all tablets or region
files of
  METADATA table. Is it doable or feasible in real HBase clusters?
Thanks.
 

 Yeah, we client cache's locations, not the data.


  BTW: another confusion from me is in the paper of Big Table section
5.1
  Tablet location, it is mentioned that If the client¹s cache is stale,
 the
  location algorithm could take up to six round-trips, because stale
cache
  entries are only discovered upon misses (assuming that METADATA
tablets
 do
  not move very frequently)., I do not know how the 6 times round trip
 time
  is calculated, if anyone could answer this puzzle, it will be great.
:-)
 

 I'm not sure what the 6 is about either.  Here is a guesstimate:

 1. Go to cached location for a server for a particular user region,
 but server says that it does not have a region, the client location is
 stale
 2. Go back to client cached meta region that holds user region w/ row
 we want, but its location is stale.
 3. Go to root location, to find new location of meta, but the root
 location has moved what the client has is stale
 4. Find new root location and do lookup of meta region location
 5. Go to meta region location to find new user region
 6. Go to server w/ user region

 St.Ack





Re: Hbase Schema

2012-07-11 Thread Doug Meil

re:  Q2

Yes you can have the same CF name in different tables.  Column Family
names are embedded in each KeyValue.

See:  http://hbase.apache.org/book.html#regions.arch  for more detail

re:  Q3

It depends on what you you need.  A common pattern is using composite keys
where the lead portion represents some natural grouping of data (e.g., a
userid) but that is also hashed to provide distribution across the cluster.

re:  Q4

Read the RefGuide! 

http://hbase.apache.org/book.html




On 7/11/12 3:16 PM, grashmi13 rashmi.maheshw...@rsystems.com wrote:


Hi,

In RDBMS we have multiple DB schemas\oracle user instances.

Similarly, can we have multiple db schemas in hbase? If yes, can we have
multiple schemas one one hadoop-hbase cluster? If multiple schemas
possible,
how can we define them? Using configuration or programatically?

Q2: can we have same column family name in multiple tables? if yes, does
it
impacts performance if we have same name column family in multiple tables?

Q3: Sequential keys improves read performance and random keys improves
write
performance. which way one must go?

Q4: What are best practices to improve hadoop+hbase performance?

Q5: when one program is deleting tables, another program is accessing a
row
of that table. what would be impact of it? can we have some sort of lock
while reading or while deleting a table?

Q6: as everything in application is byte form, what would happen if hbase
db
and application are using different character set? can we synch both for
some particular character set by configuration or programatically?
-- 
View this message in context:
http://old.nabble.com/Hbase-Schema-tp34147582p34147582.html
Sent from the HBase User mailing list archive at Nabble.com.




Re: Stargate: ScannerModel

2012-06-28 Thread Doug Meil

One other thingŠ

re:  I tried using rowFilter but it is quite slow.

If you didn't use startRow/stopRow for the Scan you will be filtering all
the rows in the table (albiet on the RS, but stillŠ all the rows)





On 6/28/12 4:56 AM, N Keywal nkey...@gmail.com wrote:

(moving this to the user mailing list, with the dev one in bcc)

From what you said it should be

customerid_MIN_TX_ID to customerid_MAX_TX_ID
But only if customerid size is constant.

Note that with this rowkey design there will be very few regions
involved, so it's unlikely to be parallelized.

N.


On Thu, Jun 28, 2012 at 7:43 AM, sameer sameer_therat...@infosys.com
wrote:
 Hello,

 I want to what are the parameters for scan.setStartRow ans
scan.setStopRow.

 My requirement is that I have a table, with key as
customerid_transactionId.

 I want to scan all the rows, they key of which contains the customer Id
that
 I have.

 I tried using rowFilter but it is quite slow.

 If I am using the scan - setStartRow and setStopRow then what would I
give
 as parameters?

 Thanks,
 Sameer

 --
 View this message in context:
http://apache-hbase.679495.n3.nabble.com/Stargate-ScannerModel-tp2975161p
4019139.html
 Sent from the HBase - Developer mailing list archive at Nabble.com.





  1   2   3   4   >