Re: Scan performance

2013-06-22 Thread Anoop John
Have a look at FuzzyRowFilter

-Anoop-

On Sat, Jun 22, 2013 at 9:20 AM, Tony Dean tony.d...@sas.com wrote:

 I understand more, but have additional questions about the internals...

 So, in this example I have 6000 rows X 40 columns in this table.  In this
 test my startRow and stopRow do not narrow the scan criterior therefore all
 6000x40 KVs must be included in the search and thus read from disk and into
 memory.

 The first filter that I used was:
 Filter f2 = new SingleColumnValueFilter(cf, qualifier,
  CompareFilter.CompareOp.EQUALS, value);

 This means that HBase must look for the qualifier column on all 6000 rows.
  As you mention I could add certain columns to a different cf; but
 unfortunately, in my case there is no such small set of columns that will
 need to be compared (filtered on).  I could try to use indexes so that a
 complete row key can be calculated from a secondary index in order to
 perform a faster search against data in a primary table.  This requires
 additional tables and maintenance that I would like to avoid.

 I did try a row key filter with regex hoping that it would limit the
 number of rows that were read from disk.
 Filter f2 = new RowFilter(CompareFilter.CompareOp.EQUAL, new
 RegexStringComparator(row_regexpr));

 My row keys are something like: vid,sid,event.  sid is not known at query
 time so I can use a regex similar to: vid,.*,Logon where Logon is the event
 that I am looking for in a particular visit.  In my test data this should
 have narrowed the scan to 1 row X 40 columns.  The best I could do for
 start/stop row is: vid,0 and vid,~ respectively.  I guess that is still
 going to cause all 6000 rows to be scanned, but the filtering should be
 more specific with the rowKey filter.  However, I did not see any
 performance improvement.  Anything obvious?

 Do you have any other ideas to help out with performance when row key is:
 vid,sid,event and sid is not known at query time which leaves a gap in the
 start/stop row?  Too bad regex can't be used in start/stop row
 specification.  That's really what I need.

 Thanks again.
 -Tony

 -Original Message-
 From: Vladimir Rodionov [mailto:vrodio...@carrieriq.com]
 Sent: Friday, June 21, 2013 8:00 PM
 To: user@hbase.apache.org; lars hofhansl
 Subject: RE: Scan performance

 Lars,
 I thought that column family is the locality group and placement columns
 which are frequently accessed together into the same column family
 (locality group) is the obvious performance improvement tip. What are the
 essential column families for in this context?

 As for original question..  Unless you place your column into a separate
 column family in Table 2, you will need to scan (load from disk if not
 cached) ~ 40x more data for the second table (because you have 40 columns).
 This may explain why do  see such a difference in execution time if all
 data needs to be loaded first from HDFS.

 Best regards,
 Vladimir Rodionov
 Principal Platform Engineer
 Carrier IQ, www.carrieriq.com
 e-mail: vrodio...@carrieriq.com

 
 From: lars hofhansl [la...@apache.org]
 Sent: Friday, June 21, 2013 3:37 PM
 To: user@hbase.apache.org
 Subject: Re: Scan performance

 HBase is a key value (KV) store. Each column is stored in its own KV, a
 row is just a set of KVs that happen to have the row key (which is the
 first part of the key).
 I tried to summarize this here:
 http://hadoop-hbase.blogspot.de/2011/12/introduction-to-hbase.html)

 In the StoreFiles all KVs are sorted in row/column order, but HBase still
 needs to skip over many KVs in order to reach the next row. So more disk
 and memory IO is needed.

 If you using 0.94 there is a new feature essential column families. If
 you always search by the same column you can place that one in its own
 column family and all other column in another column family. In that case
 your scan performance should be close identical.


 -- Lars
 

 From: Tony Dean tony.d...@sas.com
 To: user@hbase.apache.org user@hbase.apache.org
 Sent: Friday, June 21, 2013 2:08 PM
 Subject: Scan performance




 Hi,

 I hope that you can shed some light on these 2 scenarios below.

 I have 2 small tables of 6000 rows.
 Table 1 has only 1 column in each of its rows.
 Table 2 has 40 columns in each of its rows.
 Other than that the two tables are identical.

 In both tables there is only 1 row that contains a matching column that I
 am filtering on.   And the Scan performs correctly in both cases by
 returning only the single result.

 The code looks something like the following:

 Scan scan = new Scan(startRow, stopRow);   // the start/stop rows should
 include all 6000 rows
 scan.addColumn(cf, qualifier); // only return the column that I am
 interested in (should only be in 1 row and only 1 version)

 Filter f1 = new InclusiveStopFilter(stopRow); Filter f2 = new
 SingleColumnValueFilter(cf, qualifier,  CompareFilter.CompareOp.EQUALS,
 value); 

Re: Logging for MR Job

2013-06-22 Thread Suraj Varma
Did you try passing in the log level via  generic options?
E.g. I can switch the log level of a running job via:
hadoop jar hadoop-mapreduce-examples.jar pi *-D
mapred.map.child.log.level=DEBUG *10 10
hadoop jar hadoop-mapreduce-examples.jar pi *-D
mapred.map.child.log.level=INFO* 10 10

--Suraj



On Fri, Jun 21, 2013 at 4:41 PM, Joel Alexandre joel.alexan...@gmail.comwrote:

 hi,

 i'm running some Hbase MR jobs through the bin/hadoop jar command line.

 How can i change the log level for those specific execution without
 changing hbase/conf/log4j.properties ?

 I'm my jar there is a log4j.properties file, but it is being ignored.

 Thanks,
 Joel



difference between major and minor compactions?

2013-06-22 Thread yun peng
Hi, All

I am asking the different practices of major and minor compaction... My
current understanding is that minor compaction, triggered automatically,
usually run along with online query serving (but in background), so that it
is important to make it as lightweight as possible... to minimise downtime
(pause time) of online query.

In contrast, the major compaction is invoked in  offpeak time and usually
can be assume to have resource exclusively. It may have a different
performance optimization goal...

Correct me if wrong, but let me know if HBase does design different
compaction mechanism this way..?

Regards,
Yun


Re: difference between major and minor compactions?

2013-06-22 Thread Jean-Marc Spaggiari
Hi Yun,

Few links:
- http://blog.cloudera.com/blog/2012/06/hbase-io-hfile-input-output/
= There is a small paragraph about compactions which explain when
they are triggered.
- http://hbase.apache.org/book/regions.arch.html 9.7.6.5

You are almost right. Only thing is that HBase doesn't know when is
your offpeak, so a major compaction can be triggered anytime if the
minor is promoted to be a major one.

JM

2013/6/22 yun peng pengyunm...@gmail.com:
 Hi, All

 I am asking the different practices of major and minor compaction... My
 current understanding is that minor compaction, triggered automatically,
 usually run along with online query serving (but in background), so that it
 is important to make it as lightweight as possible... to minimise downtime
 (pause time) of online query.

 In contrast, the major compaction is invoked in  offpeak time and usually
 can be assume to have resource exclusively. It may have a different
 performance optimization goal...

 Correct me if wrong, but let me know if HBase does design different
 compaction mechanism this way..?

 Regards,
 Yun


Re: difference between major and minor compactions?

2013-06-22 Thread yun peng
Thanks, JM
It seems like the sole difference btwn major and minor compaction is the
number of files (to be all or just a subset of storefiles). It mentioned
very briefly in
http://hbase.apache.org/bookhttp://hbase.apache.org/book/regions.arch.htmlthat
Sometimes a minor compaction will ... promote itself to being a major
compaction. What does sometime exactly mean here? or any policy in HBase
that allow application to specify when to promote a minor compaction to be
a major (like user or some monitoring service can specify now is offpeak
time?)
Yun



On Sat, Jun 22, 2013 at 8:51 AM, Jean-Marc Spaggiari 
jean-m...@spaggiari.org wrote:

 Hi Yun,

 Few links:
 - http://blog.cloudera.com/blog/2012/06/hbase-io-hfile-input-output/
 = There is a small paragraph about compactions which explain when
 they are triggered.
 - http://hbase.apache.org/book/regions.arch.html 9.7.6.5

 You are almost right. Only thing is that HBase doesn't know when is
 your offpeak, so a major compaction can be triggered anytime if the
 minor is promoted to be a major one.

 JM

 2013/6/22 yun peng pengyunm...@gmail.com:
  Hi, All
 
  I am asking the different practices of major and minor compaction... My
  current understanding is that minor compaction, triggered automatically,
  usually run along with online query serving (but in background), so that
 it
  is important to make it as lightweight as possible... to minimise
 downtime
  (pause time) of online query.
 
  In contrast, the major compaction is invoked in  offpeak time and usually
  can be assume to have resource exclusively. It may have a different
  performance optimization goal...
 
  Correct me if wrong, but let me know if HBase does design different
  compaction mechanism this way..?
 
  Regards,
  Yun



Re: Scan performance

2013-06-22 Thread lars hofhansl
essential column families help when you filter on one column but want to 
return *other* columns for the rows that matched the column.

Check out HBASE-5416.

-- Lars




 From: Vladimir Rodionov vrodio...@carrieriq.com
To: user@hbase.apache.org user@hbase.apache.org; lars hofhansl 
la...@apache.org 
Sent: Friday, June 21, 2013 5:00 PM
Subject: RE: Scan performance
 

Lars,
I thought that column family is the locality group and placement columns which 
are frequently accessed together into
the same column family (locality group) is the obvious performance improvement 
tip. What are the essential column families for in this context?

As for original question..  Unless you place your column into a separate column 
family in Table 2, you will
need to scan (load from disk if not cached) ~ 40x more data for the second 
table (because you have 40 columns). This may explain why do  see such a 
difference in
execution time if all data needs to be loaded first from HDFS.

Best regards,
Vladimir Rodionov
Principal Platform Engineer
Carrier IQ, www.carrieriq.com
e-mail: vrodio...@carrieriq.com


From: lars hofhansl [la...@apache.org]
Sent: Friday, June 21, 2013 3:37 PM
To: user@hbase.apache.org
Subject: Re: Scan performance

HBase is a key value (KV) store. Each column is stored in its own KV, a row is 
just a set of KVs that happen to have the row key (which is the first part of 
the key).
I tried to summarize this here: 
http://hadoop-hbase.blogspot.de/2011/12/introduction-to-hbase.html)

In the StoreFiles all KVs are sorted in row/column order, but HBase still needs 
to skip over many KVs in order to reach the next row. So more disk and memory 
IO is needed.

If you using 0.94 there is a new feature essential column families. If you 
always search by the same column you can place that one in its own column 
family and all other column in another column family. In that case your scan 
performance should be close identical.


-- Lars


From: Tony Dean tony.d...@sas.com
To: user@hbase.apache.org user@hbase.apache.org
Sent: Friday, June 21, 2013 2:08 PM
Subject: Scan performance




Hi,

I hope that you can shed some light on these 2 scenarios below.

I have 2 small tables of 6000 rows.
Table 1 has only 1 column in each of its rows.
Table 2 has 40 columns in each of its rows.
Other than that the two tables are identical.

In both tables there is only 1 row that contains a matching column that I am 
filtering on.   And the Scan performs correctly in both cases by returning only 
the single result.

The code looks something like the following:

Scan scan = new Scan(startRow, stopRow);   // the start/stop rows should 
include all 6000 rows
scan.addColumn(cf, qualifier); // only return the column that I am interested 
in (should only be in 1 row and only 1 version)

Filter f1 = new InclusiveStopFilter(stopRow);
Filter f2 = new SingleColumnValueFilter(cf, qualifier,  
CompareFilter.CompareOp.EQUALS, value);
scan.setFilter(new FilterList(f1, f2));

scan .setTimeRange(0, MAX_LONG);
scan.setMaxVersions(1);

ResultScanner rs = t.getScanner(scan);
for (Result result: rs)
{

}

For table 1, rs.next() takes about 30ms.
For table 2, rs.next() takes about 180ms.

Both are returning the exact same result.  Why is it taking so much longer on 
table 2 to get the same result?  The scan depth is the same.  The only 
difference is the column width.  But I’m filtering on a single column and 
returning only that column.

Am I missing something?  As I increase the number of columns, the response time 
gets worse.  I do expect the response time to get worse when increasing the 
number of rows, but not by increasing the number of columns since I’m returning 
only 1 column in
both cases.

I appreciate any comments that you have.

-Tony



Tony Dean
SAS Institute Inc.
Principal Software Developer
919-531-6704          …

Confidentiality Notice:  The information contained in this message, including 
any attachments hereto, may be confidential and is intended to be read only by 
the individual or entity to whom this message is addressed. If the reader of 
this message is not the intended recipient or an agent or designee of the 
intended recipient, please note that any review, use, disclosure or 
distribution of this message or its attachments, in any form, is strictly 
prohibited.  If you have received this message in error, please immediately 
notify the sender and/or notificati...@carrieriq.com and delete or destroy any 
copy of this message and its attachments.

Re: difference between major and minor compactions?

2013-06-22 Thread Jean-Marc Spaggiari
Hi Yun,

There is more differences.

The minor compactions are not remove the delete flags and the deleted
cells. It only merge the small files into a bigger one. Only the major
compaction (in 0.94) will deal with the delete cells. There is also
some more compaction mechanism coming in trunk with nice features.

Look at: https://issues.apache.org/jira/browse/HBASE-7902
https://issues.apache.org/jira/browse/HBASE-7680
https://issues.apache.org/jira/browse/HBASE-7680

Minor compactions are promoted to major compactions when the
compaction policy decide to compact all the files. If all the files
need to be merged, then we can run a major compaction which will do
the same thing as the minor one, but with the bonus of deleting the
required marked cells.

JM

2013/6/22 yun peng pengyunm...@gmail.com:
 Thanks, JM
 It seems like the sole difference btwn major and minor compaction is the
 number of files (to be all or just a subset of storefiles). It mentioned
 very briefly in
 http://hbase.apache.org/bookhttp://hbase.apache.org/book/regions.arch.htmlthat
 Sometimes a minor compaction will ... promote itself to being a major
 compaction. What does sometime exactly mean here? or any policy in HBase
 that allow application to specify when to promote a minor compaction to be
 a major (like user or some monitoring service can specify now is offpeak
 time?)
 Yun



 On Sat, Jun 22, 2013 at 8:51 AM, Jean-Marc Spaggiari 
 jean-m...@spaggiari.org wrote:

 Hi Yun,

 Few links:
 - http://blog.cloudera.com/blog/2012/06/hbase-io-hfile-input-output/
 = There is a small paragraph about compactions which explain when
 they are triggered.
 - http://hbase.apache.org/book/regions.arch.html 9.7.6.5

 You are almost right. Only thing is that HBase doesn't know when is
 your offpeak, so a major compaction can be triggered anytime if the
 minor is promoted to be a major one.

 JM

 2013/6/22 yun peng pengyunm...@gmail.com:
  Hi, All
 
  I am asking the different practices of major and minor compaction... My
  current understanding is that minor compaction, triggered automatically,
  usually run along with online query serving (but in background), so that
 it
  is important to make it as lightweight as possible... to minimise
 downtime
  (pause time) of online query.
 
  In contrast, the major compaction is invoked in  offpeak time and usually
  can be assume to have resource exclusively. It may have a different
  performance optimization goal...
 
  Correct me if wrong, but let me know if HBase does design different
  compaction mechanism this way..?
 
  Regards,
  Yun



Re: Scan performance

2013-06-22 Thread lars hofhansl
Yep generally you should design your keys such that start/stopKey can 
efficiently narrow the scope.

If that really cannot be done (and you should try hard), the 2nd  best option 
are skip scans.

Filters in HBase allow for providing the scanner framework with hints where to 
go next.
They can skip to the next column (to avoid looking at many versions), to the 
next row (to avoid looking at many columns), or they can provide a custom seek 
hint to a specific key value. The latter is what FuzzyRowFilter does.


-- Lars




 From: Anoop John anoop.hb...@gmail.com
To: user@hbase.apache.org 
Sent: Friday, June 21, 2013 11:58 PM
Subject: Re: Scan performance
 

Have a look at FuzzyRowFilter

-Anoop-

On Sat, Jun 22, 2013 at 9:20 AM, Tony Dean tony.d...@sas.com wrote:

 I understand more, but have additional questions about the internals...

 So, in this example I have 6000 rows X 40 columns in this table.  In this
 test my startRow and stopRow do not narrow the scan criterior therefore all
 6000x40 KVs must be included in the search and thus read from disk and into
 memory.

 The first filter that I used was:
 Filter f2 = new SingleColumnValueFilter(cf, qualifier,
  CompareFilter.CompareOp.EQUALS, value);

 This means that HBase must look for the qualifier column on all 6000 rows.
  As you mention I could add certain columns to a different cf; but
 unfortunately, in my case there is no such small set of columns that will
 need to be compared (filtered on).  I could try to use indexes so that a
 complete row key can be calculated from a secondary index in order to
 perform a faster search against data in a primary table.  This requires
 additional tables and maintenance that I would like to avoid.

 I did try a row key filter with regex hoping that it would limit the
 number of rows that were read from disk.
 Filter f2 = new RowFilter(CompareFilter.CompareOp.EQUAL, new
 RegexStringComparator(row_regexpr));

 My row keys are something like: vid,sid,event.  sid is not known at query
 time so I can use a regex similar to: vid,.*,Logon where Logon is the event
 that I am looking for in a particular visit.  In my test data this should
 have narrowed the scan to 1 row X 40 columns.  The best I could do for
 start/stop row is: vid,0 and vid,~ respectively.  I guess that is still
 going to cause all 6000 rows to be scanned, but the filtering should be
 more specific with the rowKey filter.  However, I did not see any
 performance improvement.  Anything obvious?

 Do you have any other ideas to help out with performance when row key is:
 vid,sid,event and sid is not known at query time which leaves a gap in the
 start/stop row?  Too bad regex can't be used in start/stop row
 specification.  That's really what I need.

 Thanks again.
 -Tony

 -Original Message-
 From: Vladimir Rodionov [mailto:vrodio...@carrieriq.com]
 Sent: Friday, June 21, 2013 8:00 PM
 To: user@hbase.apache.org; lars hofhansl
 Subject: RE: Scan performance

 Lars,
 I thought that column family is the locality group and placement columns
 which are frequently accessed together into the same column family
 (locality group) is the obvious performance improvement tip. What are the
 essential column families for in this context?

 As for original question..  Unless you place your column into a separate
 column family in Table 2, you will need to scan (load from disk if not
 cached) ~ 40x more data for the second table (because you have 40 columns).
 This may explain why do  see such a difference in execution time if all
 data needs to be loaded first from HDFS.

 Best regards,
 Vladimir Rodionov
 Principal Platform Engineer
 Carrier IQ, www.carrieriq.com
 e-mail: vrodio...@carrieriq.com

 
 From: lars hofhansl [la...@apache.org]
 Sent: Friday, June 21, 2013 3:37 PM
 To: user@hbase.apache.org
 Subject: Re: Scan performance

 HBase is a key value (KV) store. Each column is stored in its own KV, a
 row is just a set of KVs that happen to have the row key (which is the
 first part of the key).
 I tried to summarize this here:
 http://hadoop-hbase.blogspot.de/2011/12/introduction-to-hbase.html)

 In the StoreFiles all KVs are sorted in row/column order, but HBase still
 needs to skip over many KVs in order to reach the next row. So more disk
 and memory IO is needed.

 If you using 0.94 there is a new feature essential column families. If
 you always search by the same column you can place that one in its own
 column family and all other column in another column family. In that case
 your scan performance should be close identical.


 -- Lars
 

 From: Tony Dean tony.d...@sas.com
 To: user@hbase.apache.org user@hbase.apache.org
 Sent: Friday, June 21, 2013 2:08 PM
 Subject: Scan performance




 Hi,

 I hope that you can shed some light on these 2 scenarios below.

 I have 2 small tables of 6000 rows.
 Table 1 has only 1 column in each of its rows.
 

how many severs in a hbase cluster

2013-06-22 Thread myhbase
Hello All,

I learn hbase almost from papers and books, according to my
understanding, HBase is the kind of architecture which is more appliable
to a big cluster. We should have many HDFS nodes, and many HBase(region
server) nodes. If we only have several severs(5-8), it seems hbase is
not a good choice, please correct me if I am wrong. In addition, how
many nodes usually we can start to consider the hbase solution and how
about the physic mem size and other hardware resource in each node, any
reference document or cases? Thanks.

--Ning



Re: how many severs in a hbase cluster

2013-06-22 Thread Jean-Marc Spaggiari
Hi Ning,

I'm personally running HBase in production with only 8 nodes.

As you will see here: http://wiki.apache.org/hadoop/Hbase/PoweredBy
some are also running small clusters.

So I will say it more depend on you need than on the size.

I will say the minimum is 4 to make sure you have your factor 3
replication and some stability if a node fails, but you might be good
also with 3.And there is almost no maximum.

Regarding memory, the more, the merrier... Ỳou also need to make sure
you have many disks per server. Forget that if you have just 1. I'm
able to run with 3, but it's limit. 5 is a good number, and some are
running with 12...

Again, it depend if your application is more read intensive, or CPU
intensive, etc. Can you tell us a bit more about what you want to
achieve?

Thanks,

JM

2013/6/22 myhbase myhb...@126.com:
 Hello All,

 I learn hbase almost from papers and books, according to my
 understanding, HBase is the kind of architecture which is more appliable
 to a big cluster. We should have many HDFS nodes, and many HBase(region
 server) nodes. If we only have several severs(5-8), it seems hbase is
 not a good choice, please correct me if I am wrong. In addition, how
 many nodes usually we can start to consider the hbase solution and how
 about the physic mem size and other hardware resource in each node, any
 reference document or cases? Thanks.

 --Ning



Re: how many severs in a hbase cluster

2013-06-22 Thread Mohammad Tariq
Hello there,

IMHO, 5-8 servers are sufficient enough to start with. But it's all
relative to the data you have and the intensity of your reads/writes. You
should have different strategies though, based on whether it's 'read' or
'write'. You actually can't define 'big' in absolute terms. My cluster
might be big for me, but for someone else it might still be not big enough
or for someone it might be very big. Long story short it depends on your
needs. If you are able to achieve your goal with 5-8 RSs, then having more
machines will be a wastage, I think.

But you should always keep in mind that HBase is kinda greedy when it comes
to memory. For a decent load 4G is sufficient, IMHO. But it again depends
on operations you are gonna perform. If you have large clusters where you
are planning to run MR jobs frequently you are better off with additional
2G.


Warm Regards,
Tariq
cloudfront.blogspot.com


On Sat, Jun 22, 2013 at 7:51 PM, myhbase myhb...@126.com wrote:

 Hello All,

 I learn hbase almost from papers and books, according to my
 understanding, HBase is the kind of architecture which is more appliable
 to a big cluster. We should have many HDFS nodes, and many HBase(region
 server) nodes. If we only have several severs(5-8), it seems hbase is
 not a good choice, please correct me if I am wrong. In addition, how
 many nodes usually we can start to consider the hbase solution and how
 about the physic mem size and other hardware resource in each node, any
 reference document or cases? Thanks.

 --Ning




Re: difference between major and minor compactions?

2013-06-22 Thread yun peng
I am more concerned with CompactionPolicy available that allows application
to manipulate a bit how compaction should go... It looks like  there is
newest API in .97 version
*ExploringCompactionPolicy*http://hbase.apache.org/apidocs/org/apache/hadoop/hbase/regionserver/compactions/ExploringCompactionPolicy.html,
which allow application when we should have a major compaction.

For stripe compaction, it is very interesting, will look into it. Thanks.
Yun


On Sat, Jun 22, 2013 at 9:24 AM, Jean-Marc Spaggiari 
jean-m...@spaggiari.org wrote:

 Hi Yun,

 There is more differences.

 The minor compactions are not remove the delete flags and the deleted
 cells. It only merge the small files into a bigger one. Only the major
 compaction (in 0.94) will deal with the delete cells. There is also
 some more compaction mechanism coming in trunk with nice features.

 Look at: https://issues.apache.org/jira/browse/HBASE-7902
 https://issues.apache.org/jira/browse/HBASE-7680
 https://issues.apache.org/jira/browse/HBASE-7680

 Minor compactions are promoted to major compactions when the
 compaction policy decide to compact all the files. If all the files
 need to be merged, then we can run a major compaction which will do
 the same thing as the minor one, but with the bonus of deleting the
 required marked cells.

 JM

 2013/6/22 yun peng pengyunm...@gmail.com:
  Thanks, JM
  It seems like the sole difference btwn major and minor compaction is the
  number of files (to be all or just a subset of storefiles). It mentioned
  very briefly in
  http://hbase.apache.org/book
 http://hbase.apache.org/book/regions.arch.htmlthat
  Sometimes a minor compaction will ... promote itself to being a major
  compaction. What does sometime exactly mean here? or any policy in
 HBase
  that allow application to specify when to promote a minor compaction to
 be
  a major (like user or some monitoring service can specify now is offpeak
  time?)
  Yun
 
 
 
  On Sat, Jun 22, 2013 at 8:51 AM, Jean-Marc Spaggiari 
  jean-m...@spaggiari.org wrote:
 
  Hi Yun,
 
  Few links:
  - http://blog.cloudera.com/blog/2012/06/hbase-io-hfile-input-output/
  = There is a small paragraph about compactions which explain when
  they are triggered.
  - http://hbase.apache.org/book/regions.arch.html 9.7.6.5
 
  You are almost right. Only thing is that HBase doesn't know when is
  your offpeak, so a major compaction can be triggered anytime if the
  minor is promoted to be a major one.
 
  JM
 
  2013/6/22 yun peng pengyunm...@gmail.com:
   Hi, All
  
   I am asking the different practices of major and minor compaction...
 My
   current understanding is that minor compaction, triggered
 automatically,
   usually run along with online query serving (but in background), so
 that
  it
   is important to make it as lightweight as possible... to minimise
  downtime
   (pause time) of online query.
  
   In contrast, the major compaction is invoked in  offpeak time and
 usually
   can be assume to have resource exclusively. It may have a different
   performance optimization goal...
  
   Correct me if wrong, but let me know if HBase does design different
   compaction mechanism this way..?
  
   Regards,
   Yun
 



Re: how many severs in a hbase cluster

2013-06-22 Thread Mohammad Tariq
Oh, you already have heavyweight's input :).

Thanks JM.

Warm Regards,
Tariq
cloudfront.blogspot.com


On Sat, Jun 22, 2013 at 8:05 PM, Mohammad Tariq donta...@gmail.com wrote:

 Hello there,

 IMHO, 5-8 servers are sufficient enough to start with. But it's
 all relative to the data you have and the intensity of your reads/writes.
 You should have different strategies though, based on whether it's 'read'
 or 'write'. You actually can't define 'big' in absolute terms. My cluster
 might be big for me, but for someone else it might still be not big enough
 or for someone it might be very big. Long story short it depends on your
 needs. If you are able to achieve your goal with 5-8 RSs, then having more
 machines will be a wastage, I think.

 But you should always keep in mind that HBase is kinda greedy when it
 comes to memory. For a decent load 4G is sufficient, IMHO. But it again
 depends on operations you are gonna perform. If you have large clusters
 where you are planning to run MR jobs frequently you are better off with
 additional 2G.


 Warm Regards,
 Tariq
 cloudfront.blogspot.com


 On Sat, Jun 22, 2013 at 7:51 PM, myhbase myhb...@126.com wrote:

 Hello All,

 I learn hbase almost from papers and books, according to my
 understanding, HBase is the kind of architecture which is more appliable
 to a big cluster. We should have many HDFS nodes, and many HBase(region
 server) nodes. If we only have several severs(5-8), it seems hbase is
 not a good choice, please correct me if I am wrong. In addition, how
 many nodes usually we can start to consider the hbase solution and how
 about the physic mem size and other hardware resource in each node, any
 reference document or cases? Thanks.

 --Ning





Any mechanism in Hadoop to run in background

2013-06-22 Thread yun peng
Hi, All...
We have a user case intended to run Mapreduce in background, while the
server serves online operations. The MapReduce job may have lower priority
comparing to the online jobs..

I know this is a different use case of Mapreduce comparing to its
originally targeted scenario (where Mapreduce largely own resource
exclusively)... But I want to know if there is any tuning knobs that allow
Mapreduce to run in low priority/with limited resource.

Thanks,
Yun


Re: how many severs in a hbase cluster

2013-06-22 Thread myhbase

Thanks for your response.

Now if 5 servers are enough, how can I install  and configure my nodes? 
If I need 3 replicas in case data loss, I should at least have 3 
datanodes, we still have namenode, regionserver and HMaster nodes, 
zookeeper nodes, some of them must be installed in the same machine. The 
datanode seems the disk IO sensitive node while region server is the mem 
sensitive, can I install them in the same machine? Any suggestion on the 
deployment plan?


My business requirement is that the write is much more than read(7:3), 
and I have another concern that I have a field which will have the 
8~15KB in  data size, I am not sure, there will be any problem in hbase 
when it runs compaction and split in regions.

Oh, you already have heavyweight's input :).

Thanks JM.

Warm Regards,
Tariq
cloudfront.blogspot.com


On Sat, Jun 22, 2013 at 8:05 PM, Mohammad Tariq donta...@gmail.com wrote:


Hello there,

 IMHO, 5-8 servers are sufficient enough to start with. But it's
all relative to the data you have and the intensity of your reads/writes.
You should have different strategies though, based on whether it's 'read'
or 'write'. You actually can't define 'big' in absolute terms. My cluster
might be big for me, but for someone else it might still be not big enough
or for someone it might be very big. Long story short it depends on your
needs. If you are able to achieve your goal with 5-8 RSs, then having more
machines will be a wastage, I think.

But you should always keep in mind that HBase is kinda greedy when it
comes to memory. For a decent load 4G is sufficient, IMHO. But it again
depends on operations you are gonna perform. If you have large clusters
where you are planning to run MR jobs frequently you are better off with
additional 2G.


Warm Regards,
Tariq
cloudfront.blogspot.com


On Sat, Jun 22, 2013 at 7:51 PM, myhbase myhb...@126.com wrote:


Hello All,

I learn hbase almost from papers and books, according to my
understanding, HBase is the kind of architecture which is more appliable
to a big cluster. We should have many HDFS nodes, and many HBase(region
server) nodes. If we only have several severs(5-8), it seems hbase is
not a good choice, please correct me if I am wrong. In addition, how
many nodes usually we can start to consider the hbase solution and how
about the physic mem size and other hardware resource in each node, any
reference document or cases? Thanks.

--Ning







Re: how many severs in a hbase cluster

2013-06-22 Thread Mohammad Tariq
With 8 machines you can do something like this :

Machine 1 - NN+JT
Machine 2 - SNN+ZK1
Machine 3 - HM+ZK2
Machine 4-8 - DN+TT+RS
(You can run ZK3 on a slave node with some additional memory).

DN and RS run on the same machine. Although RSs are said to hold the data,
the data is actually stored in DNs. Replication is managed at HDFS level.
You don't have to worry about that.

You can visit this link http://hbase.apache.org/book/perf.writing.html to
see how to write efficiently into HBase. With a small field there should
not be any problem except storage and increased metadata, as you'll have
many small cells. If possible club several small fields into one and put
them together in one cell.

HTH

Warm Regards,
Tariq
cloudfront.blogspot.com


On Sat, Jun 22, 2013 at 8:31 PM, myhbase myhb...@126.com wrote:

 Thanks for your response.

 Now if 5 servers are enough, how can I install  and configure my nodes? If
 I need 3 replicas in case data loss, I should at least have 3 datanodes, we
 still have namenode, regionserver and HMaster nodes, zookeeper nodes, some
 of them must be installed in the same machine. The datanode seems the disk
 IO sensitive node while region server is the mem sensitive, can I install
 them in the same machine? Any suggestion on the deployment plan?

 My business requirement is that the write is much more than read(7:3), and
 I have another concern that I have a field which will have the 8~15KB in
  data size, I am not sure, there will be any problem in hbase when it runs
 compaction and split in regions.

  Oh, you already have heavyweight's input :).

 Thanks JM.

 Warm Regards,
 Tariq
 cloudfront.blogspot.com


 On Sat, Jun 22, 2013 at 8:05 PM, Mohammad Tariq donta...@gmail.com
 wrote:

  Hello there,

  IMHO, 5-8 servers are sufficient enough to start with. But it's
 all relative to the data you have and the intensity of your reads/writes.
 You should have different strategies though, based on whether it's 'read'
 or 'write'. You actually can't define 'big' in absolute terms. My cluster
 might be big for me, but for someone else it might still be not big
 enough
 or for someone it might be very big. Long story short it depends on your
 needs. If you are able to achieve your goal with 5-8 RSs, then having
 more
 machines will be a wastage, I think.

 But you should always keep in mind that HBase is kinda greedy when it
 comes to memory. For a decent load 4G is sufficient, IMHO. But it again
 depends on operations you are gonna perform. If you have large clusters
 where you are planning to run MR jobs frequently you are better off with
 additional 2G.


 Warm Regards,
 Tariq
 cloudfront.blogspot.com


 On Sat, Jun 22, 2013 at 7:51 PM, myhbase myhb...@126.com wrote:

  Hello All,

 I learn hbase almost from papers and books, according to my
 understanding, HBase is the kind of architecture which is more appliable
 to a big cluster. We should have many HDFS nodes, and many HBase(region
 server) nodes. If we only have several severs(5-8), it seems hbase is
 not a good choice, please correct me if I am wrong. In addition, how
 many nodes usually we can start to consider the hbase solution and how
 about the physic mem size and other hardware resource in each node, any
 reference document or cases? Thanks.

 --Ning







Re: how many severs in a hbase cluster

2013-06-22 Thread Jean-Marc Spaggiari
You HAVE TO run a ZK3, or else you don't need to have ZK2 and any ZK
failure will be an issue. You need to have an odd number of ZK
servers...

Also, if you don't run MR jobs, you don't need the TT and JT... Else,
everything below is correct. But there is many other options, all
depend on your needs and the hardware you have ;)

JM

2013/6/22 Mohammad Tariq donta...@gmail.com:
 With 8 machines you can do something like this :

 Machine 1 - NN+JT
 Machine 2 - SNN+ZK1
 Machine 3 - HM+ZK2
 Machine 4-8 - DN+TT+RS
 (You can run ZK3 on a slave node with some additional memory).

 DN and RS run on the same machine. Although RSs are said to hold the data,
 the data is actually stored in DNs. Replication is managed at HDFS level.
 You don't have to worry about that.

 You can visit this link http://hbase.apache.org/book/perf.writing.html to
 see how to write efficiently into HBase. With a small field there should
 not be any problem except storage and increased metadata, as you'll have
 many small cells. If possible club several small fields into one and put
 them together in one cell.

 HTH

 Warm Regards,
 Tariq
 cloudfront.blogspot.com


 On Sat, Jun 22, 2013 at 8:31 PM, myhbase myhb...@126.com wrote:

 Thanks for your response.

 Now if 5 servers are enough, how can I install  and configure my nodes? If
 I need 3 replicas in case data loss, I should at least have 3 datanodes, we
 still have namenode, regionserver and HMaster nodes, zookeeper nodes, some
 of them must be installed in the same machine. The datanode seems the disk
 IO sensitive node while region server is the mem sensitive, can I install
 them in the same machine? Any suggestion on the deployment plan?

 My business requirement is that the write is much more than read(7:3), and
 I have another concern that I have a field which will have the 8~15KB in
  data size, I am not sure, there will be any problem in hbase when it runs
 compaction and split in regions.

  Oh, you already have heavyweight's input :).

 Thanks JM.

 Warm Regards,
 Tariq
 cloudfront.blogspot.com


 On Sat, Jun 22, 2013 at 8:05 PM, Mohammad Tariq donta...@gmail.com
 wrote:

  Hello there,

  IMHO, 5-8 servers are sufficient enough to start with. But it's
 all relative to the data you have and the intensity of your reads/writes.
 You should have different strategies though, based on whether it's 'read'
 or 'write'. You actually can't define 'big' in absolute terms. My cluster
 might be big for me, but for someone else it might still be not big
 enough
 or for someone it might be very big. Long story short it depends on your
 needs. If you are able to achieve your goal with 5-8 RSs, then having
 more
 machines will be a wastage, I think.

 But you should always keep in mind that HBase is kinda greedy when it
 comes to memory. For a decent load 4G is sufficient, IMHO. But it again
 depends on operations you are gonna perform. If you have large clusters
 where you are planning to run MR jobs frequently you are better off with
 additional 2G.


 Warm Regards,
 Tariq
 cloudfront.blogspot.com


 On Sat, Jun 22, 2013 at 7:51 PM, myhbase myhb...@126.com wrote:

  Hello All,

 I learn hbase almost from papers and books, according to my
 understanding, HBase is the kind of architecture which is more appliable
 to a big cluster. We should have many HDFS nodes, and many HBase(region
 server) nodes. If we only have several severs(5-8), it seems hbase is
 not a good choice, please correct me if I am wrong. In addition, how
 many nodes usually we can start to consider the hbase solution and how
 about the physic mem size and other hardware resource in each node, any
 reference document or cases? Thanks.

 --Ning







Re: how many severs in a hbase cluster

2013-06-22 Thread Kevin O'dell
If you run ZK with a DN/TT/RS please make sure to dedicate a hard drive and
a core to the ZK process. I have seen many strange occurrences.
On Jun 22, 2013 12:10 PM, Jean-Marc Spaggiari jean-m...@spaggiari.org
wrote:

 You HAVE TO run a ZK3, or else you don't need to have ZK2 and any ZK
 failure will be an issue. You need to have an odd number of ZK
 servers...

 Also, if you don't run MR jobs, you don't need the TT and JT... Else,
 everything below is correct. But there is many other options, all
 depend on your needs and the hardware you have ;)

 JM

 2013/6/22 Mohammad Tariq donta...@gmail.com:
  With 8 machines you can do something like this :
 
  Machine 1 - NN+JT
  Machine 2 - SNN+ZK1
  Machine 3 - HM+ZK2
  Machine 4-8 - DN+TT+RS
  (You can run ZK3 on a slave node with some additional memory).
 
  DN and RS run on the same machine. Although RSs are said to hold the
 data,
  the data is actually stored in DNs. Replication is managed at HDFS level.
  You don't have to worry about that.
 
  You can visit this link http://hbase.apache.org/book/perf.writing.html
 to
  see how to write efficiently into HBase. With a small field there should
  not be any problem except storage and increased metadata, as you'll have
  many small cells. If possible club several small fields into one and put
  them together in one cell.
 
  HTH
 
  Warm Regards,
  Tariq
  cloudfront.blogspot.com
 
 
  On Sat, Jun 22, 2013 at 8:31 PM, myhbase myhb...@126.com wrote:
 
  Thanks for your response.
 
  Now if 5 servers are enough, how can I install  and configure my nodes?
 If
  I need 3 replicas in case data loss, I should at least have 3
 datanodes, we
  still have namenode, regionserver and HMaster nodes, zookeeper nodes,
 some
  of them must be installed in the same machine. The datanode seems the
 disk
  IO sensitive node while region server is the mem sensitive, can I
 install
  them in the same machine? Any suggestion on the deployment plan?
 
  My business requirement is that the write is much more than read(7:3),
 and
  I have another concern that I have a field which will have the 8~15KB in
   data size, I am not sure, there will be any problem in hbase when it
 runs
  compaction and split in regions.
 
   Oh, you already have heavyweight's input :).
 
  Thanks JM.
 
  Warm Regards,
  Tariq
  cloudfront.blogspot.com
 
 
  On Sat, Jun 22, 2013 at 8:05 PM, Mohammad Tariq donta...@gmail.com
  wrote:
 
   Hello there,
 
   IMHO, 5-8 servers are sufficient enough to start with. But
 it's
  all relative to the data you have and the intensity of your
 reads/writes.
  You should have different strategies though, based on whether it's
 'read'
  or 'write'. You actually can't define 'big' in absolute terms. My
 cluster
  might be big for me, but for someone else it might still be not big
  enough
  or for someone it might be very big. Long story short it depends on
 your
  needs. If you are able to achieve your goal with 5-8 RSs, then having
  more
  machines will be a wastage, I think.
 
  But you should always keep in mind that HBase is kinda greedy when it
  comes to memory. For a decent load 4G is sufficient, IMHO. But it
 again
  depends on operations you are gonna perform. If you have large
 clusters
  where you are planning to run MR jobs frequently you are better off
 with
  additional 2G.
 
 
  Warm Regards,
  Tariq
  cloudfront.blogspot.com
 
 
  On Sat, Jun 22, 2013 at 7:51 PM, myhbase myhb...@126.com wrote:
 
   Hello All,
 
  I learn hbase almost from papers and books, according to my
  understanding, HBase is the kind of architecture which is more
 appliable
  to a big cluster. We should have many HDFS nodes, and many
 HBase(region
  server) nodes. If we only have several severs(5-8), it seems hbase is
  not a good choice, please correct me if I am wrong. In addition, how
  many nodes usually we can start to consider the hbase solution and
 how
  about the physic mem size and other hardware resource in each node,
 any
  reference document or cases? Thanks.
 
  --Ning
 
 
 
 
 



Re: how many severs in a hbase cluster

2013-06-22 Thread Mohammad Tariq
Yeah, I forgot to mention that no. of ZKs should be odd. Perhaps those
parentheses made that statement look like an optional statement. Just to
clarify it was mandatory.

Warm Regards,
Tariq
cloudfront.blogspot.com


On Sat, Jun 22, 2013 at 9:45 PM, Kevin O'dell kevin.od...@cloudera.comwrote:

 If you run ZK with a DN/TT/RS please make sure to dedicate a hard drive and
 a core to the ZK process. I have seen many strange occurrences.
 On Jun 22, 2013 12:10 PM, Jean-Marc Spaggiari jean-m...@spaggiari.org
 wrote:

  You HAVE TO run a ZK3, or else you don't need to have ZK2 and any ZK
  failure will be an issue. You need to have an odd number of ZK
  servers...
 
  Also, if you don't run MR jobs, you don't need the TT and JT... Else,
  everything below is correct. But there is many other options, all
  depend on your needs and the hardware you have ;)
 
  JM
 
  2013/6/22 Mohammad Tariq donta...@gmail.com:
   With 8 machines you can do something like this :
  
   Machine 1 - NN+JT
   Machine 2 - SNN+ZK1
   Machine 3 - HM+ZK2
   Machine 4-8 - DN+TT+RS
   (You can run ZK3 on a slave node with some additional memory).
  
   DN and RS run on the same machine. Although RSs are said to hold the
  data,
   the data is actually stored in DNs. Replication is managed at HDFS
 level.
   You don't have to worry about that.
  
   You can visit this link 
 http://hbase.apache.org/book/perf.writing.html
  to
   see how to write efficiently into HBase. With a small field there
 should
   not be any problem except storage and increased metadata, as you'll
 have
   many small cells. If possible club several small fields into one and
 put
   them together in one cell.
  
   HTH
  
   Warm Regards,
   Tariq
   cloudfront.blogspot.com
  
  
   On Sat, Jun 22, 2013 at 8:31 PM, myhbase myhb...@126.com wrote:
  
   Thanks for your response.
  
   Now if 5 servers are enough, how can I install  and configure my
 nodes?
  If
   I need 3 replicas in case data loss, I should at least have 3
  datanodes, we
   still have namenode, regionserver and HMaster nodes, zookeeper nodes,
  some
   of them must be installed in the same machine. The datanode seems the
  disk
   IO sensitive node while region server is the mem sensitive, can I
  install
   them in the same machine? Any suggestion on the deployment plan?
  
   My business requirement is that the write is much more than read(7:3),
  and
   I have another concern that I have a field which will have the 8~15KB
 in
data size, I am not sure, there will be any problem in hbase when it
  runs
   compaction and split in regions.
  
Oh, you already have heavyweight's input :).
  
   Thanks JM.
  
   Warm Regards,
   Tariq
   cloudfront.blogspot.com
  
  
   On Sat, Jun 22, 2013 at 8:05 PM, Mohammad Tariq donta...@gmail.com
   wrote:
  
Hello there,
  
IMHO, 5-8 servers are sufficient enough to start with. But
  it's
   all relative to the data you have and the intensity of your
  reads/writes.
   You should have different strategies though, based on whether it's
  'read'
   or 'write'. You actually can't define 'big' in absolute terms. My
  cluster
   might be big for me, but for someone else it might still be not big
   enough
   or for someone it might be very big. Long story short it depends on
  your
   needs. If you are able to achieve your goal with 5-8 RSs, then
 having
   more
   machines will be a wastage, I think.
  
   But you should always keep in mind that HBase is kinda greedy when
 it
   comes to memory. For a decent load 4G is sufficient, IMHO. But it
  again
   depends on operations you are gonna perform. If you have large
  clusters
   where you are planning to run MR jobs frequently you are better off
  with
   additional 2G.
  
  
   Warm Regards,
   Tariq
   cloudfront.blogspot.com
  
  
   On Sat, Jun 22, 2013 at 7:51 PM, myhbase myhb...@126.com wrote:
  
Hello All,
  
   I learn hbase almost from papers and books, according to my
   understanding, HBase is the kind of architecture which is more
  appliable
   to a big cluster. We should have many HDFS nodes, and many
  HBase(region
   server) nodes. If we only have several severs(5-8), it seems hbase
 is
   not a good choice, please correct me if I am wrong. In addition,
 how
   many nodes usually we can start to consider the hbase solution and
  how
   about the physic mem size and other hardware resource in each node,
  any
   reference document or cases? Thanks.
  
   --Ning
  
  
  
  
  
 



Re: running MR job and puts on the same table

2013-06-22 Thread Jean-Marc Spaggiari
Hi Rahit,

The list is a bad idea. When you will have millions of lines per
regions, are going to pu millions of them in memory in your list?

Your MR will scan the entire table, row by row. If you modify the
current row, when the scanner will search for the next one, it will
not look at current one. So there is no real issue with that.

Also, instead of doing puts one by one I will recommand you to buffer
them (let's say, 100 by 100) and put them as a batch. Don't forget to
push the remaining at the end of the job. The drawback is that if the
MR crash you will have some rows already processed and not marked as
processed...

JM

2013/6/22 Rohit Kelkar rohitkel...@gmail.com:
 I have a usecase where I push data in my HTable in waves followed by
 Mapper-only processing. Currently once a row is processed in map I
 immediately mark it as processed=true. For this inside the map I execute a
 table.put(isprocessed=true). I am not sure if modifying the table like this
 is a good idea. I am also concerned that I am modifying the same table that
 I am running the MR job on.
 So I am thinking of another approach where I accumulate the processed rows
 in a list (or a better compact data structure) and use the cleanup method
 of the MR job to execute all the table.put(isprocessed=true) at once.
 What is the suggested best practice?

 - R


Re: Scan performance

2013-06-22 Thread James Taylor
Hi Tony,
Have you had a look at Phoenix(https://github.com/forcedotcom/phoenix), a SQL 
skin over HBase? It has a skip scan that will let you model a multi part row 
key and skip through it efficiently as you've described. Take a look at this 
blog for more info: 
http://phoenix-hbase.blogspot.com/2013/05/demystifying-skip-scan-in-phoenix.html?m=1

Regards,
James

On Jun 22, 2013, at 6:29 AM, lars hofhansl la...@apache.org wrote:

 Yep generally you should design your keys such that start/stopKey can 
 efficiently narrow the scope.
 
 If that really cannot be done (and you should try hard), the 2nd  best option 
 are skip scans.
 
 Filters in HBase allow for providing the scanner framework with hints where 
 to go next.
 They can skip to the next column (to avoid looking at many versions), to the 
 next row (to avoid looking at many columns), or they can provide a custom 
 seek hint to a specific key value. The latter is what FuzzyRowFilter does.
 
 
 -- Lars
 
 
 
 
 From: Anoop John anoop.hb...@gmail.com
 To: user@hbase.apache.org
 Sent: Friday, June 21, 2013 11:58 PM
 Subject: Re: Scan performance
 
 
 Have a look at FuzzyRowFilter
 
 -Anoop-
 
 On Sat, Jun 22, 2013 at 9:20 AM, Tony Dean tony.d...@sas.com wrote:
 
 I understand more, but have additional questions about the internals...
 
 So, in this example I have 6000 rows X 40 columns in this table.  In this
 test my startRow and stopRow do not narrow the scan criterior therefore all
 6000x40 KVs must be included in the search and thus read from disk and into
 memory.
 
 The first filter that I used was:
 Filter f2 = new SingleColumnValueFilter(cf, qualifier,
 CompareFilter.CompareOp.EQUALS, value);
 
 This means that HBase must look for the qualifier column on all 6000 rows.
 As you mention I could add certain columns to a different cf; but
 unfortunately, in my case there is no such small set of columns that will
 need to be compared (filtered on).  I could try to use indexes so that a
 complete row key can be calculated from a secondary index in order to
 perform a faster search against data in a primary table.  This requires
 additional tables and maintenance that I would like to avoid.
 
 I did try a row key filter with regex hoping that it would limit the
 number of rows that were read from disk.
 Filter f2 = new RowFilter(CompareFilter.CompareOp.EQUAL, new
 RegexStringComparator(row_regexpr));
 
 My row keys are something like: vid,sid,event.  sid is not known at query
 time so I can use a regex similar to: vid,.*,Logon where Logon is the event
 that I am looking for in a particular visit.  In my test data this should
 have narrowed the scan to 1 row X 40 columns.  The best I could do for
 start/stop row is: vid,0 and vid,~ respectively.  I guess that is still
 going to cause all 6000 rows to be scanned, but the filtering should be
 more specific with the rowKey filter.  However, I did not see any
 performance improvement.  Anything obvious?
 
 Do you have any other ideas to help out with performance when row key is:
 vid,sid,event and sid is not known at query time which leaves a gap in the
 start/stop row?  Too bad regex can't be used in start/stop row
 specification.  That's really what I need.
 
 Thanks again.
 -Tony
 
 -Original Message-
 From: Vladimir Rodionov [mailto:vrodio...@carrieriq.com]
 Sent: Friday, June 21, 2013 8:00 PM
 To: user@hbase.apache.org; lars hofhansl
 Subject: RE: Scan performance
 
 Lars,
 I thought that column family is the locality group and placement columns
 which are frequently accessed together into the same column family
 (locality group) is the obvious performance improvement tip. What are the
 essential column families for in this context?
 
 As for original question..  Unless you place your column into a separate
 column family in Table 2, you will need to scan (load from disk if not
 cached) ~ 40x more data for the second table (because you have 40 columns).
 This may explain why do  see such a difference in execution time if all
 data needs to be loaded first from HDFS.
 
 Best regards,
 Vladimir Rodionov
 Principal Platform Engineer
 Carrier IQ, www.carrieriq.com
 e-mail: vrodio...@carrieriq.com
 
 
 From: lars hofhansl [la...@apache.org]
 Sent: Friday, June 21, 2013 3:37 PM
 To: user@hbase.apache.org
 Subject: Re: Scan performance
 
 HBase is a key value (KV) store. Each column is stored in its own KV, a
 row is just a set of KVs that happen to have the row key (which is the
 first part of the key).
 I tried to summarize this here:
 http://hadoop-hbase.blogspot.de/2011/12/introduction-to-hbase.html)
 
 In the StoreFiles all KVs are sorted in row/column order, but HBase still
 needs to skip over many KVs in order to reach the next row. So more disk
 and memory IO is needed.
 
 If you using 0.94 there is a new feature essential column families. If
 you always search by the same column you can place that one in its own
 

Re: running MR job and puts on the same table

2013-06-22 Thread Rohit Kelkar
Thanks JM, I am not so concerned about holding those rows in memory because
they are mostly ordered integers and I would be using a bitset. So I have
some leeway in that sense. My dilemma was
1. updating instantly within the map
2. bulk updating at the end of the map
Yes I do understand the drawback with 2 if map crashes. I am ready to incur
that penalty if that avoids any inconsistent behaviour on hbase.

- R


On Sat, Jun 22, 2013 at 12:16 PM, Jean-Marc Spaggiari 
jean-m...@spaggiari.org wrote:

 Hi Rahit,

 The list is a bad idea. When you will have millions of lines per
 regions, are going to pu millions of them in memory in your list?

 Your MR will scan the entire table, row by row. If you modify the
 current row, when the scanner will search for the next one, it will
 not look at current one. So there is no real issue with that.

 Also, instead of doing puts one by one I will recommand you to buffer
 them (let's say, 100 by 100) and put them as a batch. Don't forget to
 push the remaining at the end of the job. The drawback is that if the
 MR crash you will have some rows already processed and not marked as
 processed...

 JM

 2013/6/22 Rohit Kelkar rohitkel...@gmail.com:
  I have a usecase where I push data in my HTable in waves followed by
  Mapper-only processing. Currently once a row is processed in map I
  immediately mark it as processed=true. For this inside the map I execute
 a
  table.put(isprocessed=true). I am not sure if modifying the table like
 this
  is a good idea. I am also concerned that I am modifying the same table
 that
  I am running the MR job on.
  So I am thinking of another approach where I accumulate the processed
 rows
  in a list (or a better compact data structure) and use the cleanup method
  of the MR job to execute all the table.put(isprocessed=true) at once.
  What is the suggested best practice?
 
  - R



Re: Any mechanism in Hadoop to run in background

2013-06-22 Thread Suraj Varma
Yes, you can change your task tracker startup script to use nice and ionice
and restart the task tracker process. The mappers and reducers spun off
this task tracker will inherit the niceness.

See the first comment in
http://blog.cloudera.com/blog/2011/04/hbase-dos-and-donts/
Quoting:
change the hadoop-0.20-tasktracker so the process is started like this:

daemon *nice -n 19 ionice -c2 -n7*/usr/lib/hadoop-0.20/bin/hadoop-daemon.sh –
config “/etc/hadoop-0.20/conf” start tasktracker $DAEMON_FLAGS

--S



On Sat, Jun 22, 2013 at 7:55 AM, yun peng pengyunm...@gmail.com wrote:

 Hi, All...
 We have a user case intended to run Mapreduce in background, while the
 server serves online operations. The MapReduce job may have lower priority
 comparing to the online jobs..

 I know this is a different use case of Mapreduce comparing to its
 originally targeted scenario (where Mapreduce largely own resource
 exclusively)... But I want to know if there is any tuning knobs that allow
 Mapreduce to run in low priority/with limited resource.

 Thanks,
 Yun



Re: difference between major and minor compactions?

2013-06-22 Thread Suraj Varma
 In contrast, the major compaction is invoked in  offpeak time and usually
 can be assume to have resource exclusively.

There is no resource exclusivity with major compactions. It is just more
resource _intensive_ because a major compaction will rewrite all the store
files to end up with a single store file per store as described in 9.7.6.5
Compaction in the hbase book.

So - it is because it is so resource _intensive_ that for large clusters
folks prefer to have a managed compaction (i.e. turn off major compaction
and run it off hours) so that it doesn't affect latencies for low latency
consumers, for instance.
--S



On Sat, Jun 22, 2013 at 7:35 AM, yun peng pengyunm...@gmail.com wrote:

 I am more concerned with CompactionPolicy available that allows application
 to manipulate a bit how compaction should go... It looks like  there is
 newest API in .97 version
 *ExploringCompactionPolicy*
 http://hbase.apache.org/apidocs/org/apache/hadoop/hbase/regionserver/compactions/ExploringCompactionPolicy.html
 ,
 which allow application when we should have a major compaction.

 For stripe compaction, it is very interesting, will look into it. Thanks.
 Yun


 On Sat, Jun 22, 2013 at 9:24 AM, Jean-Marc Spaggiari 
 jean-m...@spaggiari.org wrote:

  Hi Yun,
 
  There is more differences.
 
  The minor compactions are not remove the delete flags and the deleted
  cells. It only merge the small files into a bigger one. Only the major
  compaction (in 0.94) will deal with the delete cells. There is also
  some more compaction mechanism coming in trunk with nice features.
 
  Look at: https://issues.apache.org/jira/browse/HBASE-7902
  https://issues.apache.org/jira/browse/HBASE-7680
  https://issues.apache.org/jira/browse/HBASE-7680
 
  Minor compactions are promoted to major compactions when the
  compaction policy decide to compact all the files. If all the files
  need to be merged, then we can run a major compaction which will do
  the same thing as the minor one, but with the bonus of deleting the
  required marked cells.
 
  JM
 
  2013/6/22 yun peng pengyunm...@gmail.com:
   Thanks, JM
   It seems like the sole difference btwn major and minor compaction is
 the
   number of files (to be all or just a subset of storefiles). It
 mentioned
   very briefly in
   http://hbase.apache.org/book
  http://hbase.apache.org/book/regions.arch.htmlthat
   Sometimes a minor compaction will ... promote itself to being a major
   compaction. What does sometime exactly mean here? or any policy in
  HBase
   that allow application to specify when to promote a minor compaction to
  be
   a major (like user or some monitoring service can specify now is
 offpeak
   time?)
   Yun
  
  
  
   On Sat, Jun 22, 2013 at 8:51 AM, Jean-Marc Spaggiari 
   jean-m...@spaggiari.org wrote:
  
   Hi Yun,
  
   Few links:
   - http://blog.cloudera.com/blog/2012/06/hbase-io-hfile-input-output/
   = There is a small paragraph about compactions which explain when
   they are triggered.
   - http://hbase.apache.org/book/regions.arch.html 9.7.6.5
  
   You are almost right. Only thing is that HBase doesn't know when is
   your offpeak, so a major compaction can be triggered anytime if the
   minor is promoted to be a major one.
  
   JM
  
   2013/6/22 yun peng pengyunm...@gmail.com:
Hi, All
   
I am asking the different practices of major and minor compaction...
  My
current understanding is that minor compaction, triggered
  automatically,
usually run along with online query serving (but in background), so
  that
   it
is important to make it as lightweight as possible... to minimise
   downtime
(pause time) of online query.
   
In contrast, the major compaction is invoked in  offpeak time and
  usually
can be assume to have resource exclusively. It may have a different
performance optimization goal...
   
Correct me if wrong, but let me know if HBase does design different
compaction mechanism this way..?
   
Regards,
Yun
  
 



Re: running MR job and puts on the same table

2013-06-22 Thread Jean-Marc Spaggiari
Hi Rohit,

It will alway be consistent. I don't see why there will be any
un-consistency with the scenario your described below.

JM

2013/6/22 Rohit Kelkar rohitkel...@gmail.com:
 Thanks JM, I am not so concerned about holding those rows in memory because
 they are mostly ordered integers and I would be using a bitset. So I have
 some leeway in that sense. My dilemma was
 1. updating instantly within the map
 2. bulk updating at the end of the map
 Yes I do understand the drawback with 2 if map crashes. I am ready to incur
 that penalty if that avoids any inconsistent behaviour on hbase.

 - R


 On Sat, Jun 22, 2013 at 12:16 PM, Jean-Marc Spaggiari 
 jean-m...@spaggiari.org wrote:

 Hi Rahit,

 The list is a bad idea. When you will have millions of lines per
 regions, are going to pu millions of them in memory in your list?

 Your MR will scan the entire table, row by row. If you modify the
 current row, when the scanner will search for the next one, it will
 not look at current one. So there is no real issue with that.

 Also, instead of doing puts one by one I will recommand you to buffer
 them (let's say, 100 by 100) and put them as a batch. Don't forget to
 push the remaining at the end of the job. The drawback is that if the
 MR crash you will have some rows already processed and not marked as
 processed...

 JM

 2013/6/22 Rohit Kelkar rohitkel...@gmail.com:
  I have a usecase where I push data in my HTable in waves followed by
  Mapper-only processing. Currently once a row is processed in map I
  immediately mark it as processed=true. For this inside the map I execute
 a
  table.put(isprocessed=true). I am not sure if modifying the table like
 this
  is a good idea. I am also concerned that I am modifying the same table
 that
  I am running the MR job on.
  So I am thinking of another approach where I accumulate the processed
 rows
  in a list (or a better compact data structure) and use the cleanup method
  of the MR job to execute all the table.put(isprocessed=true) at once.
  What is the suggested best practice?
 
  - R



Re: how many severs in a hbase cluster

2013-06-22 Thread iain wright
Hi Mohammad,

I am curious why you chose not to put the third ZK on the NN+JT? I was
planning on doing that on a new cluster and want to confirm it would be
okay.


-- 
Iain Wright
Cell: (562) 852-5916

http://www.labctsi.org/
This email message is confidential, intended only for the recipient(s)
named above and may contain information that is privileged, exempt from
disclosure under applicable law. If you are not the intended recipient, do
not disclose or disseminate the message to anyone except the intended
recipient. If you have received this message in error, or are not the named
recipient(s), please immediately notify the sender by return email, and
delete all copies of this message.


On Sat, Jun 22, 2013 at 10:05 AM, Mohammad Tariq donta...@gmail.com wrote:

 Yeah, I forgot to mention that no. of ZKs should be odd. Perhaps those
 parentheses made that statement look like an optional statement. Just to
 clarify it was mandatory.

 Warm Regards,
 Tariq
 cloudfront.blogspot.com


 On Sat, Jun 22, 2013 at 9:45 PM, Kevin O'dell kevin.od...@cloudera.com
 wrote:

  If you run ZK with a DN/TT/RS please make sure to dedicate a hard drive
 and
  a core to the ZK process. I have seen many strange occurrences.
  On Jun 22, 2013 12:10 PM, Jean-Marc Spaggiari jean-m...@spaggiari.org
 
  wrote:
 
   You HAVE TO run a ZK3, or else you don't need to have ZK2 and any ZK
   failure will be an issue. You need to have an odd number of ZK
   servers...
  
   Also, if you don't run MR jobs, you don't need the TT and JT... Else,
   everything below is correct. But there is many other options, all
   depend on your needs and the hardware you have ;)
  
   JM
  
   2013/6/22 Mohammad Tariq donta...@gmail.com:
With 8 machines you can do something like this :
   
Machine 1 - NN+JT
Machine 2 - SNN+ZK1
Machine 3 - HM+ZK2
Machine 4-8 - DN+TT+RS
(You can run ZK3 on a slave node with some additional memory).
   
DN and RS run on the same machine. Although RSs are said to hold the
   data,
the data is actually stored in DNs. Replication is managed at HDFS
  level.
You don't have to worry about that.
   
You can visit this link 
  http://hbase.apache.org/book/perf.writing.html
   to
see how to write efficiently into HBase. With a small field there
  should
not be any problem except storage and increased metadata, as you'll
  have
many small cells. If possible club several small fields into one and
  put
them together in one cell.
   
HTH
   
Warm Regards,
Tariq
cloudfront.blogspot.com
   
   
On Sat, Jun 22, 2013 at 8:31 PM, myhbase myhb...@126.com wrote:
   
Thanks for your response.
   
Now if 5 servers are enough, how can I install  and configure my
  nodes?
   If
I need 3 replicas in case data loss, I should at least have 3
   datanodes, we
still have namenode, regionserver and HMaster nodes, zookeeper
 nodes,
   some
of them must be installed in the same machine. The datanode seems
 the
   disk
IO sensitive node while region server is the mem sensitive, can I
   install
them in the same machine? Any suggestion on the deployment plan?
   
My business requirement is that the write is much more than
 read(7:3),
   and
I have another concern that I have a field which will have the
 8~15KB
  in
 data size, I am not sure, there will be any problem in hbase when
 it
   runs
compaction and split in regions.
   
 Oh, you already have heavyweight's input :).
   
Thanks JM.
   
Warm Regards,
Tariq
cloudfront.blogspot.com
   
   
On Sat, Jun 22, 2013 at 8:05 PM, Mohammad Tariq 
 donta...@gmail.com
wrote:
   
 Hello there,
   
 IMHO, 5-8 servers are sufficient enough to start with.
 But
   it's
all relative to the data you have and the intensity of your
   reads/writes.
You should have different strategies though, based on whether it's
   'read'
or 'write'. You actually can't define 'big' in absolute terms. My
   cluster
might be big for me, but for someone else it might still be not
 big
enough
or for someone it might be very big. Long story short it depends
 on
   your
needs. If you are able to achieve your goal with 5-8 RSs, then
  having
more
machines will be a wastage, I think.
   
But you should always keep in mind that HBase is kinda greedy when
  it
comes to memory. For a decent load 4G is sufficient, IMHO. But it
   again
depends on operations you are gonna perform. If you have large
   clusters
where you are planning to run MR jobs frequently you are better
 off
   with
additional 2G.
   
   
Warm Regards,
Tariq
cloudfront.blogspot.com
   
   
On Sat, Jun 22, 2013 at 7:51 PM, myhbase myhb...@126.com wrote:
   
 Hello All,
   
I learn hbase almost from papers and books, according to my
understanding, HBase is the kind of architecture which is more
   appliable
to a big cluster. We 

Re: how many severs in a hbase cluster

2013-06-22 Thread Mohammad Tariq
Hello Iain,

 You would put a lot of pressure on the RAM if you do that. NN
already has high memory requirement and then having JT+ZK on the same
machine would be too heavy, IMHO.

Warm Regards,
Tariq
cloudfront.blogspot.com


On Sun, Jun 23, 2013 at 4:07 AM, iain wright iainw...@gmail.com wrote:

 Hi Mohammad,

 I am curious why you chose not to put the third ZK on the NN+JT? I was
 planning on doing that on a new cluster and want to confirm it would be
 okay.


 --
 Iain Wright
 Cell: (562) 852-5916

 http://www.labctsi.org/
 This email message is confidential, intended only for the recipient(s)
 named above and may contain information that is privileged, exempt from
 disclosure under applicable law. If you are not the intended recipient, do
 not disclose or disseminate the message to anyone except the intended
 recipient. If you have received this message in error, or are not the named
 recipient(s), please immediately notify the sender by return email, and
 delete all copies of this message.


 On Sat, Jun 22, 2013 at 10:05 AM, Mohammad Tariq donta...@gmail.com
 wrote:

  Yeah, I forgot to mention that no. of ZKs should be odd. Perhaps those
  parentheses made that statement look like an optional statement. Just to
  clarify it was mandatory.
 
  Warm Regards,
  Tariq
  cloudfront.blogspot.com
 
 
  On Sat, Jun 22, 2013 at 9:45 PM, Kevin O'dell kevin.od...@cloudera.com
  wrote:
 
   If you run ZK with a DN/TT/RS please make sure to dedicate a hard drive
  and
   a core to the ZK process. I have seen many strange occurrences.
   On Jun 22, 2013 12:10 PM, Jean-Marc Spaggiari 
 jean-m...@spaggiari.org
  
   wrote:
  
You HAVE TO run a ZK3, or else you don't need to have ZK2 and any ZK
failure will be an issue. You need to have an odd number of ZK
servers...
   
Also, if you don't run MR jobs, you don't need the TT and JT... Else,
everything below is correct. But there is many other options, all
depend on your needs and the hardware you have ;)
   
JM
   
2013/6/22 Mohammad Tariq donta...@gmail.com:
 With 8 machines you can do something like this :

 Machine 1 - NN+JT
 Machine 2 - SNN+ZK1
 Machine 3 - HM+ZK2
 Machine 4-8 - DN+TT+RS
 (You can run ZK3 on a slave node with some additional memory).

 DN and RS run on the same machine. Although RSs are said to hold
 the
data,
 the data is actually stored in DNs. Replication is managed at HDFS
   level.
 You don't have to worry about that.

 You can visit this link 
   http://hbase.apache.org/book/perf.writing.html
to
 see how to write efficiently into HBase. With a small field there
   should
 not be any problem except storage and increased metadata, as you'll
   have
 many small cells. If possible club several small fields into one
 and
   put
 them together in one cell.

 HTH

 Warm Regards,
 Tariq
 cloudfront.blogspot.com


 On Sat, Jun 22, 2013 at 8:31 PM, myhbase myhb...@126.com wrote:

 Thanks for your response.

 Now if 5 servers are enough, how can I install  and configure my
   nodes?
If
 I need 3 replicas in case data loss, I should at least have 3
datanodes, we
 still have namenode, regionserver and HMaster nodes, zookeeper
  nodes,
some
 of them must be installed in the same machine. The datanode seems
  the
disk
 IO sensitive node while region server is the mem sensitive, can I
install
 them in the same machine? Any suggestion on the deployment plan?

 My business requirement is that the write is much more than
  read(7:3),
and
 I have another concern that I have a field which will have the
  8~15KB
   in
  data size, I am not sure, there will be any problem in hbase when
  it
runs
 compaction and split in regions.

  Oh, you already have heavyweight's input :).

 Thanks JM.

 Warm Regards,
 Tariq
 cloudfront.blogspot.com


 On Sat, Jun 22, 2013 at 8:05 PM, Mohammad Tariq 
  donta...@gmail.com
 wrote:

  Hello there,

  IMHO, 5-8 servers are sufficient enough to start with.
  But
it's
 all relative to the data you have and the intensity of your
reads/writes.
 You should have different strategies though, based on whether
 it's
'read'
 or 'write'. You actually can't define 'big' in absolute terms.
 My
cluster
 might be big for me, but for someone else it might still be not
  big
 enough
 or for someone it might be very big. Long story short it depends
  on
your
 needs. If you are able to achieve your goal with 5-8 RSs, then
   having
 more
 machines will be a wastage, I think.

 But you should always keep in mind that HBase is kinda greedy
 when
   it
 comes to memory. For a decent load 4G is sufficient, IMHO. But
 it
again
 depends on operations you are gonna perform. If 

RE: difference between major and minor compactions?

2013-06-22 Thread Vladimir Rodionov
Major compactions floods network, leaving for other operations too little 
space. The reason why major compaction are
so prohibitively expensive in HBase - 2 block replicas which need to be created 
in the cluster for every block written locally.

Best regards,
Vladimir Rodionov
Principal Platform Engineer
Carrier IQ, www.carrieriq.com
e-mail: vrodio...@carrieriq.com


From: Suraj Varma [svarma...@gmail.com]
Sent: Saturday, June 22, 2013 11:51 AM
To: user@hbase.apache.org
Subject: Re: difference between major and minor compactions?

 In contrast, the major compaction is invoked in  offpeak time and usually
 can be assume to have resource exclusively.

There is no resource exclusivity with major compactions. It is just more
resource _intensive_ because a major compaction will rewrite all the store
files to end up with a single store file per store as described in 9.7.6.5
Compaction in the hbase book.

So - it is because it is so resource _intensive_ that for large clusters
folks prefer to have a managed compaction (i.e. turn off major compaction
and run it off hours) so that it doesn't affect latencies for low latency
consumers, for instance.
--S



On Sat, Jun 22, 2013 at 7:35 AM, yun peng pengyunm...@gmail.com wrote:

 I am more concerned with CompactionPolicy available that allows application
 to manipulate a bit how compaction should go... It looks like  there is
 newest API in .97 version
 *ExploringCompactionPolicy*
 http://hbase.apache.org/apidocs/org/apache/hadoop/hbase/regionserver/compactions/ExploringCompactionPolicy.html
 ,
 which allow application when we should have a major compaction.

 For stripe compaction, it is very interesting, will look into it. Thanks.
 Yun


 On Sat, Jun 22, 2013 at 9:24 AM, Jean-Marc Spaggiari 
 jean-m...@spaggiari.org wrote:

  Hi Yun,
 
  There is more differences.
 
  The minor compactions are not remove the delete flags and the deleted
  cells. It only merge the small files into a bigger one. Only the major
  compaction (in 0.94) will deal with the delete cells. There is also
  some more compaction mechanism coming in trunk with nice features.
 
  Look at: https://issues.apache.org/jira/browse/HBASE-7902
  https://issues.apache.org/jira/browse/HBASE-7680
  https://issues.apache.org/jira/browse/HBASE-7680
 
  Minor compactions are promoted to major compactions when the
  compaction policy decide to compact all the files. If all the files
  need to be merged, then we can run a major compaction which will do
  the same thing as the minor one, but with the bonus of deleting the
  required marked cells.
 
  JM
 
  2013/6/22 yun peng pengyunm...@gmail.com:
   Thanks, JM
   It seems like the sole difference btwn major and minor compaction is
 the
   number of files (to be all or just a subset of storefiles). It
 mentioned
   very briefly in
   http://hbase.apache.org/book
  http://hbase.apache.org/book/regions.arch.htmlthat
   Sometimes a minor compaction will ... promote itself to being a major
   compaction. What does sometime exactly mean here? or any policy in
  HBase
   that allow application to specify when to promote a minor compaction to
  be
   a major (like user or some monitoring service can specify now is
 offpeak
   time?)
   Yun
  
  
  
   On Sat, Jun 22, 2013 at 8:51 AM, Jean-Marc Spaggiari 
   jean-m...@spaggiari.org wrote:
  
   Hi Yun,
  
   Few links:
   - http://blog.cloudera.com/blog/2012/06/hbase-io-hfile-input-output/
   = There is a small paragraph about compactions which explain when
   they are triggered.
   - http://hbase.apache.org/book/regions.arch.html 9.7.6.5
  
   You are almost right. Only thing is that HBase doesn't know when is
   your offpeak, so a major compaction can be triggered anytime if the
   minor is promoted to be a major one.
  
   JM
  
   2013/6/22 yun peng pengyunm...@gmail.com:
Hi, All
   
I am asking the different practices of major and minor compaction...
  My
current understanding is that minor compaction, triggered
  automatically,
usually run along with online query serving (but in background), so
  that
   it
is important to make it as lightweight as possible... to minimise
   downtime
(pause time) of online query.
   
In contrast, the major compaction is invoked in  offpeak time and
  usually
can be assume to have resource exclusively. It may have a different
performance optimization goal...
   
Correct me if wrong, but let me know if HBase does design different
compaction mechanism this way..?
   
Regards,
Yun
  
 


Confidentiality Notice:  The information contained in this message, including 
any attachments hereto, may be confidential and is intended to be read only by 
the individual or entity to whom this message is addressed. If the reader of 
this message is not the intended recipient or an agent or designee of the 
intended recipient, please note that any review, use, disclosure or 
distribution of this message 

Hbase pseudo distributed setup not starting

2013-06-22 Thread Rajkumar
After extracting, changing etc/hosts file, made some changes in
hdfs-site.xml file and hbase-env.sh file. I cant see any of hbase process
running after issuing bin/start-hbase.sh command.

my hdfs-site.xml file is
?xml version=1.0?
?xml-stylesheet type=text/xsl href=configuration.xsl?
configuration
 
  property
  namehbase.rootdir/name
  valuehdfs://localhost:9000/hbase/value
  /property
  
  property
  namehbase.cluster.distributed/name
  valuetrue/value
  /property
  
  property
   namehbase.master/name
   valuelocalhost:60010/value
  /property
  
  property
  namehbase.zookeeper.quorum/name
  valuelocalhost/value
  /property
  
  property
   namedfs.replication/name
   value1/value
  /property
  
  property
   namehbase.zookeeper.property.clientPort/name
   value2181/value
  /property

  property
   namehbase.zookeeper.property.dataDir/name
   value/home/hduser/hbase/zookeeper/value
  /property

/configuration


my hbase-env.sh is
export HBASE_REGIONSERVERS=${HBASE_HOME}/conf/regionservers
export JAVA_HOME=/usr/local/java/jdk1.7.0
export HBASE_OPTS=-XX:+UseConcMarkSweepGC
export HBASE_REGIONSERVERS=${HBASE_HOME}/conf/regionservers
export HBASE_MANAGES_ZK=false


I have also enivronment variable
export PATH=$PATH:$HADOOP_PREFIX/bin
export HBASE_HOME=/usr/local/hbase
export PATH=$PATH:$HBASE_HOME/bin
export ZOOKEEPER_HOME=/usr/local/zookeeper
export PATH=$PATH:$ZOOKEEPER_HOME/bin

but still i cant see any hbase process when i type jps in terminal.



Re: Hbase pseudo distributed setup not starting

2013-06-22 Thread Ulrich Staudinger
is there anything in the log files? check both logs/*.out and logs/*.log


On Sun, Jun 23, 2013 at 6:54 AM, Rajkumar rajkumar22...@gmail.com wrote:

 After extracting, changing etc/hosts file, made some changes in
 hdfs-site.xml file and hbase-env.sh file. I cant see any of hbase process
 running after issuing bin/start-hbase.sh command.

 my hdfs-site.xml file is
 ?xml version=1.0?
 ?xml-stylesheet type=text/xsl href=configuration.xsl?
 configuration

   property
   namehbase.rootdir/name
   valuehdfs://localhost:9000/hbase/value
   /property

   property
   namehbase.cluster.distributed/name
   valuetrue/value
   /property

   property
namehbase.master/name
valuelocalhost:60010/value
   /property

   property
   namehbase.zookeeper.quorum/name
   valuelocalhost/value
   /property

   property
namedfs.replication/name
value1/value
   /property

   property
namehbase.zookeeper.property.clientPort/name
value2181/value
   /property

   property
namehbase.zookeeper.property.dataDir/name
value/home/hduser/hbase/zookeeper/value
   /property

 /configuration


 my hbase-env.sh is
 export HBASE_REGIONSERVERS=${HBASE_HOME}/conf/regionservers
 export JAVA_HOME=/usr/local/java/jdk1.7.0
 export HBASE_OPTS=-XX:+UseConcMarkSweepGC
 export HBASE_REGIONSERVERS=${HBASE_HOME}/conf/regionservers
 export HBASE_MANAGES_ZK=false


 I have also enivronment variable
 export PATH=$PATH:$HADOOP_PREFIX/bin
 export HBASE_HOME=/usr/local/hbase
 export PATH=$PATH:$HBASE_HOME/bin
 export ZOOKEEPER_HOME=/usr/local/zookeeper
 export PATH=$PATH:$ZOOKEEPER_HOME/bin

 but still i cant see any hbase process when i type jps in terminal.




-- 
Ulrich Staudinger, Managing Director and Sr. Software Engineer, ActiveQuant
GmbH

P: +41 79 702 05 95
E: ustaudin...@activequant.com

http://www.activequant.com

AQ-R user? Join our mailing list:
http://lists.r-forge.r-project.org/cgi-bin/mailman/listinfo/aqr-user