date:20130622

Hi, All

I am asking the different practices of major and minor compaction... My
current understanding is that minor compaction, triggered automatically,
usually run along with online query serving (but in background), so that it
is important to make it as lightweight as possible... to minimise downtime
(pause time) of online query.

In contrast, the major compaction is invoked in  offpeak time and usually
can be assume to have resource exclusively. It may have a different
performance optimization goal...

Correct me if wrong, but let me know if HBase does design different
compaction mechanism this way..?

Regards,
Yun

Re: difference between major and minor compactions?

Hi Yun,

Few links:
- http://blog.cloudera.com/blog/2012/06/hbase-io-hfile-input-output/
= There is a small paragraph about compactions which explain when
they are triggered.
- http://hbase.apache.org/book/regions.arch.html 9.7.6.5

You are almost right. Only thing is that HBase doesn't know when is
your offpeak, so a major compaction can be triggered anytime if the
minor is promoted to be a major one.

JM

2013/6/22 yun peng pengyunm...@gmail.com:
 Hi, All

 I am asking the different practices of major and minor compaction... My
 current understanding is that minor compaction, triggered automatically,
 usually run along with online query serving (but in background), so that it
 is important to make it as lightweight as possible... to minimise downtime
 (pause time) of online query.

 In contrast, the major compaction is invoked in  offpeak time and usually
 can be assume to have resource exclusively. It may have a different
 performance optimization goal...

 Correct me if wrong, but let me know if HBase does design different
 compaction mechanism this way..?

 Regards,
 Yun

Re: difference between major and minor compactions?

Thanks, JM
It seems like the sole difference btwn major and minor compaction is the
number of files (to be all or just a subset of storefiles). It mentioned
very briefly in
http://hbase.apache.org/bookhttp://hbase.apache.org/book/regions.arch.htmlthat
Sometimes a minor compaction will ... promote itself to being a major
compaction. What does sometime exactly mean here? or any policy in HBase
that allow application to specify when to promote a minor compaction to be
a major (like user or some monitoring service can specify now is offpeak
time?)
Yun



On Sat, Jun 22, 2013 at 8:51 AM, Jean-Marc Spaggiari 
jean-m...@spaggiari.org wrote:

 Hi Yun,

 Few links:
 - http://blog.cloudera.com/blog/2012/06/hbase-io-hfile-input-output/
 = There is a small paragraph about compactions which explain when
 they are triggered.
 - http://hbase.apache.org/book/regions.arch.html 9.7.6.5

 You are almost right. Only thing is that HBase doesn't know when is
 your offpeak, so a major compaction can be triggered anytime if the
 minor is promoted to be a major one.

 JM

 2013/6/22 yun peng pengyunm...@gmail.com:
  Hi, All
 
  I am asking the different practices of major and minor compaction... My
  current understanding is that minor compaction, triggered automatically,
  usually run along with online query serving (but in background), so that
 it
  is important to make it as lightweight as possible... to minimise
 downtime
  (pause time) of online query.
 
  In contrast, the major compaction is invoked in  offpeak time and usually
  can be assume to have resource exclusively. It may have a different
  performance optimization goal...
 
  Correct me if wrong, but let me know if HBase does design different
  compaction mechanism this way..?
 
  Regards,
  Yun

Re: Scan performance

2013-06-22 Thread lars hofhansl

essential column families help when you filter on one column but want to 
return *other* columns for the rows that matched the column.

Check out HBASE-5416.

-- Lars




 From: Vladimir Rodionov vrodio...@carrieriq.com
To: user@hbase.apache.org user@hbase.apache.org; lars hofhansl 
la...@apache.org 
Sent: Friday, June 21, 2013 5:00 PM
Subject: RE: Scan performance
 

Lars,
I thought that column family is the locality group and placement columns which 
are frequently accessed together into
the same column family (locality group) is the obvious performance improvement 
tip. What are the essential column families for in this context?

As for original question..  Unless you place your column into a separate column 
family in Table 2, you will
need to scan (load from disk if not cached) ~ 40x more data for the second 
table (because you have 40 columns). This may explain why do  see such a 
difference in
execution time if all data needs to be loaded first from HDFS.

Best regards,
Vladimir Rodionov
Principal Platform Engineer
Carrier IQ, www.carrieriq.com
e-mail: vrodio...@carrieriq.com


From: lars hofhansl [la...@apache.org]
Sent: Friday, June 21, 2013 3:37 PM
To: user@hbase.apache.org
Subject: Re: Scan performance

HBase is a key value (KV) store. Each column is stored in its own KV, a row is 
just a set of KVs that happen to have the row key (which is the first part of 
the key).
I tried to summarize this here: 
http://hadoop-hbase.blogspot.de/2011/12/introduction-to-hbase.html)

In the StoreFiles all KVs are sorted in row/column order, but HBase still needs 
to skip over many KVs in order to reach the next row. So more disk and memory 
IO is needed.

If you using 0.94 there is a new feature essential column families. If you 
always search by the same column you can place that one in its own column 
family and all other column in another column family. In that case your scan 
performance should be close identical.


-- Lars


From: Tony Dean tony.d...@sas.com
To: user@hbase.apache.org user@hbase.apache.org
Sent: Friday, June 21, 2013 2:08 PM
Subject: Scan performance




Hi,

I hope that you can shed some light on these 2 scenarios below.

I have 2 small tables of 6000 rows.
Table 1 has only 1 column in each of its rows.
Table 2 has 40 columns in each of its rows.
Other than that the two tables are identical.

In both tables there is only 1 row that contains a matching column that I am 
filtering on.   And the Scan performs correctly in both cases by returning only 
the single result.

The code looks something like the following:

Scan scan = new Scan(startRow, stopRow);   // the start/stop rows should 
include all 6000 rows
scan.addColumn(cf, qualifier); // only return the column that I am interested 
in (should only be in 1 row and only 1 version)

Filter f1 = new InclusiveStopFilter(stopRow);
Filter f2 = new SingleColumnValueFilter(cf, qualifier,  
CompareFilter.CompareOp.EQUALS, value);
scan.setFilter(new FilterList(f1, f2));

scan .setTimeRange(0, MAX_LONG);
scan.setMaxVersions(1);

ResultScanner rs = t.getScanner(scan);
for (Result result: rs)
{

}

For table 1, rs.next() takes about 30ms.
For table 2, rs.next() takes about 180ms.

Both are returning the exact same result.  Why is it taking so much longer on 
table 2 to get the same result?  The scan depth is the same.  The only 
difference is the column width.  But I’m filtering on a single column and 
returning only that column.

Am I missing something?  As I increase the number of columns, the response time 
gets worse.  I do expect the response time to get worse when increasing the 
number of rows, but not by increasing the number of columns since I’m returning 
only 1 column in
both cases.

I appreciate any comments that you have.

-Tony



Tony Dean
SAS Institute Inc.
Principal Software Developer
919-531-6704          …

Confidentiality Notice:  The information contained in this message, including 
any attachments hereto, may be confidential and is intended to be read only by 
the individual or entity to whom this message is addressed. If the reader of 
this message is not the intended recipient or an agent or designee of the 
intended recipient, please note that any review, use, disclosure or 
distribution of this message or its attachments, in any form, is strictly 
prohibited.  If you have received this message in error, please immediately 
notify the sender and/or notificati...@carrieriq.com and delete or destroy any 
copy of this message and its attachments.

Re: difference between major and minor compactions?

Hi Yun,

There is more differences.

The minor compactions are not remove the delete flags and the deleted
cells. It only merge the small files into a bigger one. Only the major
compaction (in 0.94) will deal with the delete cells. There is also
some more compaction mechanism coming in trunk with nice features.

Look at: https://issues.apache.org/jira/browse/HBASE-7902
https://issues.apache.org/jira/browse/HBASE-7680
https://issues.apache.org/jira/browse/HBASE-7680

Minor compactions are promoted to major compactions when the
compaction policy decide to compact all the files. If all the files
need to be merged, then we can run a major compaction which will do
the same thing as the minor one, but with the bonus of deleting the
required marked cells.

JM

2013/6/22 yun peng pengyunm...@gmail.com:
 Thanks, JM
 It seems like the sole difference btwn major and minor compaction is the
 number of files (to be all or just a subset of storefiles). It mentioned
 very briefly in
 http://hbase.apache.org/bookhttp://hbase.apache.org/book/regions.arch.htmlthat
 Sometimes a minor compaction will ... promote itself to being a major
 compaction. What does sometime exactly mean here? or any policy in HBase
 that allow application to specify when to promote a minor compaction to be
 a major (like user or some monitoring service can specify now is offpeak
 time?)
 Yun



 On Sat, Jun 22, 2013 at 8:51 AM, Jean-Marc Spaggiari 
 jean-m...@spaggiari.org wrote:

 Hi Yun,

 Few links:
 - http://blog.cloudera.com/blog/2012/06/hbase-io-hfile-input-output/
 = There is a small paragraph about compactions which explain when
 they are triggered.
 - http://hbase.apache.org/book/regions.arch.html 9.7.6.5

 You are almost right. Only thing is that HBase doesn't know when is
 your offpeak, so a major compaction can be triggered anytime if the
 minor is promoted to be a major one.

 JM

 2013/6/22 yun peng pengyunm...@gmail.com:
  Hi, All
 
  I am asking the different practices of major and minor compaction... My
  current understanding is that minor compaction, triggered automatically,
  usually run along with online query serving (but in background), so that
 it
  is important to make it as lightweight as possible... to minimise
 downtime
  (pause time) of online query.
 
  In contrast, the major compaction is invoked in  offpeak time and usually
  can be assume to have resource exclusively. It may have a different
  performance optimization goal...
 
  Correct me if wrong, but let me know if HBase does design different
  compaction mechanism this way..?
 
  Regards,
  Yun

Re: Scan performance

2013-06-22 Thread lars hofhansl

Yep generally you should design your keys such that start/stopKey can 
efficiently narrow the scope.

If that really cannot be done (and you should try hard), the 2nd  best option 
are skip scans.

Filters in HBase allow for providing the scanner framework with hints where to 
go next.
They can skip to the next column (to avoid looking at many versions), to the 
next row (to avoid looking at many columns), or they can provide a custom seek 
hint to a specific key value. The latter is what FuzzyRowFilter does.


-- Lars




 From: Anoop John anoop.hb...@gmail.com
To: user@hbase.apache.org 
Sent: Friday, June 21, 2013 11:58 PM
Subject: Re: Scan performance
 

Have a look at FuzzyRowFilter

-Anoop-

On Sat, Jun 22, 2013 at 9:20 AM, Tony Dean tony.d...@sas.com wrote:

 I understand more, but have additional questions about the internals...

 So, in this example I have 6000 rows X 40 columns in this table.  In this
 test my startRow and stopRow do not narrow the scan criterior therefore all
 6000x40 KVs must be included in the search and thus read from disk and into
 memory.

 The first filter that I used was:
 Filter f2 = new SingleColumnValueFilter(cf, qualifier,
  CompareFilter.CompareOp.EQUALS, value);

 This means that HBase must look for the qualifier column on all 6000 rows.
  As you mention I could add certain columns to a different cf; but
 unfortunately, in my case there is no such small set of columns that will
 need to be compared (filtered on).  I could try to use indexes so that a
 complete row key can be calculated from a secondary index in order to
 perform a faster search against data in a primary table.  This requires
 additional tables and maintenance that I would like to avoid.

 I did try a row key filter with regex hoping that it would limit the
 number of rows that were read from disk.
 Filter f2 = new RowFilter(CompareFilter.CompareOp.EQUAL, new
 RegexStringComparator(row_regexpr));

 My row keys are something like: vid,sid,event.  sid is not known at query
 time so I can use a regex similar to: vid,.*,Logon where Logon is the event
 that I am looking for in a particular visit.  In my test data this should
 have narrowed the scan to 1 row X 40 columns.  The best I could do for
 start/stop row is: vid,0 and vid,~ respectively.  I guess that is still
 going to cause all 6000 rows to be scanned, but the filtering should be
 more specific with the rowKey filter.  However, I did not see any
 performance improvement.  Anything obvious?

 Do you have any other ideas to help out with performance when row key is:
 vid,sid,event and sid is not known at query time which leaves a gap in the
 start/stop row?  Too bad regex can't be used in start/stop row
 specification.  That's really what I need.

 Thanks again.
 -Tony

 -Original Message-
 From: Vladimir Rodionov [mailto:vrodio...@carrieriq.com]
 Sent: Friday, June 21, 2013 8:00 PM
 To: user@hbase.apache.org; lars hofhansl
 Subject: RE: Scan performance

 Lars,
 I thought that column family is the locality group and placement columns
 which are frequently accessed together into the same column family
 (locality group) is the obvious performance improvement tip. What are the
 essential column families for in this context?

 As for original question..  Unless you place your column into a separate
 column family in Table 2, you will need to scan (load from disk if not
 cached) ~ 40x more data for the second table (because you have 40 columns).
 This may explain why do  see such a difference in execution time if all
 data needs to be loaded first from HDFS.

 Best regards,
 Vladimir Rodionov
 Principal Platform Engineer
 Carrier IQ, www.carrieriq.com
 e-mail: vrodio...@carrieriq.com

 
 From: lars hofhansl [la...@apache.org]
 Sent: Friday, June 21, 2013 3:37 PM
 To: user@hbase.apache.org
 Subject: Re: Scan performance

 HBase is a key value (KV) store. Each column is stored in its own KV, a
 row is just a set of KVs that happen to have the row key (which is the
 first part of the key).
 I tried to summarize this here:
 http://hadoop-hbase.blogspot.de/2011/12/introduction-to-hbase.html)

 In the StoreFiles all KVs are sorted in row/column order, but HBase still
 needs to skip over many KVs in order to reach the next row. So more disk
 and memory IO is needed.

 If you using 0.94 there is a new feature essential column families. If
 you always search by the same column you can place that one in its own
 column family and all other column in another column family. In that case
 your scan performance should be close identical.


 -- Lars
 

 From: Tony Dean tony.d...@sas.com
 To: user@hbase.apache.org user@hbase.apache.org
 Sent: Friday, June 21, 2013 2:08 PM
 Subject: Scan performance




 Hi,

 I hope that you can shed some light on these 2 scenarios below.

 I have 2 small tables of 6000 rows.
 Table 1 has only 1 column in each of its rows.

how many severs in a hbase cluster

2013-06-22 Thread myhbase

Hello All,

I learn hbase almost from papers and books, according to my
understanding, HBase is the kind of architecture which is more appliable
to a big cluster. We should have many HDFS nodes, and many HBase(region
server) nodes. If we only have several severs(5-8), it seems hbase is
not a good choice, please correct me if I am wrong. In addition, how
many nodes usually we can start to consider the hbase solution and how
about the physic mem size and other hardware resource in each node, any
reference document or cases? Thanks.

--Ning

Re: how many severs in a hbase cluster

Hi Ning,

I'm personally running HBase in production with only 8 nodes.

As you will see here: http://wiki.apache.org/hadoop/Hbase/PoweredBy
some are also running small clusters.

So I will say it more depend on you need than on the size.

I will say the minimum is 4 to make sure you have your factor 3
replication and some stability if a node fails, but you might be good
also with 3.And there is almost no maximum.

Regarding memory, the more, the merrier... Ỳou also need to make sure
you have many disks per server. Forget that if you have just 1. I'm
able to run with 3, but it's limit. 5 is a good number, and some are
running with 12...

Again, it depend if your application is more read intensive, or CPU
intensive, etc. Can you tell us a bit more about what you want to
achieve?

Thanks,

JM

2013/6/22 myhbase myhb...@126.com:
 Hello All,

 I learn hbase almost from papers and books, according to my
 understanding, HBase is the kind of architecture which is more appliable
 to a big cluster. We should have many HDFS nodes, and many HBase(region
 server) nodes. If we only have several severs(5-8), it seems hbase is
 not a good choice, please correct me if I am wrong. In addition, how
 many nodes usually we can start to consider the hbase solution and how
 about the physic mem size and other hardware resource in each node, any
 reference document or cases? Thanks.

 --Ning

Re: how many severs in a hbase cluster

Hello there,

IMHO, 5-8 servers are sufficient enough to start with. But it's all
relative to the data you have and the intensity of your reads/writes. You
should have different strategies though, based on whether it's 'read' or
'write'. You actually can't define 'big' in absolute terms. My cluster
might be big for me, but for someone else it might still be not big enough
or for someone it might be very big. Long story short it depends on your
needs. If you are able to achieve your goal with 5-8 RSs, then having more
machines will be a wastage, I think.

But you should always keep in mind that HBase is kinda greedy when it comes
to memory. For a decent load 4G is sufficient, IMHO. But it again depends
on operations you are gonna perform. If you have large clusters where you
are planning to run MR jobs frequently you are better off with additional
2G.


Warm Regards,
Tariq
cloudfront.blogspot.com


On Sat, Jun 22, 2013 at 7:51 PM, myhbase myhb...@126.com wrote:

 Hello All,

 I learn hbase almost from papers and books, according to my
 understanding, HBase is the kind of architecture which is more appliable
 to a big cluster. We should have many HDFS nodes, and many HBase(region
 server) nodes. If we only have several severs(5-8), it seems hbase is
 not a good choice, please correct me if I am wrong. In addition, how
 many nodes usually we can start to consider the hbase solution and how
 about the physic mem size and other hardware resource in each node, any
 reference document or cases? Thanks.

 --Ning

Re: difference between major and minor compactions?

I am more concerned with CompactionPolicy available that allows application
to manipulate a bit how compaction should go... It looks like there is
newest API in .97 version
*ExploringCompactionPolicy*http://hbase.apache.org/apidocs/org/apache/hadoop/hbase/regionserver/compactions/ExploringCompactionPolicy.html,
which allow application when we should have a major compaction.

For stripe compaction, it is very interesting, will look into it. Thanks.
Yun

On Sat, Jun 22, 2013 at 9:24 AM, Jean-Marc Spaggiari
jean-m...@spaggiari.org wrote:

Hi Yun,

There is more differences.

The minor compactions are not remove the delete flags and the deleted
cells. It only merge the small files into a bigger one. Only the major
compaction (in 0.94) will deal with the delete cells. There is also
some more compaction mechanism coming in trunk with nice features.

Look at: https://issues.apache.org/jira/browse/HBASE-7902
https://issues.apache.org/jira/browse/HBASE-7680
https://issues.apache.org/jira/browse/HBASE-7680

Minor compactions are promoted to major compactions when the
compaction policy decide to compact all the files. If all the files
need to be merged, then we can run a major compaction which will do
the same thing as the minor one, but with the bonus of deleting the
required marked cells.

2013/6/22 yun peng pengyunm...@gmail.com:
Thanks, JM
It seems like the sole difference btwn major and minor compaction is the
number of files (to be all or just a subset of storefiles). It mentioned
very briefly in
http://hbase.apache.org/book
http://hbase.apache.org/book/regions.arch.htmlthat
Sometimes a minor compaction will ... promote itself to being a major
compaction. What does sometime exactly mean here? or any policy in
HBase
that allow application to specify when to promote a minor compaction to
be
a major (like user or some monitoring service can specify now is offpeak
time?)
Yun

On Sat, Jun 22, 2013 at 8:51 AM, Jean-Marc Spaggiari
jean-m...@spaggiari.org wrote:

Hi Yun,

Few links:
- http://blog.cloudera.com/blog/2012/06/hbase-io-hfile-input-output/
= There is a small paragraph about compactions which explain when
they are triggered.
- http://hbase.apache.org/book/regions.arch.html 9.7.6.5

You are almost right. Only thing is that HBase doesn't know when is
your offpeak, so a major compaction can be triggered anytime if the
minor is promoted to be a major one.

2013/6/22 yun peng pengyunm...@gmail.com:
Hi, All

I am asking the different practices of major and minor compaction...
My
current understanding is that minor compaction, triggered
automatically,
usually run along with online query serving (but in background), so
that
it
is important to make it as lightweight as possible... to minimise
downtime
(pause time) of online query.

In contrast, the major compaction is invoked in offpeak time and
usually
can be assume to have resource exclusively. It may have a different
performance optimization goal...

Correct me if wrong, but let me know if HBase does design different
compaction mechanism this way..?

Regards,
Yun

Re: how many severs in a hbase cluster

Oh, you already have heavyweight's input :).

Thanks JM.

Warm Regards,
Tariq
cloudfront.blogspot.com


On Sat, Jun 22, 2013 at 8:05 PM, Mohammad Tariq donta...@gmail.com wrote:

 Hello there,

 IMHO, 5-8 servers are sufficient enough to start with. But it's
 all relative to the data you have and the intensity of your reads/writes.
 You should have different strategies though, based on whether it's 'read'
 or 'write'. You actually can't define 'big' in absolute terms. My cluster
 might be big for me, but for someone else it might still be not big enough
 or for someone it might be very big. Long story short it depends on your
 needs. If you are able to achieve your goal with 5-8 RSs, then having more
 machines will be a wastage, I think.

 But you should always keep in mind that HBase is kinda greedy when it
 comes to memory. For a decent load 4G is sufficient, IMHO. But it again
 depends on operations you are gonna perform. If you have large clusters
 where you are planning to run MR jobs frequently you are better off with
 additional 2G.


 Warm Regards,
 Tariq
 cloudfront.blogspot.com


 On Sat, Jun 22, 2013 at 7:51 PM, myhbase myhb...@126.com wrote:

 Hello All,

 I learn hbase almost from papers and books, according to my
 understanding, HBase is the kind of architecture which is more appliable
 to a big cluster. We should have many HDFS nodes, and many HBase(region
 server) nodes. If we only have several severs(5-8), it seems hbase is
 not a good choice, please correct me if I am wrong. In addition, how
 many nodes usually we can start to consider the hbase solution and how
 about the physic mem size and other hardware resource in each node, any
 reference document or cases? Thanks.

 --Ning

Any mechanism in Hadoop to run in background

Hi, All...
We have a user case intended to run Mapreduce in background, while the
server serves online operations. The MapReduce job may have lower priority
comparing to the online jobs..

I know this is a different use case of Mapreduce comparing to its
originally targeted scenario (where Mapreduce largely own resource
exclusively)... But I want to know if there is any tuning knobs that allow
Mapreduce to run in low priority/with limited resource.

Thanks,
Yun

Re: how many severs in a hbase cluster

2013-06-22 Thread myhbase


Thanks for your response.

Now if 5 servers are enough, how can I install  and configure my nodes? 
If I need 3 replicas in case data loss, I should at least have 3 
datanodes, we still have namenode, regionserver and HMaster nodes, 
zookeeper nodes, some of them must be installed in the same machine. The 
datanode seems the disk IO sensitive node while region server is the mem 
sensitive, can I install them in the same machine? Any suggestion on the 
deployment plan?


My business requirement is that the write is much more than read(7:3), 
and I have another concern that I have a field which will have the 
8~15KB in  data size, I am not sure, there will be any problem in hbase 
when it runs compaction and split in regions.

Oh, you already have heavyweight's input :).

Thanks JM.

Warm Regards,
Tariq
cloudfront.blogspot.com


On Sat, Jun 22, 2013 at 8:05 PM, Mohammad Tariq donta...@gmail.com wrote:


Hello there,

 IMHO, 5-8 servers are sufficient enough to start with. But it's
all relative to the data you have and the intensity of your reads/writes.
You should have different strategies though, based on whether it's 'read'
or 'write'. You actually can't define 'big' in absolute terms. My cluster
might be big for me, but for someone else it might still be not big enough
or for someone it might be very big. Long story short it depends on your
needs. If you are able to achieve your goal with 5-8 RSs, then having more
machines will be a wastage, I think.

But you should always keep in mind that HBase is kinda greedy when it
comes to memory. For a decent load 4G is sufficient, IMHO. But it again
depends on operations you are gonna perform. If you have large clusters
where you are planning to run MR jobs frequently you are better off with
additional 2G.


Warm Regards,
Tariq
cloudfront.blogspot.com


On Sat, Jun 22, 2013 at 7:51 PM, myhbase myhb...@126.com wrote:


Hello All,

I learn hbase almost from papers and books, according to my
understanding, HBase is the kind of architecture which is more appliable
to a big cluster. We should have many HDFS nodes, and many HBase(region
server) nodes. If we only have several severs(5-8), it seems hbase is
not a good choice, please correct me if I am wrong. In addition, how
many nodes usually we can start to consider the hbase solution and how
about the physic mem size and other hardware resource in each node, any
reference document or cases? Thanks.

--Ning

Re: how many severs in a hbase cluster

With 8 machines you can do something like this :

Machine 1 - NN+JT
Machine 2 - SNN+ZK1
Machine 3 - HM+ZK2
Machine 4-8 - DN+TT+RS
(You can run ZK3 on a slave node with some additional memory).

DN and RS run on the same machine. Although RSs are said to hold the data,
the data is actually stored in DNs. Replication is managed at HDFS level.
You don't have to worry about that.

You can visit this link http://hbase.apache.org/book/perf.writing.html to
see how to write efficiently into HBase. With a small field there should
not be any problem except storage and increased metadata, as you'll have
many small cells. If possible club several small fields into one and put
them together in one cell.

HTH

Warm Regards,
Tariq
cloudfront.blogspot.com


On Sat, Jun 22, 2013 at 8:31 PM, myhbase myhb...@126.com wrote:

 Thanks for your response.

 Now if 5 servers are enough, how can I install  and configure my nodes? If
 I need 3 replicas in case data loss, I should at least have 3 datanodes, we
 still have namenode, regionserver and HMaster nodes, zookeeper nodes, some
 of them must be installed in the same machine. The datanode seems the disk
 IO sensitive node while region server is the mem sensitive, can I install
 them in the same machine? Any suggestion on the deployment plan?

 My business requirement is that the write is much more than read(7:3), and
 I have another concern that I have a field which will have the 8~15KB in
  data size, I am not sure, there will be any problem in hbase when it runs
 compaction and split in regions.

  Oh, you already have heavyweight's input :).

 Thanks JM.

 Warm Regards,
 Tariq
 cloudfront.blogspot.com


 On Sat, Jun 22, 2013 at 8:05 PM, Mohammad Tariq donta...@gmail.com
 wrote:

  Hello there,

  IMHO, 5-8 servers are sufficient enough to start with. But it's
 all relative to the data you have and the intensity of your reads/writes.
 You should have different strategies though, based on whether it's 'read'
 or 'write'. You actually can't define 'big' in absolute terms. My cluster
 might be big for me, but for someone else it might still be not big
 enough
 or for someone it might be very big. Long story short it depends on your
 needs. If you are able to achieve your goal with 5-8 RSs, then having
 more
 machines will be a wastage, I think.

 But you should always keep in mind that HBase is kinda greedy when it
 comes to memory. For a decent load 4G is sufficient, IMHO. But it again
 depends on operations you are gonna perform. If you have large clusters
 where you are planning to run MR jobs frequently you are better off with
 additional 2G.


 Warm Regards,
 Tariq
 cloudfront.blogspot.com


 On Sat, Jun 22, 2013 at 7:51 PM, myhbase myhb...@126.com wrote:

  Hello All,

 I learn hbase almost from papers and books, according to my
 understanding, HBase is the kind of architecture which is more appliable
 to a big cluster. We should have many HDFS nodes, and many HBase(region
 server) nodes. If we only have several severs(5-8), it seems hbase is
 not a good choice, please correct me if I am wrong. In addition, how
 many nodes usually we can start to consider the hbase solution and how
 about the physic mem size and other hardware resource in each node, any
 reference document or cases? Thanks.

 --Ning

Re: how many severs in a hbase cluster

You HAVE TO run a ZK3, or else you don't need to have ZK2 and any ZK
failure will be an issue. You need to have an odd number of ZK
servers...

Also, if you don't run MR jobs, you don't need the TT and JT... Else,
everything below is correct. But there is many other options, all
depend on your needs and the hardware you have ;)

JM

2013/6/22 Mohammad Tariq donta...@gmail.com:
 With 8 machines you can do something like this :

 Machine 1 - NN+JT
 Machine 2 - SNN+ZK1
 Machine 3 - HM+ZK2
 Machine 4-8 - DN+TT+RS
 (You can run ZK3 on a slave node with some additional memory).

 DN and RS run on the same machine. Although RSs are said to hold the data,
 the data is actually stored in DNs. Replication is managed at HDFS level.
 You don't have to worry about that.

 You can visit this link http://hbase.apache.org/book/perf.writing.html to
 see how to write efficiently into HBase. With a small field there should
 not be any problem except storage and increased metadata, as you'll have
 many small cells. If possible club several small fields into one and put
 them together in one cell.

 HTH

 Warm Regards,
 Tariq
 cloudfront.blogspot.com


 On Sat, Jun 22, 2013 at 8:31 PM, myhbase myhb...@126.com wrote:

 Thanks for your response.

 Now if 5 servers are enough, how can I install  and configure my nodes? If
 I need 3 replicas in case data loss, I should at least have 3 datanodes, we
 still have namenode, regionserver and HMaster nodes, zookeeper nodes, some
 of them must be installed in the same machine. The datanode seems the disk
 IO sensitive node while region server is the mem sensitive, can I install
 them in the same machine? Any suggestion on the deployment plan?

 My business requirement is that the write is much more than read(7:3), and
 I have another concern that I have a field which will have the 8~15KB in
  data size, I am not sure, there will be any problem in hbase when it runs
 compaction and split in regions.

  Oh, you already have heavyweight's input :).

 Thanks JM.

 Warm Regards,
 Tariq
 cloudfront.blogspot.com


 On Sat, Jun 22, 2013 at 8:05 PM, Mohammad Tariq donta...@gmail.com
 wrote:

  Hello there,

  IMHO, 5-8 servers are sufficient enough to start with. But it's
 all relative to the data you have and the intensity of your reads/writes.
 You should have different strategies though, based on whether it's 'read'
 or 'write'. You actually can't define 'big' in absolute terms. My cluster
 might be big for me, but for someone else it might still be not big
 enough
 or for someone it might be very big. Long story short it depends on your
 needs. If you are able to achieve your goal with 5-8 RSs, then having
 more
 machines will be a wastage, I think.

 But you should always keep in mind that HBase is kinda greedy when it
 comes to memory. For a decent load 4G is sufficient, IMHO. But it again
 depends on operations you are gonna perform. If you have large clusters
 where you are planning to run MR jobs frequently you are better off with
 additional 2G.


 Warm Regards,
 Tariq
 cloudfront.blogspot.com


 On Sat, Jun 22, 2013 at 7:51 PM, myhbase myhb...@126.com wrote:

  Hello All,

 I learn hbase almost from papers and books, according to my
 understanding, HBase is the kind of architecture which is more appliable
 to a big cluster. We should have many HDFS nodes, and many HBase(region
 server) nodes. If we only have several severs(5-8), it seems hbase is
 not a good choice, please correct me if I am wrong. In addition, how
 many nodes usually we can start to consider the hbase solution and how
 about the physic mem size and other hardware resource in each node, any
 reference document or cases? Thanks.

 --Ning

Re: how many severs in a hbase cluster

2013-06-22 Thread Kevin O'dell

If you run ZK with a DN/TT/RS please make sure to dedicate a hard drive and
a core to the ZK process. I have seen many strange occurrences.
On Jun 22, 2013 12:10 PM, Jean-Marc Spaggiari jean-m...@spaggiari.org
wrote:

 You HAVE TO run a ZK3, or else you don't need to have ZK2 and any ZK
 failure will be an issue. You need to have an odd number of ZK
 servers...

 Also, if you don't run MR jobs, you don't need the TT and JT... Else,
 everything below is correct. But there is many other options, all
 depend on your needs and the hardware you have ;)

 JM

 2013/6/22 Mohammad Tariq donta...@gmail.com:
  With 8 machines you can do something like this :
 
  Machine 1 - NN+JT
  Machine 2 - SNN+ZK1
  Machine 3 - HM+ZK2
  Machine 4-8 - DN+TT+RS
  (You can run ZK3 on a slave node with some additional memory).
 
  DN and RS run on the same machine. Although RSs are said to hold the
 data,
  the data is actually stored in DNs. Replication is managed at HDFS level.
  You don't have to worry about that.
 
  You can visit this link http://hbase.apache.org/book/perf.writing.html
 to
  see how to write efficiently into HBase. With a small field there should
  not be any problem except storage and increased metadata, as you'll have
  many small cells. If possible club several small fields into one and put
  them together in one cell.
 
  HTH
 
  Warm Regards,
  Tariq
  cloudfront.blogspot.com
 
 
  On Sat, Jun 22, 2013 at 8:31 PM, myhbase myhb...@126.com wrote:
 
  Thanks for your response.
 
  Now if 5 servers are enough, how can I install  and configure my nodes?
 If
  I need 3 replicas in case data loss, I should at least have 3
 datanodes, we
  still have namenode, regionserver and HMaster nodes, zookeeper nodes,
 some
  of them must be installed in the same machine. The datanode seems the
 disk
  IO sensitive node while region server is the mem sensitive, can I
 install
  them in the same machine? Any suggestion on the deployment plan?
 
  My business requirement is that the write is much more than read(7:3),
 and
  I have another concern that I have a field which will have the 8~15KB in
   data size, I am not sure, there will be any problem in hbase when it
 runs
  compaction and split in regions.
 
   Oh, you already have heavyweight's input :).
 
  Thanks JM.
 
  Warm Regards,
  Tariq
  cloudfront.blogspot.com
 
 
  On Sat, Jun 22, 2013 at 8:05 PM, Mohammad Tariq donta...@gmail.com
  wrote:
 
   Hello there,
 
   IMHO, 5-8 servers are sufficient enough to start with. But
 it's
  all relative to the data you have and the intensity of your
 reads/writes.
  You should have different strategies though, based on whether it's
 'read'
  or 'write'. You actually can't define 'big' in absolute terms. My
 cluster
  might be big for me, but for someone else it might still be not big
  enough
  or for someone it might be very big. Long story short it depends on
 your
  needs. If you are able to achieve your goal with 5-8 RSs, then having
  more
  machines will be a wastage, I think.
 
  But you should always keep in mind that HBase is kinda greedy when it
  comes to memory. For a decent load 4G is sufficient, IMHO. But it
 again
  depends on operations you are gonna perform. If you have large
 clusters
  where you are planning to run MR jobs frequently you are better off
 with
  additional 2G.
 
 
  Warm Regards,
  Tariq
  cloudfront.blogspot.com
 
 
  On Sat, Jun 22, 2013 at 7:51 PM, myhbase myhb...@126.com wrote:
 
   Hello All,
 
  I learn hbase almost from papers and books, according to my
  understanding, HBase is the kind of architecture which is more
 appliable
  to a big cluster. We should have many HDFS nodes, and many
 HBase(region
  server) nodes. If we only have several severs(5-8), it seems hbase is
  not a good choice, please correct me if I am wrong. In addition, how
  many nodes usually we can start to consider the hbase solution and
 how
  about the physic mem size and other hardware resource in each node,
 any
  reference document or cases? Thanks.
 
  --Ning

Re: how many severs in a hbase cluster

Yeah, I forgot to mention that no. of ZKs should be odd. Perhaps those
parentheses made that statement look like an optional statement. Just to
clarify it was mandatory.

Warm Regards,
Tariq
cloudfront.blogspot.com


On Sat, Jun 22, 2013 at 9:45 PM, Kevin O'dell kevin.od...@cloudera.comwrote:

 If you run ZK with a DN/TT/RS please make sure to dedicate a hard drive and
 a core to the ZK process. I have seen many strange occurrences.
 On Jun 22, 2013 12:10 PM, Jean-Marc Spaggiari jean-m...@spaggiari.org
 wrote:

  You HAVE TO run a ZK3, or else you don't need to have ZK2 and any ZK
  failure will be an issue. You need to have an odd number of ZK
  servers...
 
  Also, if you don't run MR jobs, you don't need the TT and JT... Else,
  everything below is correct. But there is many other options, all
  depend on your needs and the hardware you have ;)
 
  JM
 
  2013/6/22 Mohammad Tariq donta...@gmail.com:
   With 8 machines you can do something like this :
  
   Machine 1 - NN+JT
   Machine 2 - SNN+ZK1
   Machine 3 - HM+ZK2
   Machine 4-8 - DN+TT+RS
   (You can run ZK3 on a slave node with some additional memory).
  
   DN and RS run on the same machine. Although RSs are said to hold the
  data,
   the data is actually stored in DNs. Replication is managed at HDFS
 level.
   You don't have to worry about that.
  
   You can visit this link 
 http://hbase.apache.org/book/perf.writing.html
  to
   see how to write efficiently into HBase. With a small field there
 should
   not be any problem except storage and increased metadata, as you'll
 have
   many small cells. If possible club several small fields into one and
 put
   them together in one cell.
  
   HTH
  
   Warm Regards,
   Tariq
   cloudfront.blogspot.com
  
  
   On Sat, Jun 22, 2013 at 8:31 PM, myhbase myhb...@126.com wrote:
  
   Thanks for your response.
  
   Now if 5 servers are enough, how can I install  and configure my
 nodes?
  If
   I need 3 replicas in case data loss, I should at least have 3
  datanodes, we
   still have namenode, regionserver and HMaster nodes, zookeeper nodes,
  some
   of them must be installed in the same machine. The datanode seems the
  disk
   IO sensitive node while region server is the mem sensitive, can I
  install
   them in the same machine? Any suggestion on the deployment plan?
  
   My business requirement is that the write is much more than read(7:3),
  and
   I have another concern that I have a field which will have the 8~15KB
 in
data size, I am not sure, there will be any problem in hbase when it
  runs
   compaction and split in regions.
  
Oh, you already have heavyweight's input :).
  
   Thanks JM.
  
   Warm Regards,
   Tariq
   cloudfront.blogspot.com
  
  
   On Sat, Jun 22, 2013 at 8:05 PM, Mohammad Tariq donta...@gmail.com
   wrote:
  
Hello there,
  
IMHO, 5-8 servers are sufficient enough to start with. But
  it's
   all relative to the data you have and the intensity of your
  reads/writes.
   You should have different strategies though, based on whether it's
  'read'
   or 'write'. You actually can't define 'big' in absolute terms. My
  cluster
   might be big for me, but for someone else it might still be not big
   enough
   or for someone it might be very big. Long story short it depends on
  your
   needs. If you are able to achieve your goal with 5-8 RSs, then
 having
   more
   machines will be a wastage, I think.
  
   But you should always keep in mind that HBase is kinda greedy when
 it
   comes to memory. For a decent load 4G is sufficient, IMHO. But it
  again
   depends on operations you are gonna perform. If you have large
  clusters
   where you are planning to run MR jobs frequently you are better off
  with
   additional 2G.
  
  
   Warm Regards,
   Tariq
   cloudfront.blogspot.com
  
  
   On Sat, Jun 22, 2013 at 7:51 PM, myhbase myhb...@126.com wrote:
  
Hello All,
  
   I learn hbase almost from papers and books, according to my
   understanding, HBase is the kind of architecture which is more
  appliable
   to a big cluster. We should have many HDFS nodes, and many
  HBase(region
   server) nodes. If we only have several severs(5-8), it seems hbase
 is
   not a good choice, please correct me if I am wrong. In addition,
 how
   many nodes usually we can start to consider the hbase solution and
  how
   about the physic mem size and other hardware resource in each node,
  any
   reference document or cases? Thanks.
  
   --Ning

Re: running MR job and puts on the same table

Hi Rahit,

The list is a bad idea. When you will have millions of lines per
regions, are going to pu millions of them in memory in your list?

Your MR will scan the entire table, row by row. If you modify the
current row, when the scanner will search for the next one, it will
not look at current one. So there is no real issue with that.

Also, instead of doing puts one by one I will recommand you to buffer
them (let's say, 100 by 100) and put them as a batch. Don't forget to
push the remaining at the end of the job. The drawback is that if the
MR crash you will have some rows already processed and not marked as
processed...

JM

2013/6/22 Rohit Kelkar rohitkel...@gmail.com:
 I have a usecase where I push data in my HTable in waves followed by
 Mapper-only processing. Currently once a row is processed in map I
 immediately mark it as processed=true. For this inside the map I execute a
 table.put(isprocessed=true). I am not sure if modifying the table like this
 is a good idea. I am also concerned that I am modifying the same table that
 I am running the MR job on.
 So I am thinking of another approach where I accumulate the processed rows
 in a list (or a better compact data structure) and use the cleanup method
 of the MR job to execute all the table.put(isprocessed=true) at once.
 What is the suggested best practice?

 - R

Re: Scan performance

2013-06-22 Thread James Taylor

Hi Tony,
Have you had a look at Phoenix(https://github.com/forcedotcom/phoenix), a SQL 
skin over HBase? It has a skip scan that will let you model a multi part row 
key and skip through it efficiently as you've described. Take a look at this 
blog for more info: 
http://phoenix-hbase.blogspot.com/2013/05/demystifying-skip-scan-in-phoenix.html?m=1

Regards,
James

On Jun 22, 2013, at 6:29 AM, lars hofhansl la...@apache.org wrote:

 Yep generally you should design your keys such that start/stopKey can 
 efficiently narrow the scope.
 
 If that really cannot be done (and you should try hard), the 2nd  best option 
 are skip scans.
 
 Filters in HBase allow for providing the scanner framework with hints where 
 to go next.
 They can skip to the next column (to avoid looking at many versions), to the 
 next row (to avoid looking at many columns), or they can provide a custom 
 seek hint to a specific key value. The latter is what FuzzyRowFilter does.
 
 
 -- Lars
 
 
 
 
 From: Anoop John anoop.hb...@gmail.com
 To: user@hbase.apache.org
 Sent: Friday, June 21, 2013 11:58 PM
 Subject: Re: Scan performance
 
 
 Have a look at FuzzyRowFilter
 
 -Anoop-
 
 On Sat, Jun 22, 2013 at 9:20 AM, Tony Dean tony.d...@sas.com wrote:
 
 I understand more, but have additional questions about the internals...
 
 So, in this example I have 6000 rows X 40 columns in this table.  In this
 test my startRow and stopRow do not narrow the scan criterior therefore all
 6000x40 KVs must be included in the search and thus read from disk and into
 memory.
 
 The first filter that I used was:
 Filter f2 = new SingleColumnValueFilter(cf, qualifier,
 CompareFilter.CompareOp.EQUALS, value);
 
 This means that HBase must look for the qualifier column on all 6000 rows.
 As you mention I could add certain columns to a different cf; but
 unfortunately, in my case there is no such small set of columns that will
 need to be compared (filtered on).  I could try to use indexes so that a
 complete row key can be calculated from a secondary index in order to
 perform a faster search against data in a primary table.  This requires
 additional tables and maintenance that I would like to avoid.
 
 I did try a row key filter with regex hoping that it would limit the
 number of rows that were read from disk.
 Filter f2 = new RowFilter(CompareFilter.CompareOp.EQUAL, new
 RegexStringComparator(row_regexpr));
 
 My row keys are something like: vid,sid,event.  sid is not known at query
 time so I can use a regex similar to: vid,.*,Logon where Logon is the event
 that I am looking for in a particular visit.  In my test data this should
 have narrowed the scan to 1 row X 40 columns.  The best I could do for
 start/stop row is: vid,0 and vid,~ respectively.  I guess that is still
 going to cause all 6000 rows to be scanned, but the filtering should be
 more specific with the rowKey filter.  However, I did not see any
 performance improvement.  Anything obvious?
 
 Do you have any other ideas to help out with performance when row key is:
 vid,sid,event and sid is not known at query time which leaves a gap in the
 start/stop row?  Too bad regex can't be used in start/stop row
 specification.  That's really what I need.
 
 Thanks again.
 -Tony
 
 -Original Message-
 From: Vladimir Rodionov [mailto:vrodio...@carrieriq.com]
 Sent: Friday, June 21, 2013 8:00 PM
 To: user@hbase.apache.org; lars hofhansl
 Subject: RE: Scan performance
 
 Lars,
 I thought that column family is the locality group and placement columns
 which are frequently accessed together into the same column family
 (locality group) is the obvious performance improvement tip. What are the
 essential column families for in this context?
 
 As for original question..  Unless you place your column into a separate
 column family in Table 2, you will need to scan (load from disk if not
 cached) ~ 40x more data for the second table (because you have 40 columns).
 This may explain why do  see such a difference in execution time if all
 data needs to be loaded first from HDFS.
 
 Best regards,
 Vladimir Rodionov
 Principal Platform Engineer
 Carrier IQ, www.carrieriq.com
 e-mail: vrodio...@carrieriq.com
 
 
 From: lars hofhansl [la...@apache.org]
 Sent: Friday, June 21, 2013 3:37 PM
 To: user@hbase.apache.org
 Subject: Re: Scan performance
 
 HBase is a key value (KV) store. Each column is stored in its own KV, a
 row is just a set of KVs that happen to have the row key (which is the
 first part of the key).
 I tried to summarize this here:
 http://hadoop-hbase.blogspot.de/2011/12/introduction-to-hbase.html)
 
 In the StoreFiles all KVs are sorted in row/column order, but HBase still
 needs to skip over many KVs in order to reach the next row. So more disk
 and memory IO is needed.
 
 If you using 0.94 there is a new feature essential column families. If
 you always search by the same column you can place that one in its own

Re: running MR job and puts on the same table

2013-06-22 Thread Rohit Kelkar

Thanks JM, I am not so concerned about holding those rows in memory because
they are mostly ordered integers and I would be using a bitset. So I have
some leeway in that sense. My dilemma was
1. updating instantly within the map
2. bulk updating at the end of the map
Yes I do understand the drawback with 2 if map crashes. I am ready to incur
that penalty if that avoids any inconsistent behaviour on hbase.

- R


On Sat, Jun 22, 2013 at 12:16 PM, Jean-Marc Spaggiari 
jean-m...@spaggiari.org wrote:

 Hi Rahit,

 The list is a bad idea. When you will have millions of lines per
 regions, are going to pu millions of them in memory in your list?

 Your MR will scan the entire table, row by row. If you modify the
 current row, when the scanner will search for the next one, it will
 not look at current one. So there is no real issue with that.

 Also, instead of doing puts one by one I will recommand you to buffer
 them (let's say, 100 by 100) and put them as a batch. Don't forget to
 push the remaining at the end of the job. The drawback is that if the
 MR crash you will have some rows already processed and not marked as
 processed...

 JM

 2013/6/22 Rohit Kelkar rohitkel...@gmail.com:
  I have a usecase where I push data in my HTable in waves followed by
  Mapper-only processing. Currently once a row is processed in map I
  immediately mark it as processed=true. For this inside the map I execute
 a
  table.put(isprocessed=true). I am not sure if modifying the table like
 this
  is a good idea. I am also concerned that I am modifying the same table
 that
  I am running the MR job on.
  So I am thinking of another approach where I accumulate the processed
 rows
  in a list (or a better compact data structure) and use the cleanup method
  of the MR job to execute all the table.put(isprocessed=true) at once.
  What is the suggested best practice?
 
  - R

Re: Any mechanism in Hadoop to run in background

2013-06-22 Thread Suraj Varma

Yes, you can change your task tracker startup script to use nice and ionice
and restart the task tracker process. The mappers and reducers spun off
this task tracker will inherit the niceness.

See the first comment in
http://blog.cloudera.com/blog/2011/04/hbase-dos-and-donts/
Quoting:
change the hadoop-0.20-tasktracker so the process is started like this:

daemon *nice -n 19 ionice -c2 -n7*/usr/lib/hadoop-0.20/bin/hadoop-daemon.sh –
config “/etc/hadoop-0.20/conf” start tasktracker $DAEMON_FLAGS

--S



On Sat, Jun 22, 2013 at 7:55 AM, yun peng pengyunm...@gmail.com wrote:

 Hi, All...
 We have a user case intended to run Mapreduce in background, while the
 server serves online operations. The MapReduce job may have lower priority
 comparing to the online jobs..

 I know this is a different use case of Mapreduce comparing to its
 originally targeted scenario (where Mapreduce largely own resource
 exclusively)... But I want to know if there is any tuning knobs that allow
 Mapreduce to run in low priority/with limited resource.

 Thanks,
 Yun

Re: difference between major and minor compactions?

2013-06-22 Thread Suraj Varma

In contrast, the major compaction is invoked in offpeak time and usually
can be assume to have resource exclusively.

There is no resource exclusivity with major compactions. It is just more
resource _intensive_ because a major compaction will rewrite all the store
files to end up with a single store file per store as described in 9.7.6.5
Compaction in the hbase book.

So - it is because it is so resource _intensive_ that for large clusters
folks prefer to have a managed compaction (i.e. turn off major compaction
and run it off hours) so that it doesn't affect latencies for low latency
consumers, for instance.
--S

On Sat, Jun 22, 2013 at 7:35 AM, yun peng pengyunm...@gmail.com wrote:

I am more concerned with CompactionPolicy available that allows application
to manipulate a bit how compaction should go... It looks like there is
newest API in .97 version
*ExploringCompactionPolicy*
http://hbase.apache.org/apidocs/org/apache/hadoop/hbase/regionserver/compactions/ExploringCompactionPolicy.html
,
which allow application when we should have a major compaction.

For stripe compaction, it is very interesting, will look into it. Thanks.
Yun

On Sat, Jun 22, 2013 at 9:24 AM, Jean-Marc Spaggiari
jean-m...@spaggiari.org wrote:

Hi Yun,

There is more differences.

Look at: https://issues.apache.org/jira/browse/HBASE-7902
https://issues.apache.org/jira/browse/HBASE-7680
https://issues.apache.org/jira/browse/HBASE-7680

2013/6/22 yun peng pengyunm...@gmail.com:
Thanks, JM
It seems like the sole difference btwn major and minor compaction is
the
number of files (to be all or just a subset of storefiles). It
mentioned
very briefly in
http://hbase.apache.org/book
http://hbase.apache.org/book/regions.arch.htmlthat
Sometimes a minor compaction will ... promote itself to being a major
compaction. What does sometime exactly mean here? or any policy in
HBase
that allow application to specify when to promote a minor compaction to
be
a major (like user or some monitoring service can specify now is
offpeak
time?)
Yun

On Sat, Jun 22, 2013 at 8:51 AM, Jean-Marc Spaggiari
jean-m...@spaggiari.org wrote:

Hi Yun,

You are almost right. Only thing is that HBase doesn't know when is
your offpeak, so a major compaction can be triggered anytime if the
minor is promoted to be a major one.

2013/6/22 yun peng pengyunm...@gmail.com:
Hi, All

In contrast, the major compaction is invoked in offpeak time and
usually
can be assume to have resource exclusively. It may have a different
performance optimization goal...

Correct me if wrong, but let me know if HBase does design different
compaction mechanism this way..?

Regards,
Yun

Re: running MR job and puts on the same table

Hi Rohit,

It will alway be consistent. I don't see why there will be any
un-consistency with the scenario your described below.

JM

2013/6/22 Rohit Kelkar rohitkel...@gmail.com:
 Thanks JM, I am not so concerned about holding those rows in memory because
 they are mostly ordered integers and I would be using a bitset. So I have
 some leeway in that sense. My dilemma was
 1. updating instantly within the map
 2. bulk updating at the end of the map
 Yes I do understand the drawback with 2 if map crashes. I am ready to incur
 that penalty if that avoids any inconsistent behaviour on hbase.

 - R


 On Sat, Jun 22, 2013 at 12:16 PM, Jean-Marc Spaggiari 
 jean-m...@spaggiari.org wrote:

 Hi Rahit,

 The list is a bad idea. When you will have millions of lines per
 regions, are going to pu millions of them in memory in your list?

 Your MR will scan the entire table, row by row. If you modify the
 current row, when the scanner will search for the next one, it will
 not look at current one. So there is no real issue with that.

 Also, instead of doing puts one by one I will recommand you to buffer
 them (let's say, 100 by 100) and put them as a batch. Don't forget to
 push the remaining at the end of the job. The drawback is that if the
 MR crash you will have some rows already processed and not marked as
 processed...

 JM

 2013/6/22 Rohit Kelkar rohitkel...@gmail.com:
  I have a usecase where I push data in my HTable in waves followed by
  Mapper-only processing. Currently once a row is processed in map I
  immediately mark it as processed=true. For this inside the map I execute
 a
  table.put(isprocessed=true). I am not sure if modifying the table like
 this
  is a good idea. I am also concerned that I am modifying the same table
 that
  I am running the MR job on.
  So I am thinking of another approach where I accumulate the processed
 rows
  in a list (or a better compact data structure) and use the cleanup method
  of the MR job to execute all the table.put(isprocessed=true) at once.
  What is the suggested best practice?
 
  - R

Re: how many severs in a hbase cluster

2013-06-22 Thread iain wright

Hi Mohammad,

I am curious why you chose not to put the third ZK on the NN+JT? I was
planning on doing that on a new cluster and want to confirm it would be
okay.


-- 
Iain Wright
Cell: (562) 852-5916

http://www.labctsi.org/
This email message is confidential, intended only for the recipient(s)
named above and may contain information that is privileged, exempt from
disclosure under applicable law. If you are not the intended recipient, do
not disclose or disseminate the message to anyone except the intended
recipient. If you have received this message in error, or are not the named
recipient(s), please immediately notify the sender by return email, and
delete all copies of this message.


On Sat, Jun 22, 2013 at 10:05 AM, Mohammad Tariq donta...@gmail.com wrote:

 Yeah, I forgot to mention that no. of ZKs should be odd. Perhaps those
 parentheses made that statement look like an optional statement. Just to
 clarify it was mandatory.

 Warm Regards,
 Tariq
 cloudfront.blogspot.com


 On Sat, Jun 22, 2013 at 9:45 PM, Kevin O'dell kevin.od...@cloudera.com
 wrote:

  If you run ZK with a DN/TT/RS please make sure to dedicate a hard drive
 and
  a core to the ZK process. I have seen many strange occurrences.
  On Jun 22, 2013 12:10 PM, Jean-Marc Spaggiari jean-m...@spaggiari.org
 
  wrote:
 
   You HAVE TO run a ZK3, or else you don't need to have ZK2 and any ZK
   failure will be an issue. You need to have an odd number of ZK
   servers...
  
   Also, if you don't run MR jobs, you don't need the TT and JT... Else,
   everything below is correct. But there is many other options, all
   depend on your needs and the hardware you have ;)
  
   JM
  
   2013/6/22 Mohammad Tariq donta...@gmail.com:
With 8 machines you can do something like this :
   
Machine 1 - NN+JT
Machine 2 - SNN+ZK1
Machine 3 - HM+ZK2
Machine 4-8 - DN+TT+RS
(You can run ZK3 on a slave node with some additional memory).
   
DN and RS run on the same machine. Although RSs are said to hold the
   data,
the data is actually stored in DNs. Replication is managed at HDFS
  level.
You don't have to worry about that.
   
You can visit this link 
  http://hbase.apache.org/book/perf.writing.html
   to
see how to write efficiently into HBase. With a small field there
  should
not be any problem except storage and increased metadata, as you'll
  have
many small cells. If possible club several small fields into one and
  put
them together in one cell.
   
HTH
   
Warm Regards,
Tariq
cloudfront.blogspot.com
   
   
On Sat, Jun 22, 2013 at 8:31 PM, myhbase myhb...@126.com wrote:
   
Thanks for your response.
   
Now if 5 servers are enough, how can I install  and configure my
  nodes?
   If
I need 3 replicas in case data loss, I should at least have 3
   datanodes, we
still have namenode, regionserver and HMaster nodes, zookeeper
 nodes,
   some
of them must be installed in the same machine. The datanode seems
 the
   disk
IO sensitive node while region server is the mem sensitive, can I
   install
them in the same machine? Any suggestion on the deployment plan?
   
My business requirement is that the write is much more than
 read(7:3),
   and
I have another concern that I have a field which will have the
 8~15KB
  in
 data size, I am not sure, there will be any problem in hbase when
 it
   runs
compaction and split in regions.
   
 Oh, you already have heavyweight's input :).
   
Thanks JM.
   
Warm Regards,
Tariq
cloudfront.blogspot.com
   
   
On Sat, Jun 22, 2013 at 8:05 PM, Mohammad Tariq 
 donta...@gmail.com
wrote:
   
 Hello there,
   
 IMHO, 5-8 servers are sufficient enough to start with.
 But
   it's
all relative to the data you have and the intensity of your
   reads/writes.
You should have different strategies though, based on whether it's
   'read'
or 'write'. You actually can't define 'big' in absolute terms. My
   cluster
might be big for me, but for someone else it might still be not
 big
enough
or for someone it might be very big. Long story short it depends
 on
   your
needs. If you are able to achieve your goal with 5-8 RSs, then
  having
more
machines will be a wastage, I think.
   
But you should always keep in mind that HBase is kinda greedy when
  it
comes to memory. For a decent load 4G is sufficient, IMHO. But it
   again
depends on operations you are gonna perform. If you have large
   clusters
where you are planning to run MR jobs frequently you are better
 off
   with
additional 2G.
   
   
Warm Regards,
Tariq
cloudfront.blogspot.com
   
   
On Sat, Jun 22, 2013 at 7:51 PM, myhbase myhb...@126.com wrote:
   
 Hello All,
   
I learn hbase almost from papers and books, according to my
understanding, HBase is the kind of architecture which is more
   appliable
to a big cluster. We

Re: how many severs in a hbase cluster