RE: 答复: HBase random read performance

2013-04-16 Thread Liu, Raymond
So what is lacking here? The action should also been parallel inside RS for 
each region, Instead of just parallel on RS level?
Seems this will be rather difficult to implement, and for Get, might not be 
worthy?

 
 I looked
 at src/main/java/org/apache/hadoop/hbase/client/HConnectionManager.java
 in
 0.94
 
 In processBatchCallback(), starting line 1538,
 
 // step 1: break up into regionserver-sized chunks and build the data
 structs
 MapHRegionLocation, MultiActionR actionsByServer =
   new HashMapHRegionLocation, MultiActionR();
 for (int i = 0; i  workingList.size(); i++) {
 
 So we do group individual action by server.
 
 FYI
 
 On Mon, Apr 15, 2013 at 6:30 AM, Ted Yu yuzhih...@gmail.com wrote:
 
  Doug made a good point.
 
  Take a look at the performance gain for parallel scan (bottom chart
  compared to top chart):
  https://issues.apache.org/jira/secure/attachment/12578083/FDencode.png
 
  See
 
 https://issues.apache.org/jira/browse/HBASE-8316?focusedCommentId=1362
 8300page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpan
 el#comment-13628300for explanation of the two methods.
 
  Cheers
 
  On Mon, Apr 15, 2013 at 6:21 AM, Doug Meil
 doug.m...@explorysmedical.comwrote:
 
 
  Hi there, regarding this...
 
   We are passing random 1 row-keys as input, while HBase is
   taking
  around
   17 secs to return 1 records.
 
 
  ….  Given that you are generating 10,000 random keys, your multi-get
  is very likely hitting all 5 nodes of your cluster.
 
 
  Historically, multi-Get used to first sort the requests by RS and
  then
  *serially* go the RS to process the multi-Get.  I'm not sure of the
  current (0.94.x) behavior if it multi-threads or not.
 
  One thing you might want to consider is confirming that client
  behavior, and if it's not multi-threading then perform a test that
  does the same RS sorting via...
 
 
  http://hbase.apache.org/apidocs/org/apache/hadoop/hbase/client/HTable
  .html#
  getRegionLocation%28byte[http://hbase.apache.org/apidocs/org/apache/
  hadoop/hbase/client/HTable.html#getRegionLocation%28byte[
  ]%29
 
  …. and then spin up your own threads (one per target RS) and see what
  happens.
 
 
 
  On 4/15/13 9:04 AM, Ankit Jain ankitjainc...@gmail.com wrote:
 
  Hi Liang,
  
  Thanks Liang for reply..
  
  Ans1:
  I tried by using HFile block size of 32 KB and bloom filter is enabled.
  The
  random read performance is 1 records in 23 secs.
  
  Ans2:
  We are retrieving all the 1 rows in one call.
  
  Ans3:
  Disk detai:
  Model Number:   ST2000DM001-1CH164
  Serial Number:  Z1E276YF
  
  Please suggest some more optimization
  
  Thanks,
  Ankit Jain
  
  On Mon, Apr 15, 2013 at 5:11 PM, 谢良 xieli...@xiaomi.com wrote:
  
   First, it's probably helpless to set block size to 4KB, please
   refer to the beginning of HFile.java:
  
Smaller blocks are good
* for random access, but require more memory to hold the block
  index, and  may
* be slower to create (because we must flush the compressor
  stream at the
* conclusion of each data block, which leads to an FS I/O flush).
   Further, due
* to the internal caching in Compression codec, the smallest
  possible  block
* size would be around 20KB-30KB.
  
   Second, is it a single-thread test client or multi-threads? we
   couldn't expect too much if the requests are one by one.
  
   Third, could you provide more info about  your DN disk numbers and
   IO utils ?
  
   Thanks,
   Liang
   
   发件人: Ankit Jain [ankitjainc...@gmail.com]
   发送时间: 2013年4月15日 18:53
   收件人: user@hbase.apache.org
   主题: Re: HBase random read performance
  
   Hi Anoop,
  
   Thanks for reply..
  
   I tried by setting Hfile block size 4KB and also enabled the bloom
   filter(ROW). The maximum read performance that I was able to
   achieve is
   1 records in 14 secs (size of record is 1.6KB).
  
   Please suggest some tuning..
  
   Thanks,
   Ankit Jain
  
  
  
   On Mon, Apr 15, 2013 at 4:12 PM, Rishabh Agrawal 
   rishabh.agra...@impetus.co.in wrote:
  
Interesting. Can you explain why this happens?
   
-Original Message-
From: Anoop Sam John [mailto:anoo...@huawei.com]
Sent: Monday, April 15, 2013 3:47 PM
To: user@hbase.apache.org
Subject: RE: HBase random read performance
   
Ankit
 I guess you might be having default HFile block
size which is 64KB.
For random gets a lower value will be better. Try will some
thing
  like
   8KB
and check the latency?
   
Ya ofcourse blooms can help (if major compaction was not done at
the
  time
of testing)
   
-Anoop-

From: Ankit Jain [ankitjainc...@gmail.com]
Sent: Saturday, April 13, 2013 11:01 AM
To: user@hbase.apache.org
Subject: HBase random read performance
   
Hi All,
   
We are using HBase 0.94.5 and Hadoop 

Re: 答复: HBase random read performance

2013-04-16 Thread Nicolas Liochon
I think there is something in the middle that could be done. It was
discussed here a while ago, but without any JIRA created.  See thread:
http://mail-archives.apache.org/mod_mbox/hbase-user/201302.mbox/%3CCAKxWWm19OC+dePTK60bMmcecv=7tc+3t4-bq6fdqeppix_e...@mail.gmail.com%3E

If someone can spend some time on it, I can create the JIRA...

Nicolas


On Tue, Apr 16, 2013 at 9:49 AM, Liu, Raymond raymond@intel.com wrote:

 So what is lacking here? The action should also been parallel inside RS
 for each region, Instead of just parallel on RS level?
 Seems this will be rather difficult to implement, and for Get, might not
 be worthy?

 
  I looked
  at src/main/java/org/apache/hadoop/hbase/client/HConnectionManager.java
  in
  0.94
 
  In processBatchCallback(), starting line 1538,
 
  // step 1: break up into regionserver-sized chunks and build the
 data
  structs
  MapHRegionLocation, MultiActionR actionsByServer =
new HashMapHRegionLocation, MultiActionR();
  for (int i = 0; i  workingList.size(); i++) {
 
  So we do group individual action by server.
 
  FYI
 
  On Mon, Apr 15, 2013 at 6:30 AM, Ted Yu yuzhih...@gmail.com wrote:
 
   Doug made a good point.
  
   Take a look at the performance gain for parallel scan (bottom chart
   compared to top chart):
   https://issues.apache.org/jira/secure/attachment/12578083/FDencode.png
  
   See
  
  https://issues.apache.org/jira/browse/HBASE-8316?focusedCommentId=1362
  8300page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpan
  el#comment-13628300for explanation of the two methods.
  
   Cheers
  
   On Mon, Apr 15, 2013 at 6:21 AM, Doug Meil
  doug.m...@explorysmedical.comwrote:
  
  
   Hi there, regarding this...
  
We are passing random 1 row-keys as input, while HBase is
taking
   around
17 secs to return 1 records.
  
  
   ….  Given that you are generating 10,000 random keys, your multi-get
   is very likely hitting all 5 nodes of your cluster.
  
  
   Historically, multi-Get used to first sort the requests by RS and
   then
   *serially* go the RS to process the multi-Get.  I'm not sure of the
   current (0.94.x) behavior if it multi-threads or not.
  
   One thing you might want to consider is confirming that client
   behavior, and if it's not multi-threading then perform a test that
   does the same RS sorting via...
  
  
   http://hbase.apache.org/apidocs/org/apache/hadoop/hbase/client/HTable
   .html#
   getRegionLocation%28byte[http://hbase.apache.org/apidocs/org/apache/
   hadoop/hbase/client/HTable.html#getRegionLocation%28byte[
   ]%29
  
   …. and then spin up your own threads (one per target RS) and see what
   happens.
  
  
  
   On 4/15/13 9:04 AM, Ankit Jain ankitjainc...@gmail.com wrote:
  
   Hi Liang,
   
   Thanks Liang for reply..
   
   Ans1:
   I tried by using HFile block size of 32 KB and bloom filter is
 enabled.
   The
   random read performance is 1 records in 23 secs.
   
   Ans2:
   We are retrieving all the 1 rows in one call.
   
   Ans3:
   Disk detai:
   Model Number:   ST2000DM001-1CH164
   Serial Number:  Z1E276YF
   
   Please suggest some more optimization
   
   Thanks,
   Ankit Jain
   
   On Mon, Apr 15, 2013 at 5:11 PM, 谢良 xieli...@xiaomi.com wrote:
   
First, it's probably helpless to set block size to 4KB, please
refer to the beginning of HFile.java:
   
 Smaller blocks are good
 * for random access, but require more memory to hold the block
   index, and  may
 * be slower to create (because we must flush the compressor
   stream at the
 * conclusion of each data block, which leads to an FS I/O flush).
Further, due
 * to the internal caching in Compression codec, the smallest
   possible  block
 * size would be around 20KB-30KB.
   
Second, is it a single-thread test client or multi-threads? we
couldn't expect too much if the requests are one by one.
   
Third, could you provide more info about  your DN disk numbers and
IO utils ?
   
Thanks,
Liang

发件人: Ankit Jain [ankitjainc...@gmail.com]
发送时间: 2013年4月15日 18:53
收件人: user@hbase.apache.org
主题: Re: HBase random read performance
   
Hi Anoop,
   
Thanks for reply..
   
I tried by setting Hfile block size 4KB and also enabled the bloom
filter(ROW). The maximum read performance that I was able to
achieve is
1 records in 14 secs (size of record is 1.6KB).
   
Please suggest some tuning..
   
Thanks,
Ankit Jain
   
   
   
On Mon, Apr 15, 2013 at 4:12 PM, Rishabh Agrawal 
rishabh.agra...@impetus.co.in wrote:
   
 Interesting. Can you explain why this happens?

 -Original Message-
 From: Anoop Sam John [mailto:anoo...@huawei.com]
 Sent: Monday, April 15, 2013 3:47 PM
 To: user@hbase.apache.org
 Subject: RE: HBase random read performance

 Ankit
  I 

Re: 答复: HBase random read performance

2013-04-16 Thread Jean-Marc Spaggiari
Hi Nicolas,

I think it might be good to create a JIRA for that anyway since seems that
some users are expecting this behaviour.

My 2¢ ;)

JM

2013/4/16 Nicolas Liochon nkey...@gmail.com

 I think there is something in the middle that could be done. It was
 discussed here a while ago, but without any JIRA created.  See thread:

 http://mail-archives.apache.org/mod_mbox/hbase-user/201302.mbox/%3CCAKxWWm19OC+dePTK60bMmcecv=7tc+3t4-bq6fdqeppix_e...@mail.gmail.com%3E

 If someone can spend some time on it, I can create the JIRA...

 Nicolas


 On Tue, Apr 16, 2013 at 9:49 AM, Liu, Raymond raymond@intel.com
 wrote:

  So what is lacking here? The action should also been parallel inside RS
  for each region, Instead of just parallel on RS level?
  Seems this will be rather difficult to implement, and for Get, might not
  be worthy?
 
  
   I looked
   at src/main/java/org/apache/hadoop/hbase/client/HConnectionManager.java
   in
   0.94
  
   In processBatchCallback(), starting line 1538,
  
   // step 1: break up into regionserver-sized chunks and build
 the
  data
   structs
   MapHRegionLocation, MultiActionR actionsByServer =
 new HashMapHRegionLocation, MultiActionR();
   for (int i = 0; i  workingList.size(); i++) {
  
   So we do group individual action by server.
  
   FYI
  
   On Mon, Apr 15, 2013 at 6:30 AM, Ted Yu yuzhih...@gmail.com wrote:
  
Doug made a good point.
   
Take a look at the performance gain for parallel scan (bottom chart
compared to top chart):
   
 https://issues.apache.org/jira/secure/attachment/12578083/FDencode.png
   
See
   
   https://issues.apache.org/jira/browse/HBASE-8316?focusedCommentId=1362
  
 8300page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpan
   el#comment-13628300for explanation of the two methods.
   
Cheers
   
On Mon, Apr 15, 2013 at 6:21 AM, Doug Meil
   doug.m...@explorysmedical.comwrote:
   
   
Hi there, regarding this...
   
 We are passing random 1 row-keys as input, while HBase is
 taking
around
 17 secs to return 1 records.
   
   
….  Given that you are generating 10,000 random keys, your multi-get
is very likely hitting all 5 nodes of your cluster.
   
   
Historically, multi-Get used to first sort the requests by RS and
then
*serially* go the RS to process the multi-Get.  I'm not sure of the
current (0.94.x) behavior if it multi-threads or not.
   
One thing you might want to consider is confirming that client
behavior, and if it's not multi-threading then perform a test that
does the same RS sorting via...
   
   
   
 http://hbase.apache.org/apidocs/org/apache/hadoop/hbase/client/HTable
.html#
getRegionLocation%28byte[
 http://hbase.apache.org/apidocs/org/apache/
hadoop/hbase/client/HTable.html#getRegionLocation%28byte[
]%29
   
…. and then spin up your own threads (one per target RS) and see
 what
happens.
   
   
   
On 4/15/13 9:04 AM, Ankit Jain ankitjainc...@gmail.com wrote:
   
Hi Liang,

Thanks Liang for reply..

Ans1:
I tried by using HFile block size of 32 KB and bloom filter is
  enabled.
The
random read performance is 1 records in 23 secs.

Ans2:
We are retrieving all the 1 rows in one call.

Ans3:
Disk detai:
Model Number:   ST2000DM001-1CH164
Serial Number:  Z1E276YF

Please suggest some more optimization

Thanks,
Ankit Jain

On Mon, Apr 15, 2013 at 5:11 PM, 谢良 xieli...@xiaomi.com wrote:

 First, it's probably helpless to set block size to 4KB, please
 refer to the beginning of HFile.java:

  Smaller blocks are good
  * for random access, but require more memory to hold the block
index, and  may
  * be slower to create (because we must flush the compressor
stream at the
  * conclusion of each data block, which leads to an FS I/O
 flush).
 Further, due
  * to the internal caching in Compression codec, the smallest
possible  block
  * size would be around 20KB-30KB.

 Second, is it a single-thread test client or multi-threads? we
 couldn't expect too much if the requests are one by one.

 Third, could you provide more info about  your DN disk numbers
 and
 IO utils ?

 Thanks,
 Liang
 
 发件人: Ankit Jain [ankitjainc...@gmail.com]
 发送时间: 2013年4月15日 18:53
 收件人: user@hbase.apache.org
 主题: Re: HBase random read performance

 Hi Anoop,

 Thanks for reply..

 I tried by setting Hfile block size 4KB and also enabled the
 bloom
 filter(ROW). The maximum read performance that I was able to
 achieve is
 1 records in 14 secs (size of record is 1.6KB).

 Please suggest some tuning..

 Thanks,
 Ankit Jain



 On Mon, Apr 15, 2013 at 4:12 PM, Rishabh 

Re: 答复: HBase random read performance

2013-04-16 Thread lars hofhansl
This fundamentally different, though. A scanner by default scans all regions 
serially, because it promises to return all rows in sort order.
A multi get is already parallelized across regions (and hence accross region 
servers).


Before we do a lot of work here we should fist make sure that nothing else is 
wrong with OPs setup.
17s for 1 is not right.


Ankit, what does the IO look like across the machines in the cluster while this 
is happening?

Since you pick 1 rows at random your expectation is that entire set of rows 
will fit into the block cache? Is that the case?

-- Lars




 From: Ted Yu yuzhih...@gmail.com
To: user@hbase.apache.org 
Sent: Monday, April 15, 2013 10:03 AM
Subject: Re: 答复: HBase random read performance
 

This is a related JIRA which should provide noticeable speed up:

HBASE-1935 Scan in parallel

Cheers

On Mon, Apr 15, 2013 at 7:13 AM, Ted Yu yuzhih...@gmail.com wrote:

 I looked
 at src/main/java/org/apache/hadoop/hbase/client/HConnectionManager.java in
 0.94

 In processBatchCallback(), starting line 1538,

         // step 1: break up into regionserver-sized chunks and build the
 data structs
         MapHRegionLocation, MultiActionR actionsByServer =
           new HashMapHRegionLocation, MultiActionR();
         for (int i = 0; i  workingList.size(); i++) {

 So we do group individual action by server.

 FYI

 On Mon, Apr 15, 2013 at 6:30 AM, Ted Yu yuzhih...@gmail.com wrote:

 Doug made a good point.

 Take a look at the performance gain for parallel scan (bottom chart
 compared to top chart):
 https://issues.apache.org/jira/secure/attachment/12578083/FDencode.png

 See
 https://issues.apache.org/jira/browse/HBASE-8316?focusedCommentId=13628300page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel#comment-13628300for
  explanation of the two methods.

 Cheers

 On Mon, Apr 15, 2013 at 6:21 AM, Doug Meil doug.m...@explorysmedical.com
  wrote:


 Hi there, regarding this...

  We are passing random 1 row-keys as input, while HBase is taking
 around
  17 secs to return 1 records.


 ….  Given that you are generating 10,000 random keys, your multi-get is
 very likely hitting all 5 nodes of your cluster.


 Historically, multi-Get used to first sort the requests by RS and then
 *serially* go the RS to process the multi-Get.  I'm not sure of the
 current (0.94.x) behavior if it multi-threads or not.

 One thing you might want to consider is confirming that client behavior,
 and if it's not multi-threading then perform a test that does the same RS
 sorting via...


 http://hbase.apache.org/apidocs/org/apache/hadoop/hbase/client/HTable.html#
 getRegionLocation%28byte[http://hbase.apache.org/apidocs/org/apache/hadoop/hbase/client/HTable.html#getRegionLocation%28byte[
 ]%29

 …. and then spin up your own threads (one per target RS) and see what
 happens.



 On 4/15/13 9:04 AM, Ankit Jain ankitjainc...@gmail.com wrote:

 Hi Liang,
 
 Thanks Liang for reply..
 
 Ans1:
 I tried by using HFile block size of 32 KB and bloom filter is enabled.
 The
 random read performance is 1 records in 23 secs.
 
 Ans2:
 We are retrieving all the 1 rows in one call.
 
 Ans3:
 Disk detai:
 Model Number:       ST2000DM001-1CH164
 Serial Number:      Z1E276YF
 
 Please suggest some more optimization
 
 Thanks,
 Ankit Jain
 
 On Mon, Apr 15, 2013 at 5:11 PM, 谢良 xieli...@xiaomi.com wrote:
 
  First, it's probably helpless to set block size to 4KB, please refer
 to
  the beginning of HFile.java:
 
   Smaller blocks are good
   * for random access, but require more memory to hold the block index,
 and
  may
   * be slower to create (because we must flush the compressor stream at
 the
   * conclusion of each data block, which leads to an FS I/O flush).
  Further, due
   * to the internal caching in Compression codec, the smallest possible
  block
   * size would be around 20KB-30KB.
 
  Second, is it a single-thread test client or multi-threads? we
 couldn't
  expect too much if the requests are one by one.
 
  Third, could you provide more info about  your DN disk numbers and IO
  utils ?
 
  Thanks,
  Liang
  
  发件人: Ankit Jain [ankitjainc...@gmail.com]
  发送时间: 2013年4月15日 18:53
  收件人: user@hbase.apache.org
  主题: Re: HBase random read performance
 
  Hi Anoop,
 
  Thanks for reply..
 
  I tried by setting Hfile block size 4KB and also enabled the bloom
  filter(ROW). The maximum read performance that I was able to achieve
 is
  1 records in 14 secs (size of record is 1.6KB).
 
  Please suggest some tuning..
 
  Thanks,
  Ankit Jain
 
 
 
  On Mon, Apr 15, 2013 at 4:12 PM, Rishabh Agrawal 
  rishabh.agra...@impetus.co.in wrote:
 
   Interesting. Can you explain why this happens?
  
   -Original Message-
   From: Anoop Sam John [mailto:anoo...@huawei.com]
   Sent: Monday, April 15, 2013 3:47 PM
   To: user@hbase.apache.org
   Subject: RE: HBase random read performance
  
   Ankit

Re: 答复: HBase random read performance

2013-04-15 Thread Ankit Jain
Hi Liang,

Thanks Liang for reply..

Ans1:
I tried by using HFile block size of 32 KB and bloom filter is enabled. The
random read performance is 1 records in 23 secs.

Ans2:
We are retrieving all the 1 rows in one call.

Ans3:
Disk detai:
Model Number:   ST2000DM001-1CH164
Serial Number:  Z1E276YF

Please suggest some more optimization

Thanks,
Ankit Jain

On Mon, Apr 15, 2013 at 5:11 PM, 谢良 xieli...@xiaomi.com wrote:

 First, it's probably helpless to set block size to 4KB, please refer to
 the beginning of HFile.java:

  Smaller blocks are good
  * for random access, but require more memory to hold the block index, and
 may
  * be slower to create (because we must flush the compressor stream at the
  * conclusion of each data block, which leads to an FS I/O flush).
 Further, due
  * to the internal caching in Compression codec, the smallest possible
 block
  * size would be around 20KB-30KB.

 Second, is it a single-thread test client or multi-threads? we couldn't
 expect too much if the requests are one by one.

 Third, could you provide more info about  your DN disk numbers and IO
 utils ?

 Thanks,
 Liang
 
 发件人: Ankit Jain [ankitjainc...@gmail.com]
 发送时间: 2013年4月15日 18:53
 收件人: user@hbase.apache.org
 主题: Re: HBase random read performance

 Hi Anoop,

 Thanks for reply..

 I tried by setting Hfile block size 4KB and also enabled the bloom
 filter(ROW). The maximum read performance that I was able to achieve is
 1 records in 14 secs (size of record is 1.6KB).

 Please suggest some tuning..

 Thanks,
 Ankit Jain



 On Mon, Apr 15, 2013 at 4:12 PM, Rishabh Agrawal 
 rishabh.agra...@impetus.co.in wrote:

  Interesting. Can you explain why this happens?
 
  -Original Message-
  From: Anoop Sam John [mailto:anoo...@huawei.com]
  Sent: Monday, April 15, 2013 3:47 PM
  To: user@hbase.apache.org
  Subject: RE: HBase random read performance
 
  Ankit
   I guess you might be having default HFile block size
  which is 64KB.
  For random gets a lower value will be better. Try will some thing like
 8KB
  and check the latency?
 
  Ya ofcourse blooms can help (if major compaction was not done at the time
  of testing)
 
  -Anoop-
  
  From: Ankit Jain [ankitjainc...@gmail.com]
  Sent: Saturday, April 13, 2013 11:01 AM
  To: user@hbase.apache.org
  Subject: HBase random read performance
 
  Hi All,
 
  We are using HBase 0.94.5 and Hadoop 1.0.4.
 
  We have HBase cluster of 5 nodes(5 regionservers and 1 master node). Each
  regionserver has 8 GB RAM.
 
  We have loaded 25 millions records in HBase table, regions are pre-split
  into 16 regions and all the regions are equally loaded.
 
  We are getting very low random read performance while performing multi
 get
  from HBase.
 
  We are passing random 1 row-keys as input, while HBase is taking
 around
  17 secs to return 1 records.
 
  Please suggest some tuning to increase HBase read performance.
 
  Thanks,
  Ankit Jain
  iLabs
 
 
 
  --
  Thanks,
  Ankit Jain
 
  
 
 
 
 
 
 
  NOTE: This message may contain information that is confidential,
  proprietary, privileged or otherwise protected by law. The message is
  intended solely for the named addressee. If received in error, please
  destroy and notify the sender. Any use of this email is prohibited when
  received in error. Impetus does not represent, warrant and/or guarantee,
  that the integrity of this communication has been maintained nor that the
  communication is free of errors, virus, interception or interference.
 



 --
 Thanks,
 Ankit Jain




-- 
Thanks,
Ankit Jain


Re: 答复: HBase random read performance

2013-04-15 Thread Doug Meil

Hi there, regarding this...

 We are passing random 1 row-keys as input, while HBase is taking
around
 17 secs to return 1 records.


….  Given that you are generating 10,000 random keys, your multi-get is
very likely hitting all 5 nodes of your cluster.


Historically, multi-Get used to first sort the requests by RS and then
*serially* go the RS to process the multi-Get.  I'm not sure of the
current (0.94.x) behavior if it multi-threads or not.

One thing you might want to consider is confirming that client behavior,
and if it's not multi-threading then perform a test that does the same RS
sorting via...

http://hbase.apache.org/apidocs/org/apache/hadoop/hbase/client/HTable.html#
getRegionLocation%28byte[]%29

…. and then spin up your own threads (one per target RS) and see what
happens.



On 4/15/13 9:04 AM, Ankit Jain ankitjainc...@gmail.com wrote:

Hi Liang,

Thanks Liang for reply..

Ans1:
I tried by using HFile block size of 32 KB and bloom filter is enabled.
The
random read performance is 1 records in 23 secs.

Ans2:
We are retrieving all the 1 rows in one call.

Ans3:
Disk detai:
Model Number:   ST2000DM001-1CH164
Serial Number:  Z1E276YF

Please suggest some more optimization

Thanks,
Ankit Jain

On Mon, Apr 15, 2013 at 5:11 PM, 谢良 xieli...@xiaomi.com wrote:

 First, it's probably helpless to set block size to 4KB, please refer to
 the beginning of HFile.java:

  Smaller blocks are good
  * for random access, but require more memory to hold the block index,
and
 may
  * be slower to create (because we must flush the compressor stream at
the
  * conclusion of each data block, which leads to an FS I/O flush).
 Further, due
  * to the internal caching in Compression codec, the smallest possible
 block
  * size would be around 20KB-30KB.

 Second, is it a single-thread test client or multi-threads? we couldn't
 expect too much if the requests are one by one.

 Third, could you provide more info about  your DN disk numbers and IO
 utils ?

 Thanks,
 Liang
 
 发件人: Ankit Jain [ankitjainc...@gmail.com]
 发送时间: 2013年4月15日 18:53
 收件人: user@hbase.apache.org
 主题: Re: HBase random read performance

 Hi Anoop,

 Thanks for reply..

 I tried by setting Hfile block size 4KB and also enabled the bloom
 filter(ROW). The maximum read performance that I was able to achieve is
 1 records in 14 secs (size of record is 1.6KB).

 Please suggest some tuning..

 Thanks,
 Ankit Jain



 On Mon, Apr 15, 2013 at 4:12 PM, Rishabh Agrawal 
 rishabh.agra...@impetus.co.in wrote:

  Interesting. Can you explain why this happens?
 
  -Original Message-
  From: Anoop Sam John [mailto:anoo...@huawei.com]
  Sent: Monday, April 15, 2013 3:47 PM
  To: user@hbase.apache.org
  Subject: RE: HBase random read performance
 
  Ankit
   I guess you might be having default HFile block size
  which is 64KB.
  For random gets a lower value will be better. Try will some thing like
 8KB
  and check the latency?
 
  Ya ofcourse blooms can help (if major compaction was not done at the
time
  of testing)
 
  -Anoop-
  
  From: Ankit Jain [ankitjainc...@gmail.com]
  Sent: Saturday, April 13, 2013 11:01 AM
  To: user@hbase.apache.org
  Subject: HBase random read performance
 
  Hi All,
 
  We are using HBase 0.94.5 and Hadoop 1.0.4.
 
  We have HBase cluster of 5 nodes(5 regionservers and 1 master node).
Each
  regionserver has 8 GB RAM.
 
  We have loaded 25 millions records in HBase table, regions are
pre-split
  into 16 regions and all the regions are equally loaded.
 
  We are getting very low random read performance while performing multi
 get
  from HBase.
 
  We are passing random 1 row-keys as input, while HBase is taking
 around
  17 secs to return 1 records.
 
  Please suggest some tuning to increase HBase read performance.
 
  Thanks,
  Ankit Jain
  iLabs
 
 
 
  --
  Thanks,
  Ankit Jain
 
  
 
 
 
 
 
 
  NOTE: This message may contain information that is confidential,
  proprietary, privileged or otherwise protected by law. The message is
  intended solely for the named addressee. If received in error, please
  destroy and notify the sender. Any use of this email is prohibited
when
  received in error. Impetus does not represent, warrant and/or
guarantee,
  that the integrity of this communication has been maintained nor that
the
  communication is free of errors, virus, interception or interference.
 



 --
 Thanks,
 Ankit Jain




-- 
Thanks,
Ankit Jain



Re: 答复: HBase random read performance

2013-04-15 Thread Ted Yu
Doug made a good point.

Take a look at the performance gain for parallel scan (bottom chart
compared to top chart):
https://issues.apache.org/jira/secure/attachment/12578083/FDencode.png

See
https://issues.apache.org/jira/browse/HBASE-8316?focusedCommentId=13628300page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel#comment-13628300for
explanation of the two methods.

Cheers

On Mon, Apr 15, 2013 at 6:21 AM, Doug Meil doug.m...@explorysmedical.comwrote:


 Hi there, regarding this...

  We are passing random 1 row-keys as input, while HBase is taking
 around
  17 secs to return 1 records.


 ….  Given that you are generating 10,000 random keys, your multi-get is
 very likely hitting all 5 nodes of your cluster.


 Historically, multi-Get used to first sort the requests by RS and then
 *serially* go the RS to process the multi-Get.  I'm not sure of the
 current (0.94.x) behavior if it multi-threads or not.

 One thing you might want to consider is confirming that client behavior,
 and if it's not multi-threading then perform a test that does the same RS
 sorting via...

 http://hbase.apache.org/apidocs/org/apache/hadoop/hbase/client/HTable.html#
 getRegionLocation%28byte[]%29

 …. and then spin up your own threads (one per target RS) and see what
 happens.



 On 4/15/13 9:04 AM, Ankit Jain ankitjainc...@gmail.com wrote:

 Hi Liang,
 
 Thanks Liang for reply..
 
 Ans1:
 I tried by using HFile block size of 32 KB and bloom filter is enabled.
 The
 random read performance is 1 records in 23 secs.
 
 Ans2:
 We are retrieving all the 1 rows in one call.
 
 Ans3:
 Disk detai:
 Model Number:   ST2000DM001-1CH164
 Serial Number:  Z1E276YF
 
 Please suggest some more optimization
 
 Thanks,
 Ankit Jain
 
 On Mon, Apr 15, 2013 at 5:11 PM, 谢良 xieli...@xiaomi.com wrote:
 
  First, it's probably helpless to set block size to 4KB, please refer to
  the beginning of HFile.java:
 
   Smaller blocks are good
   * for random access, but require more memory to hold the block index,
 and
  may
   * be slower to create (because we must flush the compressor stream at
 the
   * conclusion of each data block, which leads to an FS I/O flush).
  Further, due
   * to the internal caching in Compression codec, the smallest possible
  block
   * size would be around 20KB-30KB.
 
  Second, is it a single-thread test client or multi-threads? we couldn't
  expect too much if the requests are one by one.
 
  Third, could you provide more info about  your DN disk numbers and IO
  utils ?
 
  Thanks,
  Liang
  
  发件人: Ankit Jain [ankitjainc...@gmail.com]
  发送时间: 2013年4月15日 18:53
  收件人: user@hbase.apache.org
  主题: Re: HBase random read performance
 
  Hi Anoop,
 
  Thanks for reply..
 
  I tried by setting Hfile block size 4KB and also enabled the bloom
  filter(ROW). The maximum read performance that I was able to achieve is
  1 records in 14 secs (size of record is 1.6KB).
 
  Please suggest some tuning..
 
  Thanks,
  Ankit Jain
 
 
 
  On Mon, Apr 15, 2013 at 4:12 PM, Rishabh Agrawal 
  rishabh.agra...@impetus.co.in wrote:
 
   Interesting. Can you explain why this happens?
  
   -Original Message-
   From: Anoop Sam John [mailto:anoo...@huawei.com]
   Sent: Monday, April 15, 2013 3:47 PM
   To: user@hbase.apache.org
   Subject: RE: HBase random read performance
  
   Ankit
I guess you might be having default HFile block size
   which is 64KB.
   For random gets a lower value will be better. Try will some thing like
  8KB
   and check the latency?
  
   Ya ofcourse blooms can help (if major compaction was not done at the
 time
   of testing)
  
   -Anoop-
   
   From: Ankit Jain [ankitjainc...@gmail.com]
   Sent: Saturday, April 13, 2013 11:01 AM
   To: user@hbase.apache.org
   Subject: HBase random read performance
  
   Hi All,
  
   We are using HBase 0.94.5 and Hadoop 1.0.4.
  
   We have HBase cluster of 5 nodes(5 regionservers and 1 master node).
 Each
   regionserver has 8 GB RAM.
  
   We have loaded 25 millions records in HBase table, regions are
 pre-split
   into 16 regions and all the regions are equally loaded.
  
   We are getting very low random read performance while performing multi
  get
   from HBase.
  
   We are passing random 1 row-keys as input, while HBase is taking
  around
   17 secs to return 1 records.
  
   Please suggest some tuning to increase HBase read performance.
  
   Thanks,
   Ankit Jain
   iLabs
  
  
  
   --
   Thanks,
   Ankit Jain
  
   
  
  
  
  
  
  
   NOTE: This message may contain information that is confidential,
   proprietary, privileged or otherwise protected by law. The message is
   intended solely for the named addressee. If received in error, please
   destroy and notify the sender. Any use of this email is prohibited
 when
   received in error. Impetus does not represent, warrant and/or

Re: 答复: HBase random read performance

2013-04-15 Thread Ted Yu
I looked
at src/main/java/org/apache/hadoop/hbase/client/HConnectionManager.java in
0.94

In processBatchCallback(), starting line 1538,

// step 1: break up into regionserver-sized chunks and build the
data structs
MapHRegionLocation, MultiActionR actionsByServer =
  new HashMapHRegionLocation, MultiActionR();
for (int i = 0; i  workingList.size(); i++) {

So we do group individual action by server.

FYI

On Mon, Apr 15, 2013 at 6:30 AM, Ted Yu yuzhih...@gmail.com wrote:

 Doug made a good point.

 Take a look at the performance gain for parallel scan (bottom chart
 compared to top chart):
 https://issues.apache.org/jira/secure/attachment/12578083/FDencode.png

 See
 https://issues.apache.org/jira/browse/HBASE-8316?focusedCommentId=13628300page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel#comment-13628300for
  explanation of the two methods.

 Cheers

 On Mon, Apr 15, 2013 at 6:21 AM, Doug Meil 
 doug.m...@explorysmedical.comwrote:


 Hi there, regarding this...

  We are passing random 1 row-keys as input, while HBase is taking
 around
  17 secs to return 1 records.


 ….  Given that you are generating 10,000 random keys, your multi-get is
 very likely hitting all 5 nodes of your cluster.


 Historically, multi-Get used to first sort the requests by RS and then
 *serially* go the RS to process the multi-Get.  I'm not sure of the
 current (0.94.x) behavior if it multi-threads or not.

 One thing you might want to consider is confirming that client behavior,
 and if it's not multi-threading then perform a test that does the same RS
 sorting via...


 http://hbase.apache.org/apidocs/org/apache/hadoop/hbase/client/HTable.html#
 getRegionLocation%28byte[http://hbase.apache.org/apidocs/org/apache/hadoop/hbase/client/HTable.html#getRegionLocation%28byte[
 ]%29

 …. and then spin up your own threads (one per target RS) and see what
 happens.



 On 4/15/13 9:04 AM, Ankit Jain ankitjainc...@gmail.com wrote:

 Hi Liang,
 
 Thanks Liang for reply..
 
 Ans1:
 I tried by using HFile block size of 32 KB and bloom filter is enabled.
 The
 random read performance is 1 records in 23 secs.
 
 Ans2:
 We are retrieving all the 1 rows in one call.
 
 Ans3:
 Disk detai:
 Model Number:   ST2000DM001-1CH164
 Serial Number:  Z1E276YF
 
 Please suggest some more optimization
 
 Thanks,
 Ankit Jain
 
 On Mon, Apr 15, 2013 at 5:11 PM, 谢良 xieli...@xiaomi.com wrote:
 
  First, it's probably helpless to set block size to 4KB, please refer to
  the beginning of HFile.java:
 
   Smaller blocks are good
   * for random access, but require more memory to hold the block index,
 and
  may
   * be slower to create (because we must flush the compressor stream at
 the
   * conclusion of each data block, which leads to an FS I/O flush).
  Further, due
   * to the internal caching in Compression codec, the smallest possible
  block
   * size would be around 20KB-30KB.
 
  Second, is it a single-thread test client or multi-threads? we couldn't
  expect too much if the requests are one by one.
 
  Third, could you provide more info about  your DN disk numbers and IO
  utils ?
 
  Thanks,
  Liang
  
  发件人: Ankit Jain [ankitjainc...@gmail.com]
  发送时间: 2013年4月15日 18:53
  收件人: user@hbase.apache.org
  主题: Re: HBase random read performance
 
  Hi Anoop,
 
  Thanks for reply..
 
  I tried by setting Hfile block size 4KB and also enabled the bloom
  filter(ROW). The maximum read performance that I was able to achieve is
  1 records in 14 secs (size of record is 1.6KB).
 
  Please suggest some tuning..
 
  Thanks,
  Ankit Jain
 
 
 
  On Mon, Apr 15, 2013 at 4:12 PM, Rishabh Agrawal 
  rishabh.agra...@impetus.co.in wrote:
 
   Interesting. Can you explain why this happens?
  
   -Original Message-
   From: Anoop Sam John [mailto:anoo...@huawei.com]
   Sent: Monday, April 15, 2013 3:47 PM
   To: user@hbase.apache.org
   Subject: RE: HBase random read performance
  
   Ankit
I guess you might be having default HFile block size
   which is 64KB.
   For random gets a lower value will be better. Try will some thing
 like
  8KB
   and check the latency?
  
   Ya ofcourse blooms can help (if major compaction was not done at the
 time
   of testing)
  
   -Anoop-
   
   From: Ankit Jain [ankitjainc...@gmail.com]
   Sent: Saturday, April 13, 2013 11:01 AM
   To: user@hbase.apache.org
   Subject: HBase random read performance
  
   Hi All,
  
   We are using HBase 0.94.5 and Hadoop 1.0.4.
  
   We have HBase cluster of 5 nodes(5 regionservers and 1 master node).
 Each
   regionserver has 8 GB RAM.
  
   We have loaded 25 millions records in HBase table, regions are
 pre-split
   into 16 regions and all the regions are equally loaded.
  
   We are getting very low random read performance while performing
 multi
  get
   from HBase.
  
   We are passing random 1 row-keys as 

Re: 答复: HBase random read performance

2013-04-15 Thread Ted Yu
This is a related JIRA which should provide noticeable speed up:

HBASE-1935 Scan in parallel

Cheers

On Mon, Apr 15, 2013 at 7:13 AM, Ted Yu yuzhih...@gmail.com wrote:

 I looked
 at src/main/java/org/apache/hadoop/hbase/client/HConnectionManager.java in
 0.94

 In processBatchCallback(), starting line 1538,

 // step 1: break up into regionserver-sized chunks and build the
 data structs
 MapHRegionLocation, MultiActionR actionsByServer =
   new HashMapHRegionLocation, MultiActionR();
 for (int i = 0; i  workingList.size(); i++) {

 So we do group individual action by server.

 FYI

 On Mon, Apr 15, 2013 at 6:30 AM, Ted Yu yuzhih...@gmail.com wrote:

 Doug made a good point.

 Take a look at the performance gain for parallel scan (bottom chart
 compared to top chart):
 https://issues.apache.org/jira/secure/attachment/12578083/FDencode.png

 See
 https://issues.apache.org/jira/browse/HBASE-8316?focusedCommentId=13628300page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel#comment-13628300for
  explanation of the two methods.

 Cheers

 On Mon, Apr 15, 2013 at 6:21 AM, Doug Meil doug.m...@explorysmedical.com
  wrote:


 Hi there, regarding this...

  We are passing random 1 row-keys as input, while HBase is taking
 around
  17 secs to return 1 records.


 ….  Given that you are generating 10,000 random keys, your multi-get is
 very likely hitting all 5 nodes of your cluster.


 Historically, multi-Get used to first sort the requests by RS and then
 *serially* go the RS to process the multi-Get.  I'm not sure of the
 current (0.94.x) behavior if it multi-threads or not.

 One thing you might want to consider is confirming that client behavior,
 and if it's not multi-threading then perform a test that does the same RS
 sorting via...


 http://hbase.apache.org/apidocs/org/apache/hadoop/hbase/client/HTable.html#
 getRegionLocation%28byte[http://hbase.apache.org/apidocs/org/apache/hadoop/hbase/client/HTable.html#getRegionLocation%28byte[
 ]%29

 …. and then spin up your own threads (one per target RS) and see what
 happens.



 On 4/15/13 9:04 AM, Ankit Jain ankitjainc...@gmail.com wrote:

 Hi Liang,
 
 Thanks Liang for reply..
 
 Ans1:
 I tried by using HFile block size of 32 KB and bloom filter is enabled.
 The
 random read performance is 1 records in 23 secs.
 
 Ans2:
 We are retrieving all the 1 rows in one call.
 
 Ans3:
 Disk detai:
 Model Number:   ST2000DM001-1CH164
 Serial Number:  Z1E276YF
 
 Please suggest some more optimization
 
 Thanks,
 Ankit Jain
 
 On Mon, Apr 15, 2013 at 5:11 PM, 谢良 xieli...@xiaomi.com wrote:
 
  First, it's probably helpless to set block size to 4KB, please refer
 to
  the beginning of HFile.java:
 
   Smaller blocks are good
   * for random access, but require more memory to hold the block index,
 and
  may
   * be slower to create (because we must flush the compressor stream at
 the
   * conclusion of each data block, which leads to an FS I/O flush).
  Further, due
   * to the internal caching in Compression codec, the smallest possible
  block
   * size would be around 20KB-30KB.
 
  Second, is it a single-thread test client or multi-threads? we
 couldn't
  expect too much if the requests are one by one.
 
  Third, could you provide more info about  your DN disk numbers and IO
  utils ?
 
  Thanks,
  Liang
  
  发件人: Ankit Jain [ankitjainc...@gmail.com]
  发送时间: 2013年4月15日 18:53
  收件人: user@hbase.apache.org
  主题: Re: HBase random read performance
 
  Hi Anoop,
 
  Thanks for reply..
 
  I tried by setting Hfile block size 4KB and also enabled the bloom
  filter(ROW). The maximum read performance that I was able to achieve
 is
  1 records in 14 secs (size of record is 1.6KB).
 
  Please suggest some tuning..
 
  Thanks,
  Ankit Jain
 
 
 
  On Mon, Apr 15, 2013 at 4:12 PM, Rishabh Agrawal 
  rishabh.agra...@impetus.co.in wrote:
 
   Interesting. Can you explain why this happens?
  
   -Original Message-
   From: Anoop Sam John [mailto:anoo...@huawei.com]
   Sent: Monday, April 15, 2013 3:47 PM
   To: user@hbase.apache.org
   Subject: RE: HBase random read performance
  
   Ankit
I guess you might be having default HFile block
 size
   which is 64KB.
   For random gets a lower value will be better. Try will some thing
 like
  8KB
   and check the latency?
  
   Ya ofcourse blooms can help (if major compaction was not done at the
 time
   of testing)
  
   -Anoop-
   
   From: Ankit Jain [ankitjainc...@gmail.com]
   Sent: Saturday, April 13, 2013 11:01 AM
   To: user@hbase.apache.org
   Subject: HBase random read performance
  
   Hi All,
  
   We are using HBase 0.94.5 and Hadoop 1.0.4.
  
   We have HBase cluster of 5 nodes(5 regionservers and 1 master node).
 Each
   regionserver has 8 GB RAM.
  
   We have loaded 25 millions records in HBase table, regions are
 pre-split
   into 16 regions