[jira] [Commented] (HBASE-29107) shell: Improve 'count' performance

Hudson (Jira) Sun, 02 Mar 2025 23:56:11 -0800


    [ 
https://issues.apache.org/jira/browse/HBASE-29107?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17931890#comment-17931890
 ]


Hudson commented on HBASE-29107:
--------------------------------

Results for branch branch-2
        [build #1240 on 
builds.a.o|https://ci-hbase.apache.org/job/HBase%20Nightly/job/branch-2/1240/]: 
(x) *{color:red}-1 overall{color}*
----
details (if available):

(/) {color:green}+1 general checks{color}
-- For more information [see general 
report|https://ci-hbase.apache.org/job/HBase%20Nightly/job/branch-2/1240/General_20Nightly_20Build_20Report/]


(x) {color:red}-1 jdk8 hadoop2 checks{color}
-- For more information [see jdk8 (hadoop2) 
report|https://ci-hbase.apache.org/job/HBase%20Nightly/job/branch-2/1240/JDK8_20Nightly_20Build_20Report_20_28Hadoop2_29/]


(x) {color:red}-1 jdk8 hadoop3 checks{color}
-- For more information [see jdk8 (hadoop3) 
report|https://ci-hbase.apache.org/job/HBase%20Nightly/job/branch-2/1240/JDK8_20Nightly_20Build_20Report_20_28Hadoop3_29/]


(x) {color:red}-1 jdk11 hadoop3 checks{color}
-- For more information [see jdk11 
report|https://ci-hbase.apache.org/job/HBase%20Nightly/job/branch-2/1240/JDK11_20Nightly_20Build_20Report_20_28Hadoop3_29/]


(x) {color:red}-1 jdk17 hadoop3 checks{color}
-- For more information [see jdk17 
report|https://ci-hbase.apache.org/job/HBase%20Nightly/job/branch-2/1240/JDK17_20Nightly_20Build_20Report_20_28Hadoop3_29/]


(x) {color:red}-1 jdk17 hadoop ${HADOOP_THREE_VERSION} backward compatibility 
checks{color}
-- For more information [see jdk17 
report|https://ci-hbase.apache.org/job/HBase%20Nightly/job/branch-2/1240/JDK17_20Nightly_20Build_20Report_20_28Hadoop3_29/]


(x) {color:red}-1 jdk17 hadoop ${HADOOP_THREE_VERSION} backward compatibility 
checks{color}
-- For more information [see jdk17 
report|https://ci-hbase.apache.org/job/HBase%20Nightly/job/branch-2/1240/JDK17_20Nightly_20Build_20Report_20_28Hadoop3_29/]


(x) {color:red}-1 jdk17 hadoop ${HADOOP_THREE_VERSION} backward compatibility 
checks{color}
-- For more information [see jdk17 
report|https://ci-hbase.apache.org/job/HBase%20Nightly/job/branch-2/1240/JDK17_20Nightly_20Build_20Report_20_28Hadoop3_29/]


(/) {color:green}+1 source release artifact{color}
-- See build output for details.


(/) {color:green}+1 client integration test for HBase 2 {color}
(/) {color:green}+1 client integration test for 3.3.5 {color}
(/) {color:green}+1 client integration test for 3.3.6 {color}
(/) {color:green}+1 client integration test for 3.4.0 {color}
(/) {color:green}+1 client integration test for 3.4.1 {color}


> shell: Improve 'count' performance
> ----------------------------------
>
>                 Key: HBASE-29107
>                 URL: https://issues.apache.org/jira/browse/HBASE-29107
>             Project: HBase
>          Issue Type: Improvement
>          Components: shell
>            Reporter: Junegunn Choi
>            Assignee: Junegunn Choi
>            Priority: Major
>              Labels: pull-request-available
>             Fix For: 2.7.0, 3.0.0-beta-2, 2.6.3, 2.5.12
>
>
> I propose two changes to the 'count' command of HBase shell to improve its 
> performance.
> h2. Not setting scanner caching
> The command currently sets the scanner caching to 10 rows by default, and 
> instructs the users to increase it if necessary. According to HBASE-2331, the 
> default value was chosen as such in case the table has large records.
> {quote}Default value of 10 is really slow, but should be kept as low for 
> clients with huge rows.
> {quote}
> However, with the current version of HBase, we use a better mechanism 
> {{{}hbase.client.scanner.max.result.size{}}}, which is 2MB by default. So 
> just by not setting the scanner caching, we automatically get a better 
> performance, and we don't have to worry about huge rows.
> h3. Test
> {code:java}
> # Create table
> create 't', 'd', {NUMREGIONS => 4, SPLITALGO => 'HexStringSplit'}
> # Insert data
> data = '_' * 1024
> bm = @hbase.connection.getBufferedMutator(TableName.valueOf('t'))
> (8 * 1024 * 1024).times do |i|
>   row = format('%010x', i).reverse.to_java_bytes
>   p = org.apache.hadoop.hbase.client.Put.new(row)
>   p.addColumn('d'.to_java_bytes, ''.to_java_bytes, data.to_java_bytes)
>   bm.mutate(p)
> end
> bm.close
> # Before patch
> count 't', INTERVAL => 100000
>   # 8388608 row(s)
>   # Took 53.5826 seconds
> # Before patch with custom 'CACHE'
> count 't', INTERVAL => 100000, CACHE => 2000
>   # 8388608 row(s)
>   # Took 13.6717 seconds
> # After patch
> count 't', INTERVAL => 100000
>   # 8388608 row(s)
>   # Took 14.0911 seconds
> {code}
> The test was performed locally on my machine, so the different in performance 
> in a real cluster should be larger.
> h2. KeyOnlyFilter
> Another thing we can do is to apply {{KeyOnlyFilter}} as well because we're 
> not interested in the values. This helps when the records are large.
> h3. Test
> {code:java}
> create 't2', 'd', {NUMREGIONS => 4, SPLITALGO => 'HexStringSplit'}
> data = '_' * 1024 * 1024
> bm = @hbase.connection.getBufferedMutator(TableName.valueOf('t2'))
> (8 * 1024).times do |i|
>   row = format('%010x', i).reverse.to_java_bytes
>   p = org.apache.hadoop.hbase.client.Put.new(row)
>   p.addColumn('d'.to_java_bytes, ''.to_java_bytes, data.to_java_bytes)
>   bm.mutate(p)
> end
> bm.close
> # Before patch
> count 't2'
>   # 8192 row(s)
>   # Took 8.8952 seconds
> # After patch
> count 't2'
>   # 8192 row(s)
>   # Took 3.4052 seconds
> {code}



--
This message was sent by Atlassian Jira
(v8.20.10#820010)

[jira] [Commented] (HBASE-29107) shell: Improve 'count' performance

Reply via email to