[
https://issues.apache.org/jira/browse/HBASE-27904?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17748408#comment-17748408
]
Hudson commented on HBASE-27904:
--------------------------------
Results for branch branch-2
[build #852 on
builds.a.o|https://ci-hbase.apache.org/job/HBase%20Nightly/job/branch-2/852/]:
(x) *{color:red}-1 overall{color}*
----
details (if available):
(/) {color:green}+1 general checks{color}
-- For more information [see general
report|https://ci-hbase.apache.org/job/HBase%20Nightly/job/branch-2/852/General_20Nightly_20Build_20Report/]
(x) {color:red}-1 jdk8 hadoop2 checks{color}
-- For more information [see jdk8 (hadoop2)
report|https://ci-hbase.apache.org/job/HBase%20Nightly/job/branch-2/852/JDK8_20Nightly_20Build_20Report_20_28Hadoop2_29/]
(x) {color:red}-1 jdk8 hadoop3 checks{color}
-- For more information [see jdk8 (hadoop3)
report|https://ci-hbase.apache.org/job/HBase%20Nightly/job/branch-2/852/JDK8_20Nightly_20Build_20Report_20_28Hadoop3_29/]
(/) {color:green}+1 jdk11 hadoop3 checks{color}
-- For more information [see jdk11
report|https://ci-hbase.apache.org/job/HBase%20Nightly/job/branch-2/852/JDK11_20Nightly_20Build_20Report_20_28Hadoop3_29/]
(/) {color:green}+1 source release artifact{color}
-- See build output for details.
(/) {color:green}+1 client integration test{color}
> A random data generator tool leveraging bulk load.
> --------------------------------------------------
>
> Key: HBASE-27904
> URL: https://issues.apache.org/jira/browse/HBASE-27904
> Project: HBase
> Issue Type: New Feature
> Components: util
> Reporter: Himanshu Gwalani
> Assignee: Himanshu Gwalani
> Priority: Major
> Fix For: 2.6.0, 3.0.0-beta-1
>
>
> As of now, there is no data generator tool in HBase leveraging bulk load.
> Since bulk load skips client writes path, it's much faster to generate data
> and use of for load/performance tests where client writes are not a mandate.
> {*}Example{*}: Any tooling over HBase that need x TBs of HBase Table for load
> testing.
> {*}Requirements{*}:
> 1. Tooling should generate RANDOM data on the fly and should not require any
> pre-generated data as CSV/XML files as input.
> 2. Tooling should support pre-splited tables (number of splits to be taken as
> input).
> 3. Data should be UNIFORMLY distributed across all regions of the table.
> *High-level Steps*
> 1. A table will be created (pre-splited with number of splits as input)
> 2. The mapper of a custom Map Reduce job will generate random key-value pair
> and ensure that those are equally distributed across all regions of the table.
> 3.
> [HFileOutputFormat2|https://github.com/apache/hbase/blob/master/hbase-mapreduce/src/main/java/org/apache/hadoop/hbase/mapreduce/HFileOutputFormat2.java]
> will be used to add reducer to the MR job and create HFiles based on key
> value pairs generated by mapper.
> 4. Bulk load those HFiles to the respective regions of the table using
> [LoadIncrementalFiles|https://hbase.apache.org/2.2/devapidocs/org/apache/hadoop/hbase/tool/LoadIncrementalHFiles.html]
> *Results*
> We had POC for this tool in our organization, tested this tool with a 11
> nodes HBase cluster (having HBase + Hadoop services running). The tool
> generated:
> 1. *100* *GB* of data in *6 minutes*
> 2. *340 GB* of data in *13 minutes*
> 3. *3.5 TB* of data in *3 hours and 10 minutes*
> *Usage*
> hbase org.apache.hadoop.hbase.util.bulkdatagenerator.BulkDataGeneratorTool
> -mapper-count 100 -table TEST_TABLE_1 -rows-per-mapper 1000000 -split-count
> 100 -delete-if-exist -table-options "NORMALIZATION_ENABLED=false"
>
--
This message was sent by Atlassian Jira
(v8.20.10#820010)