[jira] [Updated] (HBASE-27904) A random data generator tool leveraging bulk load.

Himanshu Gwalani (Jira) Fri, 02 Jun 2023 10:46:06 -0700


     [ 
https://issues.apache.org/jira/browse/HBASE-27904?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]


Himanshu Gwalani updated HBASE-27904:
-------------------------------------
    Description: 
As of now, there is no data generator tool in HBase leveraging bulk load. Since 
bulk load skips client writes path, it's much faster to generate data and use 
of for load/performance tests where client writes are not a mandate.
{*}Example{*}: Any tooling over HBase that need x TBs of HBase Table for load 
testing.

{*}Requirements{*}:
1. Tooling should generate RANDOM data on the fly and should not require any 
pre-generated data as CSV/XML files.
2. Tooling should support pre-splited tables (number of splits to be taken as 
input).
3. Data should be UNIFORMLY distributed across all regions of the table.

*High-level Steps*
1. Generate HFiles with random data (using custom Mapper and 
[HFileOutputFormat2|https://github.com/apache/hbase/blob/master/hbase-mapreduce/src/main/java/org/apache/hadoop/hbase/mapreduce/HFileOutputFormat2.java])
2. Bulk load those HFiles to the respective regions of the table using 
[LoadIncrementalFiles|https://hbase.apache.org/2.2/devapidocs/org/apache/hadoop/hbase/tool/LoadIncrementalHFiles.html]

*Results*
We had POC for this tool in our organization, tested this tool in our test 
environments with 11 nodes cluster and having HBase + Hadoop services running. 
The tool generated:
1. *100* *GB* of data in *6 minutes*
2. *340 GB* of data in *13 minutes*
3. *3.5 TB* of data in {*}3 hours 10 minutes{*}.{*}{*}

 

  was:
As of now, there is no data generator tool in HBase leveraging bulk load. Since 
bulk load skips client writes path, it's much faster to generate data and use 
of for load/performance tests where client writes are not a mandate.
{*}Example{*}: Any tooling over HBase that need x TBs of HBase Table for load 
testing.


{*}Requirements{*}:
1. Tooling should support pre-split tables (number of splits to be taken as 
input).
2. Data should be UNIFORMLY distributed across all regions of the table.

*High-level Steps*
1. Generate HFiles with random data (using custom Mapper and 
[HFileOutputFormat2|https://github.com/apache/hbase/blob/master/hbase-mapreduce/src/main/java/org/apache/hadoop/hbase/mapreduce/HFileOutputFormat2.java])
2. Bulk load those HFiles to the respective regions of the table using 
[LoadIncrementalFiles|https://hbase.apache.org/2.2/devapidocs/org/apache/hadoop/hbase/tool/LoadIncrementalHFiles.html]


> A random data generator tool leveraging bulk load.
> --------------------------------------------------
>
>                 Key: HBASE-27904
>                 URL: https://issues.apache.org/jira/browse/HBASE-27904
>             Project: HBase
>          Issue Type: New Feature
>          Components: util
>            Reporter: Himanshu Gwalani
>            Priority: Minor
>
> As of now, there is no data generator tool in HBase leveraging bulk load. 
> Since bulk load skips client writes path, it's much faster to generate data 
> and use of for load/performance tests where client writes are not a mandate.
> {*}Example{*}: Any tooling over HBase that need x TBs of HBase Table for load 
> testing.
> {*}Requirements{*}:
> 1. Tooling should generate RANDOM data on the fly and should not require any 
> pre-generated data as CSV/XML files.
> 2. Tooling should support pre-splited tables (number of splits to be taken as 
> input).
> 3. Data should be UNIFORMLY distributed across all regions of the table.
> *High-level Steps*
> 1. Generate HFiles with random data (using custom Mapper and 
> [HFileOutputFormat2|https://github.com/apache/hbase/blob/master/hbase-mapreduce/src/main/java/org/apache/hadoop/hbase/mapreduce/HFileOutputFormat2.java])
> 2. Bulk load those HFiles to the respective regions of the table using 
> [LoadIncrementalFiles|https://hbase.apache.org/2.2/devapidocs/org/apache/hadoop/hbase/tool/LoadIncrementalHFiles.html]
> *Results*
> We had POC for this tool in our organization, tested this tool in our test 
> environments with 11 nodes cluster and having HBase + Hadoop services 
> running. The tool generated:
> 1. *100* *GB* of data in *6 minutes*
> 2. *340 GB* of data in *13 minutes*
> 3. *3.5 TB* of data in {*}3 hours 10 minutes{*}.{*}{*}
>  



--
This message was sent by Atlassian Jira
(v8.20.10#820010)

[jira] [Updated] (HBASE-27904) A random data generator tool leveraging bulk load.

Reply via email to