[jira] [Updated] (HBASE-29013) Make PerformanceEvaluation support larger data sets

Junegunn Choi (Jira) Mon, 02 Dec 2024 19:47:07 -0800


     [ 
https://issues.apache.org/jira/browse/HBASE-29013?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]


Junegunn Choi updated HBASE-29013:
----------------------------------
    Description: 
The use of 4-byte integers in PerformanceEvaluation can be limiting when you 
want to test with larger data sets. Suppose you want to generate 10TB of data 
with the default value size of 1KB, you would need 10G rows.
{code:java}
bin/hbase pe --nomapred --presplit=21 --compress=LZ4 --rows=10737418240 
randomWrite 1
{code}
But you can't do it because {{--rows}} expect a number that can be represented 
with 4 bytes.
{noformat}
java.lang.NumberFormatException: For input string: "10737418240"
{noformat}
We can instead increase the value size and decrease the number of the rows to 
circumvent the limitation, but I don't see a good reason to have the limitation 
in the first place.

And even if we use a smaller value for {{{}--row{}}}, we can accidentally cause 
integer overflow as we increase the number of clients.
{code:java}
bin/hbase pe --nomapred --compress=LZ4 --rows=1073741824 randomWrite 20
{code}
{noformat}
2024-12-03T12:21:10,333 INFO  [main {}] hbase.PerformanceEvaluation: Created 20 
connections for 20 threads
2024-12-03T12:21:10,337 INFO  [TestClient-5 {}] hbase.PerformanceEvaluation: 
Start class org.apache.hadoop.hbase.PerformanceEvaluation$RandomWriteTest at 
offset 1073741824 for 1073741824 rows
2024-12-03T12:21:10,337 INFO  [TestClient-1 {}] hbase.PerformanceEvaluation: 
Start class org.apache.hadoop.hbase.PerformanceEvaluation$RandomWriteTest at 
offset 1073741824 for 1073741824 rows
2024-12-03T12:21:10,337 INFO  [TestClient-3 {}] hbase.PerformanceEvaluation: 
Start class org.apache.hadoop.hbase.PerformanceEvaluation$RandomWriteTest at 
offset -1073741824 for 1073741824 rows
2024-12-03T12:21:10,337 INFO  [TestClient-4 {}] hbase.PerformanceEvaluation: 
Start class org.apache.hadoop.hbase.PerformanceEvaluation$RandomWriteTest at 
offset 0 for 1073741824 rows
2024-12-03T12:21:10,337 INFO  [TestClient-7 {}] hbase.PerformanceEvaluation: 
Start class org.apache.hadoop.hbase.PerformanceEvaluation$RandomWriteTest at 
offset -1073741824 for 1073741824 rows
2024-12-03T12:21:10,337 INFO  [TestClient-8 {}] hbase.PerformanceEvaluation: 
Start class org.apache.hadoop.hbase.PerformanceEvaluation$RandomWriteTest at 
offset 0 for 1073741824 rows

...

2024-12-03T12:21:10,338 INFO  [TestClient-17 {}] hbase.PerformanceEvaluation: 
Sampling 1 every 0 out of 1073741824 total rows.
2024-12-03T12:21:10,338 INFO  [TestClient-16 {}] hbase.PerformanceEvaluation: 
Sampling 1 every 0 out of 1073741824 total rows.
2024-12-03T12:21:10,338 INFO  [TestClient-6 {}] hbase.PerformanceEvaluation: 
Sampling 1 every 0 out of 1073741824 total rows.
2024-12-03T12:21:10,338 INFO  [TestClient-4 {}] hbase.PerformanceEvaluation: 
Sampling 1 every 0 out of 1073741824 total rows.

...

java.io.IOException: java.lang.ArithmeticException: / by zero
        at 
org.apache.hadoop.hbase.PerformanceEvaluation.doLocalClients(PerformanceEvaluation.java:540)
        at 
org.apache.hadoop.hbase.PerformanceEvaluation.runTest(PerformanceEvaluation.java:2674)
        at 
org.apache.hadoop.hbase.PerformanceEvaluation.run(PerformanceEvaluation.java:3216)
        at org.apache.hadoop.util.ToolRunner.run(ToolRunner.java:82)
        at org.apache.hadoop.util.ToolRunner.run(ToolRunner.java:97)
        at 
org.apache.hadoop.hbase.PerformanceEvaluation.main(PerformanceEvaluation.java:3250)
{noformat}
So I think it's best that we just use 8-byte long integers throughout the code.
 

  was:
The use of 4-byte integers in PerformanceEvaluation can be limiting when you 
want to test with larger data sets. Suppose you want to generate 10TB of data 
with the default value size of 1KB, you would need 10G rows.
{code:java}
bin/hbase pe --nomapred --presplit=21 --compress=LZ4 --rows=10737418240 
randomWrite 1
{code}
But you can't do it because {{--rows}} expect a number that can be represented 
with 4 bytes.
{noformat}
java.lang.NumberFormatException: For input string: "10737418240"
{noformat}
We can instead increase the value size and decrease the number of the rows to 
circumvent the limitation, but I don't see a good reason to have the limitation 
in the first place.

And even if we use a smaller value for {{{}--row{}}}, we can accidentally cause 
integer overflow as we increase the number of clients.
{code:java}
bin/hbase pe --nomapred --compress=LZ4 --rows=1073741824 --nomapred randomWrite 
20
{code}
{noformat}
2024-12-03T12:21:10,333 INFO  [main {}] hbase.PerformanceEvaluation: Created 20 
connections for 20 threads
2024-12-03T12:21:10,337 INFO  [TestClient-5 {}] hbase.PerformanceEvaluation: 
Start class org.apache.hadoop.hbase.PerformanceEvaluation$RandomWriteTest at 
offset 1073741824 for 1073741824 rows
2024-12-03T12:21:10,337 INFO  [TestClient-1 {}] hbase.PerformanceEvaluation: 
Start class org.apache.hadoop.hbase.PerformanceEvaluation$RandomWriteTest at 
offset 1073741824 for 1073741824 rows
2024-12-03T12:21:10,337 INFO  [TestClient-3 {}] hbase.PerformanceEvaluation: 
Start class org.apache.hadoop.hbase.PerformanceEvaluation$RandomWriteTest at 
offset -1073741824 for 1073741824 rows
2024-12-03T12:21:10,337 INFO  [TestClient-4 {}] hbase.PerformanceEvaluation: 
Start class org.apache.hadoop.hbase.PerformanceEvaluation$RandomWriteTest at 
offset 0 for 1073741824 rows
2024-12-03T12:21:10,337 INFO  [TestClient-7 {}] hbase.PerformanceEvaluation: 
Start class org.apache.hadoop.hbase.PerformanceEvaluation$RandomWriteTest at 
offset -1073741824 for 1073741824 rows
2024-12-03T12:21:10,337 INFO  [TestClient-8 {}] hbase.PerformanceEvaluation: 
Start class org.apache.hadoop.hbase.PerformanceEvaluation$RandomWriteTest at 
offset 0 for 1073741824 rows

...

2024-12-03T12:21:10,338 INFO  [TestClient-17 {}] hbase.PerformanceEvaluation: 
Sampling 1 every 0 out of 1073741824 total rows.
2024-12-03T12:21:10,338 INFO  [TestClient-16 {}] hbase.PerformanceEvaluation: 
Sampling 1 every 0 out of 1073741824 total rows.
2024-12-03T12:21:10,338 INFO  [TestClient-6 {}] hbase.PerformanceEvaluation: 
Sampling 1 every 0 out of 1073741824 total rows.
2024-12-03T12:21:10,338 INFO  [TestClient-4 {}] hbase.PerformanceEvaluation: 
Sampling 1 every 0 out of 1073741824 total rows.

...

java.io.IOException: java.lang.ArithmeticException: / by zero
        at 
org.apache.hadoop.hbase.PerformanceEvaluation.doLocalClients(PerformanceEvaluation.java:540)
        at 
org.apache.hadoop.hbase.PerformanceEvaluation.runTest(PerformanceEvaluation.java:2674)
        at 
org.apache.hadoop.hbase.PerformanceEvaluation.run(PerformanceEvaluation.java:3216)
        at org.apache.hadoop.util.ToolRunner.run(ToolRunner.java:82)
        at org.apache.hadoop.util.ToolRunner.run(ToolRunner.java:97)
        at 
org.apache.hadoop.hbase.PerformanceEvaluation.main(PerformanceEvaluation.java:3250)
{noformat}
So I think it's best that we just use 8-byte long integers throughout the code.


> Make PerformanceEvaluation support larger data sets
> ---------------------------------------------------
>
>                 Key: HBASE-29013
>                 URL: https://issues.apache.org/jira/browse/HBASE-29013
>             Project: HBase
>          Issue Type: Improvement
>            Reporter: Junegunn Choi
>            Assignee: Junegunn Choi
>            Priority: Minor
>              Labels: pull-request-available
>
> The use of 4-byte integers in PerformanceEvaluation can be limiting when you 
> want to test with larger data sets. Suppose you want to generate 10TB of data 
> with the default value size of 1KB, you would need 10G rows.
> {code:java}
> bin/hbase pe --nomapred --presplit=21 --compress=LZ4 --rows=10737418240 
> randomWrite 1
> {code}
> But you can't do it because {{--rows}} expect a number that can be 
> represented with 4 bytes.
> {noformat}
> java.lang.NumberFormatException: For input string: "10737418240"
> {noformat}
> We can instead increase the value size and decrease the number of the rows to 
> circumvent the limitation, but I don't see a good reason to have the 
> limitation in the first place.
> And even if we use a smaller value for {{{}--row{}}}, we can accidentally 
> cause integer overflow as we increase the number of clients.
> {code:java}
> bin/hbase pe --nomapred --compress=LZ4 --rows=1073741824 randomWrite 20
> {code}
> {noformat}
> 2024-12-03T12:21:10,333 INFO  [main {}] hbase.PerformanceEvaluation: Created 
> 20 connections for 20 threads
> 2024-12-03T12:21:10,337 INFO  [TestClient-5 {}] hbase.PerformanceEvaluation: 
> Start class org.apache.hadoop.hbase.PerformanceEvaluation$RandomWriteTest at 
> offset 1073741824 for 1073741824 rows
> 2024-12-03T12:21:10,337 INFO  [TestClient-1 {}] hbase.PerformanceEvaluation: 
> Start class org.apache.hadoop.hbase.PerformanceEvaluation$RandomWriteTest at 
> offset 1073741824 for 1073741824 rows
> 2024-12-03T12:21:10,337 INFO  [TestClient-3 {}] hbase.PerformanceEvaluation: 
> Start class org.apache.hadoop.hbase.PerformanceEvaluation$RandomWriteTest at 
> offset -1073741824 for 1073741824 rows
> 2024-12-03T12:21:10,337 INFO  [TestClient-4 {}] hbase.PerformanceEvaluation: 
> Start class org.apache.hadoop.hbase.PerformanceEvaluation$RandomWriteTest at 
> offset 0 for 1073741824 rows
> 2024-12-03T12:21:10,337 INFO  [TestClient-7 {}] hbase.PerformanceEvaluation: 
> Start class org.apache.hadoop.hbase.PerformanceEvaluation$RandomWriteTest at 
> offset -1073741824 for 1073741824 rows
> 2024-12-03T12:21:10,337 INFO  [TestClient-8 {}] hbase.PerformanceEvaluation: 
> Start class org.apache.hadoop.hbase.PerformanceEvaluation$RandomWriteTest at 
> offset 0 for 1073741824 rows
> ...
> 2024-12-03T12:21:10,338 INFO  [TestClient-17 {}] hbase.PerformanceEvaluation: 
> Sampling 1 every 0 out of 1073741824 total rows.
> 2024-12-03T12:21:10,338 INFO  [TestClient-16 {}] hbase.PerformanceEvaluation: 
> Sampling 1 every 0 out of 1073741824 total rows.
> 2024-12-03T12:21:10,338 INFO  [TestClient-6 {}] hbase.PerformanceEvaluation: 
> Sampling 1 every 0 out of 1073741824 total rows.
> 2024-12-03T12:21:10,338 INFO  [TestClient-4 {}] hbase.PerformanceEvaluation: 
> Sampling 1 every 0 out of 1073741824 total rows.
> ...
> java.io.IOException: java.lang.ArithmeticException: / by zero
>         at 
> org.apache.hadoop.hbase.PerformanceEvaluation.doLocalClients(PerformanceEvaluation.java:540)
>         at 
> org.apache.hadoop.hbase.PerformanceEvaluation.runTest(PerformanceEvaluation.java:2674)
>         at 
> org.apache.hadoop.hbase.PerformanceEvaluation.run(PerformanceEvaluation.java:3216)
>         at org.apache.hadoop.util.ToolRunner.run(ToolRunner.java:82)
>         at org.apache.hadoop.util.ToolRunner.run(ToolRunner.java:97)
>         at 
> org.apache.hadoop.hbase.PerformanceEvaluation.main(PerformanceEvaluation.java:3250)
> {noformat}
> So I think it's best that we just use 8-byte long integers throughout the 
> code.
>  



--
This message was sent by Atlassian Jira
(v8.20.10#820010)

[jira] [Updated] (HBASE-29013) Make PerformanceEvaluation support larger data sets

Reply via email to