[
https://issues.apache.org/jira/browse/HBASE-29013?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
]
Junegunn Choi updated HBASE-29013:
----------------------------------
Description:
The use of 4-byte integers in PerformanceEvaluation can be limiting when you
want to test with larger data sets. Suppose you want to generate 10TB of data
with the default value size of 1KB, you would need 10G rows.
{code:java}
bin/hbase pe --nomapred --presplit=21 --compress=LZ4 --rows=10737418240
randomWrite 1
{code}
But you can't do it because {{--rows}} expect a number that can be represented
with 4 bytes.
{noformat}
java.lang.NumberFormatException: For input string: "10737418240"
{noformat}
We can instead increase the value size and decrease the number of the rows to
circumvent the limitation, but I don't see a good reason to have the limitation
in the first place.
And even if we use a smaller value for {{{}--row{}}}, we can accidentally cause
integer overflow as we increase the number of clients.
{code:java}
bin/hbase pe --nomapred --compress=LZ4 --rows=1073741824 randomWrite 20
{code}
{noformat}
2024-12-03T12:21:10,333 INFO [main {}] hbase.PerformanceEvaluation: Created 20
connections for 20 threads
2024-12-03T12:21:10,337 INFO [TestClient-5 {}] hbase.PerformanceEvaluation:
Start class org.apache.hadoop.hbase.PerformanceEvaluation$RandomWriteTest at
offset 1073741824 for 1073741824 rows
2024-12-03T12:21:10,337 INFO [TestClient-1 {}] hbase.PerformanceEvaluation:
Start class org.apache.hadoop.hbase.PerformanceEvaluation$RandomWriteTest at
offset 1073741824 for 1073741824 rows
2024-12-03T12:21:10,337 INFO [TestClient-3 {}] hbase.PerformanceEvaluation:
Start class org.apache.hadoop.hbase.PerformanceEvaluation$RandomWriteTest at
offset -1073741824 for 1073741824 rows
2024-12-03T12:21:10,337 INFO [TestClient-4 {}] hbase.PerformanceEvaluation:
Start class org.apache.hadoop.hbase.PerformanceEvaluation$RandomWriteTest at
offset 0 for 1073741824 rows
2024-12-03T12:21:10,337 INFO [TestClient-7 {}] hbase.PerformanceEvaluation:
Start class org.apache.hadoop.hbase.PerformanceEvaluation$RandomWriteTest at
offset -1073741824 for 1073741824 rows
2024-12-03T12:21:10,337 INFO [TestClient-8 {}] hbase.PerformanceEvaluation:
Start class org.apache.hadoop.hbase.PerformanceEvaluation$RandomWriteTest at
offset 0 for 1073741824 rows
...
2024-12-03T12:21:10,338 INFO [TestClient-17 {}] hbase.PerformanceEvaluation:
Sampling 1 every 0 out of 1073741824 total rows.
2024-12-03T12:21:10,338 INFO [TestClient-16 {}] hbase.PerformanceEvaluation:
Sampling 1 every 0 out of 1073741824 total rows.
2024-12-03T12:21:10,338 INFO [TestClient-6 {}] hbase.PerformanceEvaluation:
Sampling 1 every 0 out of 1073741824 total rows.
2024-12-03T12:21:10,338 INFO [TestClient-4 {}] hbase.PerformanceEvaluation:
Sampling 1 every 0 out of 1073741824 total rows.
...
java.io.IOException: java.lang.ArithmeticException: / by zero
at
org.apache.hadoop.hbase.PerformanceEvaluation.doLocalClients(PerformanceEvaluation.java:540)
at
org.apache.hadoop.hbase.PerformanceEvaluation.runTest(PerformanceEvaluation.java:2674)
at
org.apache.hadoop.hbase.PerformanceEvaluation.run(PerformanceEvaluation.java:3216)
at org.apache.hadoop.util.ToolRunner.run(ToolRunner.java:82)
at org.apache.hadoop.util.ToolRunner.run(ToolRunner.java:97)
at
org.apache.hadoop.hbase.PerformanceEvaluation.main(PerformanceEvaluation.java:3250)
{noformat}
So I think it's best that we just use 8-byte long integers throughout the code.
was:
The use of 4-byte integers in PerformanceEvaluation can be limiting when you
want to test with larger data sets. Suppose you want to generate 10TB of data
with the default value size of 1KB, you would need 10G rows.
{code:java}
bin/hbase pe --nomapred --presplit=21 --compress=LZ4 --rows=10737418240
randomWrite 1
{code}
But you can't do it because {{--rows}} expect a number that can be represented
with 4 bytes.
{noformat}
java.lang.NumberFormatException: For input string: "10737418240"
{noformat}
We can instead increase the value size and decrease the number of the rows to
circumvent the limitation, but I don't see a good reason to have the limitation
in the first place.
And even if we use a smaller value for {{{}--row{}}}, we can accidentally cause
integer overflow as we increase the number of clients.
{code:java}
bin/hbase pe --nomapred --compress=LZ4 --rows=1073741824 --nomapred randomWrite
20
{code}
{noformat}
2024-12-03T12:21:10,333 INFO [main {}] hbase.PerformanceEvaluation: Created 20
connections for 20 threads
2024-12-03T12:21:10,337 INFO [TestClient-5 {}] hbase.PerformanceEvaluation:
Start class org.apache.hadoop.hbase.PerformanceEvaluation$RandomWriteTest at
offset 1073741824 for 1073741824 rows
2024-12-03T12:21:10,337 INFO [TestClient-1 {}] hbase.PerformanceEvaluation:
Start class org.apache.hadoop.hbase.PerformanceEvaluation$RandomWriteTest at
offset 1073741824 for 1073741824 rows
2024-12-03T12:21:10,337 INFO [TestClient-3 {}] hbase.PerformanceEvaluation:
Start class org.apache.hadoop.hbase.PerformanceEvaluation$RandomWriteTest at
offset -1073741824 for 1073741824 rows
2024-12-03T12:21:10,337 INFO [TestClient-4 {}] hbase.PerformanceEvaluation:
Start class org.apache.hadoop.hbase.PerformanceEvaluation$RandomWriteTest at
offset 0 for 1073741824 rows
2024-12-03T12:21:10,337 INFO [TestClient-7 {}] hbase.PerformanceEvaluation:
Start class org.apache.hadoop.hbase.PerformanceEvaluation$RandomWriteTest at
offset -1073741824 for 1073741824 rows
2024-12-03T12:21:10,337 INFO [TestClient-8 {}] hbase.PerformanceEvaluation:
Start class org.apache.hadoop.hbase.PerformanceEvaluation$RandomWriteTest at
offset 0 for 1073741824 rows
...
2024-12-03T12:21:10,338 INFO [TestClient-17 {}] hbase.PerformanceEvaluation:
Sampling 1 every 0 out of 1073741824 total rows.
2024-12-03T12:21:10,338 INFO [TestClient-16 {}] hbase.PerformanceEvaluation:
Sampling 1 every 0 out of 1073741824 total rows.
2024-12-03T12:21:10,338 INFO [TestClient-6 {}] hbase.PerformanceEvaluation:
Sampling 1 every 0 out of 1073741824 total rows.
2024-12-03T12:21:10,338 INFO [TestClient-4 {}] hbase.PerformanceEvaluation:
Sampling 1 every 0 out of 1073741824 total rows.
...
java.io.IOException: java.lang.ArithmeticException: / by zero
at
org.apache.hadoop.hbase.PerformanceEvaluation.doLocalClients(PerformanceEvaluation.java:540)
at
org.apache.hadoop.hbase.PerformanceEvaluation.runTest(PerformanceEvaluation.java:2674)
at
org.apache.hadoop.hbase.PerformanceEvaluation.run(PerformanceEvaluation.java:3216)
at org.apache.hadoop.util.ToolRunner.run(ToolRunner.java:82)
at org.apache.hadoop.util.ToolRunner.run(ToolRunner.java:97)
at
org.apache.hadoop.hbase.PerformanceEvaluation.main(PerformanceEvaluation.java:3250)
{noformat}
So I think it's best that we just use 8-byte long integers throughout the code.
> Make PerformanceEvaluation support larger data sets
> ---------------------------------------------------
>
> Key: HBASE-29013
> URL: https://issues.apache.org/jira/browse/HBASE-29013
> Project: HBase
> Issue Type: Improvement
> Reporter: Junegunn Choi
> Assignee: Junegunn Choi
> Priority: Minor
> Labels: pull-request-available
>
> The use of 4-byte integers in PerformanceEvaluation can be limiting when you
> want to test with larger data sets. Suppose you want to generate 10TB of data
> with the default value size of 1KB, you would need 10G rows.
> {code:java}
> bin/hbase pe --nomapred --presplit=21 --compress=LZ4 --rows=10737418240
> randomWrite 1
> {code}
> But you can't do it because {{--rows}} expect a number that can be
> represented with 4 bytes.
> {noformat}
> java.lang.NumberFormatException: For input string: "10737418240"
> {noformat}
> We can instead increase the value size and decrease the number of the rows to
> circumvent the limitation, but I don't see a good reason to have the
> limitation in the first place.
> And even if we use a smaller value for {{{}--row{}}}, we can accidentally
> cause integer overflow as we increase the number of clients.
> {code:java}
> bin/hbase pe --nomapred --compress=LZ4 --rows=1073741824 randomWrite 20
> {code}
> {noformat}
> 2024-12-03T12:21:10,333 INFO [main {}] hbase.PerformanceEvaluation: Created
> 20 connections for 20 threads
> 2024-12-03T12:21:10,337 INFO [TestClient-5 {}] hbase.PerformanceEvaluation:
> Start class org.apache.hadoop.hbase.PerformanceEvaluation$RandomWriteTest at
> offset 1073741824 for 1073741824 rows
> 2024-12-03T12:21:10,337 INFO [TestClient-1 {}] hbase.PerformanceEvaluation:
> Start class org.apache.hadoop.hbase.PerformanceEvaluation$RandomWriteTest at
> offset 1073741824 for 1073741824 rows
> 2024-12-03T12:21:10,337 INFO [TestClient-3 {}] hbase.PerformanceEvaluation:
> Start class org.apache.hadoop.hbase.PerformanceEvaluation$RandomWriteTest at
> offset -1073741824 for 1073741824 rows
> 2024-12-03T12:21:10,337 INFO [TestClient-4 {}] hbase.PerformanceEvaluation:
> Start class org.apache.hadoop.hbase.PerformanceEvaluation$RandomWriteTest at
> offset 0 for 1073741824 rows
> 2024-12-03T12:21:10,337 INFO [TestClient-7 {}] hbase.PerformanceEvaluation:
> Start class org.apache.hadoop.hbase.PerformanceEvaluation$RandomWriteTest at
> offset -1073741824 for 1073741824 rows
> 2024-12-03T12:21:10,337 INFO [TestClient-8 {}] hbase.PerformanceEvaluation:
> Start class org.apache.hadoop.hbase.PerformanceEvaluation$RandomWriteTest at
> offset 0 for 1073741824 rows
> ...
> 2024-12-03T12:21:10,338 INFO [TestClient-17 {}] hbase.PerformanceEvaluation:
> Sampling 1 every 0 out of 1073741824 total rows.
> 2024-12-03T12:21:10,338 INFO [TestClient-16 {}] hbase.PerformanceEvaluation:
> Sampling 1 every 0 out of 1073741824 total rows.
> 2024-12-03T12:21:10,338 INFO [TestClient-6 {}] hbase.PerformanceEvaluation:
> Sampling 1 every 0 out of 1073741824 total rows.
> 2024-12-03T12:21:10,338 INFO [TestClient-4 {}] hbase.PerformanceEvaluation:
> Sampling 1 every 0 out of 1073741824 total rows.
> ...
> java.io.IOException: java.lang.ArithmeticException: / by zero
> at
> org.apache.hadoop.hbase.PerformanceEvaluation.doLocalClients(PerformanceEvaluation.java:540)
> at
> org.apache.hadoop.hbase.PerformanceEvaluation.runTest(PerformanceEvaluation.java:2674)
> at
> org.apache.hadoop.hbase.PerformanceEvaluation.run(PerformanceEvaluation.java:3216)
> at org.apache.hadoop.util.ToolRunner.run(ToolRunner.java:82)
> at org.apache.hadoop.util.ToolRunner.run(ToolRunner.java:97)
> at
> org.apache.hadoop.hbase.PerformanceEvaluation.main(PerformanceEvaluation.java:3250)
> {noformat}
> So I think it's best that we just use 8-byte long integers throughout the
> code.
>
--
This message was sent by Atlassian Jira
(v8.20.10#820010)