Junegunn Choi created HBASE-29013: ------------------------------------- Summary: Make PerformanceEvaluation support larger data sets Key: HBASE-29013 URL: https://issues.apache.org/jira/browse/HBASE-29013 Project: HBase Issue Type: Improvement Reporter: Junegunn Choi
The use of 4-byte integers in PerformanceEvaluation can be limiting when you want to test with larger data sets. Suppose you want to generate 10TB of data with the default value size of 1KB, you would need 10G rows. {code:java} bin/hbase pe --nomapred --presplit=21 --compress=LZ4 --rows=10737418240 randomWrite 1 {code} But you can't do it because {{--rows}} expect a number that can be represented with 4 bytes. {noformat} java.lang.NumberFormatException: For input string: "10737418240" {noformat} We can instead increase the value size and decrease the number of the rows to circumvent the limitation, but I don't see a good reason to have the limitation in the first place. And even if we use a smaller value for {{{}--row{}}}, we can accidentally cause integer overflow as we increase the number of clients. {code:java} bin/hbase pe --nomapred --compress=LZ4 --rows=1073741824 --nomapred randomWrite 20 {code} {noformat} 2024-12-03T12:21:10,333 INFO [main {}] hbase.PerformanceEvaluation: Created 20 connections for 20 threads 2024-12-03T12:21:10,337 INFO [TestClient-5 {}] hbase.PerformanceEvaluation: Start class org.apache.hadoop.hbase.PerformanceEvaluation$RandomWriteTest at offset 1073741824 for 1073741824 rows 2024-12-03T12:21:10,337 INFO [TestClient-1 {}] hbase.PerformanceEvaluation: Start class org.apache.hadoop.hbase.PerformanceEvaluation$RandomWriteTest at offset 1073741824 for 1073741824 rows 2024-12-03T12:21:10,337 INFO [TestClient-3 {}] hbase.PerformanceEvaluation: Start class org.apache.hadoop.hbase.PerformanceEvaluation$RandomWriteTest at offset -1073741824 for 1073741824 rows 2024-12-03T12:21:10,337 INFO [TestClient-4 {}] hbase.PerformanceEvaluation: Start class org.apache.hadoop.hbase.PerformanceEvaluation$RandomWriteTest at offset 0 for 1073741824 rows 2024-12-03T12:21:10,337 INFO [TestClient-7 {}] hbase.PerformanceEvaluation: Start class org.apache.hadoop.hbase.PerformanceEvaluation$RandomWriteTest at offset -1073741824 for 1073741824 rows 2024-12-03T12:21:10,337 INFO [TestClient-8 {}] hbase.PerformanceEvaluation: Start class org.apache.hadoop.hbase.PerformanceEvaluation$RandomWriteTest at offset 0 for 1073741824 rows ... 2024-12-03T12:21:10,338 INFO [TestClient-17 {}] hbase.PerformanceEvaluation: Sampling 1 every 0 out of 1073741824 total rows. 2024-12-03T12:21:10,338 INFO [TestClient-16 {}] hbase.PerformanceEvaluation: Sampling 1 every 0 out of 1073741824 total rows. 2024-12-03T12:21:10,338 INFO [TestClient-6 {}] hbase.PerformanceEvaluation: Sampling 1 every 0 out of 1073741824 total rows. 2024-12-03T12:21:10,338 INFO [TestClient-4 {}] hbase.PerformanceEvaluation: Sampling 1 every 0 out of 1073741824 total rows. ... java.io.IOException: java.lang.ArithmeticException: / by zero at org.apache.hadoop.hbase.PerformanceEvaluation.doLocalClients(PerformanceEvaluation.java:540) at org.apache.hadoop.hbase.PerformanceEvaluation.runTest(PerformanceEvaluation.java:2674) at org.apache.hadoop.hbase.PerformanceEvaluation.run(PerformanceEvaluation.java:3216) at org.apache.hadoop.util.ToolRunner.run(ToolRunner.java:82) at org.apache.hadoop.util.ToolRunner.run(ToolRunner.java:97) at org.apache.hadoop.hbase.PerformanceEvaluation.main(PerformanceEvaluation.java:3250) {noformat} So I think it's best that we just use 8-byte long integers throughout the code. -- This message was sent by Atlassian Jira (v8.20.10#820010)