Andrew Kyle Purtell created HBASE-30062:
-------------------------------------------
Summary: Device layer simulator for MiniDFSCluster-based tests
Key: HBASE-30062
URL: https://issues.apache.org/jira/browse/HBASE-30062
Project: HBase
Issue Type: New Feature
Components: HFile, integration tests, test, wal
Reporter: Andrew Kyle Purtell
Assignee: Andrew Kyle Purtell
On EBS-backed deployments in AWS, or equivalents in other cloud infrastructure
providers, HBase compaction and replication throughput can be constrained by
per-volume IOPS limits rather than bandwidth. A faithful device-level simulator
within the test harness allows developers to reproduce, analyze, and validate
fixes for such performance issues without requiring actual cloud infrastructure.
This proposed change adds a test-only EBS device layer that operates at the
DataNode storage level within {{MiniDFSCluster}} by replacing the
{{FsDatasetSpi}} implementation via Hadoop's pluggable factory mechanism. This
allows HBase integration tests to simulate realistic cloud block storage
characteristics, such as per-volume bandwidth budgets, IOPS limits, sequential
IO coalescing, and per-IO device latency, enabling identification and
reproduction of IO bottlenecks.
The simulator wraps the real {{FsDatasetImpl}} with a reflection proxy that
intercepts the three SPI methods where DataNode local IO actually engages the
underlying block device, without compile-time coupling to internal Hadoop
classes.
On the read path, {{getBlockInputStream}} wraps the returned {{InputStream}}
with {{{}ThrottledBlockInputStream{}}}, charging every byte against the
volume's BW and IOPS budgets with sequential IO coalescing. On the write path,
{{submitBackgroundSyncFileRangeRequest}} charges {{nbytes}} against BW and IOPS
budgets, modeling the async {{sync_file_range(SYNC_FILE_RANGE_WRITE)}} that the
DataNode issues to flush dirty pages from the operating system's page cache to
the block device; and {{finalizeBlock}} charges the remaining unflushed delta
(minus bytes already charged via sync_file_range) against the budgets, modeling
the {{fsync()}} at block finalization.
Each proxy gets its own set of {{EBSVolumeDevice}} instances with independent
budgets. Block-to-volume resolution uses {{{}delegate.getVolume(block){}}},
providing real HDFS placement decisions. A single configuration applies to all
volumes, but each volume maintains its own token buckets, matching production
where all attached block devices to a host share the same SKU but have
independent throughput budgets, and where the host itself has a cap on maximum
aggregate throughput.
EBS merges sequential IOs up to 1 MiB before counting them as a single IOPS
token. The simulator tracks read streams and write streams independently.
After each IOPS token consumption, the simulator sleeps for a configurable
duration (default 1 ms), modeling physical device service time.
Some naming and concepts heavily favor Amazon's EBS but these naming issues can
be addressed during review.
Test integration looks like:
{noformat}
Configuration conf = HBaseConfiguration.create();
// Sets dfs.datanode.fsdataset.factory so that each DataNode started by
MiniDFSCluster
// wraps its real FsDatasetImpl with a throttling proxy that intercepts
block-level IO.
EBSDevice.configure(conf, /*budgetMbps=*/500, /*budgetIops=*/500,
/*deviceLatencyUs=*/1000, /*maxIoSizeKb=*/1024, /*instanceMbps=*/1250);
HBaseTestingUtility util = new HBaseTestingUtility(conf);
util.startMiniZKCluster();
MiniDFSCluster dfsCluster = new MiniDFSCluster.Builder(conf)
.numDataNodes(1)
.storagesPerDatanode(6)
.build();
dfsCluster.waitClusterUp();
util.setDFSCluster(dfsCluster);
util.startMiniCluster(1);
// ... run workload ...
long bytesRead = EBSDevice.getTotalBytesRead();
long deviceIops = EBSDevice.getDeviceReadOps();
String perVolume = EBSDevice.getPerVolumeStats();
EBSDevice.shutdown();
{noformat}
--
This message was sent by Atlassian Jira
(v8.20.10#820010)