[
https://issues.apache.org/jira/browse/HBASE-29665?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
]
Jaehui Lee updated HBASE-29665:
-------------------------------
Description:
h2. Problem
When performing a bulkload on one of two clusters configured with bidirectional
replication, the cluster executing the bulkload experiences unexpectedly high
network usage.
h2. Root Cause
HBASE-22380 prevented circle bulkload replication by having
{{SecureBulkloadManager}} check if the current clusterId already exists in
{{{}clusterIds{}}}. If present, it assumes replication has already occurred and
stops further processing.
However, {{SecureBulkloadManager}} is invoked by the {{{}HFileReplicator{}}},
which copies the target HFiles to a staging directory in the local HDFS
_before_ checking whether replication should proceed. This premature copying
causes unnecessary network and disk usage.
h2. Solution
Unlike {{clusterIds}} used in regular mutation replication (which are included
in {{{}WALKey{}}}), the {{clusterIds}} for bulkload replication are managed in
a separate class called {{{}BulkloadDescriptor{}}}. As a result, they are not
filtered by {{{}ClusterMarkingEntryFilter{}}}, and filtering logic only runs
after the bulkload request is received.
The solution is to include {{clusterIds}} in the {{WALKey}} for bulkload
operations, just like regular mutations. This allows filtering to occur before
the bulkload request is processed, preventing unnecessary file copying.
h2. Test
Setup
* Two clusters (Cluster A and Cluster B) running HBase 2.6.3
* HBase and HDFS clusters are separated (compute-storage separation
architecture)
* Bulkload replication and bidirectional replication enabled
* Bulkload executed on Cluster A only
!image-2025-10-16-21-59-13-156.png|width=682,height=580!
Since the bulkload is executed only on Cluster A, resource usage should be
identical between scenarios 1 and 2. However, as shown in the metrics above,
scenario 1 consumes significantly more resources. This is due to the
unnecessary copying of HFiles to the staging directory, as explained in the
root cause section.
After applying the patch, scenario 3 shows resource usage identical to scenario
2, confirming that the unnecessary file copying has been eliminated.
was:
h2. Problem
When performing a bulkload on one of two clusters configured with bidirectional
replication, the cluster executing the bulkload experiences unexpectedly high
network usage.
h2. Root Cause
HBASE-22380 prevented circle bulkload replication by having
{{SecureBulkloadManager}} check if the current clusterId already exists in
{{{}clusterIds{}}}. If present, it assumes replication has already occurred and
stops further processing.
However, {{SecureBulkloadManager}} is invoked by the {{LoadIncrementalHFiles}}
tool, which copies the target HFiles to a staging directory in the local HDFS
_before_ checking whether replication should proceed. This premature copying
causes unnecessary network and disk usage.
h2. Solution
Unlike {{clusterIds}} used in regular mutation replication (which are included
in {{{}WALKey{}}}), the {{clusterIds}} for bulkload replication are managed in
a separate class called {{{}BulkloadDescriptor{}}}. As a result, they are not
filtered by {{{}ClusterMarkingEntryFilter{}}}, and filtering logic only runs
after the bulkload request is received.
The solution is to include {{clusterIds}} in the {{WALKey}} for bulkload
operations, just like regular mutations. This allows filtering to occur before
the bulkload request is processed, preventing unnecessary file copying.
h2. Test
Setup
* Two clusters (Cluster A and Cluster B) running HBase 2.6.3
* HBase and HDFS clusters are separated (compute-storage separation
architecture)
* Bulkload replication and bidirectional replication enabled
* Bulkload executed on Cluster A only
!image-2025-10-16-21-59-13-156.png|width=682,height=580!
Since the bulkload is executed only on Cluster A, resource usage should be
identical between scenarios 1 and 2. However, as shown in the metrics above,
scenario 1 consumes significantly more resources. This is due to the
unnecessary copying of HFiles to the staging directory, as explained in the
root cause section.
After applying the patch, scenario 3 shows resource usage identical to scenario
2, confirming that the unnecessary file copying has been eliminated.
> Bidirectional bulkload replication causes excessive network traffic
> -------------------------------------------------------------------
>
> Key: HBASE-29665
> URL: https://issues.apache.org/jira/browse/HBASE-29665
> Project: HBase
> Issue Type: Bug
> Components: Replication
> Affects Versions: 2.6.3, 2.5.12
> Reporter: Jaehui Lee
> Assignee: Jaehui Lee
> Priority: Major
> Labels: pull-request-available
> Attachments: image-2025-10-16-21-59-13-156.png
>
>
> h2. Problem
> When performing a bulkload on one of two clusters configured with
> bidirectional replication, the cluster executing the bulkload experiences
> unexpectedly high network usage.
> h2. Root Cause
> HBASE-22380 prevented circle bulkload replication by having
> {{SecureBulkloadManager}} check if the current clusterId already exists in
> {{{}clusterIds{}}}. If present, it assumes replication has already occurred
> and stops further processing.
> However, {{SecureBulkloadManager}} is invoked by the {{{}HFileReplicator{}}},
> which copies the target HFiles to a staging directory in the local HDFS
> _before_ checking whether replication should proceed. This premature copying
> causes unnecessary network and disk usage.
> h2. Solution
> Unlike {{clusterIds}} used in regular mutation replication (which are
> included in {{{}WALKey{}}}), the {{clusterIds}} for bulkload replication are
> managed in a separate class called {{{}BulkloadDescriptor{}}}. As a result,
> they are not filtered by {{{}ClusterMarkingEntryFilter{}}}, and filtering
> logic only runs after the bulkload request is received.
> The solution is to include {{clusterIds}} in the {{WALKey}} for bulkload
> operations, just like regular mutations. This allows filtering to occur
> before the bulkload request is processed, preventing unnecessary file copying.
> h2. Test
> Setup
> * Two clusters (Cluster A and Cluster B) running HBase 2.6.3
> * HBase and HDFS clusters are separated (compute-storage separation
> architecture)
> * Bulkload replication and bidirectional replication enabled
> * Bulkload executed on Cluster A only
> !image-2025-10-16-21-59-13-156.png|width=682,height=580!
> Since the bulkload is executed only on Cluster A, resource usage should be
> identical between scenarios 1 and 2. However, as shown in the metrics above,
> scenario 1 consumes significantly more resources. This is due to the
> unnecessary copying of HFiles to the staging directory, as explained in the
> root cause section.
> After applying the patch, scenario 3 shows resource usage identical to
> scenario 2, confirming that the unnecessary file copying has been eliminated.
--
This message was sent by Atlassian Jira
(v8.20.10#820010)