[jira] [Updated] (HBASE-29665) Bidirectional bulkload replication causes excessive network traffic

Jaehui Lee (Jira) Mon, 20 Oct 2025 18:46:50 -0700


     [ 
https://issues.apache.org/jira/browse/HBASE-29665?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]


Jaehui Lee updated HBASE-29665:
-------------------------------
    Description: 
h2. Problem

When performing a bulkload on one of two clusters configured with bidirectional 
replication, the cluster executing the bulkload experiences unexpectedly high 
network usage.
h2. Root Cause

HBASE-22380 prevented circle bulkload replication by having 
{{SecureBulkloadManager}} check if the current clusterId already exists in 
{{{}clusterIds{}}}. If present, it assumes replication has already occurred and 
stops further processing.

However, {{SecureBulkloadManager}} is invoked by the {{{}HFileReplicator{}}}, 
which copies the target HFiles to a staging directory in the local HDFS 
_before_ checking whether replication should proceed. This premature copying 
causes unnecessary network and disk usage.
h2. Solution

Unlike {{clusterIds}} used in regular mutation replication (which are included 
in {{{}WALKey{}}}), the {{clusterIds}} for bulkload replication are managed in 
a separate class called {{{}BulkloadDescriptor{}}}. As a result, they are not 
filtered by {{{}ClusterMarkingEntryFilter{}}}, and filtering logic only runs 
after the bulkload request is received.

The solution is to include {{clusterIds}} in the {{WALKey}} for bulkload 
operations, just like regular mutations. This allows filtering to occur before 
the bulkload request is processed, preventing unnecessary file copying.
h2. Test

Setup
 * Two clusters (Cluster A and Cluster B) running HBase 2.6.3
 * HBase and HDFS clusters are separated (compute-storage separation 
architecture)
 * Bulkload replication and bidirectional replication enabled
 * Bulkload executed on Cluster A only

!image-2025-10-16-21-59-13-156.png|width=682,height=580!

Since the bulkload is executed only on Cluster A, resource usage should be 
identical between scenarios 1 and 2. However, as shown in the metrics above, 
scenario 1 consumes significantly more resources. This is due to the 
unnecessary copying of HFiles to the staging directory, as explained in the 
root cause section.

After applying the patch, scenario 3 shows resource usage identical to scenario 
2, confirming that the unnecessary file copying has been eliminated.

  was:
h2. Problem

When performing a bulkload on one of two clusters configured with bidirectional 
replication, the cluster executing the bulkload experiences unexpectedly high 
network usage.
h2. Root Cause

HBASE-22380 prevented circle bulkload replication by having 
{{SecureBulkloadManager}} check if the current clusterId already exists in 
{{{}clusterIds{}}}. If present, it assumes replication has already occurred and 
stops further processing.

However, {{SecureBulkloadManager}} is invoked by the {{LoadIncrementalHFiles}} 
tool, which copies the target HFiles to a staging directory in the local HDFS 
_before_ checking whether replication should proceed. This premature copying 
causes unnecessary network and disk usage.
h2. Solution

Unlike {{clusterIds}} used in regular mutation replication (which are included 
in {{{}WALKey{}}}), the {{clusterIds}} for bulkload replication are managed in 
a separate class called {{{}BulkloadDescriptor{}}}. As a result, they are not 
filtered by {{{}ClusterMarkingEntryFilter{}}}, and filtering logic only runs 
after the bulkload request is received.

The solution is to include {{clusterIds}} in the {{WALKey}} for bulkload 
operations, just like regular mutations. This allows filtering to occur before 
the bulkload request is processed, preventing unnecessary file copying.
h2. Test

Setup
 * Two clusters (Cluster A and Cluster B) running HBase 2.6.3
 * HBase and HDFS clusters are separated (compute-storage separation 
architecture)
 * Bulkload replication and bidirectional replication enabled
 * Bulkload executed on Cluster A only

!image-2025-10-16-21-59-13-156.png|width=682,height=580!

Since the bulkload is executed only on Cluster A, resource usage should be 
identical between scenarios 1 and 2. However, as shown in the metrics above, 
scenario 1 consumes significantly more resources. This is due to the 
unnecessary copying of HFiles to the staging directory, as explained in the 
root cause section.

After applying the patch, scenario 3 shows resource usage identical to scenario 
2, confirming that the unnecessary file copying has been eliminated.


> Bidirectional bulkload replication causes excessive network traffic
> -------------------------------------------------------------------
>
>                 Key: HBASE-29665
>                 URL: https://issues.apache.org/jira/browse/HBASE-29665
>             Project: HBase
>          Issue Type: Bug
>          Components: Replication
>    Affects Versions: 2.6.3, 2.5.12
>            Reporter: Jaehui Lee
>            Assignee: Jaehui Lee
>            Priority: Major
>              Labels: pull-request-available
>         Attachments: image-2025-10-16-21-59-13-156.png
>
>
> h2. Problem
> When performing a bulkload on one of two clusters configured with 
> bidirectional replication, the cluster executing the bulkload experiences 
> unexpectedly high network usage.
> h2. Root Cause
> HBASE-22380 prevented circle bulkload replication by having 
> {{SecureBulkloadManager}} check if the current clusterId already exists in 
> {{{}clusterIds{}}}. If present, it assumes replication has already occurred 
> and stops further processing.
> However, {{SecureBulkloadManager}} is invoked by the {{{}HFileReplicator{}}}, 
> which copies the target HFiles to a staging directory in the local HDFS 
> _before_ checking whether replication should proceed. This premature copying 
> causes unnecessary network and disk usage.
> h2. Solution
> Unlike {{clusterIds}} used in regular mutation replication (which are 
> included in {{{}WALKey{}}}), the {{clusterIds}} for bulkload replication are 
> managed in a separate class called {{{}BulkloadDescriptor{}}}. As a result, 
> they are not filtered by {{{}ClusterMarkingEntryFilter{}}}, and filtering 
> logic only runs after the bulkload request is received.
> The solution is to include {{clusterIds}} in the {{WALKey}} for bulkload 
> operations, just like regular mutations. This allows filtering to occur 
> before the bulkload request is processed, preventing unnecessary file copying.
> h2. Test
> Setup
>  * Two clusters (Cluster A and Cluster B) running HBase 2.6.3
>  * HBase and HDFS clusters are separated (compute-storage separation 
> architecture)
>  * Bulkload replication and bidirectional replication enabled
>  * Bulkload executed on Cluster A only
> !image-2025-10-16-21-59-13-156.png|width=682,height=580!
> Since the bulkload is executed only on Cluster A, resource usage should be 
> identical between scenarios 1 and 2. However, as shown in the metrics above, 
> scenario 1 consumes significantly more resources. This is due to the 
> unnecessary copying of HFiles to the staging directory, as explained in the 
> root cause section.
> After applying the patch, scenario 3 shows resource usage identical to 
> scenario 2, confirming that the unnecessary file copying has been eliminated.



--
This message was sent by Atlassian Jira
(v8.20.10#820010)

[jira] [Updated] (HBASE-29665) Bidirectional bulkload replication causes excessive network traffic

Reply via email to