[
https://issues.apache.org/jira/browse/HBASE-30058?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
]
JeongMin Ju reassigned HBASE-30058:
-----------------------------------
Assignee: JeongMin Ju
Description:
In a Kerberos-secured HBase cluster, each snapshot operation triggers two
unnecessary {{ConnectionFactory.createConnection(conf)}} calls in
{{SnapshotDescriptionUtils.validate()}}, which is invoked from
{{MasterRpcServices.snapshot()}}. These short-lived connections are created and
immediately closed, but each creation involves establishing a new ZooKeeper
session with GSSAPI authentication, resulting in KDC requests for service
tickets.
When batch snapshot jobs process many tables in a short period, this generates
a large volume of KDC requests. The KDC may interpret this traffic as a
brute-force or DDoS attack and block the HBase Master's IP. Once blocked, the
Master can no longer authenticate any Kerberos operations, effectively
rendering it non-functional and eventually causing it to fail.
h3. Root Cause
{{SnapshotDescriptionUtils.validate()}} calls:
1. {{isSecurityAvailable(conf)}} — creates a full {{Connection}} + {{Admin}}
just to check if the {{hbase:acl}} table exists
2. {{writeAclToSnapshotDescription()}} — calls
{{PermissionStorage.getTablePermissions(conf, tableName)}} which calls
{{getPermissions()}} with {{Table t = null}}, creating another {{Connection}}
to read from {{hbase:acl}}
Each {{ConnectionFactory.createConnection(conf)}} with the default
{{ZKConnectionRegistry}} creates a new {{ReadOnlyZKClient}}, which establishes
a ZK session with GSSAPI (Kerberos) SASL authentication. Since each connection
gets a new JAAS {{LoginContext}} with a new {{Subject}} (in
{{org.apache.zookeeper.Login}}), service tickets are not cached across
connections, and every connection triggers a TGS request to the KDC.
{{isSecurityAvailable()}} is also called from {{RestoreSnapshotProcedure}} and
{{CloneSnapshotProcedure}}, so the same issue affects snapshot restore/clone
operations.
h3. Workaround
Setting
{{hbase.client.registry.impl=org.apache.hadoop.hbase.client.RpcConnectionRegistry}}
mitigates the issue. With {{RpcConnectionRegistry}}, new connections use
RPC-based SASL authentication which runs under the server's shared UGI
{{Subject}} (via {{ugi.doAs()}}). This allows service tickets to be cached in
the shared {{Subject}} and reused across connections, eliminating repeated KDC
requests after the initial authentication.
However, {{ZKConnectionRegistry}} creates a new JAAS {{LoginContext}} with a
new {{Subject}} per ZK session (in {{org.apache.zookeeper.Login}}), so service
tickets are never shared. This workaround does not address the unnecessary
connection creation itself.
h3. Proposed Fix
1. Replace {{isSecurityAvailable(conf)}} with
{{AccessChecker.isAuthorizationSupported(conf)}} — this checks the
{{hbase.security.authorization}} configuration value instead of creating a
connection to verify {{hbase:acl}} table existence. When authorization is
enabled, {{hbase:acl}} table is guaranteed to exist as it is created by
{{AccessController}} coprocessor.
2. For {{writeAclToSnapshotDescription()}}, avoid creating a new {{Connection}}
by obtaining a {{Table}} instance from an existing connection (e.g., the
Master's shared connection) and passing it to
{{PermissionStorage.getTablePermissions()}}. Currently {{null}} is passed as
the {{Table}} parameter, which forces the method to create a new {{Connection}}
internally. Note that a similar pattern in {{PermissionStorage.loadAll()}}
already has a {{TODO}} comment acknowledging this issue: {{// TODO: Pass in a
Connection rather than create one each time.}}
was:
In a Kerberos-secured HBase cluster, each snapshot operation triggers two
unnecessary {{ConnectionFactory.createConnection(conf)}} calls in
{{SnapshotDescriptionUtils.validate()}}, which is invoked from
{{MasterRpcServices.snapshot()}}. These short-lived connections are created and
immediately closed, but each creation involves establishing a new ZooKeeper
session with GSSAPI authentication, resulting in KDC requests for service
tickets.
When batch snapshot jobs process many tables in a short period, this generates
a large volume of KDC requests. The KDC may interpret this traffic as a
brute-force or DDoS attack and block the HBase Master's IP. Once blocked, the
Master can no longer authenticate any Kerberos operations, effectively
rendering it non-functional and eventually causing it to fail.
h3. Root Cause
{{SnapshotDescriptionUtils.validate()}} calls:
1. {{isSecurityAvailable(conf)}} — creates a full {{Connection}} + {{Admin}}
just to check if the {{hbase:acl}} table exists
2. {{writeAclToSnapshotDescription()}} — calls
{{PermissionStorage.getTablePermissions(conf, tableName)}} which calls
{{getPermissions()}} with {{Table t = null}}, creating another {{Connection}}
to read from {{hbase:acl}}
Each {{ConnectionFactory.createConnection(conf)}} with the default
{{ZKConnectionRegistry}} creates a new {{ReadOnlyZKClient}}, which establishes
a ZK session with GSSAPI (Kerberos) SASL authentication. Since each connection
gets a new JAAS {{LoginContext}} with a new {{Subject}} (in
{{org.apache.zookeeper.Login}}), service tickets are not cached across
connections, and every connection triggers a TGS request to the KDC.
{{isSecurityAvailable()}} is also called from {{RestoreSnapshotProcedure}} and
{{CloneSnapshotProcedure}}, so the same issue affects snapshot restore/clone
operations.
h3. Workaround
Setting
{{hbase.client.registry.impl=org.apache.hadoop.hbase.client.RpcConnectionRegistry}}
mitigates the issue. With {{RpcConnectionRegistry}}, new connections use
RPC-based SASL authentication which runs under the server's shared UGI
{{Subject}} (via {{ugi.doAs()}}). This allows service tickets to be cached in
the shared {{Subject}} and reused across connections, eliminating repeated KDC
requests after the initial authentication.
However, {{ZKConnectionRegistry}} creates a new JAAS {{LoginContext}} with a
new {{Subject}} per ZK session (in {{org.apache.zookeeper.Login}}), so service
tickets are never shared. This workaround does not address the unnecessary
connection creation itself.
h3. Proposed Fix
1. Replace {{isSecurityAvailable(conf)}} with
{{User.isHBaseSecurityEnabled(conf)}} — this checks the
{{hbase.security.authentication}} configuration value instead of creating a
connection to verify {{hbase:acl}} table existence. In a properly configured
cluster, security enabled implies {{hbase:acl}} exists.
2. For {{writeAclToSnapshotDescription()}}, avoid creating a new {{Connection}}
by obtaining a {{Table}} instance from an existing connection (e.g., the
Master's shared connection) and passing it to
{{PermissionStorage.getPermissions()}}. Currently {{null}} is passed as the
{{Table}} parameter, which forces the method to create a new {{Connection}}
internally. Note that a similar pattern in {{PermissionStorage.loadAll()}}
already has a {{TODO}} comment acknowledging this issue: {{// TODO: Pass in a
Connection rather than create one each time.}}
> Snapshot operations create unnecessary short-lived connections causing
> excessive KDC requests in Kerberos environments
> ----------------------------------------------------------------------------------------------------------------------
>
> Key: HBASE-30058
> URL: https://issues.apache.org/jira/browse/HBASE-30058
> Project: HBase
> Issue Type: Bug
> Components: security, snapshots
> Reporter: JeongMin Ju
> Assignee: JeongMin Ju
> Priority: Major
>
> In a Kerberos-secured HBase cluster, each snapshot operation triggers two
> unnecessary {{ConnectionFactory.createConnection(conf)}} calls in
> {{SnapshotDescriptionUtils.validate()}}, which is invoked from
> {{MasterRpcServices.snapshot()}}. These short-lived connections are created
> and immediately closed, but each creation involves establishing a new
> ZooKeeper session with GSSAPI authentication, resulting in KDC requests for
> service tickets.
> When batch snapshot jobs process many tables in a short period, this
> generates a large volume of KDC requests. The KDC may interpret this traffic
> as a brute-force or DDoS attack and block the HBase Master's IP. Once
> blocked, the Master can no longer authenticate any Kerberos operations,
> effectively rendering it non-functional and eventually causing it to fail.
> h3. Root Cause
> {{SnapshotDescriptionUtils.validate()}} calls:
> 1. {{isSecurityAvailable(conf)}} — creates a full {{Connection}} + {{Admin}}
> just to check if the {{hbase:acl}} table exists
> 2. {{writeAclToSnapshotDescription()}} — calls
> {{PermissionStorage.getTablePermissions(conf, tableName)}} which calls
> {{getPermissions()}} with {{Table t = null}}, creating another {{Connection}}
> to read from {{hbase:acl}}
> Each {{ConnectionFactory.createConnection(conf)}} with the default
> {{ZKConnectionRegistry}} creates a new {{ReadOnlyZKClient}}, which
> establishes a ZK session with GSSAPI (Kerberos) SASL authentication. Since
> each connection gets a new JAAS {{LoginContext}} with a new {{Subject}} (in
> {{org.apache.zookeeper.Login}}), service tickets are not cached across
> connections, and every connection triggers a TGS request to the KDC.
> {{isSecurityAvailable()}} is also called from {{RestoreSnapshotProcedure}}
> and {{CloneSnapshotProcedure}}, so the same issue affects snapshot
> restore/clone operations.
> h3. Workaround
> Setting
> {{hbase.client.registry.impl=org.apache.hadoop.hbase.client.RpcConnectionRegistry}}
> mitigates the issue. With {{RpcConnectionRegistry}}, new connections use
> RPC-based SASL authentication which runs under the server's shared UGI
> {{Subject}} (via {{ugi.doAs()}}). This allows service tickets to be cached in
> the shared {{Subject}} and reused across connections, eliminating repeated
> KDC requests after the initial authentication.
> However, {{ZKConnectionRegistry}} creates a new JAAS {{LoginContext}} with a
> new {{Subject}} per ZK session (in {{org.apache.zookeeper.Login}}), so
> service tickets are never shared. This workaround does not address the
> unnecessary connection creation itself.
> h3. Proposed Fix
> 1. Replace {{isSecurityAvailable(conf)}} with
> {{AccessChecker.isAuthorizationSupported(conf)}} — this checks the
> {{hbase.security.authorization}} configuration value instead of creating a
> connection to verify {{hbase:acl}} table existence. When authorization is
> enabled, {{hbase:acl}} table is guaranteed to exist as it is created by
> {{AccessController}} coprocessor.
> 2. For {{writeAclToSnapshotDescription()}}, avoid creating a new
> {{Connection}} by obtaining a {{Table}} instance from an existing connection
> (e.g., the Master's shared connection) and passing it to
> {{PermissionStorage.getTablePermissions()}}. Currently {{null}} is passed as
> the {{Table}} parameter, which forces the method to create a new
> {{Connection}} internally. Note that a similar pattern in
> {{PermissionStorage.loadAll()}} already has a {{TODO}} comment acknowledging
> this issue: {{// TODO: Pass in a Connection rather than create one each
> time.}}
--
This message was sent by Atlassian Jira
(v8.20.10#820010)