[ 
https://issues.apache.org/jira/browse/HIVE-29584?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

soowan4147 updated HIVE-29584:
------------------------------
    Description: 
# 
 ## Summary

  After a brief MySQL outage (e.g., 4-second network glitch from a planned
  DBA operation), one HMS Thrift worker thread can permanently retain a broken
  HikariProxyConnection in its ObjectStore.pm ThreadLocal cache, leading to
  indefinite reuse of the same broken wrapper for hours until HMS is restarted.

  ## Root Cause

  `MetaStoreDirectSql.prepareTxn()` executes `SET 
@@session.sql_mode=ANSI_QUOTES`
  on every transaction. When this fails on a broken connection (e.g.,
  "Connection is closed" SQLException after MySQL transient outage),
  `ObjectStore.handleDirectSqlError()` falls back to ORM mode but does NOT
  invalidate the PersistenceManager. As a result:

  1. Same `ObjectStore.pm` is reused on next RPC
  2. Same broken HikariProxyConnection wrapper is reused (held via ThreadLocal)
  3. HikariCP cannot evict the in-use connection per design
  4. Only HMS restart releases the wrapper

  ## Production Incident Evidence (2026-04-26)

  - pool-6-thread-93872 retained broken wrapper for ~5 hours
  - 41,302 audit RPCs all from same client IP, all failing with same error
  - 6,047 "Falling back to ORM" + 14,309 ERROR logs in 5 hours
  - master02 normal threads' RPC throughput dropped 90%+ during incident
  - catalogd's ALTER_TABLE processing stalled 1~3 minutes per event
  - Resolved only by master HMS restart (3h 10m total impact)

  ## Source References

  Verified the defect exists in:
  - 3.1.3: `ObjectStore.java#L3646-L3697` (handleDirectSqlError)
  - 4.0.0: `ObjectStore.java#L4449-L4495` (same defect, no PM cleanup)
  - `MetaStoreDirectSql.java#L2026-L2034` (prepareTxn trigger)

  ## Steps to Reproduce

  1. Set up HMS 3.1.3+ with HikariCP backed by MySQL
  2. Create a long-lived metastore client that maps permanently to one HMS
     worker thread (e.g., Apache Amoro pod, Spark Thrift Server)
  3. Briefly disconnect MySQL (4 seconds via iptables drop or KILL CONNECTION)
  4. Observe: one worker thread continues to reuse the broken wrapper 
indefinitely
  5. Verify: log shows continuous "Falling back to ORM path due to direct SQL
     failure: Error setting ansi quotes: Connection is closed" from same thread

  ## Proposed Fix

  In `ObjectStore.handleDirectSqlError()`, when the cause is a connection-level
  SQLException, invalidate the PM:

  ```java
  if (isConnectionLevelError(ex)) {
      if (pm != null) {
          try {
              if (pm.currentTransaction().isActive())

{                   pm.currentTransaction().rollback();               }

              pm.close();   // releases HikariCP wrapper to pool
          } catch (Exception e)

{               // best effort           }

          pm = null;
          directSql = null;
      }
  }

  This forces a fresh PM (and thus a fresh connection) on the next RPC,
  allowing the broken connection to be properly evicted by HikariCP.

  Workarounds (currently in use)

  - Client-side: shorten hive.metastore.client.socket.timeout on long-lived
  clients (e.g., Amoro) so they auto-reconnect every few minutes, breaking
  the permanent thread mapping
  - Operational: enable HikariCP leakDetectionThreshold, alarm on
  "Connection leak detection triggered" log, and auto-restart the affected HMS

  Related JIRAs (none directly fix this)

  - HIVE-22804 (sessionVariables workaround) — does not prevent the leak
  - HIVE-20192 (PM cleanup at thread exit) — different mechanism
  - HIVE-28788 (commit failure → starvation) — different trigger
  - HIVE-28839 (DataNucleus connection starvation) — different code path

  To my knowledge, this specific defect (PM ThreadLocal retaining broken
  wrapper after SQLException in handleDirectSqlError) has not been reported
  before.

  was:
## Summary

  After a brief MySQL outage (e.g., 4-second network glitch from a planned
  DBA operation), one HMS Thrift worker thread can permanently retain a broken
  HikariProxyConnection in its ObjectStore.pm ThreadLocal cache, leading to
  indefinite reuse of the same broken wrapper for hours until HMS is restarted.

  ## Root Cause

  `MetaStoreDirectSql.prepareTxn()` executes `SET 
@@session.sql_mode=ANSI_QUOTES`
  on every transaction. When this fails on a broken connection (e.g.,
  "Connection is closed" SQLException after MySQL transient outage),
  `ObjectStore.handleDirectSqlError()` falls back to ORM mode but does NOT
  invalidate the PersistenceManager. As a result:

  1. Same `ObjectStore.pm` is reused on next RPC
  2. Same broken HikariProxyConnection wrapper is reused (held via ThreadLocal)
  3. HikariCP cannot evict the in-use connection per design
  4. Only HMS restart releases the wrapper

  ## Production Incident Evidence (2026-04-26)

  - pool-6-thread-93872 retained broken wrapper for ~5 hours
  - 41,302 audit RPCs all from same client IP, all failing with same error
  - 6,047 "Falling back to ORM" + 14,309 ERROR logs in 5 hours
  - master02 normal threads' RPC throughput dropped 90%+ during incident
  - catalogd's ALTER_TABLE processing stalled 1~3 minutes per event
  - Resolved only by master HMS restart (3h 10m total impact)

  ## Source References

  Verified the defect exists in:
  - 3.1.3: `ObjectStore.java#L3646-L3697` (handleDirectSqlError)
  - 4.0.0: `ObjectStore.java#L4449-L4495` (same defect, no PM cleanup)
  - `MetaStoreDirectSql.java#L2026-L2034` (prepareTxn trigger)

  ## Steps to Reproduce

  1. Set up HMS 3.1.3+ with HikariCP backed by MySQL
  2. Create a long-lived metastore client that maps permanently to one HMS
     worker thread (e.g., Apache Amoro pod, Spark Thrift Server)
  3. Briefly disconnect MySQL (4 seconds via iptables drop or KILL CONNECTION)
  4. Observe: one worker thread continues to reuse the broken wrapper 
indefinitely
  5. Verify: log shows continuous "Falling back to ORM path due to direct SQL
     failure: Error setting ansi quotes: Connection is closed" from same thread

  ## Proposed Fix

  In `ObjectStore.handleDirectSqlError()`, when the cause is a connection-level
  SQLException, invalidate the PM:

  ```java
  if (isConnectionLevelError(ex)) {
      if (pm != null) {
          try {
              if (pm.currentTransaction().isActive()) {
                  pm.currentTransaction().rollback();
              }
              pm.close();   // releases HikariCP wrapper to pool
          } catch (Exception e) {
              // best effort
          }
          pm = null;
          directSql = null;
      }
  }

  This forces a fresh PM (and thus a fresh connection) on the next RPC,
  allowing the broken connection to be properly evicted by HikariCP.

  Workarounds (currently in use)

  - Client-side: shorten hive.metastore.client.socket.timeout on long-lived
  clients (e.g., Amoro) so they auto-reconnect every few minutes, breaking
  the permanent thread mapping
  - Operational: enable HikariCP leakDetectionThreshold, alarm on
  "Connection leak detection triggered" log, and auto-restart the affected HMS

  Related JIRAs (none directly fix this)

  - HIVE-22804 (sessionVariables workaround) — does not prevent the leak
  - HIVE-20192 (PM cleanup at thread exit) — different mechanism
  - HIVE-28788 (commit failure → starvation) — different trigger
  - HIVE-28839 (DataNucleus connection starvation) — different code path

  To my knowledge, this specific defect (PM ThreadLocal retaining broken
  wrapper after SQLException in handleDirectSqlError) has not been reported
  before.

  ---


> ObjectStore.handleDirectSqlError() leaks broken JDBC connection wrapper in PM 
> ThreadLocal cache after MySQL transient outage
> ----------------------------------------------------------------------------------------------------------------------------
>
>                 Key: HIVE-29584
>                 URL: https://issues.apache.org/jira/browse/HIVE-29584
>             Project: Hive
>          Issue Type: Bug
>          Components: Metastore, Standalone Metastore
>    Affects Versions: 3.1.3, 4.0.0, 4.2.0
>         Environment:   Hive: 3.1.3 (also verified affected: 4.0.0, 4.1.0)
>   Standalone Metastore: yes
>   Backend DB: MySQL 8.0
>   JDBC Driver: mysql-connector-j 8.0.30
>   Connection Pool: HikariCP 2.6.1
>   ORM: DataNucleus 4.1.19 + datanucleus-core 4.1.17 + datanucleus-api-jdo 
> 4.2.4
>   JVM: OpenJDK 1.8.0_412
>   Affected Client: Apache Amoro (long-lived thrift connection client)
>            Reporter: soowan4147
>            Priority: Critical
>              Labels: datanucleus, hikaricp, hms, metastore, mysql, threadlocal
>
> # 
>  ## Summary
>   After a brief MySQL outage (e.g., 4-second network glitch from a planned
>   DBA operation), one HMS Thrift worker thread can permanently retain a broken
>   HikariProxyConnection in its ObjectStore.pm ThreadLocal cache, leading to
>   indefinite reuse of the same broken wrapper for hours until HMS is 
> restarted.
>   ## Root Cause
>   `MetaStoreDirectSql.prepareTxn()` executes `SET 
> @@session.sql_mode=ANSI_QUOTES`
>   on every transaction. When this fails on a broken connection (e.g.,
>   "Connection is closed" SQLException after MySQL transient outage),
>   `ObjectStore.handleDirectSqlError()` falls back to ORM mode but does NOT
>   invalidate the PersistenceManager. As a result:
>   1. Same `ObjectStore.pm` is reused on next RPC
>   2. Same broken HikariProxyConnection wrapper is reused (held via 
> ThreadLocal)
>   3. HikariCP cannot evict the in-use connection per design
>   4. Only HMS restart releases the wrapper
>   ## Production Incident Evidence (2026-04-26)
>   - pool-6-thread-93872 retained broken wrapper for ~5 hours
>   - 41,302 audit RPCs all from same client IP, all failing with same error
>   - 6,047 "Falling back to ORM" + 14,309 ERROR logs in 5 hours
>   - master02 normal threads' RPC throughput dropped 90%+ during incident
>   - catalogd's ALTER_TABLE processing stalled 1~3 minutes per event
>   - Resolved only by master HMS restart (3h 10m total impact)
>   ## Source References
>   Verified the defect exists in:
>   - 3.1.3: `ObjectStore.java#L3646-L3697` (handleDirectSqlError)
>   - 4.0.0: `ObjectStore.java#L4449-L4495` (same defect, no PM cleanup)
>   - `MetaStoreDirectSql.java#L2026-L2034` (prepareTxn trigger)
>   ## Steps to Reproduce
>   1. Set up HMS 3.1.3+ with HikariCP backed by MySQL
>   2. Create a long-lived metastore client that maps permanently to one HMS
>      worker thread (e.g., Apache Amoro pod, Spark Thrift Server)
>   3. Briefly disconnect MySQL (4 seconds via iptables drop or KILL CONNECTION)
>   4. Observe: one worker thread continues to reuse the broken wrapper 
> indefinitely
>   5. Verify: log shows continuous "Falling back to ORM path due to direct SQL
>      failure: Error setting ansi quotes: Connection is closed" from same 
> thread
>   ## Proposed Fix
>   In `ObjectStore.handleDirectSqlError()`, when the cause is a 
> connection-level
>   SQLException, invalidate the PM:
>   ```java
>   if (isConnectionLevelError(ex)) {
>       if (pm != null) {
>           try {
>               if (pm.currentTransaction().isActive())
> {                   pm.currentTransaction().rollback();               }
>               pm.close();   // releases HikariCP wrapper to pool
>           } catch (Exception e)
> {               // best effort           }
>           pm = null;
>           directSql = null;
>       }
>   }
>   This forces a fresh PM (and thus a fresh connection) on the next RPC,
>   allowing the broken connection to be properly evicted by HikariCP.
>   Workarounds (currently in use)
>   - Client-side: shorten hive.metastore.client.socket.timeout on long-lived
>   clients (e.g., Amoro) so they auto-reconnect every few minutes, breaking
>   the permanent thread mapping
>   - Operational: enable HikariCP leakDetectionThreshold, alarm on
>   "Connection leak detection triggered" log, and auto-restart the affected HMS
>   Related JIRAs (none directly fix this)
>   - HIVE-22804 (sessionVariables workaround) — does not prevent the leak
>   - HIVE-20192 (PM cleanup at thread exit) — different mechanism
>   - HIVE-28788 (commit failure → starvation) — different trigger
>   - HIVE-28839 (DataNucleus connection starvation) — different code path
>   To my knowledge, this specific defect (PM ThreadLocal retaining broken
>   wrapper after SQLException in handleDirectSqlError) has not been reported
>   before.



--
This message was sent by Atlassian Jira
(v8.20.10#820010)

Reply via email to