[ https://issues.apache.org/jira/browse/UNOMI-897?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=18008990#comment-18008990 ]
Serge Huber commented on UNOMI-897: ----------------------------------- This fix has been committed, reviewed and merged in https://github.com/apache/unomi/pull/720 We now need to up-port it to Unomi 3 which has a different implementation of the service due to multi-tenancy. > Two separate issues in Groovy Actions implementation - Data contamination > race condition and severe performance problems > ------------------------------------------------------------------------------------------------------------------------ > > Key: UNOMI-897 > URL: https://issues.apache.org/jira/browse/UNOMI-897 > Project: Apache Unomi > Issue Type: Bug > Components: unomi(-core) > Affects Versions: unomi-2.4.0, unomi-2.5.0, unomi-2.6.0, unomi-2.6.1 > Environment: - Apache Unomi 2.0.0+ > - Groovy actions extension enabled > - Multi-threaded environments > - Production deployments with concurrent users > - High-concurrency scenarios (1000+ concurrent threads) > Reporter: Serge Huber > Priority: Major > Fix For: unomi-3.0.0, unomi-2.7.0 > > Time Spent: 2.5h > Remaining Estimate: 0h > > This ticket documents *two separate but related issues* in the > {code}GroovyActionDispatcher{code} that affect Unomi deployments using Groovy > actions: > # *ISSUE #1: Data Contamination Race Condition* - Security vulnerability > where user data can leak between different users > # *ISSUE #2: Performance and Memory Problems* - Scalability issues that > prevent handling concurrent users effectively > *Business Impact*: These issues can lead to data integrity problems, customer > trust issues, system reliability problems, and operational costs due to poor > performance. > *Note*: These two issues are reported together because they both stem from > the same architectural problems in the {code}GroovyActionDispatcher{code} and > will require coordinated fixes in the same code area. Addressing one issue > without considering the other could lead to incomplete solutions or introduce > new problems. > ---- > h2. ISSUE #1: Data Contamination Race Condition > h3. 🚨 Security Vulnerability - User Data Leakage > *What's happening*: The system has a bug that can cause user data from one > customer to appear in another customer's profile or event data. This is a > *serious privacy and security problem*. > *Why it matters*: > * Customer A's personal information could be visible to Customer B > * This can lead to loss of customer trust and data integrity issues > * Audit logs become unreliable > * System behavior becomes unpredictable > h4. Root Cause (Technical) > The implementation uses a *shared GroovyShell instance* across all threads in > {code}GroovyActionDispatcher{code}, which means: > * When multiple users are processed simultaneously, their data can get mixed > up > * Thread A sets variables for User A's action/event > * Thread B overwrites the same shared shell with User B's action/event data > * Both threads now see User B's data, causing contamination > h4. Race Condition Scenario (Simplified) > # *User A* logs in and their profile is being processed > # *User B* logs in at the same time and their profile is also being processed > # Due to the shared system component, User A's profile gets updated with User > B's information > # *Result*: User A now sees User B's data in their profile > h3. 🔒 Security Implications > h4. Data Integrity Violation > * *User profile data leaks* between different user profiles > * *Event data corruption* where event IDs get mixed up > * *Unauthorized data disclosure* - data from one user visible to another > * *Audit trail corruption* - incorrect data in logs > * *Data integrity compromise* - unreliable system behavior > h4. Example Data Contamination > {code} > User A's profile: { "profileId": "user-a", "actionId": "action-a", "eventId": > "event-a" } > User B's profile: { "profileId": "user-b", "actionId": "action-b", "eventId": > "event-b" } > After race condition: > User A's profile: { "profileId": "user-b", "actionId": "action-b", "eventId": > "event-b" } // CONTAMINATED! > User B's profile: { "profileId": "user-b", "actionId": "action-b", "eventId": > "event-b" } > {code} > h3. 📋 How to Reproduce the Data Contamination Issue > *⚠️ Important Note*: Race conditions are inherently difficult to reproduce > synthetically because they depend on precise timing between threads. The > following steps may not consistently reproduce the issue in controlled test > environments, but the underlying problem exists and can manifest in > production under the right conditions. > h4. Setup Environment > {code} > # Deploy Apache Unomi with Groovy actions extension enabled > # Ensure multi-threaded environment with concurrent user access > {code} > h4. Create Race Condition Sensitive Groovy Action > {code} > def execute(action, event) { > // Read from shared variables multiple times to increase race condition > probability > def profile1 = event.getProfile() > def action1 = action > def event1 = event > > // Simulate some processing that takes time > Thread.sleep(2) > > // Read again - this might now be different due to race condition! > def profile2 = event.getProfile() > def action2 = action > def event2 = event > > def profileId1 = profile1.getProfileId() > def profileId2 = profile2.getProfileId() > def actionId1 = action1.getActionId() > def actionId2 = action2.getActionId() > def eventId1 = event1.getEventId() > def eventId2 = event2.getEventId() > > // Return data for verification - this will show contamination if race > condition occurs > return [ > profileId: profileId1, > profileId2: profileId2, > actionId: actionId1, > actionId2: actionId2, > eventId: eventId1, > eventId2: eventId2, > visitCount: newValue, > timestamp: System.currentTimeMillis(), > threadId: Thread.currentThread().getId() > ] > } > {code} > h4. Generate Concurrent Load > {code} > # Create a load test that simulates 1000+ concurrent users > # Each user should trigger the Groovy action multiple times > # Use tools like Apache JMeter, Gatling, or custom load testing scripts > # Ensure high concurrency to increase race condition probability > {code} > h4. Observe Data Contamination > * Check for profileId, actionId, and eventId mismatches in action results > * Observe that data from one user appears in another user's result > * Verify audit logs show incorrect data associations > * Look for inconsistent data patterns that indicate cross-contamination > *Note*: The race condition may not be consistently reproducible in test > environments due to timing dependencies. However, the architectural flaw > exists and can cause data contamination in production environments with real > user load patterns. > h3. 🔍 Technical Evidence for Data Contamination > h4. Code Location > * *File*: > {code}src/main/java/org/apache/unomi/groovy/actions/GroovyActionDispatcher.java{code} > * *Lines*: 67-72 (race condition in execute method) > * *Method*: {code}execute(Action action, Event event, String actionName){code} > * *Shared Shell*: Retrieved via > {code}groovyActionsService.getGroovyShell(){code} > h4. Detection Challenges > * *Timing-dependent*: Race conditions require specific thread interleaving > patterns > * *Environment-specific*: May not manifest in development or test environments > * *Load-dependent*: More likely under high concurrent load conditions > * *Intermittent*: Can appear sporadically in production without consistent > reproduction > h4. Service Implementation Details > * *Shared GroovyShell*: Retrieved from GroovyActionsService as singleton > instance > * *Variable Setting*: action and event variables set in shared shell binding > * *Script Parsing*: Script parsed from shared shell instance > * *Concurrent Access*: Multiple threads access same shell simultaneously > * *Variable Overwriting*: Thread A's variables overwritten by Thread B's > variables > h4. Race Condition Characteristics > * *Timing-dependent*: May not be consistently reproducible in all environments > * *Concurrency-sensitive*: More likely with higher thread counts > * *Data-sensitive*: Affects profileId, actionId, and eventId > * *Processing-sensitive*: More likely with longer script execution times > * *Difficult to reproduce synthetically*: Race conditions are inherently > timing-dependent and may not manifest in controlled test environments > ---- > h2. ISSUE #2: Performance and Memory Problems > h3. 📊 Severe Performance Issues > *What's happening*: The system performs very poorly when multiple users are > active at the same time, using excessive memory and processing very slowly. > Scripts are recompiled on every execution instead of being cached. > *Why it matters*: > * The system can only handle 957 operations per second (very slow for > production) > * Memory usage grows to over 5GB, causing system instability > * Performance gets worse as more users connect simultaneously > * High garbage collection activity slows down the entire system > * Script compilation overhead adds significant latency to every action > execution > h4. Performance Problems > # *Poor Throughput*: Only 957 executions/second (unacceptable for production) > # *Thread Contention*: Shared GroovyShell causes blocking > # *Memory Leaks*: 5,151MB peak memory usage > # *High GC Pressure*: 58 garbage collections, 2,294ms GC time > # *Poor Scalability*: Performance degrades with concurrency > # *No Script Caching*: Scripts are recompiled on every execution instead of > being cached > h3. 📊 Performance Benchmark Results > h4. Thread Safety Test Results (1,000 threads, 50 iterations each) > {code} > Test Configuration: > - Concurrent Threads: 1,000 > - Executions per Thread: 50 > - Total Executions: 50,000 > - Test Duration: 52+ seconds > Results: > - Throughput: 957 executions/second > - Memory Peak: 5,155 MB > - Memory Increase: 5,151 MB > - GC Time: 2,294 ms > - GC Collections: 58 > - Thread Safety: FAILED (timing-dependent) > {code} > h4. Memory Usage Analysis > {code} > Initial Memory State: > Used: 4 MB, Free: 64 MB, Total: 68 MB, Max: 16384 MB > Peak Memory State: > Used: 5155 MB, Free: 1860 MB, Total: 7015 MB, Max: 16384 MB > Memory Growth: 5,151 MB (128,775% increase) > {code} > h4. GC Activity Analysis > {code} > GC Statistics for Original Implementation: > - GC Time: 2,294 ms > - GC Count: 58 collections > - Average GC Time: 39 ms per collection > - GC Pressure: High (affects overall system performance) > {code} > h3. 🏗️ Architecture Problems > h4. Current Architecture Issues > {code} > ┌─────────────────────────────────────────────────────────────┐ > │ Current Architecture │ > ├─────────────────────────────────────────────────────────────┤ > │ GroovyActionDispatcher │ > │ ├── Shared GroovyShell (SINGLETON) ← RACE CONDITION! │ > │ ├── No Script Caching ← PERFORMANCE ISSUE! │ > │ ├── Script Recompilation on Every Execution │ > │ ├── Variable Overwriting in Shared Binding │ > │ └── Thread-Unsafe Execution │ > └─────────────────────────────────────────────────────────────┘ > {code} > h4. Performance Issues Identified > * ❌ *Shared mutable state* across threads > * ❌ *Variable overwriting* in shared GroovyShell binding > * ❌ *No thread isolation* for script execution > * ❌ *Memory inefficiency* from shared instances > * ❌ *GC pressure* from object creation patterns > * ❌ *No script compilation caching* - scripts recompiled on every execution > ---- > h2. 🎯 Impact Assessment > h3. Business Impact > * *Data Integrity Issues*: Unreliable system behavior > * *Customer Trust*: Loss of confidence in data security > * *System Reliability*: Unpredictable data processing > * *Operational Costs*: High memory usage, poor performance > h3. Technical Impact > * *Scalability*: Cannot handle concurrent users effectively > * *Performance*: 957 executions/second (unacceptable) > * *Memory*: 5,155MB peak usage leads to OOM errors > * *Reliability*: Data corruption and inconsistencies > * *GC Pressure*: 58 collections impact overall system performance > h2. 📝 Additional Information > h3. Affected Components > * {code}GroovyActionDispatcher{code} > * {code}GroovyActionsService{code} (provides shared GroovyShell) > * All Groovy action executions > * Profile and event data processing > h3. Monitoring Recommendations > * Monitor for profile data inconsistencies > * Track memory usage patterns > * Monitor GC frequency and duration > * Alert on data contamination patterns > * Monitor script execution times > h3. Workarounds (Temporary) > * Reduce concurrent event processing > * Monitor for data contamination > * Implement data validation checks > * Consider disabling Groovy actions in production > * Increase memory allocation to handle GC pressure -- This message was sent by Atlassian Jira (v8.20.10#820010)