kfaraz opened a new pull request, #17959:
URL: https://github.com/apache/druid/pull/17959

   ### Description
   
   This patch adds various tools to introduce artificial faults in a Druid test 
cluster to allow for scale testing
   and fault-tolerance testing.
   
   These tools can be used to trigger faults such as segment allocation delay, 
segment publish delay, etc.
   and evaluate the capability of a cluster to recover from such situations.
   
   The `ClusterTestingModule` can evolve over time to allow inducing more types 
of faults.
   
   ### Changes
   
   - Add `ClusterTestingModule` to bind faulty clients when testing is enabled
   - Enable cluster testing mode by loading extension `druid-testing-tools` and
   setting config `druid.unsafe.cluster.testing=true`
   - Bind `FaultyCoordinatorClient`, `FaultyRemoteTaskActionClient` on Peons if 
testing is enabled
   - Trigger specific faults on tasks by setting the corresponding task context 
param
   - Bind `FaultyMetadataStorageCoordinator` on Overlord if testing is enabled
   - Bind `FaultyLagAggregator` to a supervisor if specified in the supervisor 
spec
   
   ### Pending items
   
   - Write config to trigger faults on Overlord
   - Write ITs to trigger and test faults
   
   ### Supported faults
   
   |Fault type|Steps to trigger|
   |----------|----------------|
   |Slow segment allocation|Specify task context parameter in the supervisor 
spec:<br>`taskActionClientConfig={segmentAllocationDelay=PT5S}`|
   |Slow segment publish|Specify task context parameter in the supervisor 
spec:<br>`taskActionClientConfig={segmentPublishDelay=PT1S}`|
   |Slow segment handoff|Specify task context parameter in the supervisor 
spec:<br>`coordinatorClient={minSegmentHandoffDelay=PT10S}`|
   |Tasks do not finish within completion timeout|Reduce `completionTimeout` in 
the supervisor spec||
   |High ingestion lag|<ul><li>Reset supervisor and read from start of stream 
</li><li>OR Add long delays for segment hand-off</li><li>OR Add long delays for 
segment publish</li><li>OR Use a transformSpec with the `sleep` function. The 
expression below would add a 5s delay while reading each row.</li><li>OR 
Specify a `lagAggregator` in the supervisor IO config to magnify lag using a 
multiplier. This option doesn’t increase the actual ingestion lag and only 
causes the Overlord to perceive and report a high lag.||
   |Task goes OutOfMemory|Increase `maxBytesInMemory`||
   |Skew in distribution of data across tasks|Set `taskCount` in supervisor 
spec to a value which does not exactly divide numPartitions||
   |High lag for a specific partition|Configure tasks with only the 
corresponding taskGroupId to have high lag||
   |Late arriving data|Manipulate timestamp of incoming data by applying a 
transform||
   |Future data|Manipulate timestamp of incoming data by applying a transform||
   |Many used segments in metadata store|Reduce `maxRowsPerSegment` to increase 
the number of segments created by the tasks||
   |Many pending segments in metadata store|||
   
   #### Examples to trigger faults
   
   <details>
   <summary>Use transforms to manipulate timestamps of incoming 
records</summary>
   
   ```json
   {
     "type": "expression",
     "name": "__time",
     "expression": "timestamp_shift(__time, 'P3D', 1)"
   }
   ```
   
   </details>
   
   <details>
   <summary>Use sleep expressions to add delay while reading records</summary>
   
   The expression below would add a 5s delay while reading each row.
   
   ```json
   {
     "type": "expression",
     "name": "__time",
     "expression": "timestamp_shift(__time, coalesce(sleep(5), 'P3D'), 1)"
   }
   ```
   
   </details>
   
   <details>
   <summary>Use faulty lag aggregator to multiply lag</summary>
   
   Specify the following in the supervisor spec to multiply the max, average 
and total lag by 100.
   
   ```json
   {
     "spec": {
       "ioConfig": {
         "lagAggregator": {
           "type": "unsafe_cluster_testing",
           "multiplier": 100
         }
       }
     }
   }
   ```
   
   </details>
   
   <hr>
   
   This PR has:
   
   - [ ] been self-reviewed.
      - [ ] using the [concurrency 
checklist](https://github.com/apache/druid/blob/master/dev/code-review/concurrency.md)
 (Remove this item if the PR doesn't have any relation to concurrency.)
   - [ ] added documentation for new or modified features or behaviors.
   - [ ] a release note entry in the PR description.
   - [ ] added Javadocs for most classes and all non-trivial methods. Linked 
related entities via Javadoc links.
   - [ ] added or updated version, license, or notice information in 
[licenses.yaml](https://github.com/apache/druid/blob/master/dev/license.md)
   - [ ] added comments explaining the "why" and the intent of the code 
wherever would not be obvious for an unfamiliar reader.
   - [ ] added unit tests or modified existing tests to cover new code paths, 
ensuring the threshold for [code 
coverage](https://github.com/apache/druid/blob/master/dev/code-review/code-coverage.md)
 is met.
   - [ ] added integration tests.
   - [ ] been tested in a test Druid cluster.
   


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: [email protected]

For queries about this service, please contact Infrastructure at:
[email protected]


---------------------------------------------------------------------
To unsubscribe, e-mail: [email protected]
For additional commands, e-mail: [email protected]

Reply via email to