jihoonson opened a new pull request #11457:
URL: https://github.com/apache/druid/pull/11457


   ### Description
   
   A concurrency issue is recently found between `DruidSchema` and 
`BrokerServerView`, which is, refreshing a `DruidTable` in `DruidSchema` can 
block query processing. This can happen in the scenario described below.
   
   1. The `DruidSchema-Cache` thread locks `DruidSchema#lock` and calls 
`DruidSchema.buildDruidTable()` to rebuild the rowSignature of druidTable.
   2. A new segment is added to `BrokerServerView`. The `BrokerServerView` 
thread will lock `BrokerServerView#lock`, process the new segment added, and 
call timelineCallbacks of `DruidSchema`.
   3. The timeline callback of `DruidSchema` is executed by the same thread 
used in the step 2. This thread will wait for `DruidSchema.buildDruidTable()` 
in the step 1 to be done and `DruidSchema#lock` is released.
   4. A new query is issued and the query thread will call 
`BrokerServerView.getTimeline()`. This call will wait for the timelineCallbacks 
to be done in the step 2 and `BrokerServerView#lock` is released.
   
   The following flame graphs show what those threads were doing when this 
happened in our cluster. The metrics for these flame graphs were collected for 
30 seconds.
   
   
![DruidSchema](https://user-images.githubusercontent.com/2322288/125991564-008eddf1-1bc8-435b-bc93-4b331589c0a8.png)
   
   `DruidSchema` was calling `refreshSegmentsForDataSource()` and 
`buildDruidTable()`. `refreshSegmentsForDataSource` issues a 
`SegmentMetadataQuery` per segment and locks `DruidSchema#lock` per row in the 
result to update segment metadata in memory. `buildDruidTable` locks 
`DruidSchema#lock` _while it iterates all columns in all segments_ in the 
datasource. When the datasource has 481200 segments and 774 columns in each 
segment, `buildDruidTable` took about 25 seconds (!!) on my desktop.
   
   
![BrokerServerView](https://user-images.githubusercontent.com/2322288/125991550-eb61f746-56f2-4ca0-9494-d3ad20a929e4.png)
   
   `BrokerServerView` was blocked in the `DruidSchema.addSegment` callback.
   
   ![2 timeseries 
queries](https://user-images.githubusercontent.com/2322288/125991541-e3b7972e-397f-4e9c-a012-66a2d7b32285.png)
   
   Finally, there were 2 timeseries queries that were blocked in 
`BrokerServerView.getTimeline()` for 30 seconds.
   
   Currently, this can happen whenever `DruidSchema` needs to refresh 
`DruidTable`, which is whenever a new segment is added to the cluster or a 
segment is completely removed from the cluster. Moving segments should not 
cause this issue because a new segment is always loaded first in the new server 
before it is removed from the previous server. As a result, moving segments 
does not require to update the `RowSignature` of `DruidTable`.
   
   To fix this issue, this PR improves the concurrency of `DruidSchema` by not 
holding `DruidSchema#lock` to process expensive operations such as refreshing 
`DruidTables`.
   
   - A new thread, `DruidSchema-Callback`, is added in `DruidSchema` to 
asynchronously process the timeline callbacks.
   - The `segmentMetadataInfo` map is changed to `ConcurrentHashMap<String, 
ConcurrentSkipListMap<SegmentId, AvailableSegmentMetadata>>`, so that updating 
the map doesn't have to lock `DruidSchema#lock`. Instead, the concurrency 
control is delegated to `ConcurrentMap`s. This could potentially make querying 
the segments table faster because `getSegmentMetadataSnapshot()` no longer 
requires to lock `DruidSchema#lock`.
   
   Finally, this PR does not fix the expensive logic in `buildDruidTable()`. 
Incremental updates on `DruidTable` could be better but it requires that, for 
each column, we should be able to fall back to the column type in the second 
most recent segment when the most recent segment disappears. We can research 
more how we can track those column types of segments efficiently as a follow-up.
   
   <hr>
   
   ##### Key changed/added classes in this PR
    * `DruidSchema`
   
   <hr>
   
   This PR has:
   - [x] been self-reviewed.
      - [x] using the [concurrency 
checklist](https://github.com/apache/druid/blob/master/dev/code-review/concurrency.md)
 (Remove this item if the PR doesn't have any relation to concurrency.)
   - [ ] added documentation for new or modified features or behaviors.
   - [x] added Javadocs for most classes and all non-trivial methods. Linked 
related entities via Javadoc links.
   - [ ] added or updated version, license, or notice information in 
[licenses.yaml](https://github.com/apache/druid/blob/master/dev/license.md)
   - [x] added comments explaining the "why" and the intent of the code 
wherever would not be obvious for an unfamiliar reader.
   - [x] added unit tests or modified existing tests to cover new code paths, 
ensuring the threshold for [code 
coverage](https://github.com/apache/druid/blob/master/dev/code-review/code-coverage.md)
 is met.
   - [ ] added integration tests.
   - [ ] been tested in a test Druid cluster.
   


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: [email protected]

For queries about this service, please contact Infrastructure at:
[email protected]



---------------------------------------------------------------------
To unsubscribe, e-mail: [email protected]
For additional commands, e-mail: [email protected]

Reply via email to