findingrish opened a new pull request, #15475:
URL: https://github.com/apache/druid/pull/15475

   ## Description
   
   Issue: https://github.com/apache/druid/issues/14989
   
   The initial step in optimizing segment metadata was to centralize the 
construction of table schema in the Coordinator 
(https://github.com/apache/druid/pull/14985). Subsequently, our goal is to 
eliminate the requirement for regularly executing queries to obtain segment 
schema information. This task encompasses addressing both realtime and 
finalized segments.
   
   This modification specifically addresses the issue with realtime segments. 
Tasks will now routinely communicate the schema for realtime segments during 
the segment announcement process. The Coordinator will identify the schema 
alongside the segment announcement and subsequently update the schema for 
realtime segments in the metadata cache.
   
   ## Design 
   
   ### Task
   - Periodically, the `StreamAppenderator.SinkSchemaAnnouncer` will compute 
sink schema changes and announce them to the `DataSegmentAnnouncer`.
   - New APIs have been introduced in `DataSegmentAnnouncer` to receive sink 
schema information and manage schema cleanup when a task is closed.
   - A new Pojo named `SegmentSchemas` has been added to facilitate the passing 
of schema information for multiple segments.
   - A new implementation of `DataSegmentChangeRequest` has been introduced, 
named `SegmentSchemasChangeRequest`.
   
   ### Coordinator
   - Modifications have been made to the `HttpServerInventoryView` to handle 
schema information.
   - The `CoordinatorSegmentMetadata` cache has been updated to incorporate 
schema changes. Changes have also been made to the refresh logic to eliminate 
the need for executing segment metadata queries for realtime segments.
   
   ## Testing
   
   * Added UTs. 
   * Tested it locally with wikipedia dataset and kafka based ingestion. 
   
   ## Potential side effects 
   TBA
   
   ## Limitations 
   
   Currently, this feature doesn't work with zookeeper based segment 
announcement. 
   
   ## Upgrade considerations
   
   The general upgrade order should be followed. The new code is behind a 
feature flag, so it is compatible with existing setups. Even if centralized 
table schema building (https://github.com/apache/druid/pull/14985) is enabled, 
realtime segments will be refreshed using segment metadata query to 
Indexer/Task. 
   
   ## Release notes 
   
   This experimental feature aims to eliminate the necessity for periodically 
executing the SegmentMetadataQuery to the Indexer/Task for retrieving the 
schema of realtime segments. Presently, it is accessible through two feature 
flags and should only be enabled for Proof of Concept (PoC) or testing 
purposes. To activate it, configure the following settings in the common 
configurations: `druid.coordinator.centralizedTableSchema.enabled` and 
`druid.coordinator.centralizedTableSchema.announceRealtimeSegmentSchema`. It's 
important to note that the feature flag is temporary 
`druid.coordinator.centralizedTableSchema.announceRealtimeSegmentSchema` and 
will be removed in a subsequent update.
   
   This PR has:
   
   - [ ] been self-reviewed.
      - [ ] using the [concurrency 
checklist](https://github.com/apache/druid/blob/master/dev/code-review/concurrency.md)
 (Remove this item if the PR doesn't have any relation to concurrency.)
   - [ ] added documentation for new or modified features or behaviors.
   - [ ] a release note entry in the PR description.
   - [ ] added Javadocs for most classes and all non-trivial methods. Linked 
related entities via Javadoc links.
   - [ ] added or updated version, license, or notice information in 
[licenses.yaml](https://github.com/apache/druid/blob/master/dev/license.md)
   - [ ] added comments explaining the "why" and the intent of the code 
wherever would not be obvious for an unfamiliar reader.
   - [ ] added unit tests or modified existing tests to cover new code paths, 
ensuring the threshold for [code 
coverage](https://github.com/apache/druid/blob/master/dev/code-review/code-coverage.md)
 is met.
   - [ ] added integration tests.
   - [ ] been tested in a test Druid cluster.
   


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: [email protected]

For queries about this service, please contact Infrastructure at:
[email protected]


---------------------------------------------------------------------
To unsubscribe, e-mail: [email protected]
For additional commands, e-mail: [email protected]

Reply via email to