findingrish opened a new pull request, #15475: URL: https://github.com/apache/druid/pull/15475
## Description Issue: https://github.com/apache/druid/issues/14989 The initial step in optimizing segment metadata was to centralize the construction of table schema in the Coordinator (https://github.com/apache/druid/pull/14985). Subsequently, our goal is to eliminate the requirement for regularly executing queries to obtain segment schema information. This task encompasses addressing both realtime and finalized segments. This modification specifically addresses the issue with realtime segments. Tasks will now routinely communicate the schema for realtime segments during the segment announcement process. The Coordinator will identify the schema alongside the segment announcement and subsequently update the schema for realtime segments in the metadata cache. ## Design ### Task - Periodically, the `StreamAppenderator.SinkSchemaAnnouncer` will compute sink schema changes and announce them to the `DataSegmentAnnouncer`. - New APIs have been introduced in `DataSegmentAnnouncer` to receive sink schema information and manage schema cleanup when a task is closed. - A new Pojo named `SegmentSchemas` has been added to facilitate the passing of schema information for multiple segments. - A new implementation of `DataSegmentChangeRequest` has been introduced, named `SegmentSchemasChangeRequest`. ### Coordinator - Modifications have been made to the `HttpServerInventoryView` to handle schema information. - The `CoordinatorSegmentMetadata` cache has been updated to incorporate schema changes. Changes have also been made to the refresh logic to eliminate the need for executing segment metadata queries for realtime segments. ## Testing * Added UTs. * Tested it locally with wikipedia dataset and kafka based ingestion. ## Potential side effects TBA ## Limitations Currently, this feature doesn't work with zookeeper based segment announcement. ## Upgrade considerations The general upgrade order should be followed. The new code is behind a feature flag, so it is compatible with existing setups. Even if centralized table schema building (https://github.com/apache/druid/pull/14985) is enabled, realtime segments will be refreshed using segment metadata query to Indexer/Task. ## Release notes This experimental feature aims to eliminate the necessity for periodically executing the SegmentMetadataQuery to the Indexer/Task for retrieving the schema of realtime segments. Presently, it is accessible through two feature flags and should only be enabled for Proof of Concept (PoC) or testing purposes. To activate it, configure the following settings in the common configurations: `druid.coordinator.centralizedTableSchema.enabled` and `druid.coordinator.centralizedTableSchema.announceRealtimeSegmentSchema`. It's important to note that the feature flag is temporary `druid.coordinator.centralizedTableSchema.announceRealtimeSegmentSchema` and will be removed in a subsequent update. This PR has: - [ ] been self-reviewed. - [ ] using the [concurrency checklist](https://github.com/apache/druid/blob/master/dev/code-review/concurrency.md) (Remove this item if the PR doesn't have any relation to concurrency.) - [ ] added documentation for new or modified features or behaviors. - [ ] a release note entry in the PR description. - [ ] added Javadocs for most classes and all non-trivial methods. Linked related entities via Javadoc links. - [ ] added or updated version, license, or notice information in [licenses.yaml](https://github.com/apache/druid/blob/master/dev/license.md) - [ ] added comments explaining the "why" and the intent of the code wherever would not be obvious for an unfamiliar reader. - [ ] added unit tests or modified existing tests to cover new code paths, ensuring the threshold for [code coverage](https://github.com/apache/druid/blob/master/dev/code-review/code-coverage.md) is met. - [ ] added integration tests. - [ ] been tested in a test Druid cluster. -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: [email protected] For queries about this service, please contact Infrastructure at: [email protected] --------------------------------------------------------------------- To unsubscribe, e-mail: [email protected] For additional commands, e-mail: [email protected]
