findingrish opened a new pull request, #15817: URL: https://github.com/apache/druid/pull/15817
## Description Issue: https://github.com/apache/druid/issues/14989 The initial step in optimizing segment metadata was to centralize the construction of datasource schema in the Coordinator (https://github.com/apache/druid/pull/14985). Thereafter, we addressed the problem of publishing schema for realtime segments (https://github.com/apache/druid/pull/15475). Subsequently, our goal is to eliminate the requirement for regularly executing queries to obtain segment schema information. This is the final change which involves publishing segment schema for finalized segments from task and periodically polling them in the Coordinator. ## Design ### Database #### Schema Table Table Name: `SegmentSchema` Purpose: Store unique schema for segment. Columns | Column Name | Data Type | Description | -------------- | ---------- | ------------ | id | autoincrement | primary key | created_date | varchar | creation time, allows filtering schema created after a point | fingerprint | varchar | unique identifier for the schema | payload | blob | includes rowSignature, aggregatorFactories #### Segments Table New columns will be added to the already existing `Segments` table. Columns | Column Name | Data Type | Description |---------------| -----------| ------------ | num_rows | long | number of rows in the segment | schema_id | long | foreign key, references id in the schema table ### Task Changes in the task to publish schema along with segment metadata. #### Streaming - Changes in `StreamAppenderator` to get the RowSignature, AggregatorFactories and numRows for the segment. #### Batch TBA #### MSQ TBA ### Coordinator #### Schema Poll #### Schema Caching #### SegmentMetadataCache changes #### Schema Cleanup ## Testing TBA ## Potential side effects TBA ## Limitations TBA ## Upgrade considerations TBA This PR has: - [ ] been self-reviewed. - [ ] using the [concurrency checklist](https://github.com/apache/druid/blob/master/dev/code-review/concurrency.md) (Remove this item if the PR doesn't have any relation to concurrency.) - [ ] added documentation for new or modified features or behaviors. - [ ] a release note entry in the PR description. - [ ] added Javadocs for most classes and all non-trivial methods. Linked related entities via Javadoc links. - [ ] added or updated version, license, or notice information in [licenses.yaml](https://github.com/apache/druid/blob/master/dev/license.md) - [ ] added comments explaining the "why" and the intent of the code wherever would not be obvious for an unfamiliar reader. - [ ] added unit tests or modified existing tests to cover new code paths, ensuring the threshold for [code coverage](https://github.com/apache/druid/blob/master/dev/code-review/code-coverage.md) is met. - [ ] added integration tests. - [ ] been tested in a test Druid cluster. -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: [email protected] For queries about this service, please contact Infrastructure at: [email protected] --------------------------------------------------------------------- To unsubscribe, e-mail: [email protected] For additional commands, e-mail: [email protected]
