findingrish opened a new pull request, #15817:
URL: https://github.com/apache/druid/pull/15817

   ## Description
   
   Issue: https://github.com/apache/druid/issues/14989
   
   The initial step in optimizing segment metadata was to centralize the 
construction of datasource schema in the Coordinator 
(https://github.com/apache/druid/pull/14985). Thereafter, we addressed the 
problem of publishing schema for realtime segments 
(https://github.com/apache/druid/pull/15475). Subsequently, our goal is to 
eliminate the requirement for regularly executing queries to obtain segment 
schema information. 
   
   This is the final change which involves publishing segment schema for 
finalized segments from task and periodically polling them in the Coordinator. 
   
   ## Design 
   
   ### Database
   
   #### Schema Table 
   
   Table Name: `SegmentSchema`
   Purpose: Store unique schema for segment. 
   
   Columns 
   
   | Column Name | Data Type | Description 
   | -------------- | ---------- | ------------
   | id | autoincrement | primary key 
   | created_date | varchar | creation time, allows filtering schema created 
after a point 
   | fingerprint | varchar | unique identifier for the schema
   | payload | blob | includes rowSignature, aggregatorFactories      
   
   
   #### Segments Table 
   New columns will be added to the already existing `Segments` table. 
   
   Columns 
   | Column Name | Data Type | Description 
   |---------------| -----------| ------------
   | num_rows | long | number of rows in the segment 
   | schema_id | long | foreign key, references id in the schema table 
   
   ### Task 
   Changes in the task to publish schema along with segment metadata. 
   
   #### Streaming 
   - Changes in `StreamAppenderator` to get the RowSignature, 
AggregatorFactories and numRows for the segment. 
   
   
   #### Batch 
   TBA
   
   #### MSQ
   TBA
   
   ### Coordinator
   
   #### Schema Poll  
   
   #### Schema Caching 
   
   #### SegmentMetadataCache changes 
   
   #### Schema Cleanup 
   
   
   
   ## Testing
   
   TBA 
   
   ## Potential side effects 
   
   TBA
   
   ## Limitations 
   
   TBA
   
   ## Upgrade considerations
   
   TBA
   This PR has:
   
   - [ ] been self-reviewed.
      - [ ] using the [concurrency 
checklist](https://github.com/apache/druid/blob/master/dev/code-review/concurrency.md)
 (Remove this item if the PR doesn't have any relation to concurrency.)
   - [ ] added documentation for new or modified features or behaviors.
   - [ ] a release note entry in the PR description.
   - [ ] added Javadocs for most classes and all non-trivial methods. Linked 
related entities via Javadoc links.
   - [ ] added or updated version, license, or notice information in 
[licenses.yaml](https://github.com/apache/druid/blob/master/dev/license.md)
   - [ ] added comments explaining the "why" and the intent of the code 
wherever would not be obvious for an unfamiliar reader.
   - [ ] added unit tests or modified existing tests to cover new code paths, 
ensuring the threshold for [code 
coverage](https://github.com/apache/druid/blob/master/dev/code-review/code-coverage.md)
 is met.
   - [ ] added integration tests.
   - [ ] been tested in a test Druid cluster.
   


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: [email protected]

For queries about this service, please contact Infrastructure at:
[email protected]


---------------------------------------------------------------------
To unsubscribe, e-mail: [email protected]
For additional commands, e-mail: [email protected]

Reply via email to