asdf2014 opened a new issue, #17968:
URL: https://github.com/apache/druid/issues/17968

   ### Description
   
   This proposes to enhance the `SegmentMetadataQuery` by introducing a new 
optional parameter: `segmentIds`. This parameter allows users to query metadata 
for specific segments directly by their `segmentId`, rather than relying solely 
on interval-based filtering.
   
   ### Motivation
   
   This feature will be particularly useful for use cases such as:
   
   - Debugging or inspecting individual segments;
   - Validating the state of a known segment after ingestion or compaction;
   - Programmatic access in custom tooling where segment IDs are already known.
   
   ### Proposed Changes
   
   1. **Query Definition Layer**
      - Extend `SegmentMetadataQuery` to include a `List<String> segmentIds` 
field.
      - Ensure proper serialization/deserialization with Jackson.
      - Update equality, hashCode, and toString logic accordingly.
   2. **Query Runner**
      - Modify `SegmentMetadataQueryRunner` to evaluate and skip segments whose 
`segmentId` is not in the provided list.
   3. **Query Planning / Timeline Resolution**
      - Update `CachingClusteredClient` (on the Broker) to support filtering 
segments by `segmentId` before dispatching queries.
      - Introduce a utility to map `segmentId` to `SegmentDescriptor`, or 
extend `VersionedIntervalTimeline` if appropriate.
   4. **Backward Compatibility**
      - The new parameter will be **optional** and non-intrusive: if not 
specified, current behavior is preserved.
   5. **Testing**
      - Add unit tests for query definition, runner logic, and broker-level 
filtering behavior.
      - Extend integration tests to cover mixed queries with and without 
`segmentIds`.
   
   ### Impacted Classes
   
   The following classes are expected to be modified as part of this change:
   
   - `org.apache.druid.query.metadata.metadata.SegmentMetadataQuery`
   - `org.apache.druid.query.metadata.metadata.SegmentMetadataQueryRunner`
   - `org.apache.druid.client.CachingClusteredClient`
   - `org.apache.druid.query.SegmentDescriptor`
   - `org.apache.druid.timeline.VersionedIntervalTimeline` (if necessary to 
locate segments by ID)
   - `org.apache.druid.segment.ReferenceCountingSegment` (for ID exposure)
   - `org.apache.druid.query.QueryToolChest` (for caching or context changes)
   - `org.apache.druid.query.QueryRunnerTestHelper` (for test support)
   
   ### Example Usage
   
   #### Query part
   
   ```json
   {
     "queryType": "segmentMetadata",
     "dataSource": "sample_datasource",
     "segmentIds": [
       
"sample_datasource_2025-12-01T00:00:00.000Z_2025-12-02T00:00:00.000Z_2025-12-02T00:00:00.000Z_v1"
     ]
   }
   ```
   
   #### Response part
   
   ```json
   [
     {
       "id": 
"sample_datasource_2025-12-01T00:00:00.000Z_2025-12-02T00:00:00.000Z_2025-12-02T00:00:00.000Z_v1",
       "intervals": ["2025-12-01T00:00:00.000Z/2025-12-02T00:00:00.000Z"],
       "columns": {
         "__time": {
           "type": "LONG",
           "typeSignature": "LONG",
           "hasMultipleValues": false,
           "hasNulls": false,
           "size": 800000,
           "cardinality": null,
           "errorMessage": null
         },
         "user_id": {
           "type": "STRING",
           "typeSignature": "STRING",
           "hasMultipleValues": false,
           "hasNulls": false,
           "size": 2000000,
           "cardinality": 135000,
           "errorMessage": null
         },
         "event_type": {
           "type": "STRING",
           "typeSignature": "STRING",
           "hasMultipleValues": false,
           "hasNulls": true,
           "size": 500000,
           "cardinality": 25,
           "errorMessage": null
         },
         "metric_clicks": {
           "type": "FLOAT",
           "typeSignature": "FLOAT",
           "hasMultipleValues": false,
           "hasNulls": false,
           "size": 1000000,
           "cardinality": null,
           "errorMessage": null
         }
       },
       "aggregators": {
         "metric_clicks": {
           "type": "floatSum",
           "name": "metric_clicks",
           "fieldName": "metric_clicks"
         }
       },
       "queryGranularity": {
         "type": "minute"
       },
       "size": 4500000,
       "numRows": 1000000,
       "rollup": false
     }
   ]
   ```
   
   ### Testing
   
   #### Unit Tests
   
     - Add tests in `SegmentMetadataQueryTest` to validate correct behavior 
when `segmentIds` is provided or omitted.
     - Extend `SegmentMetadataQueryRunnerTest` to ensure only the specified 
segments are queried.
     - Add test coverage for edge cases, such as empty or non-existent 
`segmentIds`.
   
   #### Integration Tests
   
     - Update or extend `ITSegmentMetadataTest` to include scenarios using the 
new `segmentIds` parameter.
     - Add new tests that:
       - Query metadata for a single known segment.
       - Query with multiple segment IDs across intervals.
       - Query with a mix of valid and invalid segment IDs (expect partial 
results or error handling).
       - Validate compatibility with existing query context parameters (e.g., 
`toInclude`, `merge`, etc.).
     - Verify that the query returns accurate and expected results without 
performance regressions.
   
   ### Alternatives Considered
   
   And considered performing this filtering at the client side, but that 
requires unnecessarily querying irrelevant segments, which is inefficient for 
large datasources. Implementing it natively at the Broker and QueryRunner 
layers is more scalable and consistent.
   
   ### Backward Compatibility
   
   The introduction of the `segmentIds` parameter will be designed to be 
**optional** and will not break any existing functionality. If the `segmentIds` 
parameter is not provided in the query, the current behavior based on interval 
filtering will remain unchanged. 
   
   However, we recognize that this new feature might require certain 
modifications in existing systems or tooling, especially for users who rely on 
interval-based querying for segment metadata. To mitigate any potential 
compatibility issues:
   
   1. **Query Compatibility**:
      - If `segmentIds` is used alongside `intervals`, the query will return 
metadata only for segments whose `segmentId` matches the provided list, within 
the specified interval.
      - If no `segmentIds` are provided, the system will continue to use the 
interval-based filtering mechanism, ensuring seamless backward compatibility.
   
   2. **Documentation and Communication**:
      - Documentation will be updated to highlight this new optional parameter, 
with examples for both use cases, one with and one without the `segmentIds` 
parameter.
      - Users who have been using segment metadata queries with interval-based 
filtering will not experience any changes unless they explicitly choose to use 
the `segmentIds` parameter.
   
   3. **Feature Flagging**:
      - To ensure smooth rollout, this feature could be initially introduced 
behind a feature flag, allowing users to opt-in and test the new functionality 
before enabling it fully in production environments.
      
   4. **Fallback Mechanism**:
      - If a `segmentId` does not exist (e.g., due to a typo or missing 
segment), the query will gracefully handle the error, either by returning an 
empty result for the invalid `segmentId` or providing an appropriate error 
message, depending on the desired behavior.
   
   By implementing this optional parameter in a non-intrusive manner, the 
overall system remains compatible with existing workloads and users are given 
the flexibility to adopt the new feature at their discretion.
   


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: [email protected]

For queries about this service, please contact Infrastructure at:
[email protected]


---------------------------------------------------------------------
To unsubscribe, e-mail: [email protected]
For additional commands, e-mail: [email protected]

Reply via email to