asdf2014 opened a new issue, #17968:
URL: https://github.com/apache/druid/issues/17968
### Description
This proposes to enhance the `SegmentMetadataQuery` by introducing a new
optional parameter: `segmentIds`. This parameter allows users to query metadata
for specific segments directly by their `segmentId`, rather than relying solely
on interval-based filtering.
### Motivation
This feature will be particularly useful for use cases such as:
- Debugging or inspecting individual segments;
- Validating the state of a known segment after ingestion or compaction;
- Programmatic access in custom tooling where segment IDs are already known.
### Proposed Changes
1. **Query Definition Layer**
- Extend `SegmentMetadataQuery` to include a `List<String> segmentIds`
field.
- Ensure proper serialization/deserialization with Jackson.
- Update equality, hashCode, and toString logic accordingly.
2. **Query Runner**
- Modify `SegmentMetadataQueryRunner` to evaluate and skip segments whose
`segmentId` is not in the provided list.
3. **Query Planning / Timeline Resolution**
- Update `CachingClusteredClient` (on the Broker) to support filtering
segments by `segmentId` before dispatching queries.
- Introduce a utility to map `segmentId` to `SegmentDescriptor`, or
extend `VersionedIntervalTimeline` if appropriate.
4. **Backward Compatibility**
- The new parameter will be **optional** and non-intrusive: if not
specified, current behavior is preserved.
5. **Testing**
- Add unit tests for query definition, runner logic, and broker-level
filtering behavior.
- Extend integration tests to cover mixed queries with and without
`segmentIds`.
### Impacted Classes
The following classes are expected to be modified as part of this change:
- `org.apache.druid.query.metadata.metadata.SegmentMetadataQuery`
- `org.apache.druid.query.metadata.metadata.SegmentMetadataQueryRunner`
- `org.apache.druid.client.CachingClusteredClient`
- `org.apache.druid.query.SegmentDescriptor`
- `org.apache.druid.timeline.VersionedIntervalTimeline` (if necessary to
locate segments by ID)
- `org.apache.druid.segment.ReferenceCountingSegment` (for ID exposure)
- `org.apache.druid.query.QueryToolChest` (for caching or context changes)
- `org.apache.druid.query.QueryRunnerTestHelper` (for test support)
### Example Usage
#### Query part
```json
{
"queryType": "segmentMetadata",
"dataSource": "sample_datasource",
"segmentIds": [
"sample_datasource_2025-12-01T00:00:00.000Z_2025-12-02T00:00:00.000Z_2025-12-02T00:00:00.000Z_v1"
]
}
```
#### Response part
```json
[
{
"id":
"sample_datasource_2025-12-01T00:00:00.000Z_2025-12-02T00:00:00.000Z_2025-12-02T00:00:00.000Z_v1",
"intervals": ["2025-12-01T00:00:00.000Z/2025-12-02T00:00:00.000Z"],
"columns": {
"__time": {
"type": "LONG",
"typeSignature": "LONG",
"hasMultipleValues": false,
"hasNulls": false,
"size": 800000,
"cardinality": null,
"errorMessage": null
},
"user_id": {
"type": "STRING",
"typeSignature": "STRING",
"hasMultipleValues": false,
"hasNulls": false,
"size": 2000000,
"cardinality": 135000,
"errorMessage": null
},
"event_type": {
"type": "STRING",
"typeSignature": "STRING",
"hasMultipleValues": false,
"hasNulls": true,
"size": 500000,
"cardinality": 25,
"errorMessage": null
},
"metric_clicks": {
"type": "FLOAT",
"typeSignature": "FLOAT",
"hasMultipleValues": false,
"hasNulls": false,
"size": 1000000,
"cardinality": null,
"errorMessage": null
}
},
"aggregators": {
"metric_clicks": {
"type": "floatSum",
"name": "metric_clicks",
"fieldName": "metric_clicks"
}
},
"queryGranularity": {
"type": "minute"
},
"size": 4500000,
"numRows": 1000000,
"rollup": false
}
]
```
### Testing
#### Unit Tests
- Add tests in `SegmentMetadataQueryTest` to validate correct behavior
when `segmentIds` is provided or omitted.
- Extend `SegmentMetadataQueryRunnerTest` to ensure only the specified
segments are queried.
- Add test coverage for edge cases, such as empty or non-existent
`segmentIds`.
#### Integration Tests
- Update or extend `ITSegmentMetadataTest` to include scenarios using the
new `segmentIds` parameter.
- Add new tests that:
- Query metadata for a single known segment.
- Query with multiple segment IDs across intervals.
- Query with a mix of valid and invalid segment IDs (expect partial
results or error handling).
- Validate compatibility with existing query context parameters (e.g.,
`toInclude`, `merge`, etc.).
- Verify that the query returns accurate and expected results without
performance regressions.
### Alternatives Considered
And considered performing this filtering at the client side, but that
requires unnecessarily querying irrelevant segments, which is inefficient for
large datasources. Implementing it natively at the Broker and QueryRunner
layers is more scalable and consistent.
### Backward Compatibility
The introduction of the `segmentIds` parameter will be designed to be
**optional** and will not break any existing functionality. If the `segmentIds`
parameter is not provided in the query, the current behavior based on interval
filtering will remain unchanged.
However, we recognize that this new feature might require certain
modifications in existing systems or tooling, especially for users who rely on
interval-based querying for segment metadata. To mitigate any potential
compatibility issues:
1. **Query Compatibility**:
- If `segmentIds` is used alongside `intervals`, the query will return
metadata only for segments whose `segmentId` matches the provided list, within
the specified interval.
- If no `segmentIds` are provided, the system will continue to use the
interval-based filtering mechanism, ensuring seamless backward compatibility.
2. **Documentation and Communication**:
- Documentation will be updated to highlight this new optional parameter,
with examples for both use cases, one with and one without the `segmentIds`
parameter.
- Users who have been using segment metadata queries with interval-based
filtering will not experience any changes unless they explicitly choose to use
the `segmentIds` parameter.
3. **Feature Flagging**:
- To ensure smooth rollout, this feature could be initially introduced
behind a feature flag, allowing users to opt-in and test the new functionality
before enabling it fully in production environments.
4. **Fallback Mechanism**:
- If a `segmentId` does not exist (e.g., due to a typo or missing
segment), the query will gracefully handle the error, either by returning an
empty result for the invalid `segmentId` or providing an appropriate error
message, depending on the desired behavior.
By implementing this optional parameter in a non-intrusive manner, the
overall system remains compatible with existing workloads and users are given
the flexibility to adopt the new feature at their discretion.
--
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.
To unsubscribe, e-mail: [email protected]
For queries about this service, please contact Infrastructure at:
[email protected]
---------------------------------------------------------------------
To unsubscribe, e-mail: [email protected]
For additional commands, e-mail: [email protected]