himanshug opened a new issue #8627: make segmentID available in segment metadata and a rowID virtual column URL: https://github.com/apache/incubator-druid/issues/8627 ### Motivation We have use cases that would be equivalent to following standard SQL query `select * from T where <some filters> order by x,y,z` - "select" or "scan" query can't do it as it can't sort on arbitrary columns but just on __time. - "topN" query can't do it as it supports sort on single column (and also not appropriate in situations where accurate results are required) so, we decided to use the "groupBy" query which supports all the requirements except that it "groups" the rows. there is no natural "row-id" in our data, so we created a `VirtualColumn` that returns `concat(uuid,row-offset-in-segment)` and use that as a dimension in "groupBy" query. however, `uuid` is generated at query time so could be different across multiple runs of same query and could produce non-deterministic results when sorting. Idea is to make available "segmentId" to the `VirtualColumn` interface so that "segmentId" could be used instead of a generated uuid , this will produce deterministic `row-id` for each row in the segment. However, there might be one slight wrinkle. In Realtime ingestion case, there might not be a natural `segmentId` (when data is in in-memory index or intermediate-persisted segments). But we could try to have as stable as possible an id for that case. ### Proposed changes Add a new field to `Metadata` class like ``` // Note: it may not be a real segmentId for segments used during realtime indexing // This is not persisted in segment metadata on disk @JsonIgnore private String segmentID; ``` `IndexLoader.load(..)` and various other places would be updated to set `segmentID` in metadata object. `Metadata` is already accessible to `VirtualColumn` interface, so no changes there. ### Rationale One alternative is to generate a rowID column in segment at time of indexing, but that would be a high cardinality column consuming too much space unnecessarily. Another alternative would be to modify `GroupBy` query implementation to support a flag that tells it to not "group" rows, that might be harder to implement and would require lot more surgery and I'm not sure if we even want that for groupBy query. ### Operational impact None ### Test plan (optional) Unit Tests would be written.
---------------------------------------------------------------- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. For queries about this service, please contact Infrastructure at: [email protected] With regards, Apache Git Services --------------------------------------------------------------------- To unsubscribe, e-mail: [email protected] For additional commands, e-mail: [email protected]
