himanshug opened a new issue #8627: make segmentID available in segment 
metadata and a rowID virtual column
URL: https://github.com/apache/incubator-druid/issues/8627
 
 
   ### Motivation
   
   We have use cases that would be equivalent to following standard SQL query
   `select * from T where <some filters> order by x,y,z`
   
   -  "select" or "scan" query can't do it as it can't sort on arbitrary 
columns but just on __time.
   -  "topN" query can't do it as it supports sort on single column (and also 
not appropriate in situations where accurate results are required)
   
   so, we decided to use the "groupBy" query which supports all the 
requirements except that it "groups" the rows. there is no natural "row-id" in 
our data, so we created a `VirtualColumn` that returns 
`concat(uuid,row-offset-in-segment)` and use that as a dimension in "groupBy" 
query.
   however, `uuid` is generated at query time so could be different across 
multiple runs of same query and could produce non-deterministic results when 
sorting.
   
   Idea is to make available "segmentId" to the `VirtualColumn` interface so 
that "segmentId" could be used instead of a generated uuid , this will produce 
deterministic `row-id` for each row in the segment.
   
   However, there might be one slight wrinkle. In Realtime ingestion case, 
there might not be a natural `segmentId` (when data is in in-memory index or 
intermediate-persisted segments). But we could try to have as stable as 
possible an id for that case.
   
   ### Proposed changes
   
   Add a new field to `Metadata` class like
   ```
     // Note: it may not be a real segmentId for segments used during realtime 
indexing
     // This is not persisted in segment metadata on disk
     @JsonIgnore
     private String segmentID;
   ```
   
   `IndexLoader.load(..)` and various other places would be updated to  set 
`segmentID` in metadata object. 
   
   `Metadata` is already accessible to `VirtualColumn` interface, so no changes 
there.
   
   
   ### Rationale
   
   One alternative is to generate a rowID column in segment at time of 
indexing, but that would be a high cardinality column consuming too much space 
unnecessarily.
   
   Another alternative would be to modify `GroupBy` query implementation to 
support a flag that tells it to not "group" rows, that might be harder to 
implement and would require lot more surgery and I'm not sure if we even want 
that for groupBy query.
   
   ### Operational impact
   
   None
   
   ### Test plan (optional)
   
   Unit Tests would be written.
   

----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.
 
For queries about this service, please contact Infrastructure at:
[email protected]


With regards,
Apache Git Services

---------------------------------------------------------------------
To unsubscribe, e-mail: [email protected]
For additional commands, e-mail: [email protected]

Reply via email to