gianm commented on issue #6105: Allow sorting segments on some dims before time URL: https://github.com/apache/incubator-druid/issues/6105#issuecomment-414080582 Fwiw, some situations where ability to sort by other-than-time would be expected to be useful: 1. Timeseries data (ironically). Timeseries data is usually modeled as "series" of "points" where each series has a "metric" (like its name) and "tags" that can be used to differentiate it from other series with the same "metric". In Druid you'd model this by making each point into a row, and making the metric and tags into dimensions. It's best to store these rows sorted by metric, so the rows for a particular metric compress better (since they are likely to have a lot of the same tag values), and so we can retrieve all the points for a series faster (since they will have better locality of storage). 2. Clickstream data when you want to do session analyses on it. The idea would be to partition by day first, then by session id (we do already support this: segmentGranularity DAY, and single-dimension partitioning via Hadoop indexing). Then within a segment, sort by session id. It makes it possible to do queries like "count the number of sessions where X, then Y, then Z happened" in linear time and constant memory. 3. Multi-tenant datasets, where you store data for different tenants in the same Druid dataSource. In this case you'd want to both partition and sort by tenant_id. It should improve both compression ratio and query time.
---------------------------------------------------------------- This is an automated message from the Apache Git Service. To respond to the message, please log on GitHub and use the URL above to go to the specific comment. For queries about this service, please contact Infrastructure at: [email protected] With regards, Apache Git Services --------------------------------------------------------------------- To unsubscribe, e-mail: [email protected] For additional commands, e-mail: [email protected]
