gianm commented on issue #6105: Allow sorting segments on some dims before time
URL: 
https://github.com/apache/incubator-druid/issues/6105#issuecomment-410538507
 
 
   I changed the title a bit, since we do allow specifying the sort order for 
every column except __time (it's the order from dimensionsSpec) but we do not 
let you put time in the middle of them. So this issue may lead to adding a 
sortOrder parameter, but I wanted the title to capture the problem we're trying 
to solve.
   
   For anyone reading, check out #6066 for motivation and what workarounds 
people use today.
   
   As to removing the assumption that `__time` is sorted, these are the places 
I can think of where that exists:
   
   1. The `makeCursors` function in QueryableIndexStorageAdapter is responsible 
for creating a sequence of time-granular cursors. This will probably be the 
toughest thing to adjust, since many query engines depend on the time-ordered 
and time-granularized nature of these cursors (including the timeseries, topN, 
groupBy, and timeBoundary engines). Probably for timeseries the best thing to 
do is call through to the groupBy engine, which can handle its grouping keys 
being out of order, and is capable of grouping on time. For topN and groupBy 
I'm not so sure. For groupBy in particular keep in mind #1926; the behavior of 
"granularity" is not equivalent to the behavior of adding a time-floored 
dimension.
   2. makeCursors again: its time filtering functionality assumes that the time 
column is sorted.
   3. SingleScanTimeDimSelector is used for extractionFns on the `__time` 
column, and it also assumes the column is sorted.
   4. SingleLongInputCachingExpressionColumnValueSelector is used for 
expressions on the `__time` column. It doesn't assume sortedness, but the 
optimization it does may not make as much sense if the column is sorted.
   5. The `getMinTime()` and `getMaxTime()` functions in 
QueryableIndexStorageAdapter assume that they should return the first and last 
row, respectively.
   
   Some other considerations:
   
   - Given (1) we will probably want to translate some `granularity` based 
queries into queries that use the `timestamp_floor` expression. For example, a 
timeseries with granularity less than the segment interval should get executed 
the same way as a `timestamp_floor` groupBy, and we could make that translation 
under the hood. Therefore we should look into optimizing `timestamp_floor`. We 
should expect at least _some_ localized sortedness in the time column and the 
function should take advantage of that. Currently, it doesn't.
   - If it makes life easier for a first version of this patch, we could 
consider _not_ supporting "granularity" based queries for topN and groupBy on 
non-time-sorted segments at first. I believe they are less commonly used than 
granular timeseries, and could be added in later.

----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on GitHub and use the
URL above to go to the specific comment.
 
For queries about this service, please contact Infrastructure at:
[email protected]


With regards,
Apache Git Services

---------------------------------------------------------------------
To unsubscribe, e-mail: [email protected]
For additional commands, e-mail: [email protected]

Reply via email to