paul-rogers commented on PR #13168: URL: https://github.com/apache/druid/pull/13168#issuecomment-1279187078
@imply-cheddar suggested some additional optimizations, though these can come later. Recall that an earlier cut of the PR attempted to do the sort in several phases to minimize the memory footprint: * Read just the sort keys into "rows" used for sorting * Sort the key-only rows * Use random seeks into the cursor to fetch the full set of columns when returning rows The idea was to read only those columns under the limit. @imply-cheddar explained that this is possible, just not using `Cursor`. It can be done using `QueryableIndex` which provides access to the individual columns and would allow random reads. The other suggestion was to exploit indexes for string columns. The notion here is that column indexes are stored sorted. Each index entry consists of a `(dictionaryId, string)` pair. Since the dictionaries are sorted, the `dictionaryId` are in the same order as the keys. A sort can thus compare `dictionaryId`s rather than the actual strings, which should increase performance. The sort code can either store only the `dictionaryId`s in the sort key, sort the data, and fetch the string later, or it can store both the `dictionaryId` and string for each row, but discard the `dictionaryId` when returning the row downstream. It is fine if a first cut at the sort just stores the strings. Because of the dictionary, the strings will be "interned", there will be only one copy of each distinct value. This means that the rows stored in the priority queue will take somewhat less memory than they would if we stored copies of each string value per row. Again, a suggestion is to get the basics working first. Then, once we can declare victory on that, there can be an optional second PR that uses the above tricks to further optimize the sort. -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: [email protected] For queries about this service, please contact Infrastructure at: [email protected] --------------------------------------------------------------------- To unsubscribe, e-mail: [email protected] For additional commands, e-mail: [email protected]
