paul-rogers commented on PR #13168:
URL: https://github.com/apache/druid/pull/13168#issuecomment-1279187078

   @imply-cheddar suggested some additional optimizations, though these can 
come later. Recall that an earlier cut of the PR attempted to do the sort in 
several phases to minimize the memory footprint:
   
   * Read just the sort keys into "rows" used for sorting
   * Sort the key-only rows
   * Use random seeks into the cursor to fetch the full set of columns when 
returning rows
   
   The idea was to read only those columns under the limit. @imply-cheddar 
explained that this is possible, just not using `Cursor`. It can be done using 
`QueryableIndex` which provides access to the individual columns and would 
allow random reads.
   
   The other suggestion was to exploit indexes for string columns. The notion 
here is that column indexes are stored sorted. Each index entry consists of a 
`(dictionaryId, string)` pair. Since the dictionaries are sorted, the 
`dictionaryId` are in the same order as the keys. A sort can thus compare 
`dictionaryId`s rather than the actual strings, which should increase 
performance.
   
   The sort code can either store only the `dictionaryId`s in the sort key, 
sort the data, and fetch the string later, or it can store both the 
`dictionaryId` and string for each row, but discard the `dictionaryId` when 
returning the row downstream.
   
   It is fine if a first cut at the sort just stores the strings. Because of 
the dictionary, the strings will be "interned", there will be only one copy of 
each distinct value. This means that the rows stored in the priority queue will 
take somewhat less memory than they would if we stored copies of each string 
value per row.
   
   Again, a suggestion is to get the basics working first. Then, once we can 
declare victory on that, there can be an optional second PR that uses the above 
tricks to further optimize the sort.


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: [email protected]

For queries about this service, please contact Infrastructure at:
[email protected]


---------------------------------------------------------------------
To unsubscribe, e-mail: [email protected]
For additional commands, e-mail: [email protected]

Reply via email to