gaodayue opened a new pull request #2547: [Segment V2] Support lazy materialization read URL: https://github.com/apache/incubator-doris/pull/2547 Fixes #2545 Current read path of SegmentIterator ---- 1. apply short key index and various column indexes to get the row ranges (ordinals of rows) to scan 2. read all return columns according to the row ranges 3. evaluate column predicates on the RowBlockV2 to further prune rows Problem ---- When the column predicates at step 3 could filter a large proportion of rows in RowBlockV2, most values of non-predicate columns we read at step 2 are thrown away, i.e we did lots of useless work and I/O at step 2. Lazy materialization read ---- With lazy materialization, the read path changes to 1. apply short key index and various column indexes to get the row ranges (ordinals of rows) to scan (unchanged) 2. **read only predicate columns** according to the row ranges 3. evaluate column predicates on the RowBlockV2 to further prune rows, a selection vector is maintained to indicate the selected rows 4. **read the remaining columns** based on the *selection vector* of RowBlockV2 In this way, we could avoid reading values of non-predicate columns of all rows that can't pass the predicates. Example ---- ``` function: seek(ordinal), read(block_offset, count) (step 1) row ranges: [0,2),[4,8),[10,11),[15,20) (step 1) row ordinals: [0 1 4 5 6 7 10 15 16 17 18 19] (step 2) read of predicate columns: seek(0),read(0,2),seek(4),read(2,4),seek(10),read(6,1),seek(15),read(7,5) (step 3) selection vector: [3 4 5 6] (step 3) selected ordinals: [5 6 7 10] (step 4) read of remaining columns: seek(5),read(3,3),seek(10),read(6,1) ``` Performance evaluation ---- Lazy materialization is particularly useful when column predicates could filter many rows and lots of big metrics (e.g., hll and bitmap type columns) are queried. In our internal test cases on bitmap columns, queries run 20%~120% faster when using lazy materialization.
---------------------------------------------------------------- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. For queries about this service, please contact Infrastructure at: [email protected] With regards, Apache Git Services --------------------------------------------------------------------- To unsubscribe, e-mail: [email protected] For additional commands, e-mail: [email protected]
