gaodayue opened a new pull request #2547: [Segment V2] Support lazy 
materialization read
URL: https://github.com/apache/incubator-doris/pull/2547
 
 
   Fixes #2545 
   
   Current read path of SegmentIterator
   ----
   
   1. apply short key index and various column indexes to get the row ranges 
(ordinals of rows) to scan
   2. read all return columns according to the row ranges
   3. evaluate column predicates on the RowBlockV2 to further prune rows
   
   Problem
   ----
   
   When the column predicates at step 3 could filter a large proportion of rows 
in RowBlockV2, most values of non-predicate columns we read at step 2 are 
thrown away, i.e we did lots of useless work and I/O at step 2.
   
   Lazy materialization read
   ----
   With lazy materialization, the read path changes to
   1. apply short key index and various column indexes to get the row ranges 
(ordinals of rows) to scan (unchanged)
   2. **read only predicate columns** according to the row ranges
   3. evaluate column predicates on the RowBlockV2 to further prune rows, a 
selection vector is maintained to indicate the selected rows
   4. **read the remaining columns** based on the *selection vector* of 
RowBlockV2
   
   In this way, we could avoid reading values of non-predicate columns of all 
rows that can't pass the predicates.
   
   Example
   ----
   ```
   function: seek(ordinal), read(block_offset, count)
   
   (step 1) row ranges: [0,2),[4,8),[10,11),[15,20)
   (step 1) row ordinals: [0 1 4 5 6 7 10 15 16 17 18 19]
   (step 2) read of predicate columns: 
seek(0),read(0,2),seek(4),read(2,4),seek(10),read(6,1),seek(15),read(7,5)
   (step 3) selection vector: [3 4 5 6]
   (step 3) selected ordinals: [5 6 7 10]
   (step 4) read of remaining columns: seek(5),read(3,3),seek(10),read(6,1)
   ```
   
   Performance evaluation
   ----
   Lazy materialization is particularly useful when column predicates could 
filter many rows and lots of big metrics (e.g., hll and bitmap type columns) 
are queried. In our internal test cases on bitmap columns, queries run 20%~120% 
faster when using lazy materialization.
   
   

----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.
 
For queries about this service, please contact Infrastructure at:
[email protected]


With regards,
Apache Git Services

---------------------------------------------------------------------
To unsubscribe, e-mail: [email protected]
For additional commands, e-mail: [email protected]

Reply via email to