[GitHub] [orc] pavibhai opened a new pull request #635: ORC-742: LazyIO for non-filter columns

GitBox Tue, 26 Jan 2021 10:59:49 -0800


pavibhai opened a new pull request #635:
URL: https://github.com/apache/orc/pull/635



   ### What changes were proposed in this pull request?
   Added Lazy IO for follow columns.
           * Identify columns in the presence of a filter into LEAD and FOLLOW 
columns
                   * LEAD columns are read first
                   * FOLLOW columns are read only if the filter selects an 
output
                   * This evaluation is reset on every stripe change
           * RecordReaderImpl.nextBatch performs read until a batch has value 
or the file is exhaused instead of returning empty batches as was the case 
previously
           * IO of FOLLOW columns happens the same as partial RowGroup 
selections during read
           * In the presence of filters batches respected row group boundaries
           * Filter is now defined as Consumer<FilterContext> instead of 
Consumer<VectorizedRowBatch>
   
   ### Why are the changes needed?
   * The code changes allow for a lazy evaluation of FOLLOW columns, which in 
the case of reads with minimal hits gives substantial savings both of IO and 
CPU.
   * The filter is changed to Consumer<FilterContext> to offer a convenience 
method on retrieving a ColumnVector using a name `FilterContext.findVector`
   
   
   ### How was this patch tested?
   * This patch includes Unit tests that verify the IO savings accomplished as 
a result of this change.
   * Given the interface and behavior change of the filters, some of the 
existing unit tests were updated to reflect the new API as well as new behavior 
of not reading FOLLOW columns unless required.
   


----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
[email protected]

[GitHub] [orc] pavibhai opened a new pull request #635: ORC-742: LazyIO for non-filter columns

Reply via email to