pavibhai opened a new pull request #635:
URL: https://github.com/apache/orc/pull/635
### What changes were proposed in this pull request?
Added Lazy IO for follow columns.
* Identify columns in the presence of a filter into LEAD and FOLLOW
columns
* LEAD columns are read first
* FOLLOW columns are read only if the filter selects an
output
* This evaluation is reset on every stripe change
* RecordReaderImpl.nextBatch performs read until a batch has value
or the file is exhaused instead of returning empty batches as was the case
previously
* IO of FOLLOW columns happens the same as partial RowGroup
selections during read
* In the presence of filters batches respected row group boundaries
* Filter is now defined as Consumer<FilterContext> instead of
Consumer<VectorizedRowBatch>
### Why are the changes needed?
* The code changes allow for a lazy evaluation of FOLLOW columns, which in
the case of reads with minimal hits gives substantial savings both of IO and
CPU.
* The filter is changed to Consumer<FilterContext> to offer a convenience
method on retrieving a ColumnVector using a name `FilterContext.findVector`
### How was this patch tested?
* This patch includes Unit tests that verify the IO savings accomplished as
a result of this change.
* Given the interface and behavior change of the filters, some of the
existing unit tests were updated to reflect the new API as well as new behavior
of not reading FOLLOW columns unless required.
----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.
For queries about this service, please contact Infrastructure at:
[email protected]