[GitHub] [flink] StephanEwen commented on pull request #13724: [FLINK-19581][orc] Introduce Orc ColumnarRow File source bulk Format

GitBox Thu, 05 Nov 2020 05:04:39 -0800


StephanEwen commented on pull request #13724:
URL: https://github.com/apache/flink/pull/13724#issuecomment-722365033



   This looks pretty good to me, we could merge it like it is.
   
   One idea for an improvement would be to not use the "skipRecordsCount" at 
all here. I fear this can lead to surprised with ORC due to pushed down 
predicates. If during after update, the predicate would be more selective, then 
ORC itself would filter more rows and we would skip too many later.
   
   What we could do is the following: The `VectorizedRowBatch` has the `int[] 
selected` array, which has the positions of the rows. We could also pass that 
array to the `ColumnarRowIterator`, instead of the `startingOffset`.
   When returning the next record, it would set that position to the result, 
rather than incrementing the skipCount.
   
   What do you think?


----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
[email protected]

[GitHub] [flink] StephanEwen commented on pull request #13724: [FLINK-19581][orc] Introduce Orc ColumnarRow File source bulk Format

Reply via email to