rizaon opened a new pull request #990:
URL: https://github.com/apache/orc/pull/990


   ### What changes were proposed in this pull request?
   ORC C++ library doesn't have a type id for the index field of a list
   type. We have to select the type id of the whole array if we only want
   to get the list indices, which causes unnecessary materialization on the
   array elements. The offset stream is stored separately from the content
   stream. We can materialize the list indices only.
   
   This patch add the fourth option in ORC C++ library to select column
   from ORC file, namely RowReaderOptions::includeTypesAndIntents. It
   similar as RowReaderOptions::includeTypes, but with additional set of
   ReadIntent for each type id. ListColumnReader can then refer to this
   ReadIntent set to either read the list elements, read indices, or both.
   ReadIntent_DATA is the default for all type id if the selection does not
   specify any ReadIntent.
   
   Adding read intent avoid introducing fake type id only to refer to the
   list indices. Thus, expected type ids for an ORC file stay the same
   after this patch.
   
   
   ### Why are the changes needed?
   This is needed to selectively avoid materialization of array elements.
   
   
   ### How was this patch tested?
   Declare ReadIntent_POS in TestColumnReaderEncoded.testList and verify
   that the resulting indices are correct.
   


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: dev-unsubscr...@orc.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


Reply via email to