XinyuZeng commented on issue #1430:
URL: https://github.com/apache/orc/issues/1430#issuecomment-1469904603

   We can refer to the FileScan utility: 
https://github.com/apache/orc/blob/main/tools/src/FileScan.cc#L32. When the 
batch_size is set to large (e.g., the number of rows in the whole file), the 
scan time increases from ~0.5 to ~0.7, and the additional time is on the 
ColumnVectorBatch creation, specifically `new (buf + i) T()` operation, but 
this is not necessary.
   
   I am doing this because the method of scanning to ColumnVectorBatch first 
and then transforming to another in-memory format (e.g., arrow) is not 
zero-copy. There is an opportunity to transfer the memory ownership of 
ColumnVectorBatch to Arrow 
([link](https://github.com/apache/arrow/issues/21238)) (although it is hard 
given ORC's ColumnVectorBatch's API), but that requires the allocation step of 
ColumnVectorBatch to be efficient. 


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: [email protected]

For queries about this service, please contact Infrastructure at:
[email protected]

Reply via email to