XinyuZeng commented on issue #1430: URL: https://github.com/apache/orc/issues/1430#issuecomment-1469904603
We can refer to the FileScan utility: https://github.com/apache/orc/blob/main/tools/src/FileScan.cc#L32. When the batch_size is set to large (e.g., the number of rows in the whole file), the scan time increases from ~0.5 to ~0.7, and the additional time is on the ColumnVectorBatch creation, specifically `new (buf + i) T()` operation, but this is not necessary. I am doing this because the method of scanning to ColumnVectorBatch first and then transforming to another in-memory format (e.g., arrow) is not zero-copy. There is an opportunity to transfer the memory ownership of ColumnVectorBatch to Arrow ([link](https://github.com/apache/arrow/issues/21238)) (although it is hard given ORC's ColumnVectorBatch's API), but that requires the allocation step of ColumnVectorBatch to be efficient. -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: [email protected] For queries about this service, please contact Infrastructure at: [email protected]
