[
https://issues.apache.org/jira/browse/ARROW-4713?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
]
Joris Van den Bossche updated ARROW-4713:
-----------------------------------------
Labels: orc pull-request-available (was: pull-request-available)
> [C++] Improve C++ Orc Adapter performance and memory footprint
> --------------------------------------------------------------
>
> Key: ARROW-4713
> URL: https://issues.apache.org/jira/browse/ARROW-4713
> Project: Apache Arrow
> Issue Type: Improvement
> Components: C++
> Reporter: Yurui Zhou
> Assignee: Yurui Zhou
> Priority: Major
> Labels: orc, pull-request-available
> Fix For: 3.0.0
>
> Time Spent: 7h 20m
> Remaining Estimate: 0h
>
> Currently the Arrow C++ provide a naive adapter implementation that allow
> user to read orc file to Arrow RecordBatch. However, this implementation have
> several drawbacks:
> * Inefficient conversion that incurs huge memcpy overhead
> ** currently the ORCĀ adapter are performing byte to byte memcpy to move data
> to ORC VectorBatch to Arrow RecordBatch regardless of the fact that ORC
> VectorBatch shares the same memory layout with Arrow in most of the Data Types
> * Huge memory footprint because the lack of TableReader implementation
> ** The ORC adapter currently only allow user to read data with the unit of
> stripe. However, as a columnar format with high compression ration, data read
> from a ORC stripe can potential takes over gigabytes of memory, which makes
> the ORC adapter not quite usable in production environment.
> Here we propose a new ORC adapter implementation to fix the issues mentioned
> above:
> * To reduce conversion overhead, instead of performing naive data copy, the
> new adapter would be able to fully taking advantage of the memory layout
> similarity between ORC VectorBatch and Arrow RecordBatch. Namely the new
> adapter will perform pointer manipulation to transfer the memory ownership
> from VectorBatch to Arrow RecordBatch whenever possible.
> * The new ORC Adapter would be able to provide user a row level granularity
> when reading data from Orc File. The user should be able to specify how many
> rows should be expected on output RecordBatch and the ORC Adapter should make
> sure no more the requested number of rows would be returned.
--
This message was sent by Atlassian Jira
(v8.3.4#803005)