[jira] [Updated] (ARROW-4713) [C++] Improve C++ Orc Adapter performance and memory footprint

Joris Van den Bossche (Jira) Wed, 18 Nov 2020 06:28:35 -0800


     [ 
https://issues.apache.org/jira/browse/ARROW-4713?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]


Joris Van den Bossche updated ARROW-4713:
-----------------------------------------
    Labels: orc pull-request-available  (was: pull-request-available)

> [C++] Improve C++ Orc Adapter performance and memory footprint
> --------------------------------------------------------------
>
>                 Key: ARROW-4713
>                 URL: https://issues.apache.org/jira/browse/ARROW-4713
>             Project: Apache Arrow
>          Issue Type: Improvement
>          Components: C++
>            Reporter: Yurui Zhou
>            Assignee: Yurui Zhou
>            Priority: Major
>              Labels: orc, pull-request-available
>             Fix For: 3.0.0
>
>          Time Spent: 7h 20m
>  Remaining Estimate: 0h
>
> Currently the Arrow C++ provide a naive adapter implementation that allow 
> user to read orc file to Arrow RecordBatch. However, this implementation have 
> several drawbacks:
>  * Inefficient conversion that incurs huge memcpy overhead
>  ** currently the ORC adapter are performing byte to byte memcpy to move data 
> to ORC VectorBatch to Arrow RecordBatch regardless of the fact that ORC 
> VectorBatch shares the same memory layout with Arrow in most of the Data Types
>  * Huge memory footprint because the lack of TableReader implementation
>  ** The ORC adapter currently only allow user to read data with the unit of 
> stripe. However, as a columnar format with high compression ration, data read 
> from a ORC stripe can potential takes over gigabytes of memory, which makes 
> the ORC adapter not quite usable in production environment.
> Here we propose a new ORC adapter implementation to fix the issues mentioned 
> above:
>  * To reduce conversion overhead, instead of performing naive data copy, the 
> new adapter would be able to fully taking advantage of the memory layout 
> similarity between ORC VectorBatch and Arrow RecordBatch. Namely the new 
> adapter will perform pointer manipulation to transfer the memory ownership 
> from VectorBatch to Arrow RecordBatch whenever possible.
>  * The new ORC Adapter would be able to provide user a row level granularity 
> when reading data from Orc File. The user should be able to specify how many 
> rows should be expected on output RecordBatch and the ORC Adapter should make 
> sure no more the requested number of rows would be returned.



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

[jira] [Updated] (ARROW-4713) [C++] Improve C++ Orc Adapter performance and memory footprint

Reply via email to