Thank you for opening a JIRA issue. I have added you to the Contributor role in JIRA and assigned the issue to you
On Thu, Feb 28, 2019 at 3:13 AM 周宇睿(闻拙) <yurui....@alibaba-inc.com> wrote: > > Hi > > > > Currently the Arrow C++ provide a naive adapter implementation that allow > user to read orc file to Arrow RecordBatch. However, this implementation have > several drawbacks: > > > Inefficient conversion that incurs huge memcpy overhead > currently the ORC adapter are performing byte to byte memcpy to move data to > ORC VectorBatch to Arrow RecordBatch regardless of the fact that ORC > VectorBatch shares the same memory layout with Arrow in most of the Data Types > Huge memory footprint because the lack of TableReader implementation > The ORC adapter currently only allow user to read data with the unit of > stripe. However, as a columnar format with high compression ration, data read > from a ORC stripe can potential takes over gigabytes of memory, which makes > the ORC adapter not quite usable in production environment. > Here we propose a new ORC adapter implementation to fix the issues mentioned > above: > To reduce conversion overhead, instead of performing naive data copy, the new > adapter would be able to fully taking advantage of the memory layout > similarity between ORC VectorBatch and Arrow RecordBatch. Namely the new > adapter will perform pointer manipulation to transfer the memory ownership > from VectorBatch to Arrow RecordBatch whenever possible. > The new ORC Adapter would be able to provide user a row level granularity > when reading data from Orc File. The user should be able to specify how many > rows should be expected on output RecordBatch and the ORC Adapter should make > sure no more the requested number of rows would be returned. > I opened a Jira here to track the issue. Any advice would be appreciated. > > BTW, I tried to assigned the Jira to myself but looks like I am unable to do > that. Any idea how could I obtain the permission to perform the operation? > > Best regards > > Yurui >