Re: Re: [DISCUSS][C++] Add Support For INT/BYTE vector batch

Owen O'Malley Tue, 02 Apr 2019 11:38:03 -0700

If it makes the integration between ORC C++ and Arrow easier, that is a
good thing. Please file an ORC jira and create a pull request when the work
is ready.


Thank you,
   Owen

On Tue, Apr 2, 2019 at 7:29 AM Yurui Zhou <[email protected]> wrote:

> Hi Owen,
>
> Thank you for the response. Yes, you are right, generally it doesn't save
> much
> memory between int64 to int16. But when it comes to vectorized
> computation,
> such a change may make big difference to cpu L1 cache.
>
> Another movitation for me to drive this change is that I am currently
> working on
> a copy free Arrow Adapter implementation for Apache Arrow to boost the
> performance
> of reading Orc file into Arrow Recordbatch.  The Arrow RecordBatch has
> strict
> mapping between type and data size. Currently in c++ orc reader, because
> the
> data type does not actually align with underlying data size, we need to
> perform
> a memory copy to finish the conversion, which involves unnecessary
> overhead.
>
> Regarding your concern about backward compatbility, we can certainly add a
> flag
> to make sure current user are not suffer from any API breaking.
>
> Thanks
> Yurui
>
> from Alimail macOS <https://mail.alibaba-inc.com>
>
> ------------------------------------------------------------------
> 发件人：Owen O'Malley<[email protected]>
> 日 期：2019年04月02日 01:02:02
> 收件人：<[email protected]>
> 主 题：Re: [DISCUSS][C++] Add Support For INT/BYTE vector batch
>
> From the ORC library side, it isn't hard to support the additional vector
> types, although you'll need to make it API compatible for users that don't
> want it. For applications, I don't see a lot of advantages. For 1024 rows,
> the savings in memory between int64, int32, int16, and byte isn't that much
> (8k is still pretty small). However, for the application, having to have
> different code paths for each of the four integer types is a big hassle.
> Certainly, Hive does not want the other vector types and therefore, I don't
> think we should make the change on the Java side. If an application has a
> compelling use case on the C++ side, we can do it. Another concern is that
> the C++ side doesn't do automatic schema evolution and therefore reading a
> file with int64 when you were expecting an int32 would currently work, but
> won't if you make the new types.
>
> .. Owen
>
> On Mon, Apr 1, 2019 at 6:30 PM Yurui Zhou <[email protected]
> > wrote:
>
> > Hi guys:
> >
> > Currently ORC have LongVectorBatch as the only representation for
>
> > primitive integer types like boolean, byte, int and long.  This is not very
> > benefitial for memory usage and computation efficiency. I would like to
> > introduce INT and BYTE vector batch in ORC C++ version  for types like
> > boolean, byte and int to improve the memory efficiency. This change would
> > also potential benefits for data consumer  in case of SIMD computation.
> > Let me know if you have any thoughts/suggestions.
> >
> > Thanks
> > Yurui
> >
> > from Alimail macOS
>
>

Re: Re: [DISCUSS][C++] Add Support For INT/BYTE vector batch

Reply via email to