If it makes the integration between ORC C++ and Arrow easier, that is a good thing. Please file an ORC jira and create a pull request when the work is ready.
Thank you, Owen On Tue, Apr 2, 2019 at 7:29 AM Yurui Zhou <[email protected]> wrote: > Hi Owen, > > Thank you for the response. Yes, you are right, generally it doesn't save > much > memory between int64 to int16. But when it comes to vectorized > computation, > such a change may make big difference to cpu L1 cache. > > Another movitation for me to drive this change is that I am currently > working on > a copy free Arrow Adapter implementation for Apache Arrow to boost the > performance > of reading Orc file into Arrow Recordbatch. The Arrow RecordBatch has > strict > mapping between type and data size. Currently in c++ orc reader, because > the > data type does not actually align with underlying data size, we need to > perform > a memory copy to finish the conversion, which involves unnecessary > overhead. > > Regarding your concern about backward compatbility, we can certainly add a > flag > to make sure current user are not suffer from any API breaking. > > Thanks > Yurui > > from Alimail macOS <https://mail.alibaba-inc.com> > > ------------------------------------------------------------------ > 发件人:Owen O'Malley<[email protected]> > 日 期:2019年04月02日 01:02:02 > 收件人:<[email protected]> > 主 题:Re: [DISCUSS][C++] Add Support For INT/BYTE vector batch > > From the ORC library side, it isn't hard to support the additional vector > types, although you'll need to make it API compatible for users that don't > want it. For applications, I don't see a lot of advantages. For 1024 rows, > the savings in memory between int64, int32, int16, and byte isn't that much > (8k is still pretty small). However, for the application, having to have > different code paths for each of the four integer types is a big hassle. > Certainly, Hive does not want the other vector types and therefore, I don't > think we should make the change on the Java side. If an application has a > compelling use case on the C++ side, we can do it. Another concern is that > the C++ side doesn't do automatic schema evolution and therefore reading a > file with int64 when you were expecting an int32 would currently work, but > won't if you make the new types. > > .. Owen > > On Mon, Apr 1, 2019 at 6:30 PM Yurui Zhou <[email protected] > > wrote: > > > Hi guys: > > > > Currently ORC have LongVectorBatch as the only representation for > > > primitive integer types like boolean, byte, int and long. This is not very > > benefitial for memory usage and computation efficiency. I would like to > > introduce INT and BYTE vector batch in ORC C++ version for types like > > boolean, byte and int to improve the memory efficiency. This change would > > also potential benefits for data consumer in case of SIMD computation. > > Let me know if you have any thoughts/suggestions. > > > > Thanks > > Yurui > > > > from Alimail macOS > >
