Hi, This is Gang from Alibaba working on Alibaba's big data platform - MaxCompute. We have developed our own columnar storage format within MaxCompute to support MapReduce and other batch processing workload. But as Apache Orc is getting popular in the industry, we are actively looking at integrating Orc format into MaxCompute. In the past few months, Xiening (cc'ed) and I have been working on echancing Orc C++ to provide full featured C++ reader and writer. Our work mainly involves adding a C++ writer that supports all data types and stats, and supporting index for both reader and writer. As of today, we have finished development and testing and plan to contribute this work back to the Apach Orc project. We have communicated with Owen via email and have created an umbrella JIRA ORC-179 for the plan. In brief, we plan to do the following: 1. Refactor common classes for writer and reader -- extract common classes and functions for writer and reader to share 2. OutputStream interface for writer -- implement several output streams for writing to memory, file, etc. -- implement ByteRleEncoder, RleEncoder, BooleanRleEncoder, etc. -- support zlib compression 3. ORC Writer -- write orc file header, file footer, postscript, etc. -- write columns of all types -- write column statistics -- write index stream in writer and reader seeks to row based on index information 4. other -- some minor bug fixes of current code base.
Should you have any question, please feel free to contact us. Any feedbacks and suggestions are welcome. Thanks! Gang WuSenior EngineerAlibaba Group
