ORC contribution from Alibaba

Gang Wu Tue, 25 Apr 2017 23:15:04 -0700

Hi,
This is Gang from Alibaba working on Alibaba's big data platform - MaxCompute. 
We have developed our own columnar storage format within MaxCompute to support 
MapReduce and other batch processing workload. But as Apache Orc is getting 
popular in the industry, we are actively looking at integrating Orc format into 
MaxCompute. 
In the past few months, Xiening (cc'ed) and I have been working on echancing 
Orc C++ to provide full featured C++ reader and writer. Our work mainly 
involves adding a C++ writer that supports all data types and stats, and 
supporting index for both reader and writer. As of today, we have finished 
development and testing and plan to contribute this work back to the Apach Orc 
project. We have communicated with Owen via email and have created an umbrella 
JIRA ORC-179 for the plan. In brief, we plan to do the following:
  1. Refactor common classes for writer and reader
    -- extract common classes and functions for writer and reader to share
  2. OutputStream interface for writer
    -- implement several output streams for writing to memory, file, etc.
    -- implement ByteRleEncoder, RleEncoder, BooleanRleEncoder, etc.
    -- support zlib compression
  3. ORC Writer
    -- write orc file header, file footer, postscript, etc.
    -- write columns of all types 
    -- write column statistics
    -- write index stream in writer and reader seeks to row based on index 
information 
  4. other
    -- some minor bug fixes of current code base.


Should you have any question, please feel free to contact us. Any feedbacks and 
suggestions are welcome. Thanks!
Gang WuSenior EngineerAlibaba Group

ORC contribution from Alibaba

Reply via email to