On Fri, Jun 17, 2016 at 3:02 AM, Ming Li <[email protected]> wrote: > Hi Guys, > > ORC (Optimized Row Columnar) is a very popular open source format adopted > in some major components in Hadoop eco-system. It is also used by a lot of > users. The advantages of supporting ORC storage in HAWQ are in two folds: > firstly, it makes HAWQ more Hadoop native which interacts with other > components more easily; secondly, ORC stores some meta info for query > optimization, thus, it might potentially outperform two native formats > (i.e., AO, Parquet) if it is available. > > Since there are lots of popular formats available in HDFS community, and > more advanced formats are emerging frequently. It is good option for HAWQ > to design a general framework that supports pluggable c/c++ formats such as > ORC, as well as native format such as AO and Parquet. In designing this > framework, we also need to support data stored in different file systems: > HDFS, local disk, amazon S3, etc. Thus, it is better to offer a framework > to support pluggable formats and pluggable file systems. > > We are proposing support ORC in JIRA ( > https://issues.apache.org/jira/browse/HAWQ-786). Please see the design spec > in the JIRA. > > Your comments are appreciated!
This sounds reasonable, but I'd like to understand the trade-offs between supporting something like ORC in PXF vs. implementing it natively in C/C++. Is there any hard performance/etc. data that you could share to illuminated the tradeoffs between these two approaches? Thanks, Roman.
