On Wed, Jun 22, 2016 at 1:39 AM, Goden Yao <[email protected]> wrote:
> This is not comparable as native vs. external. > The design doc attached in HAWQ-786 > <https://issues.apache.org/jira/browse/HAWQ-786>, as some community > responses in the JIRA, is mixing up an External Table data access framework > with a file format support. > > If the JIRA is merely about using ORC as native file format as we see its > popularity in the Hadoop community and potentially want to replace parquet > with ORC as default for its benefits and advantages, this JIRA should be > focusing on the native file format part and how to integrate with C library > from Apache ORC project. > as it was described in the JIRA. the framework is designed as a general framework. it can also potentially be used for external data. there is an example showing the usage. > > To answer Roman's questions, I think we first need to understand user > scenario with external tables (with ORC format), which is users : > 1) already have ORC files landed in HDFS (or stored as Hive tables) > 2) want to query from HAWQ, so they may get performance gain with MPP > architecture provided by HAWQ, instead of MR jobs. > 3) want to avoid data duplication, which means they don't want to load data > into HAWQ native format (so doesn't matter what native format HAWQ uses to > store the table) > > Given that, I think it's worth a further discussion in the theme of > improving external data source access/query performance. > > Thanks > -Goden > > > > On Mon, Jun 20, 2016 at 5:55 PM Lei Chang <[email protected]> wrote: > > > On Tue, Jun 21, 2016 at 8:38 AM, Roman Shaposhnik <[email protected]> > > wrote: > > > > > On Fri, Jun 17, 2016 at 3:02 AM, Ming Li <[email protected]> wrote: > > > > Hi Guys, > > > > > > > > ORC (Optimized Row Columnar) is a very popular open source format > > adopted > > > > in some major components in Hadoop eco-system. It is also used by a > lot > > > of > > > > users. The advantages of supporting ORC storage in HAWQ are in two > > folds: > > > > firstly, it makes HAWQ more Hadoop native which interacts with other > > > > components more easily; secondly, ORC stores some meta info for query > > > > optimization, thus, it might potentially outperform two native > formats > > > > (i.e., AO, Parquet) if it is available. > > > > > > > > Since there are lots of popular formats available in HDFS community, > > and > > > > more advanced formats are emerging frequently. It is good option for > > HAWQ > > > > to design a general framework that supports pluggable c/c++ formats > > such > > > as > > > > ORC, as well as native format such as AO and Parquet. In designing > this > > > > framework, we also need to support data stored in different file > > systems: > > > > HDFS, local disk, amazon S3, etc. Thus, it is better to offer a > > framework > > > > to support pluggable formats and pluggable file systems. > > > > > > > > We are proposing support ORC in JIRA ( > > > > https://issues.apache.org/jira/browse/HAWQ-786). Please see the > design > > > spec > > > > in the JIRA. > > > > > > > > Your comments are appreciated! > > > > > > This sounds reasonable, but I'd like to understand the trade-offs > > > between supporting > > > something like ORC in PXF vs. implementing it natively in C/C++. > > > > > > Is there any hard performance/etc. data that you could share to > > > illuminated the > > > tradeoffs between these two approaches? > > > > > > > Implementing it natively in C/C++ will get at least comparable > performance > > with current native AO and parquet format. > > > > And we know that ao and parquet is faster than pxf, so we are expecting > > better performance here. > > > > Cheers > > Lei > > >
