On Thu, Jun 23, 2016 at 6:17 AM, Ting(Goden) Yao <[email protected]> wrote:
> 1) the framework is not designed by HAWQ community - it was from Postgres > It is not correct. fdw is SQL standard. we are try to following the standards. postgres has a implementation. and hawq can also have one and from the spec, you can see that it is designed for hawq by us. > 2) the JIRA itself is titled as "ORC as native format" which has nothing to > do with this framework > thanks for pointing this out, the title is somewhat confusing, I will change it to a more general one, or separate the two into two umbrella JIRAs. > > We should not try to lump multiple features, ideas in one JIRA > > > On Wed, Jun 22, 2016 at 12:28 AM Lei Chang <[email protected]> wrote: > > > On Wed, Jun 22, 2016 at 1:39 AM, Goden Yao <[email protected]> wrote: > > > > > This is not comparable as native vs. external. > > > The design doc attached in HAWQ-786 > > > <https://issues.apache.org/jira/browse/HAWQ-786>, as some community > > > responses in the JIRA, is mixing up an External Table data access > > framework > > > with a file format support. > > > > > > If the JIRA is merely about using ORC as native file format as we see > its > > > popularity in the Hadoop community and potentially want to replace > > parquet > > > with ORC as default for its benefits and advantages, this JIRA should > be > > > focusing on the native file format part and how to integrate with C > > library > > > from Apache ORC project. > > > > > > > > > as it was described in the JIRA. the framework is designed as a general > > framework. > > > > it can also potentially be used for external data. there is an example > > showing the usage. > > > > > > > > > > To answer Roman's questions, I think we first need to understand user > > > scenario with external tables (with ORC format), which is users : > > > 1) already have ORC files landed in HDFS (or stored as Hive tables) > > > 2) want to query from HAWQ, so they may get performance gain with MPP > > > architecture provided by HAWQ, instead of MR jobs. > > > 3) want to avoid data duplication, which means they don't want to load > > data > > > into HAWQ native format (so doesn't matter what native format HAWQ uses > > to > > > store the table) > > > > > > Given that, I think it's worth a further discussion in the theme of > > > improving external data source access/query performance. > > > > > > Thanks > > > -Goden > > > > > > > > > > > > On Mon, Jun 20, 2016 at 5:55 PM Lei Chang <[email protected]> > wrote: > > > > > > > On Tue, Jun 21, 2016 at 8:38 AM, Roman Shaposhnik < > > [email protected]> > > > > wrote: > > > > > > > > > On Fri, Jun 17, 2016 at 3:02 AM, Ming Li <[email protected]> wrote: > > > > > > Hi Guys, > > > > > > > > > > > > ORC (Optimized Row Columnar) is a very popular open source format > > > > adopted > > > > > > in some major components in Hadoop eco-system. It is also used > by a > > > lot > > > > > of > > > > > > users. The advantages of supporting ORC storage in HAWQ are in > two > > > > folds: > > > > > > firstly, it makes HAWQ more Hadoop native which interacts with > > other > > > > > > components more easily; secondly, ORC stores some meta info for > > query > > > > > > optimization, thus, it might potentially outperform two native > > > formats > > > > > > (i.e., AO, Parquet) if it is available. > > > > > > > > > > > > Since there are lots of popular formats available in HDFS > > community, > > > > and > > > > > > more advanced formats are emerging frequently. It is good option > > for > > > > HAWQ > > > > > > to design a general framework that supports pluggable c/c++ > formats > > > > such > > > > > as > > > > > > ORC, as well as native format such as AO and Parquet. In > designing > > > this > > > > > > framework, we also need to support data stored in different file > > > > systems: > > > > > > HDFS, local disk, amazon S3, etc. Thus, it is better to offer a > > > > framework > > > > > > to support pluggable formats and pluggable file systems. > > > > > > > > > > > > We are proposing support ORC in JIRA ( > > > > > > https://issues.apache.org/jira/browse/HAWQ-786). Please see the > > > design > > > > > spec > > > > > > in the JIRA. > > > > > > > > > > > > Your comments are appreciated! > > > > > > > > > > This sounds reasonable, but I'd like to understand the trade-offs > > > > > between supporting > > > > > something like ORC in PXF vs. implementing it natively in C/C++. > > > > > > > > > > Is there any hard performance/etc. data that you could share to > > > > > illuminated the > > > > > tradeoffs between these two approaches? > > > > > > > > > > > > > Implementing it natively in C/C++ will get at least comparable > > > performance > > > > with current native AO and parquet format. > > > > > > > > And we know that ao and parquet is faster than pxf, so we are > expecting > > > > better performance here. > > > > > > > > Cheers > > > > Lei > > > > > > > > > >
