Re: Support orc format

Lei Chang Wed, 22 Jun 2016 22:47:04 -0700

On Thu, Jun 23, 2016 at 9:39 AM, Shivram Mani <[email protected]>
wrote:


> Yes. Two separate Jiras would be more apt.
>

Thanks for the comments. Another new JIRA (
https://issues.apache.org/jira/browse/HAWQ-864) has been created, and the
original JIRA is changed to framework support.


> FDW came from SQL/MED standard and is now part of SQL. The contradiction
> here is, part of the standard is 'Foreign table' which overlaps with HAWQ's
> 'External Table' (along with protocols introduced such as PXF, gpfdist).
>
> I don't think we should have any more discussions about the framework in
> this thread as the subject is strictly ORC support.
>

Further discussions can go to individual JIRAs.


>
> On Wed, Jun 22, 2016 at 5:02 PM, Lei Chang <[email protected]> wrote:
>
> > On Thu, Jun 23, 2016 at 6:17 AM, Ting(Goden) Yao <[email protected]>
> wrote:
> >
> > > 1) the framework is not designed by HAWQ community - it was from
> Postgres
> > >
> >
> > It is not correct. fdw is SQL standard. we are try to following the
> > standards. postgres has a implementation. and hawq can also have one and
> > from the spec, you can see that it is designed for hawq by us.
> >
> >
> > > 2) the JIRA itself is titled as "ORC as native format" which has
> nothing
> > to
> > > do with this framework
> > >
> >
> > thanks for pointing this out, the title is somewhat confusing, I will
> > change it to a more general one, or separate the two into two umbrella
> > JIRAs.
> >
> >
> > >
> > > We should not try to lump multiple features, ideas in one JIRA
> > >
> > >
> > > On Wed, Jun 22, 2016 at 12:28 AM Lei Chang <[email protected]>
> wrote:
> > >
> > > > On Wed, Jun 22, 2016 at 1:39 AM, Goden Yao <[email protected]>
> > wrote:
> > > >
> > > > > This is not comparable as native vs. external.
> > > > > The design doc attached in HAWQ-786
> > > > > <https://issues.apache.org/jira/browse/HAWQ-786>, as some
> community
> > > > > responses in the JIRA, is mixing up an External Table data access
> > > > framework
> > > > > with a file format support.
> > > > >
> > > > > If the JIRA is merely about using ORC as native file format as we
> see
> > > its
> > > > > popularity in the Hadoop community and potentially want to replace
> > > > parquet
> > > > > with ORC as default for its benefits and advantages, this JIRA
> should
> > > be
> > > > > focusing on the native file format part and how to integrate with C
> > > > library
> > > > > from Apache ORC project.
> > > > >
> > > >
> > > >
> > > > as it was described in the JIRA. the framework is designed as a
> general
> > > > framework.
> > > >
> > > > it can also potentially be used for external data. there is an
> example
> > > > showing the usage.
> > > >
> > > >
> > > > >
> > > > > To answer Roman's questions, I think we first need to understand
> user
> > > > > scenario with external tables (with ORC format), which is users :
> > > > > 1) already have ORC files landed in HDFS (or stored as Hive tables)
> > > > > 2) want to query from HAWQ, so they may get performance gain with
> MPP
> > > > > architecture provided by HAWQ, instead of MR jobs.
> > > > > 3) want to avoid data duplication, which means they don't want to
> > load
> > > > data
> > > > > into HAWQ native format (so doesn't matter what native format HAWQ
> > uses
> > > > to
> > > > > store the table)
> > > > >
> > > > > Given that, I think it's worth a further discussion in the theme of
> > > > > improving external data source access/query performance.
> > > > >
> > > > > Thanks
> > > > > -Goden
> > > > >
> > > > >
> > > > >
> > > > > On Mon, Jun 20, 2016 at 5:55 PM Lei Chang <[email protected]>
> > > wrote:
> > > > >
> > > > > > On Tue, Jun 21, 2016 at 8:38 AM, Roman Shaposhnik <
> > > > [email protected]>
> > > > > > wrote:
> > > > > >
> > > > > > > On Fri, Jun 17, 2016 at 3:02 AM, Ming Li <[email protected]>
> wrote:
> > > > > > > > Hi Guys,
> > > > > > > >
> > > > > > > > ORC (Optimized Row Columnar) is a very popular open source
> > format
> > > > > > adopted
> > > > > > > > in some major components in Hadoop eco-system. It is also
> used
> > > by a
> > > > > lot
> > > > > > > of
> > > > > > > > users. The advantages of supporting ORC storage in HAWQ are
> in
> > > two
> > > > > > folds:
> > > > > > > > firstly, it makes HAWQ more Hadoop native which interacts
> with
> > > > other
> > > > > > > > components more easily; secondly, ORC stores some meta info
> for
> > > > query
> > > > > > > > optimization, thus, it might potentially outperform two
> native
> > > > > formats
> > > > > > > > (i.e., AO, Parquet) if it is available.
> > > > > > > >
> > > > > > > > Since there are lots of popular formats available in HDFS
> > > > community,
> > > > > > and
> > > > > > > > more advanced formats are emerging frequently. It is good
> > option
> > > > for
> > > > > > HAWQ
> > > > > > > > to design a general framework that supports pluggable c/c++
> > > formats
> > > > > > such
> > > > > > > as
> > > > > > > > ORC, as well as native format such as AO and Parquet. In
> > > designing
> > > > > this
> > > > > > > > framework, we also need to support data stored in different
> > file
> > > > > > systems:
> > > > > > > > HDFS, local disk, amazon S3, etc. Thus, it is better to
> offer a
> > > > > > framework
> > > > > > > > to support pluggable formats and pluggable file systems.
> > > > > > > >
> > > > > > > > We are proposing support ORC in JIRA (
> > > > > > > > https://issues.apache.org/jira/browse/HAWQ-786). Please see
> > the
> > > > > design
> > > > > > > spec
> > > > > > > > in the JIRA.
> > > > > > > >
> > > > > > > > Your comments are appreciated!
> > > > > > >
> > > > > > > This sounds reasonable, but I'd like to understand the
> trade-offs
> > > > > > > between supporting
> > > > > > > something like ORC in PXF vs. implementing it natively in
> C/C++.
> > > > > > >
> > > > > > > Is there any hard performance/etc. data that you could share to
> > > > > > > illuminated the
> > > > > > > tradeoffs between these two approaches?
> > > > > > >
> > > > > >
> > > > > > Implementing it natively in C/C++ will get at least comparable
> > > > > performance
> > > > > > with current native AO and parquet format.
> > > > > >
> > > > > > And we know that ao and parquet is faster than pxf, so we are
> > > expecting
> > > > > > better performance here.
> > > > > >
> > > > > > Cheers
> > > > > > Lei
> > > > > >
> > > > >
> > > >
> > >
> >
>
>
>
> --
> shivram mani
>

Re: Support orc format

Reply via email to