Re: Predicate pushdown in columnar file formats

Aniket Mokashi Mon, 28 Apr 2014 10:47:42 -0700

There is already a jira for this -
https://issues.apache.org/jira/browse/PIG-3760. Let's start a discussion
over how we want to do this.


Parquet uses protobuf style schema definitions (list is a repeated group
etc), so it does support pushdown filters on any column including nested
columns or complex data structures.

Let me create a wiki page to document/discuss this in detail.

Thanks,
Aniket


On Thu, Apr 24, 2014 at 4:20 PM, Daniel Dai <[email protected]> wrote:

> We need a new interface for predicate pushdown. Current interface only
> support partition pushdown which consume the filter condition into
> loader. AFAIK, ORC predicate pushdown only support simple types. I
> discussed briefly with Aniket before, and we are open to the choice of
> interface design. There is no Jira yet we do need to create one.
>
> Thanks,
> Daniel
>
> On Thu, Apr 24, 2014 at 3:21 PM, Rohini Palaniswamy
> <[email protected]> wrote:
> > Hi,
> >    Both Parquet and ORC both support predicate pushdown. Was looking at
> > whether we can make use of the existing PartitionFilterOptimizer and
> report
> > whether columns supported for predicate pushdown can be reported as
> > partition columns. Dmitriy was talking about the
>  PartitionFilterOptimizer
> > pushing down the filter conditions to the LoadFunc but not removing them
> > from the actual filter condition. But even the new FilterExtractor (and
> old
> > PColFilterExtractor) that Aniket wrote removes the filter condition
> pushed
> > down. And in a way it makes sense for HCat when you filter lot of
> > partitions, you don't want each record also again filtered for the
> > partition condition wasting CPU. But in case of columnar file formats,
> the
> > predicates pushed down is only for selection/skipping of row
> groups/stripes
> > and not answering actual queries. So we need a new optimizer for pushing
> > down predicates to file formats which does not remove the filter
> condition
> > and a new Load interface.
> >
> >  There are no jiras filed for this yet. Will file one soon. Has anyone
> > already given thought to this and have any API design in mind? We are
> > planning to work on this and the main focus is on ORCFile, but want to
> > ensure that we address all cases of Parquet as well. Julien/Aniket could
> > you help with any questions on the Parquet front?
> >
> > ORCFile pushes down filter predicates using indexes/column sorting,
> > dictionary sorting or bloom filters according to
> > https://cwiki.apache.org/confluence/display/Hive/LanguageManual+ORC. I
> > don't think it can push down filters for complex data structures like
> list
> > or maps. Daniel, can you confirm?
> >
> > Julien,
> >    Can you tell how predicate pushdown works with Parquet. Does it
> support
> > map columns? I could not find much documentation on it.
> >
> > Regards,
> > Rohini
>
> --
> CONFIDENTIALITY NOTICE
> NOTICE: This message is intended for the use of the individual or entity to
> which it is addressed and may contain information that is confidential,
> privileged and exempt from disclosure under applicable law. If the reader
> of this message is not the intended recipient, you are hereby notified that
> any printing, copying, dissemination, distribution, disclosure or
> forwarding of this communication is strictly prohibited. If you have
> received this communication in error, please contact the sender immediately
> and delete it from your system. Thank You.
>



-- 
"...:::Aniket:::... Quetzalco@tl"

Re: Predicate pushdown in columnar file formats

Reply via email to