There is already a jira for this - https://issues.apache.org/jira/browse/PIG-3760. Let's start a discussion over how we want to do this.
Parquet uses protobuf style schema definitions (list is a repeated group etc), so it does support pushdown filters on any column including nested columns or complex data structures. Let me create a wiki page to document/discuss this in detail. Thanks, Aniket On Thu, Apr 24, 2014 at 4:20 PM, Daniel Dai <[email protected]> wrote: > We need a new interface for predicate pushdown. Current interface only > support partition pushdown which consume the filter condition into > loader. AFAIK, ORC predicate pushdown only support simple types. I > discussed briefly with Aniket before, and we are open to the choice of > interface design. There is no Jira yet we do need to create one. > > Thanks, > Daniel > > On Thu, Apr 24, 2014 at 3:21 PM, Rohini Palaniswamy > <[email protected]> wrote: > > Hi, > > Both Parquet and ORC both support predicate pushdown. Was looking at > > whether we can make use of the existing PartitionFilterOptimizer and > report > > whether columns supported for predicate pushdown can be reported as > > partition columns. Dmitriy was talking about the > PartitionFilterOptimizer > > pushing down the filter conditions to the LoadFunc but not removing them > > from the actual filter condition. But even the new FilterExtractor (and > old > > PColFilterExtractor) that Aniket wrote removes the filter condition > pushed > > down. And in a way it makes sense for HCat when you filter lot of > > partitions, you don't want each record also again filtered for the > > partition condition wasting CPU. But in case of columnar file formats, > the > > predicates pushed down is only for selection/skipping of row > groups/stripes > > and not answering actual queries. So we need a new optimizer for pushing > > down predicates to file formats which does not remove the filter > condition > > and a new Load interface. > > > > There are no jiras filed for this yet. Will file one soon. Has anyone > > already given thought to this and have any API design in mind? We are > > planning to work on this and the main focus is on ORCFile, but want to > > ensure that we address all cases of Parquet as well. Julien/Aniket could > > you help with any questions on the Parquet front? > > > > ORCFile pushes down filter predicates using indexes/column sorting, > > dictionary sorting or bloom filters according to > > https://cwiki.apache.org/confluence/display/Hive/LanguageManual+ORC. I > > don't think it can push down filters for complex data structures like > list > > or maps. Daniel, can you confirm? > > > > Julien, > > Can you tell how predicate pushdown works with Parquet. Does it > support > > map columns? I could not find much documentation on it. > > > > Regards, > > Rohini > > -- > CONFIDENTIALITY NOTICE > NOTICE: This message is intended for the use of the individual or entity to > which it is addressed and may contain information that is confidential, > privileged and exempt from disclosure under applicable law. If the reader > of this message is not the intended recipient, you are hereby notified that > any printing, copying, dissemination, distribution, disclosure or > forwarding of this communication is strictly prohibited. If you have > received this communication in error, please contact the sender immediately > and delete it from your system. Thank You. > -- "...:::Aniket:::... Quetzalco@tl"
