Hi, Both Parquet and ORC both support predicate pushdown. Was looking at whether we can make use of the existing PartitionFilterOptimizer and report whether columns supported for predicate pushdown can be reported as partition columns. Dmitriy was talking about the PartitionFilterOptimizer pushing down the filter conditions to the LoadFunc but not removing them from the actual filter condition. But even the new FilterExtractor (and old PColFilterExtractor) that Aniket wrote removes the filter condition pushed down. And in a way it makes sense for HCat when you filter lot of partitions, you don't want each record also again filtered for the partition condition wasting CPU. But in case of columnar file formats, the predicates pushed down is only for selection/skipping of row groups/stripes and not answering actual queries. So we need a new optimizer for pushing down predicates to file formats which does not remove the filter condition and a new Load interface.
There are no jiras filed for this yet. Will file one soon. Has anyone already given thought to this and have any API design in mind? We are planning to work on this and the main focus is on ORCFile, but want to ensure that we address all cases of Parquet as well. Julien/Aniket could you help with any questions on the Parquet front? ORCFile pushes down filter predicates using indexes/column sorting, dictionary sorting or bloom filters according to https://cwiki.apache.org/confluence/display/Hive/LanguageManual+ORC. I don't think it can push down filters for complex data structures like list or maps. Daniel, can you confirm? Julien, Can you tell how predicate pushdown works with Parquet. Does it support map columns? I could not find much documentation on it. Regards, Rohini
