Oleksiy Sayankin created HIVE-22980:
---------------------------------------

             Summary: Support custom path filter for ORC tables
                 Key: HIVE-22980
                 URL: https://issues.apache.org/jira/browse/HIVE-22980
             Project: Hive
          Issue Type: New Feature
          Components: ORC
            Reporter: Oleksiy Sayankin
            Assignee: Oleksiy Sayankin


The customer is looking for an option to specify custom path filter for ORC 
tables. Please find the details below from customer requirement.

Problem Statement/Approach in customer words :

{quote} 
Currently, Orc file input format does not take in path filters set in the 
property "mapreduce.input.pathfilter.class" OR " mapred.input.pathfilter.class 
". So, we cannot use custom filters with Orc files. 

AcidUtils class has a static filter called "hiddenFilters" which is used by ORC 
to filter input paths. If we can pass the custom filter classes(set in the 
property mentioned above) to AcidUtils and replace hiddenFilter with a filter 
that does an "and" operation over hiddenFilter+customFilters, the filters would 
work well.

On local testing, mapreduce.input.pathfilter.class seems to be working for Text 
tables but not for ORC tables.
{quote}

Our analysis:

{{OrcInputFormat}} and {{FileInputFormat}} are different implementations for 
{{Inputformat}} interface. Property "{{mapreduce.input.pathfilter.class}}" is 
only respected by {{FileInputFormat}}, but not by any other implementations of 
{{InputFormat}}. The customer wants to have the ability to filter out rows 
based on path/filenames, current ORC features like bloomfilters and indexes are 
not good enough for them to minimize number of disk read operations.




--
This message was sent by Atlassian Jira
(v8.3.4#803005)

Reply via email to