I have been writing some test code that creates a simple orc writer and reader with bloom filters enabled. The issue I have is when the SearchArgument matches the first column name provided in the Options searchArgument method ( https://github.com/apache/orc/blob/rel/release-1.1.2/java/core/src/java/org/apache/orc/Reader.java#L197) the bloom filter doesn't seem to get applied.
The test program creates an orc file file with 2 string columns. Then it populates the orc file with 1 million records with same UUID in both columns, but different values for each row. Then it performs a series of reads on the file and counts the number of batches read and displays the output. Test program: https://gist.github.com/amccurry/a25a9dad1e657da5f4a1d8aec5e49118 NOTE: I'm assuming the searchArgument ( https://github.com/apache/orc/blob/rel/release-1.1.2/java/core/src/java/org/apache/orc/Reader.java#L197) method that contains the columns names are to inform the orc reader what indexes it should read to perform the search operations. High Level Output: where a1 == literal colNames : ["a1"] reads 977 batches colNames : ["a1", "a2"] reads 977 batches colNames : ["a2", "a1"] reads 90 batches where a2 == literal colNames : ["a2"] reads 977 batches colNames : ["a1", "a2"] reads 90 batches colNames : ["a2", "a1"] reads 977 batches where a1 == literal AND where a2 == literal colNames : ["a1", "a2"] reads 90 batches colNames : ["a2", "a1"] reads 90 batches where a1 == literal AND where a1 == literal colNames : ["a1"] reads 977 batches colNames : ["a1", "a2"] reads 977 batches colNames : ["a2", "a1"] reads 90 batches where a2 == literal AND where a2 == literal colNames : ["a2"] reads 977 batches colNames : ["a1", "a2"] reads 90 batches colNames : ["a2", "a1"] reads 977 batches Given that every row has the same value in both columns a1 and a2 I would assume that every one of these test runs would yield the same number of batches read, which should be 90. Raw Output: https://gist.github.com/amccurry/962744f35b19bd013ec48c9bcbfb15e4 I think the issue is from mapSargColumnsToOrcInternalColIdx method where the rootColumn value is hard coded to '0': https://github.com/apache/orc/blob/rel/release-1.1.2/java/core/src/java/org/apache/orc/impl/RecordReaderImpl.java#L713 The mapSargColumnsToOrcInternalColIdx method checks each provided column against the columns in the orc schema. During this it calls findColumns ( https://github.com/apache/orc/blob/rel/release-1.1.2/java/core/src/java/org/apache/orc/impl/RecordReaderImpl.java#L104) where if the column name matches one of the values in the columnNames array, the index and rootColumn are added and returned. Then when the mapSargColumnsToOrcInternalColIdx returns it checks each value in the filterColumns array to make sure it's value is greater than '0'. If the column index is the first column and the rootColumn is '0' then it's return value is '0' and the logical column filter gets omitted. I think the rootColumn literal should be '1' instead of '0' ( https://github.com/apache/orc/blob/rel/release-1.1.2/java/core/src/java/org/apache/orc/impl/RecordReaderImpl.java#L713 ). Thoughts? Thanks, Aaron
