Yeah I will work on patch with some test cases. Thanks. Aaron
On Mon, Aug 15, 2016 at 9:59 PM, Prasanth J <[email protected]> wrote: > Hi Aaron > > Thanks a lot for reporting the issue and providing test case! > > I looked at the test case and I think your solution to offset to > rootColumn by 1 is correct. It will be good to have this tested with ACID > as well as the root column for acid will be different. > > Would you be willing put up patch for this issue? I will help with the > review and commit. > > Thanks > Prasanth > > > On Aug 15, 2016, at 1:08 PM, Aaron McCurry <[email protected]> wrote: > > > > I have been writing some test code that creates a simple orc writer and > > reader with bloom filters enabled. The issue I have is when the > > SearchArgument matches the first column name provided in the Options > > searchArgument method ( > > https://github.com/apache/orc/blob/rel/release-1.1.2/java/ > core/src/java/org/apache/orc/Reader.java#L197) > > the bloom filter doesn't seem to get applied. > > > > The test program creates an orc file file with 2 string columns. Then it > > populates the orc file with 1 million records with same UUID in both > > columns, but different values for each row. Then it performs a series of > > reads on the file and counts the number of batches read and displays the > > output. > > > > Test program: > > https://gist.github.com/amccurry/a25a9dad1e657da5f4a1d8aec5e49118 > > > > NOTE: I'm assuming the searchArgument ( > > https://github.com/apache/orc/blob/rel/release-1.1.2/java/ > core/src/java/org/apache/orc/Reader.java#L197) > > method that contains the columns names are to inform the orc reader what > > indexes it should read to perform the search operations. > > > > High Level Output: > > > > where a1 == literal > > colNames : ["a1"] reads 977 batches > > colNames : ["a1", "a2"] reads 977 batches > > colNames : ["a2", "a1"] reads 90 batches > > > > where a2 == literal > > colNames : ["a2"] reads 977 batches > > colNames : ["a1", "a2"] reads 90 batches > > colNames : ["a2", "a1"] reads 977 batches > > > > where a1 == literal AND where a2 == literal > > colNames : ["a1", "a2"] reads 90 batches > > colNames : ["a2", "a1"] reads 90 batches > > > > where a1 == literal AND where a1 == literal > > colNames : ["a1"] reads 977 batches > > colNames : ["a1", "a2"] reads 977 batches > > colNames : ["a2", "a1"] reads 90 batches > > > > where a2 == literal AND where a2 == literal > > colNames : ["a2"] reads 977 batches > > colNames : ["a1", "a2"] reads 90 batches > > colNames : ["a2", "a1"] reads 977 batches > > > > Given that every row has the same value in both columns a1 and a2 I would > > assume that every one of these test runs would yield the same number of > > batches read, which should be 90. > > > > Raw Output: > > https://gist.github.com/amccurry/962744f35b19bd013ec48c9bcbfb15e4 > > > > I think the issue is from mapSargColumnsToOrcInternalColIdx method where > > the rootColumn value is hard coded to '0': > > https://github.com/apache/orc/blob/rel/release-1.1.2/java/ > core/src/java/org/apache/orc/impl/RecordReaderImpl.java#L713 > > > > The mapSargColumnsToOrcInternalColIdx method checks each provided column > > against the columns in the orc schema. During this it calls findColumns > ( > > https://github.com/apache/orc/blob/rel/release-1.1.2/java/ > core/src/java/org/apache/orc/impl/RecordReaderImpl.java#L104) > > where if the column name matches one of the values in the columnNames > > array, the index and rootColumn are added and returned. > > > > Then when the mapSargColumnsToOrcInternalColIdx returns it checks each > > value in the filterColumns array to make sure it's value is greater than > > '0'. If the column index is the first column and the rootColumn is '0' > > then it's return value is '0' and the logical column filter gets omitted. > > > > I think the rootColumn literal should be '1' instead of '0' ( > > https://github.com/apache/orc/blob/rel/release-1.1.2/java/ > core/src/java/org/apache/orc/impl/RecordReaderImpl.java#L713 > > ). > > > > Thoughts? > > > > Thanks, > > > > Aaron > >
