Re: Issue bloom filters with orc?

Prasanth J Mon, 15 Aug 2016 19:00:07 -0700

Hi Aaron

Thanks a lot for reporting the issue and providing test case!


I looked at the test case and I think your solution to offset to rootColumn by 
1 is correct. It will be good to have this tested with ACID as well as the root 
column for acid will be different. 

Would you be willing put up patch for this issue? I will help with the review 
and commit. 

Thanks
Prasanth

> On Aug 15, 2016, at 1:08 PM, Aaron McCurry <[email protected]> wrote:
> 
> I have been writing some test code that creates a simple orc writer and
> reader with bloom filters enabled.  The issue I have is when the
> SearchArgument matches the first column name provided in the Options
> searchArgument method (
> https://github.com/apache/orc/blob/rel/release-1.1.2/java/core/src/java/org/apache/orc/Reader.java#L197)
> the bloom filter doesn't seem to get applied.
> 
> The test program creates an orc file file with 2 string columns.  Then it
> populates the orc file with 1 million records with same UUID in both
> columns, but different values for each row.  Then it performs a series of
> reads on the file and counts the number of batches read and displays the
> output.
> 
> Test program:
> https://gist.github.com/amccurry/a25a9dad1e657da5f4a1d8aec5e49118
> 
> NOTE: I'm assuming the searchArgument (
> https://github.com/apache/orc/blob/rel/release-1.1.2/java/core/src/java/org/apache/orc/Reader.java#L197)
> method that contains the columns names are to inform the orc reader what
> indexes it should read to perform the search operations.
> 
> High Level Output:
> 
> where a1 == literal
> colNames : ["a1"] reads 977 batches
> colNames : ["a1", "a2"] reads 977 batches
> colNames : ["a2", "a1"] reads 90 batches
> 
> where a2 == literal
> colNames : ["a2"] reads 977 batches
> colNames : ["a1", "a2"] reads 90 batches
> colNames : ["a2", "a1"] reads 977 batches
> 
> where a1 == literal AND where a2 == literal
> colNames : ["a1", "a2"] reads 90 batches
> colNames : ["a2", "a1"] reads 90 batches
> 
> where a1 == literal AND where a1 == literal
> colNames : ["a1"] reads 977 batches
> colNames : ["a1", "a2"] reads 977 batches
> colNames : ["a2", "a1"] reads 90 batches
> 
> where a2 == literal AND where a2 == literal
> colNames : ["a2"] reads 977 batches
> colNames : ["a1", "a2"] reads 90 batches
> colNames : ["a2", "a1"] reads 977 batches
> 
> Given that every row has the same value in both columns a1 and a2 I would
> assume that every one of these test runs would yield the same number of
> batches read, which should be 90.
> 
> Raw Output:
> https://gist.github.com/amccurry/962744f35b19bd013ec48c9bcbfb15e4
> 
> I think the issue is from mapSargColumnsToOrcInternalColIdx method where
> the rootColumn value is hard coded to '0':
> https://github.com/apache/orc/blob/rel/release-1.1.2/java/core/src/java/org/apache/orc/impl/RecordReaderImpl.java#L713
> 
> The mapSargColumnsToOrcInternalColIdx method checks each provided column
> against the columns in the orc schema.  During this it calls findColumns (
> https://github.com/apache/orc/blob/rel/release-1.1.2/java/core/src/java/org/apache/orc/impl/RecordReaderImpl.java#L104)
> where if the column name matches one of the values in the columnNames
> array, the index and rootColumn are added and returned.
> 
> Then when the mapSargColumnsToOrcInternalColIdx returns it checks each
> value in the filterColumns array to make sure it's value is greater than
> '0'.  If the column index is the first column and the rootColumn is '0'
> then it's return value is '0' and the logical column filter gets omitted.
> 
> I think the rootColumn literal should be '1' instead of '0' (
> https://github.com/apache/orc/blob/rel/release-1.1.2/java/core/src/java/org/apache/orc/impl/RecordReaderImpl.java#L713
> ).
> 
> Thoughts?
> 
> Thanks,
> 
> Aaron

Re: Issue bloom filters with orc?

Reply via email to