That's great!  Thanks for checking.

Aaron

On Wed, Aug 17, 2016 at 6:13 PM, Prasanth J <[email protected]> wrote:

> I can confirm that ORC-54 fixes the issue.
>
> I ran the test case initially provided by Aaron, and I am getting the
> expected test results.
> Total Batches Added [977]
> Total Batches Read [90] with columnNames [[a1]] for sarg [leaf-0 = (EQUALS
> a1 91760645-5296-83b7-1fcd-955395a8db38), expr = leaf-0].
> Total Batches Read [90] with columnNames [[a1, a2]] for sarg [leaf-0 =
> (EQUALS a1 91760645-5296-83b7-1fcd-955395a8db38), expr = leaf-0].
> Total Batches Read [90] with columnNames [[a2, a1]] for sarg [leaf-0 =
> (EQUALS a1 91760645-5296-83b7-1fcd-955395a8db38), expr = leaf-0].
> Total Batches Read [90] with columnNames [[a2]] for sarg [leaf-0 = (EQUALS
> a2 91760645-5296-83b7-1fcd-955395a8db38), expr = leaf-0].
> Total Batches Read [90] with columnNames [[a1, a2]] for sarg [leaf-0 =
> (EQUALS a2 91760645-5296-83b7-1fcd-955395a8db38), expr = leaf-0].
> Total Batches Read [90] with columnNames [[a2, a1]] for sarg [leaf-0 =
> (EQUALS a2 91760645-5296-83b7-1fcd-955395a8db38), expr = leaf-0].
> Total Batches Read [90] with columnNames [[a1, a2]] for sarg [leaf-0 =
> (EQUALS a1 91760645-5296-83b7-1fcd-955395a8db38), leaf-1 = (EQUALS a2
> 91760645-5296-83b7-1fcd-955395a8db38), expr = (and leaf-0 leaf-1)].
> Total Batches Read [90] with columnNames [[a2, a1]] for sarg [leaf-0 =
> (EQUALS a1 91760645-5296-83b7-1fcd-955395a8db38), leaf-1 = (EQUALS a2
> 91760645-5296-83b7-1fcd-955395a8db38), expr = (and leaf-0 leaf-1)].
> Total Batches Read [90] with columnNames [[a1]] for sarg [leaf-0 = (EQUALS
> a1 91760645-5296-83b7-1fcd-955395a8db38), expr = (and leaf-0 leaf-0)].
> Total Batches Read [90] with columnNames [[a1, a2]] for sarg [leaf-0 =
> (EQUALS a1 91760645-5296-83b7-1fcd-955395a8db38), expr = (and leaf-0
> leaf-0)].
> Total Batches Read [90] with columnNames [[a2, a1]] for sarg [leaf-0 =
> (EQUALS a1 91760645-5296-83b7-1fcd-955395a8db38), expr = (and leaf-0
> leaf-0)].
> Total Batches Read [90] with columnNames [[a2]] for sarg [leaf-0 = (EQUALS
> a2 91760645-5296-83b7-1fcd-955395a8db38), expr = (and leaf-0 leaf-0)].
> Total Batches Read [90] with columnNames [[a1, a2]] for sarg [leaf-0 =
> (EQUALS a2 91760645-5296-83b7-1fcd-955395a8db38), expr = (and leaf-0
> leaf-0)].
> Total Batches Read [90] with columnNames [[a2, a1]] for sarg [leaf-0 =
> (EQUALS a2 91760645-5296-83b7-1fcd-955395a8db38), expr = (and leaf-0
> leaf-0)].
>
> Thanks
> Prasanth
>
> > On Aug 17, 2016, at 3:09 PM, Owen O'Malley <[email protected]> wrote:
> >
> > This issue might have been fixed as part of ORC-54, which got committed
> > this morning. Do you have a testcase already?
> >
> > .. Owen
> >
> > On Mon, Aug 15, 2016 at 1:08 PM, Aaron McCurry <[email protected]>
> wrote:
> >
> >> I have been writing some test code that creates a simple orc writer and
> >> reader with bloom filters enabled.  The issue I have is when the
> >> SearchArgument matches the first column name provided in the Options
> >> searchArgument method (
> >> https://github.com/apache/orc/blob/rel/release-1.1.2/java/
> >> core/src/java/org/apache/orc/Reader.java#L197)
> >> the bloom filter doesn't seem to get applied.
> >>
> >> The test program creates an orc file file with 2 string columns.  Then
> it
> >> populates the orc file with 1 million records with same UUID in both
> >> columns, but different values for each row.  Then it performs a series
> of
> >> reads on the file and counts the number of batches read and displays the
> >> output.
> >>
> >> Test program:
> >> https://gist.github.com/amccurry/a25a9dad1e657da5f4a1d8aec5e49118
> >>
> >> NOTE: I'm assuming the searchArgument (
> >> https://github.com/apache/orc/blob/rel/release-1.1.2/java/
> >> core/src/java/org/apache/orc/Reader.java#L197)
> >> method that contains the columns names are to inform the orc reader what
> >> indexes it should read to perform the search operations.
> >>
> >> High Level Output:
> >>
> >> where a1 == literal
> >> colNames : ["a1"] reads 977 batches
> >> colNames : ["a1", "a2"] reads 977 batches
> >> colNames : ["a2", "a1"] reads 90 batches
> >>
> >> where a2 == literal
> >> colNames : ["a2"] reads 977 batches
> >> colNames : ["a1", "a2"] reads 90 batches
> >> colNames : ["a2", "a1"] reads 977 batches
> >>
> >> where a1 == literal AND where a2 == literal
> >> colNames : ["a1", "a2"] reads 90 batches
> >> colNames : ["a2", "a1"] reads 90 batches
> >>
> >> where a1 == literal AND where a1 == literal
> >> colNames : ["a1"] reads 977 batches
> >> colNames : ["a1", "a2"] reads 977 batches
> >> colNames : ["a2", "a1"] reads 90 batches
> >>
> >> where a2 == literal AND where a2 == literal
> >> colNames : ["a2"] reads 977 batches
> >> colNames : ["a1", "a2"] reads 90 batches
> >> colNames : ["a2", "a1"] reads 977 batches
> >>
> >> Given that every row has the same value in both columns a1 and a2 I
> would
> >> assume that every one of these test runs would yield the same number of
> >> batches read, which should be 90.
> >>
> >> Raw Output:
> >> https://gist.github.com/amccurry/962744f35b19bd013ec48c9bcbfb15e4
> >>
> >> I think the issue is from mapSargColumnsToOrcInternalColIdx method
> where
> >> the rootColumn value is hard coded to '0':
> >> https://github.com/apache/orc/blob/rel/release-1.1.2/java/
> >> core/src/java/org/apache/orc/impl/RecordReaderImpl.java#L713
> >>
> >> The mapSargColumnsToOrcInternalColIdx method checks each provided
> column
> >> against the columns in the orc schema.  During this it calls
> findColumns (
> >> https://github.com/apache/orc/blob/rel/release-1.1.2/java/
> >> core/src/java/org/apache/orc/impl/RecordReaderImpl.java#L104)
> >> where if the column name matches one of the values in the columnNames
> >> array, the index and rootColumn are added and returned.
> >>
> >> Then when the mapSargColumnsToOrcInternalColIdx returns it checks each
> >> value in the filterColumns array to make sure it's value is greater than
> >> '0'.  If the column index is the first column and the rootColumn is '0'
> >> then it's return value is '0' and the logical column filter gets
> omitted.
> >>
> >> I think the rootColumn literal should be '1' instead of '0' (
> >> https://github.com/apache/orc/blob/rel/release-1.1.2/java/
> >> core/src/java/org/apache/orc/impl/RecordReaderImpl.java#L713
> >> ).
> >>
> >> Thoughts?
> >>
> >> Thanks,
> >>
> >> Aaron
> >>
>
>

Reply via email to