Paul Rogers created DRILL-6034:
----------------------------------

             Summary: repeated_contains returns a count, not a boolean, subject 
to overflow
                 Key: DRILL-6034
                 URL: https://issues.apache.org/jira/browse/DRILL-6034
             Project: Apache Drill
          Issue Type: Bug
    Affects Versions: 1.10.0
            Reporter: Paul Rogers


Consider the existing Drill unit tests 
{{testJsonReader.testRepeatedContains()}}. Consider the following query:

{code}
select repeated_contains(str_list, 'asdf') from 
cp.`store/json/json_basic_repeated_varchar.json`
{code}

According to the 
[documentation|http://drill.apache.org/docs/repeated-contains/]:

bq. REPEATED_CONTAINS returns true if Drill finds a match; otherwise, the 
function returns false.

Run the above query and print the results:

{noformat}
select repeated_contains(str_list, 'asdf') from 
cp.`store/json/json_basic_repeated_varchar.json`
#: EXPR$0
0: 5
1: 0
2: 0
3: 0
{noformat}

Note that the first row has a value of 5 which is *not* a Boolean. Drill has no 
Boolean type and instead uses the traditional encoding to integers: {{TRUE}} = 
1, {{FALSE}} = 0. A value of 5 is not a valid Boolean value. It may be that the 
following expression will fail:

{code}
SELECT * FROM cp.`store/json/json_basic_repeated_varchar.json`
  WHERE repeated_contains(str_list, 'asdf') = TRUE
{code}

The schema of the returned count value uses the Drill {{BIT}} type. For various 
historical reasons, Drill implements {{BIT}} as "UInt1" -- an unsigned 8 bit 
integer.

Further, since the function seems to return a count, it is subject to overflow 
if the count is 256, 512 or any multiple o 256. That is, if a list has 256 
occurrences of the pattern, {{repeated_contains}} will return 256 modulo 256 = 
0, which is the equivalent of SQL {{FALSE}}.

The recommendation is that the function be modified to return either 1 or 0. If 
there is a reason to have a count, use the existing {{repeated_count}} function.

Note that the "test" never caught this because it simply ran the query, but did 
not verify results:

{code}
  @Test
  public void testRepeatedContains() throws Exception {
    test("select repeated_contains(str_list, 'asdf') from 
cp.`store/json/json_basic_repeated_varchar.json`");
...
{code}

The issue was revealed when adding verification. For now, the new test verifies 
the incorrect results; it should be modified to match the documented results 
if/when the code is updated.



--
This message was sent by Atlassian JIRA
(v6.4.14#64029)

Reply via email to