[ 
https://issues.apache.org/jira/browse/DRILL-6034?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Paul Rogers updated DRILL-6034:
-------------------------------
    Summary: repeated_contains returns a count, not a Boolean, subject to 
overflow  (was: repeated_contains returns a count, not a boolean, subject to 
overflow)

> repeated_contains returns a count, not a Boolean, subject to overflow
> ---------------------------------------------------------------------
>
>                 Key: DRILL-6034
>                 URL: https://issues.apache.org/jira/browse/DRILL-6034
>             Project: Apache Drill
>          Issue Type: Bug
>    Affects Versions: 1.10.0
>            Reporter: Paul Rogers
>
> Consider the existing Drill unit tests 
> {{testJsonReader.testRepeatedContains()}}. Consider the following query:
> {code}
> select repeated_contains(str_list, 'asdf') from 
> cp.`store/json/json_basic_repeated_varchar.json`
> {code}
> According to the 
> [documentation|http://drill.apache.org/docs/repeated-contains/]:
> bq. REPEATED_CONTAINS returns true if Drill finds a match; otherwise, the 
> function returns false.
> Run the above query and print the results:
> {noformat}
> select repeated_contains(str_list, 'asdf') from 
> cp.`store/json/json_basic_repeated_varchar.json`
> #: EXPR$0
> 0: 5
> 1: 0
> 2: 0
> 3: 0
> {noformat}
> Note that the first row has a value of 5 which is *not* a Boolean. Drill has 
> no Boolean type and instead uses the traditional encoding to integers: 
> {{TRUE}} = 1, {{FALSE}} = 0. A value of 5 is not a valid Boolean value. It 
> may be that the following expression will fail:
> {code}
> SELECT * FROM cp.`store/json/json_basic_repeated_varchar.json`
>   WHERE repeated_contains(str_list, 'asdf') = TRUE
> {code}
> The schema of the returned count value uses the Drill {{BIT}} type. For 
> various historical reasons, Drill implements {{BIT}} as "UInt1" -- an 
> unsigned 8 bit integer.
> Further, since the function seems to return a count, it is subject to 
> overflow if the count is 256, 512 or any multiple o 256. That is, if a list 
> has 256 occurrences of the pattern, {{repeated_contains}} will return 256 
> modulo 256 = 0, which is the equivalent of SQL {{FALSE}}.
> The recommendation is that the function be modified to return either 1 or 0. 
> If there is a reason to have a count, use the existing {{repeated_count}} 
> function.
> Note that the "test" never caught this because it simply ran the query, but 
> did not verify results:
> {code}
>   @Test
>   public void testRepeatedContains() throws Exception {
>     test("select repeated_contains(str_list, 'asdf') from 
> cp.`store/json/json_basic_repeated_varchar.json`");
> ...
> {code}
> The issue was revealed when adding verification. For now, the new test 
> verifies the incorrect results; it should be modified to match the documented 
> results if/when the code is updated.



--
This message was sent by Atlassian JIRA
(v6.4.14#64029)

Reply via email to