[jira] [Commented] (DRILL-6359) All-text mode in JSON still reads missing column as Nullable Int

Paul Rogers (JIRA) Sun, 29 Apr 2018 13:05:17 -0700

    [ 
https://issues.apache.org/jira/browse/DRILL-6359?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16458148#comment-16458148
 ]


Paul Rogers commented on DRILL-6359:
------------------------------------

Suppose we have the following two files:

{noformat}
json/missing/
  file1.json:
    {a: 1}
  file2.json:
    {a: 2, b: “foo”}
{noformat}

If Drill reads the above in the same minor fragment, then the above may work if 
Drill happens to read file2 before file1. (The order in which Drill reads files 
is random.) But, the query will fail if Drill reads file1 before file2. How it 
fails depends on the query.

For example, in one test, the following query works:

{noformat}
ALTER SESSION SET `store.json.all_text_mode` = true;
SELECT a, b FROM `json/missing` ORDER BY a;
{noformat}

Then, rename {{file1.json}} to {{file3.json}}. The same query now produces an 
error:

{noformat}
Error: UNSUPPORTED_OPERATION ERROR: Schema changes not supported in External 
Sort. Please enable Union type.

Previous schema BatchSchema [fields=[[`a` (BIGINT:OPTIONAL)]],
  selectionVector=NONE]
Incoming schema BatchSchema [fields=[[`a` (BIGINT:OPTIONAL)], 
  [`b` (VARCHAR:OPTIONAL)]], selectionVector=NONE] 
{noformat}

The key of this comment is the error message above: the JSON reader appears to 
not have created the missing column {{b}}. As a result, some higher-level 
Project operator created the column. Since that operator does not know the 
source of the data, it guessed Nullable Int.

So, the problem described here seems to be that JSON does not fill in a missing 
column when all-text mode is set, but it does do so when all-text mode is off.

> All-text mode in JSON still reads missing column as Nullable Int
> ----------------------------------------------------------------
>
>                 Key: DRILL-6359
>                 URL: https://issues.apache.org/jira/browse/DRILL-6359
>             Project: Apache Drill
>          Issue Type: Bug
>    Affects Versions: 1.13.0, 1.14.0
>            Reporter: Paul Rogers
>            Priority: Major
>
> Suppose we have the following file:
> {noformat}
> {a: 0}
> {a: 1}
> ...
> {a: 70001, b: 10.5}
> {noformat}
> Where the "..." indicates another 70K records. (Chosen to force the 
> appearance of {{b}} into a second or later batch.)
> Suppose we execute the following query:
> {code}
> ALTER SESSION SET `store.json.all_text_mode` = true;
> SELECT a, b FROM `70Kmissing.json` WHERE b IS NOT NULL ORDER BY a;
> {code}
> The query should work. We have an explicit project for column {{b}} and we've 
> told JSON to always use text. So, JSON should have enough information to 
> create column {{b}} as {{Nullable VarChar}}.
> Yet, the result of the query in {{sqlline}} is:
> {noformat}
> Error: UNSUPPORTED_OPERATION ERROR: Schema changes not supported in External 
> Sort. Please enable Union type.
> Previous schema BatchSchema [fields=[[`a` (VARCHAR:OPTIONAL)], [`b` 
> (INT:OPTIONAL)]], selectionVector=NONE]
> Incoming schema BatchSchema [fields=[[`a` (VARCHAR:OPTIONAL)], [`b` 
> (VARCHAR:OPTIONAL)]], selectionVector=NONE]
> {noformat}
> The expected result is that the query works because even missing columns 
> should be subject to the "all text mode" setting because the JSON reader 
> handles projection push-down, and is responsible for filling in the missing 
> columns.
> This is with the shipping Drill 1.13 JSON reader. I *think* this is fixed in 
> the "batch size handling" JSON reader rewrite, but I've not tested it.



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

[jira] [Commented] (DRILL-6359) All-text mode in JSON still reads missing column as Nullable Int

Reply via email to