[jira] [Comment Edited] (DRILL-7308) Incorrect Metadata from text file queries

Paul Rogers (JIRA) Sat, 29 Jun 2019 18:55:15 -0700


    [ 
https://issues.apache.org/jira/browse/DRILL-7308?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16872033#comment-16872033
 ]


Paul Rogers edited comment on DRILL-7308 at 6/30/19 1:54 AM:
-------------------------------------------------------------

Recall that Drill can return not only multiple batches, but multiple "result 
sets": runs of batches with different schemas.


A more sophisticated REST solution would handle this case. I can't find any 
ProtoBuf field that says that the schema changed. Instead, we'd have to reuse 
code from elsewhere which compares the current schema to the previous one. 
Ideally, in that case, we'd create a new JSON element for the second schema. 
Something like:

{code:json}
{ resultSets: [
    { "rows": ...
      "schema": ...
    }, 
    { "rows": ...
      "schema": ...
    } ]
}
{code}

It is easy to create such a case. Simply create two CSV files, one with 2 
columns, the other with three. Use just a simple \{{SELECT * FROM yourTable}} 
query. You will get two data batches, each with a distinct schema.

The current implementation will give just the first schema and all rows, with 
varying schemas. (Actually, the current implementation will list the two 
columns, then the three columns, duplicating the first two, but we want to fix 
that...)

This is yet another reason to use a provisioned schema: with such a schema we 
can guarantee that the entire query will return a single, consistent schema 
regardless of the variation across files.

A quick & dirty solution is to clear and rebuild the schema objects on every 
batch. That way, the value sent to the user will reflect the last schema which, 
if you are lucky, will be valid for the initial batches as well as later 
batches.

It is a known open, unresolved issue that Drill does not attempt to merge 
schema changes, and that unmerged schema changes cannot be handled by ODBC or 
JDBC clients. We can assume, however, that the users of the REST API won't have 
messy data and won't run into this issue.


was (Author: paul.rogers):
Recall that Drill can return not only multiple batches, but multiple "result 
sets": runs of batches with different schemas.


A more sophisticated REST solution would handle this case. I can't find any 
ProtoBuf field that says that the schema changed. Instead, we'd have to reuse 
code from elsewhere which compares the current schema to the previous one. 
Ideally, in that case, we'd create a new JSON element for the second schema. 
Something like:

{code:json}
{ resultSets: [
    { "rows": ...
      "schema": ...
    }, 
    { "rows": ...
      "schema": ...
    } ]
}
{code}

It is easy to create such a case. Simply create two CSV files, one with 2 
columns, the other with three. Use just a simple \{{SELECT * FROM yourTable}} 
query. You will get two data batches, each with a distinct schema.

The current implementation will give just the first schema and all rows, with 
varying schemas. (Actually, the current implementation will list the two 
columns, then the three columns, duplicating the first two, but we want to fix 
that...)

This is yet another reason to use a provisioned schema: with such a schema we 
can guarantee that the entire query will return a single, consistent schema 
regardless of the variation across files.

> Incorrect Metadata from text file queries
> -----------------------------------------
>
>                 Key: DRILL-7308
>                 URL: https://issues.apache.org/jira/browse/DRILL-7308
>             Project: Apache Drill
>          Issue Type: Bug
>          Components: Metadata
>    Affects Versions: 1.17.0
>            Reporter: Charles Givre
>            Priority: Major
>         Attachments: Screen Shot 2019-06-24 at 3.16.40 PM.png, domains.csvh
>
>
> I'm noticing some strange behavior with the newest version of Drill.  If you 
> query a CSV file, you get the following metadata:
> {code:sql}
> SELECT * FROM dfs.test.`domains.csvh` LIMIT 1
> {code}
> {code:json}
> {
>   "queryId": "22eee85f-c02c-5878-9735-091d18788061",
>   "columns": [
>     "domain"
>   ],
>   "rows": [}
>    {       "domain": "thedataist.com"     }  ],
>   "metadata": [
>     "VARCHAR(0, 0)",
>     "VARCHAR(0, 0)"
>   ],
>   "queryState": "COMPLETED",
>   "attemptedAutoLimit": 0
> }
> {code}
> There are two issues here:
> 1.  VARCHAR now has precision
> 2.  There are twice as many columns as there should be.
> Additionally, if you query a regular CSV, without the columns extracted, you 
> get the following:
> {code:json}
>  "rows": [
>  { 
>       "columns": "[\"ACCT_NUM\",\"PRODUCT\",\"MONTH\",\"REVENUE\"]"     }
>   ],
>    "metadata": [
>      "VARCHAR(0, 0)",
>      "VARCHAR(0, 0)"
>    ],
> {code}



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

[jira] [Comment Edited] (DRILL-7308) Incorrect Metadata from text file queries

Reply via email to