[jira] [Commented] (DRILL-7308) Incorrect Metadata from text file queries

Paul Rogers (JIRA) Mon, 24 Jun 2019 22:51:15 -0700


    [ 
https://issues.apache.org/jira/browse/DRILL-7308?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16872033#comment-16872033
 ]


Paul Rogers commented on DRILL-7308:
------------------------------------

Recall that Drill can return not only multiple batches, but multiple "result 
sets": runs of batches with different schemas.


A more sophisticated REST solution would handle this case. I can't find any 
ProtoBuf field that says that the schema changed. Instead, we'd have to reuse 
code from elsewhere which compares the current schema to the previous one. 
Ideally, in that case, we'd create a new JSON element for the second schema. 
Something like:

{code:json}
{ resultSets: [
    { "rows": ...
      "schema": ...
    }, 
    { "rows": ...
      "schema": ...
    } ]
}
{code}

It is easy to create such a case. Simply create two CSV files, one with 2 
columns, the other with three. Use just a simple \{{SELECT * FROM yourTable}} 
query. You will get two data batches, each with a distinct schema.

The current implementation will give just the first schema and all rows, with 
varying schemas. (Actually, the current implementation will list the two 
columns, then the three columns, duplicating the first two, but we want to fix 
that...)

This is yet another reason to use a provisioned schema: with such a schema we 
can guarantee that the entire query will return a single, consistent schema 
regardless of the variation across files.

> Incorrect Metadata from text file queries
> -----------------------------------------
>
>                 Key: DRILL-7308
>                 URL: https://issues.apache.org/jira/browse/DRILL-7308
>             Project: Apache Drill
>          Issue Type: Bug
>          Components: Metadata
>    Affects Versions: 1.17.0
>            Reporter: Charles Givre
>            Priority: Major
>         Attachments: Screen Shot 2019-06-24 at 3.16.40 PM.png, domains.csvh
>
>
> I'm noticing some strange behavior with the newest version of Drill.  If you 
> query a CSV file, you get the following metadata:
> {code:sql}
> SELECT * FROM dfs.test.`domains.csvh` LIMIT 1
> {code}
> {code:json}
> {
>   "queryId": "22eee85f-c02c-5878-9735-091d18788061",
>   "columns": [
>     "domain"
>   ],
>   "rows": [}
>    {       "domain": "thedataist.com"     }  ],
>   "metadata": [
>     "VARCHAR(0, 0)",
>     "VARCHAR(0, 0)"
>   ],
>   "queryState": "COMPLETED",
>   "attemptedAutoLimit": 0
> }
> {code}
> There are two issues here:
> 1.  VARCHAR now has precision
> 2.  There are twice as many columns as there should be.
> Additionally, if you query a regular CSV, without the columns extracted, you 
> get the following:
> {code:json}
>  "rows": [
>  { 
>       "columns": "[\"ACCT_NUM\",\"PRODUCT\",\"MONTH\",\"REVENUE\"]"     }
>   ],
>    "metadata": [
>      "VARCHAR(0, 0)",
>      "VARCHAR(0, 0)"
>    ],
> {code}



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

[jira] [Commented] (DRILL-7308) Incorrect Metadata from text file queries

Reply via email to