[
https://issues.apache.org/jira/browse/DRILL-7308?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16872033#comment-16872033
]
Paul Rogers edited comment on DRILL-7308 at 6/30/19 1:54 AM:
-------------------------------------------------------------
Recall that Drill can return not only multiple batches, but multiple "result
sets": runs of batches with different schemas.
A more sophisticated REST solution would handle this case. I can't find any
ProtoBuf field that says that the schema changed. Instead, we'd have to reuse
code from elsewhere which compares the current schema to the previous one.
Ideally, in that case, we'd create a new JSON element for the second schema.
Something like:
{code:json}
{ resultSets: [
{ "rows": ...
"schema": ...
},
{ "rows": ...
"schema": ...
} ]
}
{code}
It is easy to create such a case. Simply create two CSV files, one with 2
columns, the other with three. Use just a simple \{{SELECT * FROM yourTable}}
query. You will get two data batches, each with a distinct schema.
The current implementation will give just the first schema and all rows, with
varying schemas. (Actually, the current implementation will list the two
columns, then the three columns, duplicating the first two, but we want to fix
that...)
This is yet another reason to use a provisioned schema: with such a schema we
can guarantee that the entire query will return a single, consistent schema
regardless of the variation across files.
A quick & dirty solution is to clear and rebuild the schema objects on every
batch. That way, the value sent to the user will reflect the last schema which,
if you are lucky, will be valid for the initial batches as well as later
batches.
It is a known open, unresolved issue that Drill does not attempt to merge
schema changes, and that unmerged schema changes cannot be handled by ODBC or
JDBC clients. We can assume, however, that the users of the REST API won't have
messy data and won't run into this issue.
was (Author: paul.rogers):
Recall that Drill can return not only multiple batches, but multiple "result
sets": runs of batches with different schemas.
A more sophisticated REST solution would handle this case. I can't find any
ProtoBuf field that says that the schema changed. Instead, we'd have to reuse
code from elsewhere which compares the current schema to the previous one.
Ideally, in that case, we'd create a new JSON element for the second schema.
Something like:
{code:json}
{ resultSets: [
{ "rows": ...
"schema": ...
},
{ "rows": ...
"schema": ...
} ]
}
{code}
It is easy to create such a case. Simply create two CSV files, one with 2
columns, the other with three. Use just a simple \{{SELECT * FROM yourTable}}
query. You will get two data batches, each with a distinct schema.
The current implementation will give just the first schema and all rows, with
varying schemas. (Actually, the current implementation will list the two
columns, then the three columns, duplicating the first two, but we want to fix
that...)
This is yet another reason to use a provisioned schema: with such a schema we
can guarantee that the entire query will return a single, consistent schema
regardless of the variation across files.
> Incorrect Metadata from text file queries
> -----------------------------------------
>
> Key: DRILL-7308
> URL: https://issues.apache.org/jira/browse/DRILL-7308
> Project: Apache Drill
> Issue Type: Bug
> Components: Metadata
> Affects Versions: 1.17.0
> Reporter: Charles Givre
> Priority: Major
> Attachments: Screen Shot 2019-06-24 at 3.16.40 PM.png, domains.csvh
>
>
> I'm noticing some strange behavior with the newest version of Drill. If you
> query a CSV file, you get the following metadata:
> {code:sql}
> SELECT * FROM dfs.test.`domains.csvh` LIMIT 1
> {code}
> {code:json}
> {
> "queryId": "22eee85f-c02c-5878-9735-091d18788061",
> "columns": [
> "domain"
> ],
> "rows": [}
> { "domain": "thedataist.com" } ],
> "metadata": [
> "VARCHAR(0, 0)",
> "VARCHAR(0, 0)"
> ],
> "queryState": "COMPLETED",
> "attemptedAutoLimit": 0
> }
> {code}
> There are two issues here:
> 1. VARCHAR now has precision
> 2. There are twice as many columns as there should be.
> Additionally, if you query a regular CSV, without the columns extracted, you
> get the following:
> {code:json}
> "rows": [
> {
> "columns": "[\"ACCT_NUM\",\"PRODUCT\",\"MONTH\",\"REVENUE\"]" }
> ],
> "metadata": [
> "VARCHAR(0, 0)",
> "VARCHAR(0, 0)"
> ],
> {code}
--
This message was sent by Atlassian JIRA
(v7.6.3#76005)