[ https://issues.apache.org/jira/browse/DRILL-7308?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16872033#comment-16872033 ]
Paul Rogers commented on DRILL-7308: ------------------------------------ Recall that Drill can return not only multiple batches, but multiple "result sets": runs of batches with different schemas. A more sophisticated REST solution would handle this case. I can't find any ProtoBuf field that says that the schema changed. Instead, we'd have to reuse code from elsewhere which compares the current schema to the previous one. Ideally, in that case, we'd create a new JSON element for the second schema. Something like: {code:json} { resultSets: [ { "rows": ... "schema": ... }, { "rows": ... "schema": ... } ] } {code} It is easy to create such a case. Simply create two CSV files, one with 2 columns, the other with three. Use just a simple \{{SELECT * FROM yourTable}} query. You will get two data batches, each with a distinct schema. The current implementation will give just the first schema and all rows, with varying schemas. (Actually, the current implementation will list the two columns, then the three columns, duplicating the first two, but we want to fix that...) This is yet another reason to use a provisioned schema: with such a schema we can guarantee that the entire query will return a single, consistent schema regardless of the variation across files. > Incorrect Metadata from text file queries > ----------------------------------------- > > Key: DRILL-7308 > URL: https://issues.apache.org/jira/browse/DRILL-7308 > Project: Apache Drill > Issue Type: Bug > Components: Metadata > Affects Versions: 1.17.0 > Reporter: Charles Givre > Priority: Major > Attachments: Screen Shot 2019-06-24 at 3.16.40 PM.png, domains.csvh > > > I'm noticing some strange behavior with the newest version of Drill. If you > query a CSV file, you get the following metadata: > {code:sql} > SELECT * FROM dfs.test.`domains.csvh` LIMIT 1 > {code} > {code:json} > { > "queryId": "22eee85f-c02c-5878-9735-091d18788061", > "columns": [ > "domain" > ], > "rows": [} > { "domain": "thedataist.com" } ], > "metadata": [ > "VARCHAR(0, 0)", > "VARCHAR(0, 0)" > ], > "queryState": "COMPLETED", > "attemptedAutoLimit": 0 > } > {code} > There are two issues here: > 1. VARCHAR now has precision > 2. There are twice as many columns as there should be. > Additionally, if you query a regular CSV, without the columns extracted, you > get the following: > {code:json} > "rows": [ > { > "columns": "[\"ACCT_NUM\",\"PRODUCT\",\"MONTH\",\"REVENUE\"]" } > ], > "metadata": [ > "VARCHAR(0, 0)", > "VARCHAR(0, 0)" > ], > {code} -- This message was sent by Atlassian JIRA (v7.6.3#76005)