[ https://issues.apache.org/jira/browse/DRILL-7308?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16872031#comment-16872031 ]
Paul Rogers commented on DRILL-7308: ------------------------------------ The problem with duplicated schema is also due to a flaw in the DRILL-6847 code in {{WebUserConnection}}: {code:java} @Override public void sendData(RpcOutcomeListener<Ack> listener, QueryWritableBatch result) { ... for (int i = 0; i < loader.getSchema().getFieldCount(); ++i) { //DRILL-6847: This section adds query metadata to the REST results {code} The {{sendData()}} method is called for *each* batch of data sent by the server. Probably the manual test case was against a short file that fit into a single batch. However, if the file is large, or if the query is distributed with multiple files, then multiple batches will be sent. Also, with the recent "V3" text reader, the code sends an empty schema batch followed by one or more non-empty data batches. (This "feature" is being disabled in DRILL-7306.) So, each time a batch is received, the code adds another copy of the schema to the {{metadata}} list maintained in {{WebUserConnection}}. A quick and dirty solution is to count the batches, and set the schema only on the first. > Incorrect Metadata from text file queries > ----------------------------------------- > > Key: DRILL-7308 > URL: https://issues.apache.org/jira/browse/DRILL-7308 > Project: Apache Drill > Issue Type: Bug > Components: Metadata > Affects Versions: 1.17.0 > Reporter: Charles Givre > Priority: Major > Attachments: Screen Shot 2019-06-24 at 3.16.40 PM.png, domains.csvh > > > I'm noticing some strange behavior with the newest version of Drill. If you > query a CSV file, you get the following metadata: > {code:sql} > SELECT * FROM dfs.test.`domains.csvh` LIMIT 1 > {code} > {code:json} > { > "queryId": "22eee85f-c02c-5878-9735-091d18788061", > "columns": [ > "domain" > ], > "rows": [} > { "domain": "thedataist.com" } ], > "metadata": [ > "VARCHAR(0, 0)", > "VARCHAR(0, 0)" > ], > "queryState": "COMPLETED", > "attemptedAutoLimit": 0 > } > {code} > There are two issues here: > 1. VARCHAR now has precision > 2. There are twice as many columns as there should be. > Additionally, if you query a regular CSV, without the columns extracted, you > get the following: > {code:json} > "rows": [ > { > "columns": "[\"ACCT_NUM\",\"PRODUCT\",\"MONTH\",\"REVENUE\"]" } > ], > "metadata": [ > "VARCHAR(0, 0)", > "VARCHAR(0, 0)" > ], > {code} -- This message was sent by Atlassian JIRA (v7.6.3#76005)