[jira] [Commented] (DRILL-7308) Incorrect Metadata from text file queries

Paul Rogers (JIRA) Mon, 24 Jun 2019 22:50:58 -0700


    [ 
https://issues.apache.org/jira/browse/DRILL-7308?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16872031#comment-16872031
 ]


Paul Rogers commented on DRILL-7308:
------------------------------------

The problem with duplicated schema is also due to a flaw in the DRILL-6847 code 
in {{WebUserConnection}}:

{code:java}
  @Override
  public void sendData(RpcOutcomeListener<Ack> listener, QueryWritableBatch 
result) {
    ...
        for (int i = 0; i < loader.getSchema().getFieldCount(); ++i) {
          //DRILL-6847:  This section adds query metadata to the REST results
{code}

The {{sendData()}} method is called for *each* batch of data sent by the 
server. Probably the manual test case was against a short file that fit into a 
single batch. However, if the file is large, or if the query is distributed 
with multiple files, then multiple batches will be sent. Also, with the recent 
"V3" text reader, the code sends an empty schema batch followed by one or more 
non-empty data batches. (This "feature" is being disabled in DRILL-7306.)

So, each time a batch is received, the code adds another copy of the schema to 
the {{metadata}} list maintained in {{WebUserConnection}}.

A quick and dirty solution is to count the batches, and set the schema only on 
the first.

> Incorrect Metadata from text file queries
> -----------------------------------------
>
>                 Key: DRILL-7308
>                 URL: https://issues.apache.org/jira/browse/DRILL-7308
>             Project: Apache Drill
>          Issue Type: Bug
>          Components: Metadata
>    Affects Versions: 1.17.0
>            Reporter: Charles Givre
>            Priority: Major
>         Attachments: Screen Shot 2019-06-24 at 3.16.40 PM.png, domains.csvh
>
>
> I'm noticing some strange behavior with the newest version of Drill.  If you 
> query a CSV file, you get the following metadata:
> {code:sql}
> SELECT * FROM dfs.test.`domains.csvh` LIMIT 1
> {code}
> {code:json}
> {
>   "queryId": "22eee85f-c02c-5878-9735-091d18788061",
>   "columns": [
>     "domain"
>   ],
>   "rows": [}
>    {       "domain": "thedataist.com"     }  ],
>   "metadata": [
>     "VARCHAR(0, 0)",
>     "VARCHAR(0, 0)"
>   ],
>   "queryState": "COMPLETED",
>   "attemptedAutoLimit": 0
> }
> {code}
> There are two issues here:
> 1.  VARCHAR now has precision
> 2.  There are twice as many columns as there should be.
> Additionally, if you query a regular CSV, without the columns extracted, you 
> get the following:
> {code:json}
>  "rows": [
>  { 
>       "columns": "[\"ACCT_NUM\",\"PRODUCT\",\"MONTH\",\"REVENUE\"]"     }
>   ],
>    "metadata": [
>      "VARCHAR(0, 0)",
>      "VARCHAR(0, 0)"
>    ],
> {code}



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

[jira] [Commented] (DRILL-7308) Incorrect Metadata from text file queries

Reply via email to