[
https://issues.apache.org/jira/browse/DRILL-7308?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16872964#comment-16872964
]
Paul Rogers edited comment on DRILL-7308 at 6/29/19 6:13 PM:
-------------------------------------------------------------
[~cgivre], the problem here is that the code shown earlier is counting on a
Protobuf implementation detail that is not actually a part of the Drill schema
specification (to the degree there is such a specification.) For VarChar, a
precision of 0 means that the user requested {{VARCHAR}}, while a precision of,
say, 10 means the user requested {{VARCHAR(10)}}. The scale field is never
valid for {{VARCHAR}}.
The output of {{VARCHAR(0,0)}} is not a problem with the code that generated
the schema. Instead, it is a problem with the way that the REST code attempts
to generate a type name from the schema structures. To be more precise, the
REST code incorrectly assumes that the {{isSet()}} methods are the correct way
to check for a 0 value. This is an incorrect assumption.
The Protobuf issue is that, unlike a regular Java object, if we never actually
write to the precision field, then the value is unset. If we write, even if we
write 0, the value is set. We certainly don't want to litter our code with
things like:
{code:java}
if (precision != 0) { schemaBuilder.setPrecision(precision); }
{code}
So, the code that uses the schema objects should do the following to determine
if the value is other than the default: both ask if the value is set, and if
so, ask if the value is non-zero. As it turns out, the unset value is 0, so
there is actually no need to ask if the value is set in this case.
Taking a step back, the type formatting code should not even be in the REST
API. The proper place for it is in {{Types}}. In fact, {{Types}} already has
the desired function: {{getExtendedSqlTypeName()}}. However, this function only
formats decimals; we need to add a case clause for VARCHAR.
Note that {{getExtendedSqlTypeName()}} exposes the *SQL name* for types. The
current REST implementation exposes the internal Drill name. That is,
{{getExtendedSqlTypeName()}} will report, say, {{DOUBLE}} while the REST code
will report {{Float8}}. This is probably a bug since the documentation explains
the SQL types, not the internal types.
That said, I actually have not seen any places in Drill where we set or use the
VARCHAR width. So, no point in trying to format it. In this case, you can just
use {{getExtendedSqlTypeName()}} directly as-is. Or, if we want to display the
width, add the required code to that function.
Please file a separate JIRA for the UDF issue. Please provide an attachment or
link to a sample UDF. I'll see if I can track down that CSV-specific issue in
case it relates to the EVF.
was (Author: paul.rogers):
[~cgivre], the problem here is that the code shown earlier is counting on a
Protobuf implementation detail that is not actually a part of the Drill schema
specification (to the degree there is such a specification.) For VarChar, a
precision of 0 means that the user requested {{VARCHAR}}, while a precision of,
say, 10 means the user requested {{VARCHAR(10}}. The scale is never valid for
{{VARCHAR}}, it is an artifact of the incorrect way the above code was written.
The Protobuf issue is that, unlike a regular Java object, if we never actually
write to the precision field, then the value is unset. If we write, even if we
write 0, the value is set. We certainly don't want to litter our code with
things like:
{code:java}
if (precision != 0) { schemaBuilder.setPrecision(precision); }
{code}
So, we should ask if the precision is set and non-zero.
In fact, the type formatting code should not even be in the REST API. The
proper place for it is in {{Types}}. In fact, that class already has the
desired function: {{getExtendedSqlTypeName()}}. However, this function only
formats decimals; we need to add a case clause for VARCHAR.
That said, I actually have not seen any places in Drill where we set or use the
VARCHAR width. So, no point in trying to format it. In this case, you can just
use {{getExtendedSqlTypeName()}} directly as-is.
Please file a separate JIRA for the UDF issue. Please provide an attachment or
link to a sample UDF. I'll see if I can track down that CSV-specific issue in
case it relates to the EVF.
> Incorrect Metadata from text file queries
> -----------------------------------------
>
> Key: DRILL-7308
> URL: https://issues.apache.org/jira/browse/DRILL-7308
> Project: Apache Drill
> Issue Type: Bug
> Components: Metadata
> Affects Versions: 1.17.0
> Reporter: Charles Givre
> Priority: Major
> Attachments: Screen Shot 2019-06-24 at 3.16.40 PM.png, domains.csvh
>
>
> I'm noticing some strange behavior with the newest version of Drill. If you
> query a CSV file, you get the following metadata:
> {code:sql}
> SELECT * FROM dfs.test.`domains.csvh` LIMIT 1
> {code}
> {code:json}
> {
> "queryId": "22eee85f-c02c-5878-9735-091d18788061",
> "columns": [
> "domain"
> ],
> "rows": [}
> { "domain": "thedataist.com" } ],
> "metadata": [
> "VARCHAR(0, 0)",
> "VARCHAR(0, 0)"
> ],
> "queryState": "COMPLETED",
> "attemptedAutoLimit": 0
> }
> {code}
> There are two issues here:
> 1. VARCHAR now has precision
> 2. There are twice as many columns as there should be.
> Additionally, if you query a regular CSV, without the columns extracted, you
> get the following:
> {code:json}
> "rows": [
> {
> "columns": "[\"ACCT_NUM\",\"PRODUCT\",\"MONTH\",\"REVENUE\"]" }
> ],
> "metadata": [
> "VARCHAR(0, 0)",
> "VARCHAR(0, 0)"
> ],
> {code}
--
This message was sent by Atlassian JIRA
(v7.6.3#76005)