Paul Rogers created DRILL-7522:
----------------------------------
Summary: JSON reader (v1) omits null columns in SELECT *
Key: DRILL-7522
URL: https://issues.apache.org/jira/browse/DRILL-7522
Project: Apache Drill
Issue Type: Bug
Affects Versions: 1.17.0
Reporter: Paul Rogers
Run the following unit test: {{TestStarQueries.testSelStarOrderBy}}, runs the
following query:
{code:sql}
select * from cp.`employee.json` order by last_name
{code}
The query reads a Foodmart file {{customer.json}} that has records like this:
{code:json}
{"employee_id":53,...","end_date":null,"salary":...}
{code}
The field {{end_date}} turns out to be null for all records in
{{customer.json}}.
Then, look at the verification query. It carefully includes all fields *except*
{{end_date}}. That is, the test was written to expect that the JSON reader will
omit a column that has all NULL values.
While it might seem OK to omit all-NULL columns (they don't have any data), the
problem is that Drill is a distributed system. Suppose we query a directory of
50 such files, some of which have all-NULLs in one field, some of which have
all-NULLs in another. Although the files have the same schema, {{SELECT *}}
will return different schemas (depending on which file has which non-NULL
columns.)
A downstream operator will have to merge these schemas. And, since Drill fills
in a Nullable INT field for missing columns, we might end up with a schema
change exception because the actual field type is VARCHAR when it appears.
One can argue that {{SELECT *}} means "return all columns", not "return all
columns except those that happen to be null in the first batch." Yes, we have
the problem of not knowing the actual field type. Eventually, provided schemas
will resolve such issues.
Note that in the "V2" JSON reader, {{end_date}} is included in the query.
--
This message was sent by Atlassian Jira
(v8.3.4#803005)