[
https://issues.apache.org/jira/browse/DRILL-5292?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=15883165#comment-15883165
]
Paul Rogers commented on DRILL-5292:
------------------------------------
This is another instance of a general problem in Drill: that we do not have
support for a "Null" data type. In Drill, all nulls must be null of some type.
By default, when Drill does not know the type, we chose Int. This is fine only
when the data eventually turns out to actually be integer. Otherwise, conflicts
occur.
The same issue arrises in JSON: one might have a long series of null values
followed by a non-null. In JSON, null is its own type: not "null integer" or
"null string", just "null." Again, Drill has to have "null of some type" so we
guess integer, which may or may not be right.
Then, we need type conversion rules. A "Null vector" is compatible with any
other type. So, a vector of nulls can morph into a vector of strings or a
vector of doubles once we see the type.
Such a solution still does not help the client, however. A client such as
Tableau needs the schema immediately. In this case for Parquet, or the
suggested case for JSON, we don't know the types until we read some amount of
data. But, by then, Drill had to already predict the future and tell the client
what the type will eventually be. Since prediction is hard, there is no good
solution. Many workarounds have been proposed; this is another good suggestion.
> Better Parquet handling of sparse columns
> -----------------------------------------
>
> Key: DRILL-5292
> URL: https://issues.apache.org/jira/browse/DRILL-5292
> Project: Apache Drill
> Issue Type: Improvement
> Components: Storage - Parquet
> Affects Versions: 1.10.0
> Reporter: Nate Putnam
>
> It appears the current implantation of ParquetRecordReader will fill in
> missing columns between files as a NullableIntVector. It would be better if
> the code could determine if that column was defined in a different file (and
> didn't conflict) and use the defined data type.
>
--
This message was sent by Atlassian JIRA
(v6.3.15#6346)