[jira] [Commented] (DRILL-5292) Better Parquet handling of sparse columns

Paul Rogers (JIRA) Fri, 24 Feb 2017 09:40:09 -0800

    [ 
https://issues.apache.org/jira/browse/DRILL-5292?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=15883165#comment-15883165
 ]


Paul Rogers commented on DRILL-5292:
------------------------------------

This is another instance of a general problem in Drill: that we do not have 
support for a "Null" data type. In Drill, all nulls must be null of some type. 
By default, when Drill does not know the type, we chose Int. This is fine only 
when the data eventually turns out to actually be integer. Otherwise, conflicts 
occur.

The same issue arrises in JSON: one might have a long series of null values 
followed by a non-null. In JSON, null is its own type: not "null integer" or 
"null string", just "null." Again, Drill has to have "null of some type" so we 
guess integer, which may or may not be right.

Then, we need type conversion rules. A "Null vector" is compatible with any 
other type. So, a vector of nulls can morph into a vector of strings or a 
vector of doubles once we see the type.

Such a solution still does not help the client, however. A client such as 
Tableau needs the schema immediately. In this case for Parquet, or the 
suggested case for JSON, we don't know the types until we read some amount of 
data. But, by then, Drill had to already predict the future and tell the client 
what the type will eventually be. Since prediction is hard, there is no good 
solution. Many workarounds have been proposed; this is another good suggestion.


> Better Parquet handling of sparse columns
> -----------------------------------------
>
>                 Key: DRILL-5292
>                 URL: https://issues.apache.org/jira/browse/DRILL-5292
>             Project: Apache Drill
>          Issue Type: Improvement
>          Components: Storage - Parquet
>    Affects Versions: 1.10.0
>            Reporter: Nate Putnam
>
> It appears the current implantation of ParquetRecordReader will fill in 
> missing columns between files as a NullableIntVector. It would be better if 
> the code could determine if that column was defined in a different file (and 
> didn't conflict) and use the defined data type. 
>  



--
This message was sent by Atlassian JIRA
(v6.3.15#6346)

[jira] [Commented] (DRILL-5292) Better Parquet handling of sparse columns

Reply via email to