Paul Rogers created DRILL-6037:
----------------------------------

             Summary: List vector can lose data when "promoting" to union
                 Key: DRILL-6037
                 URL: https://issues.apache.org/jira/browse/DRILL-6037
             Project: Apache Drill
          Issue Type: Bug
    Affects Versions: 1.10.0
            Reporter: Paul Rogers


Drill provides a little-known {{ListVector}} used in the JSON reader to create 
an alternative to the {{REPEATED}} data mode which allows array values to be 
null. That is, the list vector allows the following:

{noformat}
{a: [10, 20]} {a: null}
{noformat}

(It is unclear if the rest of Drill can handle this extra null state, however.)

The list vector has another form of magic. It can be "promoted" to a list of 
(barely supported) unions. Promotion to union allows the following:

{noformat}
{a: [10, "twenty"]}
{noformat}

Promotion to union is done via a call to {{ListVector.promoteToUnion()}} which 
appears to be called only from {{PromotableWriter.promoteToUnion()}}.

The {{ListVector.promoteToUnion()}} call itself transforms the list from a list 
of something to a list of Union, with the something as the first union member. 
However *it does not* go back and update the Union's type vector with the type 
of the prior values.

That work is done in {{PromotableWriter.promoteToUnion()}}, meaning that other 
uses (such as the size-aware writers) must duplicate that functionality or risk 
losing the values before the promotion. The code should be in the vector itself 
so that {{ListVector.promoteToUnion()}} "does the right thing" without clients 
needing to fill in part of the work.

Another feature of lists is that, unlike {{REPEATED}} types, lists allow nulls 
as list values. That is, a list can support the following:

{code}
{a: [10, null, 20]}
{code}

The code in {{PromotableWriter.promoteToUnion()}} code is wrong: it sets all 
unions to the prior type (such as BIGINT in the example above) without 
considering if the value is null. As a result, after promotion to union, the 
above list will be:

{code}
{a: [10, 0, 20]}
{code}

The code should check the null flag on each value. If null, set the union's 
type vector to the null marker, else set it to the type of the prior vector.

Note: a new version, {{ListVector.convertToUnion()}} was created for use in the 
new size-aware writers. The old version should be fixed or deprecated to avoid 
data corruption errors. 



--
This message was sent by Atlassian JIRA
(v6.4.14#64029)

Reply via email to