[ 
https://issues.apache.org/jira/browse/DRILL-7434?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Paul Rogers reassigned DRILL-7434:
----------------------------------

    Assignee:     (was: Paul Rogers)

> TopNBatch constructs Union vector incorrectly
> ---------------------------------------------
>
>                 Key: DRILL-7434
>                 URL: https://issues.apache.org/jira/browse/DRILL-7434
>             Project: Apache Drill
>          Issue Type: Bug
>            Reporter: Paul Rogers
>            Priority: Major
>
> The Union type is an "experimental" type that has never been completed. Yet, 
> we use it as if it works.
> Consider the test {{TestTopNSchemaChanges.testMissingColumn()}}. Run this 
> with the new batch validator enabled. This test creates a union vector. Here 
> is how the schema looks:
> {noformat}
> (UNION:OPTIONAL), subtypes=([FLOAT8, INT]),
>   children=([`internal` (MAP:REQUIRED), children=([`types` 
> (UINT1:REQUIRED)])])
> {noformat}
> This is very hard to follow because the Union vector structure is complex 
> (and has many issues.) Let's work though it.
> We are looking at the {{MaterializedField}} for the union vector. It tells us 
> that this Union has two types: {{FLOAT8}} and {{INT}}. All good.
> The Union has a vector per type, stored in an "internal map".' That map shows 
> up as child, it is there on the {{children}} list as {{internal}}. However, 
> the metadata claims that only one vector exists in that map: the {{types}} 
> vector (the one that tells us what type to use for each row.)  The vectors 
> for {{FLOAT8}} and {{INT}} are missing.
> If, however, we use our debugger and inspect the actual contents of the 
> {{internal}} map, we get the following:
> {noformat}
> [`internal` (MAP:REQUIRED), children=([`types` (UINT1:REQUIRED)], [`float8` 
> (FLOAT8:OPTIONAL)], [`int` (INT:OPTIONAL)])]
> {noformat}
> That is, the internal map has the correct schema, but the Union vector itself 
> has the wrong (incomplete) schema.
> This is an inherent design flaw with Union vector: it requires two copies of 
> the schema to be in sync. Further {{MaterializedField}} was designed to be 
> immutable, but the map and Union types require mutation. If the Union simply 
> points to the actual Map vector {{MaterializedField}}, it will drift out of 
> date since the map vector creates a new schema each time we add fields; the 
> Union vector ends up pointing to the old one.
> This is not a simple bug to fix, but the result of the bug is that the 
> vectors end up corrupted, as detected by the Batch Validator. In fact, the 
> bug itself is subtle.
> The TopNBatch does pass vector validation. However, because of the incorrect 
> metadata, the downstream {{RemovingRecordBatch}} creates the derived Union 
> vector incorrectly: it fails to set the value count for the {{INT}} type.
> {noformat}
> Found one or more vector errors from RemovingRecordBatch
> kl-type-INT - NullableIntVector: Row count = 3, but value count = 0
> {noformat}
> Where {{kl-type-INT}} is an ad-hoc way of saying we are checking the {{INT}} 
> type vector for a Union named {{kl}}.
> The schema of Union out of the {{RemovingRecordBatch}} has been truncated. 
> The Union itself:
> {noformat}
> [`kl` (UNION:OPTIONAL), subtypes=([FLOAT8, INT]),
>   children=([`internal` (MAP:REQUIRED), children=([`types` 
> (UINT1:REQUIRED)])])]
> {noformat}
> The internal map:
> {noformat}
> [`internal` (MAP:REQUIRED), children=([`types` (UINT1:REQUIRED)], [`int` 
> (INT:OPTIONAL)])]
> {noformat}
> Notice that the {{FLOAT8}} vector has disappeared: the Union vector metadata 
> claims we have such a vector, but the internal map does not actually contain 
> the vector.
> The root cause is that the vector checker (indeed, any client) will call 
> {{UnionVector.getMember(type)}} to get a vector for a type. This method 
> includes a switch statement to call, say, {{getIntVector()}}. That method, in 
> turn, creates the vector if does not exist.
> But, since we are reading, we have an existing data batch. When we create a 
> new vector, we create it as zero size. Thus, we think we have n records 
> (three in this case), but we actually have zero. This kinda-sorta works 
> because the type vector won't ever contain an entry for the "runt" vector, so 
> we won't actually access data. But, this is an inconsistent structure. It 
> breaks if we peer inside, as we are doing in the batch validator.
> If we check for this case, we now get:
> {noformat}
> Found one or more vector errors from RemovingRecordBatch
> kl - UnionVector: Union vector includes type INT, but the internal map has no 
> matching member
> {noformat}
> This is why Union is such a mess: is this a bug or just a very fragile 
> design? I claim bug.



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

Reply via email to