[ 
https://issues.apache.org/jira/browse/DRILL-7434?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Paul Rogers updated DRILL-7434:
-------------------------------
    Description: 
The Union type is an "experimental" type that has never been completed. Yet, we 
use it as if it works.

Consider the test {{TestTopNSchemaChanges.testMissingColumn()}}. Run this with 
the new batch validator enabled. This test creates a union vector. Here is how 
the schema looks:

{noformat}
(UNION:OPTIONAL), subtypes=([FLOAT8, INT]),
  children=([`internal` (MAP:REQUIRED), children=([`types` (UINT1:REQUIRED)])])
{noformat}

This is very hard to follow because the Union vector structure is complex (and 
has many issues.) Let's work though it.

We are looking at the {{MaterializedField}} for the union vector. It tells us 
that this Union has two types: {{FLOAT8}} and {{INT}}. All good.

The Union has a vector per type, stored in an "internal map".' That map shows 
up as child, it is there on the {{children}} list as {{internal}}. However, the 
metadata claims that only one vector exists in that map: the {{types}} vector 
(the one that tells us what type to use for each row.)  The vectors for 
{{FLOAT8}} and {{INT}} are missing.

If, however, we use our debugger and inspect the actual contents of the 
{{internal}} map, we get the following:

{noformat}
[`internal` (MAP:REQUIRED), children=([`types` (UINT1:REQUIRED)], [`float8` 
(FLOAT8:OPTIONAL)], [`int` (INT:OPTIONAL)])]
{noformat}

That is, the internal map has the correct schema, but the Union vector itself 
has the wrong (incomplete) schema.

This is an inherent design flaw with Union vector: it requires two copies of 
the schema to be in sync. Further {{MaterializedField}} was designed to be 
immutable, but the map and Union types require mutation. If the Union simply 
points to the actual Map vector {{MaterializedField}}, it will drift out of 
date since the map vector creates a new schema each time we add fields; the 
Union vector ends up pointing to the old one.

This is not a simple bug to fix, but the result of the bug is that the vectors 
end up corrupted, as detected by the Batch Validator. In fact, the bug itself 
is subtle.

The TopNBatch does pass vector validation. However, because of the incorrect 
metadata, the downstream {{RemovingRecordBatch}} creates the derived Union 
vector incorrectly: it fails to set the value count for the {{INT}} type.

{noformat}
Found one or more vector errors from RemovingRecordBatch
kl-type-INT - NullableIntVector: Row count = 3, but value count = 0
{noformat}

Where {{kl-type-INT}} is an ad-hoc way of saying we are checking the {{INT}} 
type vector for a Union named {{kl}}.

The schema of Union out of the {{RemovingRecordBatch}} has been truncated. The 
Union itself:

{noformat}
[`kl` (UNION:OPTIONAL), subtypes=([FLOAT8, INT]),
  children=([`internal` (MAP:REQUIRED), children=([`types` (UINT1:REQUIRED)])])]
{noformat}

The internal map:

{noformat}
[`internal` (MAP:REQUIRED), children=([`types` (UINT1:REQUIRED)], [`int` 
(INT:OPTIONAL)])]
{noformat}

Notice that the {{FLOAT8}} vector has disappeared: the Union vector metadata 
claims we have such a vector, but the internal map does not actually contain 
the vector.

The root cause is that the vector checker (indeed, any client) will call 
{{UnionVector.getMember(type)}} to get a vector for a type. This method 
includes a switch statement to call, say, {{getIntVector()}}. That method, in 
turn, creates the vector if does not exist.

But, since we are reading, we have an existing data batch. When we create a new 
vector, we create it as zero size. Thus, we think we have n records (three in 
this case), but we actually have zero. This kinda-sorta works because the type 
vector won't ever contain an entry for the "runt" vector, so we won't actually 
access data. But, this is an inconsistent structure. It breaks if we peer 
inside, as we are doing in the batch validator.

If we check for this case, we now get:

{noformat}
Found one or more vector errors from RemovingRecordBatch
kl - UnionVector: Union vector includes type INT, but the internal map has no 
matching member
{noformat}

This is why Union is such a mess: is this a bug or just a very fragile design? 
I claim bug.

  was:
The Union type is an "experimental" type that has never been completed. Yet, we 
use it as if it works.

Consider the test {{TestTopNSchemaChanges.testMissingColumn()}}. Run this with 
the new batch validator enabled. This test creates a union vector. Here is how 
the schema looks:

{noformat}
(UNION:OPTIONAL), subtypes=([FLOAT8, INT]),
  children=([`internal` (MAP:REQUIRED), children=([`types` (UINT1:REQUIRED)])])
{noformat}

This is very hard to follow because the Union vector structure is complex (and 
has many issues.) Let's work though it.

We are looking at the {{MaterializedField}} for the union vector. It tells us 
that this Union has two types: {{FLOAT8}} and {{INT}}. All good.

The Union has a vector per type, stored in an "internal map".' That map shows 
up as child, it is there on the {{children}} list as {{internal}}. However, the 
metadata claims that only one vector exists in that map: the {{types}} vector 
(the one that tells us what type to use for each row.)  The vectors for 
{{FLOAT8}} and {{INT}} are missing.

If, however, we use our debugger and inspect the actual contents of the 
{{internal}} map, we get the following:

{noformat}
[`internal` (MAP:REQUIRED), children=([`types` (UINT1:REQUIRED)], [`float8` 
(FLOAT8:OPTIONAL)], [`int` (INT:OPTIONAL)])]
{noformat}

That is, the internal map has the correct schema, but the Union vector itself 
has the wrong (incomplete) schema.

This is an inherent design flaw with Union vector: it requires two copies of 
the schema to be in sync. Further {{MaterializedField}} was designed to be 
immutable, but the map and Union types require mutation. If the Union simply 
points to the actual Map vector {{MaterializedField}}, it will drift out of 
date since the map vector creates a new schema each time we add fields; the 
Union vector ends up pointing to the old one.

This is not a simple bug to fix, but the result of the bug is that the vectors 
end up corrupted, as detected by the Batch Validator. In fact, the bug itself 
is subtle.

The TopNBatch does pass vector validation. However, because of the incorrect 
metadata, the downstream {{RemovingRecordBatch}} creates the derived Union 
vector incorrectly: it fails to set the value count for the {{INT}} type.

{noformat}
Found one or more vector errors from RemovingRecordBatch
kl-type-INT - NullableIntVector: Row count = 3, but value count = 0
{noformat}

Where {{kl-type-INT}} is an ad-hoc way of saying we are checking the {{INT}} 
type vector for a Union named {{kl}}.


> TopNBatch constructs Union vector incorrectly
> ---------------------------------------------
>
>                 Key: DRILL-7434
>                 URL: https://issues.apache.org/jira/browse/DRILL-7434
>             Project: Apache Drill
>          Issue Type: Bug
>            Reporter: Paul Rogers
>            Priority: Major
>
> The Union type is an "experimental" type that has never been completed. Yet, 
> we use it as if it works.
> Consider the test {{TestTopNSchemaChanges.testMissingColumn()}}. Run this 
> with the new batch validator enabled. This test creates a union vector. Here 
> is how the schema looks:
> {noformat}
> (UNION:OPTIONAL), subtypes=([FLOAT8, INT]),
>   children=([`internal` (MAP:REQUIRED), children=([`types` 
> (UINT1:REQUIRED)])])
> {noformat}
> This is very hard to follow because the Union vector structure is complex 
> (and has many issues.) Let's work though it.
> We are looking at the {{MaterializedField}} for the union vector. It tells us 
> that this Union has two types: {{FLOAT8}} and {{INT}}. All good.
> The Union has a vector per type, stored in an "internal map".' That map shows 
> up as child, it is there on the {{children}} list as {{internal}}. However, 
> the metadata claims that only one vector exists in that map: the {{types}} 
> vector (the one that tells us what type to use for each row.)  The vectors 
> for {{FLOAT8}} and {{INT}} are missing.
> If, however, we use our debugger and inspect the actual contents of the 
> {{internal}} map, we get the following:
> {noformat}
> [`internal` (MAP:REQUIRED), children=([`types` (UINT1:REQUIRED)], [`float8` 
> (FLOAT8:OPTIONAL)], [`int` (INT:OPTIONAL)])]
> {noformat}
> That is, the internal map has the correct schema, but the Union vector itself 
> has the wrong (incomplete) schema.
> This is an inherent design flaw with Union vector: it requires two copies of 
> the schema to be in sync. Further {{MaterializedField}} was designed to be 
> immutable, but the map and Union types require mutation. If the Union simply 
> points to the actual Map vector {{MaterializedField}}, it will drift out of 
> date since the map vector creates a new schema each time we add fields; the 
> Union vector ends up pointing to the old one.
> This is not a simple bug to fix, but the result of the bug is that the 
> vectors end up corrupted, as detected by the Batch Validator. In fact, the 
> bug itself is subtle.
> The TopNBatch does pass vector validation. However, because of the incorrect 
> metadata, the downstream {{RemovingRecordBatch}} creates the derived Union 
> vector incorrectly: it fails to set the value count for the {{INT}} type.
> {noformat}
> Found one or more vector errors from RemovingRecordBatch
> kl-type-INT - NullableIntVector: Row count = 3, but value count = 0
> {noformat}
> Where {{kl-type-INT}} is an ad-hoc way of saying we are checking the {{INT}} 
> type vector for a Union named {{kl}}.
> The schema of Union out of the {{RemovingRecordBatch}} has been truncated. 
> The Union itself:
> {noformat}
> [`kl` (UNION:OPTIONAL), subtypes=([FLOAT8, INT]),
>   children=([`internal` (MAP:REQUIRED), children=([`types` 
> (UINT1:REQUIRED)])])]
> {noformat}
> The internal map:
> {noformat}
> [`internal` (MAP:REQUIRED), children=([`types` (UINT1:REQUIRED)], [`int` 
> (INT:OPTIONAL)])]
> {noformat}
> Notice that the {{FLOAT8}} vector has disappeared: the Union vector metadata 
> claims we have such a vector, but the internal map does not actually contain 
> the vector.
> The root cause is that the vector checker (indeed, any client) will call 
> {{UnionVector.getMember(type)}} to get a vector for a type. This method 
> includes a switch statement to call, say, {{getIntVector()}}. That method, in 
> turn, creates the vector if does not exist.
> But, since we are reading, we have an existing data batch. When we create a 
> new vector, we create it as zero size. Thus, we think we have n records 
> (three in this case), but we actually have zero. This kinda-sorta works 
> because the type vector won't ever contain an entry for the "runt" vector, so 
> we won't actually access data. But, this is an inconsistent structure. It 
> breaks if we peer inside, as we are doing in the batch validator.
> If we check for this case, we now get:
> {noformat}
> Found one or more vector errors from RemovingRecordBatch
> kl - UnionVector: Union vector includes type INT, but the internal map has no 
> matching member
> {noformat}
> This is why Union is such a mess: is this a bug or just a very fragile 
> design? I claim bug.



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

Reply via email to