[ https://issues.apache.org/jira/browse/DRILL-7434?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ]
Paul Rogers reassigned DRILL-7434: ---------------------------------- Assignee: (was: Paul Rogers) > TopNBatch constructs Union vector incorrectly > --------------------------------------------- > > Key: DRILL-7434 > URL: https://issues.apache.org/jira/browse/DRILL-7434 > Project: Apache Drill > Issue Type: Bug > Reporter: Paul Rogers > Priority: Major > > The Union type is an "experimental" type that has never been completed. Yet, > we use it as if it works. > Consider the test {{TestTopNSchemaChanges.testMissingColumn()}}. Run this > with the new batch validator enabled. This test creates a union vector. Here > is how the schema looks: > {noformat} > (UNION:OPTIONAL), subtypes=([FLOAT8, INT]), > children=([`internal` (MAP:REQUIRED), children=([`types` > (UINT1:REQUIRED)])]) > {noformat} > This is very hard to follow because the Union vector structure is complex > (and has many issues.) Let's work though it. > We are looking at the {{MaterializedField}} for the union vector. It tells us > that this Union has two types: {{FLOAT8}} and {{INT}}. All good. > The Union has a vector per type, stored in an "internal map".' That map shows > up as child, it is there on the {{children}} list as {{internal}}. However, > the metadata claims that only one vector exists in that map: the {{types}} > vector (the one that tells us what type to use for each row.) The vectors > for {{FLOAT8}} and {{INT}} are missing. > If, however, we use our debugger and inspect the actual contents of the > {{internal}} map, we get the following: > {noformat} > [`internal` (MAP:REQUIRED), children=([`types` (UINT1:REQUIRED)], [`float8` > (FLOAT8:OPTIONAL)], [`int` (INT:OPTIONAL)])] > {noformat} > That is, the internal map has the correct schema, but the Union vector itself > has the wrong (incomplete) schema. > This is an inherent design flaw with Union vector: it requires two copies of > the schema to be in sync. Further {{MaterializedField}} was designed to be > immutable, but the map and Union types require mutation. If the Union simply > points to the actual Map vector {{MaterializedField}}, it will drift out of > date since the map vector creates a new schema each time we add fields; the > Union vector ends up pointing to the old one. > This is not a simple bug to fix, but the result of the bug is that the > vectors end up corrupted, as detected by the Batch Validator. In fact, the > bug itself is subtle. > The TopNBatch does pass vector validation. However, because of the incorrect > metadata, the downstream {{RemovingRecordBatch}} creates the derived Union > vector incorrectly: it fails to set the value count for the {{INT}} type. > {noformat} > Found one or more vector errors from RemovingRecordBatch > kl-type-INT - NullableIntVector: Row count = 3, but value count = 0 > {noformat} > Where {{kl-type-INT}} is an ad-hoc way of saying we are checking the {{INT}} > type vector for a Union named {{kl}}. > The schema of Union out of the {{RemovingRecordBatch}} has been truncated. > The Union itself: > {noformat} > [`kl` (UNION:OPTIONAL), subtypes=([FLOAT8, INT]), > children=([`internal` (MAP:REQUIRED), children=([`types` > (UINT1:REQUIRED)])])] > {noformat} > The internal map: > {noformat} > [`internal` (MAP:REQUIRED), children=([`types` (UINT1:REQUIRED)], [`int` > (INT:OPTIONAL)])] > {noformat} > Notice that the {{FLOAT8}} vector has disappeared: the Union vector metadata > claims we have such a vector, but the internal map does not actually contain > the vector. > The root cause is that the vector checker (indeed, any client) will call > {{UnionVector.getMember(type)}} to get a vector for a type. This method > includes a switch statement to call, say, {{getIntVector()}}. That method, in > turn, creates the vector if does not exist. > But, since we are reading, we have an existing data batch. When we create a > new vector, we create it as zero size. Thus, we think we have n records > (three in this case), but we actually have zero. This kinda-sorta works > because the type vector won't ever contain an entry for the "runt" vector, so > we won't actually access data. But, this is an inconsistent structure. It > breaks if we peer inside, as we are doing in the batch validator. > If we check for this case, we now get: > {noformat} > Found one or more vector errors from RemovingRecordBatch > kl - UnionVector: Union vector includes type INT, but the internal map has no > matching member > {noformat} > This is why Union is such a mess: is this a bug or just a very fragile > design? I claim bug. -- This message was sent by Atlassian Jira (v8.3.4#803005)