[
https://issues.apache.org/jira/browse/DRILL-8037?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17485573#comment-17485573
]
ASF GitHub Bot commented on DRILL-8037:
---------------------------------------
paul-rogers commented on pull request #2364:
URL: https://github.com/apache/drill/pull/2364#issuecomment-1027598044
@vdiravka reached out to me on this bug. His explanation:
> The issue is the schema is changed for the second batch and it is reported
by SchemaTracker#isSameSchema
I suppose V1 reader just compared the same field type, but the new one
compares vectors by identity. So schemaChange is right thing for that test
case. But Drill Hash aggregate doesn't support schema change.
Here is my analysis:
You are tight that the `SchemaTracker` is the thing in EVF which looks for
schema changes. The rule is pretty simple: if the the kind of vector differs
from the previous batch, then a schema change occurred. This is based, in part,
on the observation that the sort operator can't handle any schema changes at
all: not INT/BIGINT, not NOT NULL to NULL. Sort combines vectors, and if they
are of a different type, the values won't fit together. The same is probably
true of the hash agg.
OK, so we agree that a type change is bad. What you're describing is the
same type, but a different vector. This is a bug in several ways, and not the
way you might think.
First, it seems logical that if column x is a non-null INT in one batch, and
is a non-null INT in the next batch, that no schema change has occurred. But,
Drill will surprise you. Most operators assume that the vector itself has not
changed: only the data stored within the vector. This is a subtle point, so let
me expand it.
A vector is a permanent holder or a column. The vector holds one or more
data buffers. Those buffers change from batch to batch. Think of it as a bucket
brigade: the old-style way that people fought fires. A chain of people forms
(the vectors). They hand buckets one to the next (the data buffers).
Now, I was pretty surprised when I first discovered this. So was Boaz when
he wrote hash agg. But, most of Drill's code assumes that, in setup, the code
binds to a specific vector. In the next batch, that same binding is valid; only
the data changes. This shows up in code gen and other places. In fact, a goodly
part of EVF is concerned with this need for "vector continuity" even across
readers. (Let that sink in: two readers that have nothing to do with each other
must somehow still share common vectors. That alone should tell us the vector
design in Drill is a bit strange and needs rethinking. But, that's another
topic.)
So, now let's review the bug. On the one hand, `SchemaTracker` has noticed
the vectors changed, and is (correctly) telling the downstream operators that
they must redo their bindings to the new vector.
But, on the other hand, `SchemaTracker` has been presented with two
different vectors for the same column. That should never happen unless there is
an actual type change (non-null to null, INT to BIGINT, etc.) There are
elaborate mechanisms to reuse vectors from one reader to the next.
So, the first thing to check is if the types are at all different. (If the
"major types" differ.) If so, then we have a legitimate schema change that
Drill can't handle.
Now, if you find that the types are the same, then we have a bug in EVF,
specifically in the `ResultVectorCacheImpl` class, since that's the thing which
is supposed to do the magic.
Let me ask another question. When this occurs, is it in the first batch of
the second (or later) reader? If so, then that reader should have pulled the
vector from the cache. The question is, why didn't it?
If the error occurs within the first reader, then the bug is more subtle.
How could the JSON reader accomplish that? The `ResultSetLoader` hides all that
detail, and it defers to the `ResultVectorCacheImpl `for vector continuity.
Yet another question is why these tests didn't fail 3 months ago when the
tests first ran. And, why did the tests pass way back when I wrote this code?
Did anything change elsewhere that could trigger this? I don't see how, but we
should probably ask the question.
Fortunately, these are unit tests, so we should be able to sort out the
issues without too much trouble.
I'll also note that there were a few fixes to EVF V2 that I made in my PR,
but those were around a subtle but in implicit columns.
Suggestion: give the above a shot to see if you can see the problem.
Otherwise, tomorrow I'll try to squeeze in time to grab this branch and do some
debugging. After all, I wrote this stuff originally, so I should try to make it
work.
--
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.
To unsubscribe, e-mail: [email protected]
For queries about this service, please contact Infrastructure at:
[email protected]
> Add V2 JSON Format Plugin based on EVF
> --------------------------------------
>
> Key: DRILL-8037
> URL: https://issues.apache.org/jira/browse/DRILL-8037
> Project: Apache Drill
> Issue Type: Sub-task
> Reporter: Vitalii Diravka
> Assignee: Vitalii Diravka
> Priority: Major
>
> This adds new V2 beta JSON Format Plugin based on the "Extended Vector
> Framework".
> This is follow up DRILL-6953 (was closed with the decision to merge it by
> small pieces).
> So it is based on [https://github.com/apache/drill/pull/1913] and
> [https://github.com/paul-rogers/drill/tree/DRILL-6953-rev2] work.
--
This message was sent by Atlassian Jira
(v8.20.1#820001)