Paul Rogers created DRILL-5525:
----------------------------------
Summary: Inconsistent, unhelpful semantics for batch, field schema
comparison
Key: DRILL-5525
URL: https://issues.apache.org/jira/browse/DRILL-5525
Project: Apache Drill
Issue Type: Bug
Affects Versions: 1.10.0
Reporter: Paul Rogers
Assignee: Paul Rogers
Priority: Minor
Fix For: 1.11.0
Drill provides two classes to define schema at execution time:
* {{MaterializedField}} - describes one column (or a collection of columns for
a map)
* {{BatchSchema}} - describes the set of columns that make up a row
Each class provides an {{isEqual()}} method. However it seems these methods may
not be used much because the contain a number of flaws and contradictions.
* {{MaterializedField}} has a comment that says it compares only names; but
then it turns around and compares the object field-by-field using Protobuf:
{code}
// DRILL-1872: Compute equals only on key. See also the comment
// in MapVector$MapTransferPair
return this.name.equalsIgnoreCase(other.name) &&
Objects.equals(this.type, other.type);
{code}
* {{MaterializedField}} ends up doing a physical comparison of fields rather
than a logical comparison. Varchar columns are defined, at the logical level as
Varchar. But, at the physical level, a Varchar is a two-part column: it has a
child that represents the {{$offsets$}} vector. As a result, a comparison
between logical and physical schema returns false, even if both is just a
Varchar.
* {{BatchSchema}} ends up being inconsistent because of the above confusion. It
first (seemingly) compares names, then tries to compare types. But, because the
{{MaterializedField}} method actually compares all fields, the type comparison
does nothing.
{code}
if (!fields.equals(other.fields)) { // Compare all fields
return false;
}
for (int i = 0; i < fields.size(); i++) { // So this loop is a no-op
MajorType t1 = fields.get(i).getType();
MajorType t2 = other.fields.get(i).getType();
if (t1 == null) {
if (t2 != null) {
return false;
}
} else {
if (!majorTypeEqual(t1, t2)) {
return false;
}
}
}
{code}
The result is that one is not quite sure what the two methods are supposed to
compare. Is {{MaterializedField}} supposed to:
* Do a physical compare (existing behavior)?
* Do a name-only compare (as the comment suggests)?
* Do a logical comparison (as unit tests would want)?
And for {{BatchSchema}}, should it:
* Do a physical compare (existing behavior)?
* Do a type-aware comparison (as in the loop that does nothing)?
Further note that neither of the methods do anything special for map fields:
they end up being compared as part of the protobuf comparison. But, we shold
probably apply the same rules to map fields as we apply to top-level fields (as
expressed in the second loop above.)
This ticket requests:
* Document current uses.
* Determine the semantics of the {{isEqual()}} method.
* Create additional methods as needed: {{isPhysicallyEqual()}},
{{isLogicallyEqual()}}, {{hasSameNames()}}, or whatever.
Not that this issue is classic: it seems that "equal" is well defined concept,
but as the Greek philosopher
[Heraclitus|https://en.wikipedia.org/wiki/Heraclitus] suggested with his famous
quote that "you can never step into the same river twice", the concept of
"sameness" is fluid and ad-hoc. It is up to us to define the semantics for
equality, and sometimes we need more than one concept.
--
This message was sent by Atlassian JIRA
(v6.3.15#6346)