paul-rogers commented on a change in pull request #1870: DRILL-7359: Add
support for DICT type in RowSet Framework
URL: https://github.com/apache/drill/pull/1870#discussion_r354501095
##########
File path:
exec/vector/src/main/java/org/apache/drill/exec/vector/accessor/reader/UnionReaderImpl.java
##########
@@ -268,4 +274,51 @@ public String getAsString() {
}
return requireReader(type).getAsString();
}
+
+ private UnionReaderImpl getNullReader() {
+ AbstractObjectReader[] nullVariants = new
AbstractObjectReader[variants.length];
+ for (int i = 0; i < variants.length; i++) {
+ nullVariants[i] = variants[i].createNullReader();
+ }
+ return new NullUnionReader(schema(), unionAccessor, nullVariants);
+ }
+
+ private static class NullUnionReader extends UnionReaderImpl {
+
+ private NullUnionReader(ColumnMetadata metadata, VectorAccessor va,
AbstractObjectReader[] variants) {
Review comment:
@KazydubB, first, thanks for understanding my hastily-written example; I've
edited it to be accurate.
We are defining an API. APIs exist to support use cases, so it is very
helpful to think about those. (My bad for not doing that earlier; I was focused
more on the plumbing aspects early on. This is, in fact, why I still like
design docs, even in the era of Agile, so we can visualize these issues early
on.)
In my view, there are two key use cases: tests and production.
In tests, we simply want to verify that, say, a reader correctly created (or
filtered, or altered) a `DICT`. If a key exists, we want to verify its value.
If a key does not exist, we simply need to verify that fact. So, having a
generic `boolean findKey(Object key)` is perfectly fine. The overhead of doing
a type-based switch statement on every call is fine. (This is why tests often
use the `addRow()` methods which, internally, have a big, slow `if` statement
based on the type of the value.)
The second use case is production. Here, we can assume our goal is maximum
performance via generated code. I would argue that we do not want any extra
code in the access path. Instead, the code should be specific to each type.
This is why the readers are designed to go directly from, say, `getString()`
into the code that grabs the bytes from the underlying value vector.
For production code, we'd want a type-aware "find" operation. There are a
couple of ways to do that. One is by adding a "key accessor":
```
class DictReader ...
KeyAccessor keyAccessor();
...
}
interface KeyAccessor {
boolean findInt(int key);
boolean findString(String key);
...
}
```
The accessor code gen could then generate type-specific search code by
creating a key accessor class for each supported key type. This would be done
in the build-time code gen, using Freemarker, as part of the existing
`columnAccessors.java` template and code gen.
Now things get interesting. We need to generate code for each SQL operation
that uses a `DICT`. Let's make up an example:
```
SELECT ... FROM myTableWithDict WHERE d['foo'] = 20
```
Here, the story splits into two paths. Today, code gen works directly
against value vectors: you will need to generate the find code inline for each
operator. Longer term, it is my hope that we can shift to using the column
readers. (Doing so will simplify code gen, and puts us in a position to use
multiple value vector format, such as Arrow, if we wanted.)
So, let's imagine we are using your `DICT` readers in runtime code
generation. We want to generate code for the above example.
In either case (old-school or reader-based), we don't know the type until
the first row appears. At that point, we generate run-time code that, in our
example, uses a `VARCHAR` key to lookup an `INT` value and send it to the
equality function. Very crudely:
```
DictReader dr;
KeyAccessor ka;
ScalarReader vr;
String constKey;
String constArg;
void doSetup() {
dr = rowReader.dict("d");
ka = dr.keyAccessor();
vr = dr.value().scalar();
constKey = "foo";
constArg = "bar";
}
void doEval() {
boolean isEquals;
boolean found = ka.findString(constKey);
if (found) {
isEquals = strEquals(vr.getString(), constArg);
} else {
isEquals = false;
}
// do something with isEquals...
}
```
The above is written in "code gen style", but greatly simplified.
So, we've looked at two use cases: test and production. In neither did we
need to work with a missing value. Instead, we observed the rules that a
non-existent key has no value (rather than a non-existent key having a value
which is structurally the same as a real value, just null.)
Think of Java. A `null` in Java is the absence of a value, it is not, say a
`Foo` object, with all the same fields and methods, that somehow has special
`null` semantics. (Yes, this is a huge source of debate in CS. Some languages,
like Scala, try to use objects, like `Option`, instead of `null`. But, `null`
is still a higher-performance solution.) Further, in SQL, `NULL` is defined as
the absence of a value, not a value (say a map) with special `NULL` semantics.
So, I don't think you need to try to define null readers: a `NULL` is
missing value: there is nothing there, no structure to explore, no value to get.
Does this make sense?
In summary:
* Define the use cases you want to support.
* Define how those use cases will look up keys and how they will handle
non-existent keys.
* Refine the reader API to work well for those use cases.
The good news is that the core functionality is now looking pretty good;
we're just fine tuning the functionality to best serve the potential users of
your new code.
----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.
For queries about this service, please contact Infrastructure at:
[email protected]
With regards,
Apache Git Services