paul-rogers commented on a change in pull request #1870: DRILL-7359: Add support for DICT type in RowSet Framework URL: https://github.com/apache/drill/pull/1870#discussion_r354501095
########## File path: exec/vector/src/main/java/org/apache/drill/exec/vector/accessor/reader/UnionReaderImpl.java ########## @@ -268,4 +274,51 @@ public String getAsString() { } return requireReader(type).getAsString(); } + + private UnionReaderImpl getNullReader() { + AbstractObjectReader[] nullVariants = new AbstractObjectReader[variants.length]; + for (int i = 0; i < variants.length; i++) { + nullVariants[i] = variants[i].createNullReader(); + } + return new NullUnionReader(schema(), unionAccessor, nullVariants); + } + + private static class NullUnionReader extends UnionReaderImpl { + + private NullUnionReader(ColumnMetadata metadata, VectorAccessor va, AbstractObjectReader[] variants) { Review comment: @KazydubB, first, thanks for understanding my hastily-written example; I've edited it to be accurate. We are defining an API. APIs exist to support use cases, so it is very helpful to think about those. (My bad for not doing that earlier; I was focused more on the plumbing aspects early on. This is, in fact, why I still like design docs, even in the era of Agile, so we can visualize these issues early on.) In my view, there are two key use cases: tests and production. In tests, we simply want to verify that, say, a reader correctly created (or filtered, or altered) a `DICT`. If a key exists, we want to verify its value. If a key does not exist, we simply need to verify that fact. So, having a generic `boolean findKey(Object key)` is perfectly fine. The overhead of doing a type-based switch statement on every call is fine. (This is why tests often use the `addRow()` methods which, internally, have a big, slow `if` statement based on the type of the value.) The second use case is production. Here, we can assume our goal is maximum performance via generated code. I would argue that we do not want any extra code in the access path. Instead, the code should be specific to each type. This is why the readers are designed to go directly from, say, `getString()` into the code that grabs the bytes from the underlying value vector. For production code, we'd want a type-aware "find" operation. There are a couple of ways to do that. One is by adding a "key accessor": ``` class DictReader ... KeyAccessor keyAccessor(); ... } interface KeyAccessor { boolean findInt(int key); boolean findString(String key); ... } ``` The accessor code gen could then generate type-specific search code by creating a key accessor class for each supported key type. This would be done in the build-time code gen, using Freemarker, as part of the existing `columnAccessors.java` template and code gen. Now things get interesting. We need to generate code for each SQL operation that uses a `DICT`. Let's make up an example: ``` SELECT ... FROM myTableWithDict WHERE d['foo'] = 20 ``` Here, the story splits into two paths. Today, code gen works directly against value vectors: you will need to generate the find code inline for each operator. Longer term, it is my hope that we can shift to using the column readers. (Doing so will simplify code gen, and puts us in a position to use multiple value vector format, such as Arrow, if we wanted.) So, let's imagine we are using your `DICT` readers in runtime code generation. We want to generate code for the above example. In either case (old-school or reader-based), we don't know the type until the first row appears. At that point, we generate run-time code that, in our example, uses a `VARCHAR` key to lookup an `INT` value and send it to the equality function. Very crudely: ``` DictReader dr; KeyAccessor ka; ScalarReader vr; String constKey; String constArg; void doSetup() { dr = rowReader.dict("d"); ka = dr.keyAccessor(); vr = dr.value().scalar(); constKey = "foo"; constArg = "bar"; } void doEval() { boolean isEquals; boolean found = ka.findString(constKey); if (found) { isEquals = strEquals(vr.getString(), constArg); } else { isEquals = false; } // do something with isEquals... } ``` The above is written in "code gen style", but greatly simplified. So, we've looked at two use cases: test and production. In neither did we need to work with a missing value. Instead, we observed the rules that a non-existent key has no value (rather than a non-existent key having a value which is structurally the same as a real value, just null.) Think of Java. A `null` in Java is the absence of a value, it is not, say a `Foo` object, with all the same fields and methods, that somehow has special `null` semantics. (Yes, this is a huge source of debate in CS. Some languages, like Scala, try to use objects, like `Option`, instead of `null`. But, `null` is still a higher-performance solution.) Further, in SQL, `NULL` is defined as the absence of a value, not a value (say a map) with special `NULL` semantics. So, I don't think you need to try to define null readers: a `NULL` is missing value: there is nothing there, no structure to explore, no value to get. Does this make sense? In summary: * Define the use cases you want to support. * Define how those use cases will look up keys and how they will handle non-existent keys. * Refine the reader API to work well for those use cases. The good news is that the core functionality is now looking pretty good; we're just fine tuning the functionality to best serve the potential users of your new code. ---------------------------------------------------------------- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. For queries about this service, please contact Infrastructure at: us...@infra.apache.org With regards, Apache Git Services