[GitHub] [drill] paul-rogers commented on a change in pull request #1870: DRILL-7359: Add support for DICT type in RowSet Framework

GitBox Thu, 05 Dec 2019 11:22:00 -0800

paul-rogers commented on a change in pull request #1870: DRILL-7359: Add 
support for DICT type in RowSet Framework
URL: https://github.com/apache/drill/pull/1870#discussion_r354501095


 ##########
 File path: 
exec/vector/src/main/java/org/apache/drill/exec/vector/accessor/reader/UnionReaderImpl.java
 ##########
 @@ -268,4 +274,51 @@ public String getAsString() {
     }
     return requireReader(type).getAsString();
   }
+
+  private UnionReaderImpl getNullReader() {
+    AbstractObjectReader[] nullVariants = new 
AbstractObjectReader[variants.length];
+    for (int i = 0; i < variants.length; i++) {
+      nullVariants[i] = variants[i].createNullReader();
+    }
+    return new NullUnionReader(schema(), unionAccessor, nullVariants);
+  }
+
+  private static class NullUnionReader extends UnionReaderImpl {
+
+    private NullUnionReader(ColumnMetadata metadata, VectorAccessor va, 
AbstractObjectReader[] variants) {
 
 Review comment:
   @KazydubB, first, thanks for understanding my hastily-written example; I've 
edited it to be accurate.
   
   We are defining an API. APIs exist to support use cases, so it is very 
helpful to think about those. (My bad for not doing that earlier; I was focused 
more on the plumbing aspects early on. This is, in fact, why I still like 
design docs, even in the era of Agile, so we can visualize these issues early 
on.)
   
   In my view, there are two key use cases: tests and production.
   
   In tests, we simply want to verify that, say, a reader correctly created (or 
filtered, or altered) a `DICT`. If a key exists, we want to verify its value. 
If a key does not exist, we simply need to verify that fact. So, having a 
generic `boolean findKey(Object key)` is perfectly fine. The overhead of doing 
a type-based switch statement on every call is fine. (This is why tests often 
use the `addRow()` methods which, internally, have a big, slow `if` statement 
based on the type of the value.)
   
   The second use case is production. Here, we can assume our goal is maximum 
performance via generated code.  I would argue that we do not want any extra 
code in the access path. Instead, the code should be specific to each type. 
This is why the readers are designed to go directly from, say, `getString()` 
into the code that grabs the bytes from the underlying value vector.
   
   For production code, we'd want a type-aware "find" operation. There are a 
couple of ways to do that. One is by adding a "key accessor":
   
   ```
   class DictReader ...
      KeyAccessor keyAccessor();
     ...
   }
   
   interface KeyAccessor {
     boolean findInt(int key);
     boolean findString(String key);
     ...
   }
   ```
   
   The accessor code gen could then generate type-specific search code by 
creating a key accessor class for each supported key type. This would be done 
in the build-time code gen, using Freemarker, as part of the existing 
`columnAccessors.java` template and code gen.
   
   Now things get interesting. We need to generate code for each SQL operation 
that uses a `DICT`. Let's make up an example:
   
   ```
   SELECT ... FROM myTableWithDict WHERE d['foo'] = 20
   ```
   
   Here, the story splits into two paths. Today, code gen works directly 
against value vectors: you will need to generate the find code inline for each 
operator. Longer term, it is my hope that we can shift to using the column 
readers. (Doing so will simplify code gen, and puts us in a position to use 
multiple value vector format, such as Arrow, if we wanted.)
   
   So, let's imagine we are using your `DICT` readers in runtime code 
generation. We want to generate code for the above example. 
   
   In either case (old-school or reader-based), we don't know the type until 
the first row appears. At that point, we generate run-time code that, in our 
example, uses a `VARCHAR` key to lookup an `INT` value and send it to the 
equality function. Very crudely:
   
   ```
   DictReader dr;
   KeyAccessor ka;
   ScalarReader vr;
   String constKey;
   String constArg;
   
   void doSetup() {
     dr = rowReader.dict("d");
     ka = dr.keyAccessor();
     vr = dr.value().scalar();
     constKey = "foo";
     constArg = "bar";
   }
   
   void doEval() {
     boolean isEquals;
     boolean found = ka.findString(constKey);
     if (found) {
       isEquals = strEquals(vr.getString(), constArg);
     } else {
       isEquals = false;
     }
     // do something with isEquals...
   }
   ```
   
   The above is written in "code gen style", but greatly simplified.
   
   So, we've looked at two use cases: test and production. In neither did we 
need to work with a missing value. Instead, we observed the rules that a 
non-existent key has no value (rather than a non-existent key having a value 
which is structurally the same as a real value, just null.)
   
   Think of Java. A `null` in Java is the absence of a value, it is not, say a 
`Foo` object, with all the same fields and methods, that somehow has special 
`null` semantics. (Yes, this is a huge source of debate in CS. Some languages, 
like Scala, try to use objects, like `Option`, instead of `null`. But, `null` 
is still a higher-performance solution.) Further, in SQL, `NULL` is defined as 
the absence of a value, not a value (say a map) with special `NULL` semantics.
   
   So, I don't think you need to try to define null readers: a `NULL` is 
missing value: there is nothing there, no structure to explore, no value to get.
   
   Does this make sense?
   
   In summary:
   
   * Define the use cases you want to support.
   * Define how those use cases will look up keys and how they will handle 
non-existent keys.
   * Refine the reader API to work well for those use cases.
   
   The good news is that the core functionality is now looking pretty good; 
we're just fine tuning the functionality to best serve the potential users of 
your new code.

----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.
 
For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


With regards,
Apache Git Services

[GitHub] [drill] paul-rogers commented on a change in pull request #1870: DRILL-7359: Add support for DICT type in RowSet Framework

Reply via email to