I'm working on the execwork stuff and if someone would like to help out, here are a couple of things that need doing. I figured I'd drop them here and see if anyone wants to work on them in the next couple of days. If so, let me know otherwise I'll be picking them up soon.
*RPC* - RPC Layer Handshakes: Currently, I haven't implemented the handshake that should happen in either the User <> Bit or the Bit <> Bit layer. The plan was to use an additional inserted event handler that removed itself from the event pipeline after a successful handshake or disconnected the channel on a failed handshake (with appropriate logging). The main validation at this point will be simply confirming that both endpoints are running on the same protocol version. The only other information that is currently needed is that that in the Bit <> Bit communication, the client should inform the server of its DrillEndpoint so that the server can then map that for future communication in the other direction. *DataTypes* - General Expansion: Currently, we have a hodgepodge of datatypes within the org.apache.drill.common.expression.types.DataType. We need to clean this up. There should be types that map to standard sql types. My thinking is that we should actually have separate types for each for nullable, non-nullable and repeated (required, optional and repeated in protobuf vernaciular) since we'll generally operate with those values completely differently (and that each type should reveal which it is). We should also have a relationship mapping from each to the other (e.g. how to convert a signed 32 bit int into a nullable signed 32 bit int. - Map Types: We don't need nullable but we will need different map types: inline and fieldwise. I think these will useful for the execution engine and will be leverage depending on the particular needs-- for example fieldwise will be a natural fit where we're operating on columnar data and doing an explode or other fieldwise nested operation and inline will be useful when we're doing things like sorting a complex field. Inline will also be appropriate where we have extremely sparse record sets. We'll just need transformation methods between the two variations. In the case of a fieldwise map type field, the field is virtual and only exists to contain its child fields. - Non-static DataTypes: We have a need types that don't fit the static data type model above. Examples include fixed width types (e.g. 10 byte string), polymorphic (inline encoded) types (number or string depending on record) and repeated nested versions of our other types. These are a little more gnarly as we need to support canonicalization of these. Optiq has some methods for how to handle this kind of type system so it probably makes sense to leverage that system. *Expression Type Materialization* - LogicalExpression type materialization: Right now, LogicalExpressions include support for late type binding. As part of the record batch execution path, these need to get materialized with correct casting, etc based on the actual found schema. As such, we need to have a function which takes a LogicalExpression tree, applies a materialized BatchSchema and returns a new LogicalExpression tree with full type settings. As part of this process, all types need to be cast as necessary and full validation of the tree should be done. Timothy has a pending work for validation specifically on a pull request that would be a good piece of code to leverage that need. We also have a visitor model for the expression tree that should be able to aid in the updated LogicalExpression construction. -LogicalExpression to Java expression conversion: We need to be able to convert our logical expressions into Java code expressions. Initially, this should be done in a simplistic way, using something like implicit boxing and the like just to get something working. This will likely be specialized per major type (nullable, non-nullable and repeated) and a framework might the most sense actually just distinguishing the LogicalExpression by these types. *JDBC* - The Drill JDBC driver layer needs to be updated to leverage our zookeeper coordination locations so that it can correctly find the cluster location. - The Drill JDBC driver should also manage reconnects so that if it loses connection with a particular Drillbit partner, that it will reconnect to another available node in the cluster. - Someone should point SQuirreL at Julian's latest work and see how things go... *ByteCode Engineering* - We need to put together a concrete class materialization strategy. My thinking for relational operators and code generation is that in most cases, we'll have an interface and a template class for a particular relational operator. We will build a template class that has all the generic stuff implemented but will make calls to empty methods where it expects lower level operations to occur. This allows things like the looping and certain types of null management to be fully materialized in source code without having to deal with the complexities of ByteCode generation. It also eases testing complexity. When a particular implementation is required, the Drillbit will be responsible for generating updated method bodies as required for the record-level expressions, marking all the methods and class as final, then loading the implementation into the query-level classloader. Note that the production Drillbit will never load the template class into the JVM and will simply utilize it in ByteCode form. I was hoping someone can take a look at trying to pull together a cohesive approach to doing this using ASM and Janino (likely utilizing the JDK commons-compiler mode). The interface should be pretty simple: input is an interface, a template class name, a set of (method_signature, method_body_text) objects and a varargs of objects that are required for object instantiation. The return should be an instance of the interface. The interface should check things like method_signature provided to available method blocks, the method blocks being replaced are empty, the object constructor matches the set of object argument provided by the object instantiation request, etc. *ByteBuf Improvements* - Our BufferAllocator should support child allocators (getChild()) with their own memory maximums and accounting (so we can determine the memory overhead to particular queries). We also need to be able to release entire child allocations at once. - We need to create a number of primitive type specific wrapping classes for ByteBuf. These additions include fixed offset indexing for operations (e.g. index 1 of an int buffer should be at 4 bytes), adding support for unsigned values (my preference would be to leverage the work in Guava if that makes sense) and modifying the hard bounds checks to softer assert checks to increase production performance. While we could do this utilizing the ByteBuf interface, from everything I've experienced and read, we need to minimize issues with inlining and performance so we really need to be able to modify/refer to PooledUnsafeDirectByteBuf directly for the wrapping classes. Of course, it is a final package private class. Short term that means we really need to create a number of specific buffer types that wrap it and just put them in the io.netty.buffer package (or alternatively create a Drill version or wrapper).
