B[yi]teSize execwork tasks someone could potentially help out with...

Jacques Nadeau Thu, 25 Apr 2013 21:07:18 -0700

I'm working on the execwork stuff and if someone would like to help out,
here are a couple of things that need doing.  I figured I'd drop them here
and see if anyone wants to work on them in the next couple of days.  If so,
let me know otherwise I'll be picking them up soon.


*RPC*
- RPC Layer Handshakes: Currently, I haven't implemented the handshake that
should happen in either the User <> Bit or the Bit <> Bit layer.  The plan
was to use an additional inserted event handler that removed itself from
the event pipeline after a successful handshake or disconnected the channel
on a failed handshake (with appropriate logging).  The main validation at
this point will be simply confirming that both endpoints are running on the
same protocol version.   The only other information that is currently
needed is that that in the Bit <> Bit communication, the client should
inform the server of its DrillEndpoint so that the server can then map that
for future communication in the other direction.

*DataTypes*
- General Expansion: Currently, we have a hodgepodge of datatypes within
the org.apache.drill.common.expression.types.DataType.  We need to clean
this up.  There should be types that map to standard sql types.  My
thinking is that we should actually have separate types for each for
nullable, non-nullable and repeated (required, optional and repeated in
protobuf vernaciular) since we'll generally operate with those values
completely differently (and that each type should reveal which it is).  We
should also have a relationship mapping from each to the other (e.g. how to
convert a signed 32 bit int into a nullable signed 32 bit int.

- Map Types: We don't need nullable but we will need different map types:
inline and fieldwise.  I think these will useful for the execution engine
and will be leverage depending on the particular needs-- for example
fieldwise will be a natural fit where we're operating on columnar data and
doing an explode or other fieldwise nested operation and inline will be
useful when we're doing things like sorting a complex field.  Inline will
also be appropriate where we have extremely sparse record sets.  We'll just
need transformation methods between the two variations.  In the case of a
fieldwise map type field, the field is virtual and only exists to contain
its child fields.

- Non-static DataTypes: We have a need types that don't fit the static data
type model above.  Examples include fixed width types (e.g. 10 byte
string), polymorphic (inline encoded) types (number or string depending on
record) and repeated nested versions of our other types.  These are a
little more gnarly as we need to support canonicalization of these.  Optiq
has some methods for how to handle this kind of type system so it probably
makes sense to leverage that system.

*Expression Type Materialization*
- LogicalExpression type materialization: Right now, LogicalExpressions
include support for late type binding.  As part of the record batch
execution path, these need to get materialized with correct casting, etc
based on the actual found schema.  As such, we need to have a function
which takes a LogicalExpression tree, applies a materialized BatchSchema
and returns a new LogicalExpression tree with full type settings.  As part
of this process, all types need to be cast as necessary and full validation
of the tree should be done.  Timothy has a pending work for validation
specifically on a pull request that would be a good piece of code to
leverage that need.  We also have a visitor model for the expression tree
that should be able to aid in the updated LogicalExpression construction.
-LogicalExpression to Java expression conversion: We need to be able to
convert our logical expressions into Java code expressions.  Initially,
this should be done in a simplistic way, using something like implicit
boxing and the like just to get something working.  This will likely be
specialized per major type (nullable, non-nullable and repeated) and a
framework might the most sense actually just distinguishing the
LogicalExpression by these types.

*JDBC*
- The Drill JDBC driver layer needs to be updated to leverage our zookeeper
coordination locations so that it can correctly find the cluster location.
- The Drill JDBC driver should also manage reconnects so that if it loses
connection with a particular Drillbit partner, that it will reconnect to
another available node in the cluster.
- Someone should point SQuirreL at Julian's latest work and see how things
go...

*ByteCode Engineering*
- We need to put together a concrete class materialization strategy.  My
thinking for relational operators and code generation is that in most
cases, we'll have an interface and a template class for a particular
relational operator.  We will build a template class that has all the
generic stuff implemented but will make calls to empty methods where it
expects lower level operations to occur.  This allows things like the
looping and certain types of null management to be fully materialized in
source code without having to deal with the complexities of ByteCode
generation.  It also eases testing complexity.  When a particular
implementation is required, the Drillbit will be responsible for generating
updated method bodies as required for the record-level expressions, marking
all the methods and class as final, then loading the implementation into
the query-level classloader.  Note that the production Drillbit will never
load the template class into the JVM and will simply utilize it in ByteCode
form.  I was hoping someone can take a look at trying to pull together a
cohesive approach to doing this using ASM and Janino (likely utilizing the
JDK commons-compiler mode).  The interface should be pretty simple: input
is an interface, a template class name, a set of (method_signature,
method_body_text) objects and a varargs of objects that are required for
object instantiation.  The return should be an instance of the interface.
 The interface should check things like method_signature provided to
available method blocks, the method blocks being replaced are empty, the
object constructor matches the set of object argument provided by the
object instantiation request, etc.

*ByteBuf Improvements*
- Our BufferAllocator should support child allocators (getChild()) with
their own memory maximums and accounting (so we can determine the memory
overhead to particular queries).  We also need to be able to release entire
child allocations at once.
- We need to create a number of primitive type specific wrapping classes
for ByteBuf.  These additions include fixed offset indexing for operations
(e.g. index 1 of an int buffer should be at 4 bytes), adding support for
unsigned values (my preference would be to leverage the work in Guava if
that makes sense) and modifying the hard bounds checks to softer assert
checks to increase production performance.  While we could do this
utilizing the ByteBuf interface, from everything I've experienced and read,
we need to minimize issues with inlining and performance so we really need
to be able to modify/refer to PooledUnsafeDirectByteBuf directly for the
wrapping classes.  Of course, it is a final package private class.  Short
term that means we really need to create a number of specific buffer types
that wrap it and just put them in the io.netty.buffer package (or
alternatively create a Drill version or wrapper).

B[yi]teSize execwork tasks someone could potentially help out with...

Reply via email to