Re: B[yi]teSize execwork tasks someone could potentially help out with...

David Alves Thu, 25 Apr 2013 21:23:27 -0700

… btw thank you the all the work in laying this out.

Best
David


On Apr 25, 2013, at 11:10 PM, David Alves <[email protected]> wrote:

> Hi Jacques
> 
>       I can take the RPC stuff.
>       Have you made any progress in Bit<>Bit comms?
> 
> Best
> David
> 
> On Apr 25, 2013, at 11:06 PM, Jacques Nadeau <[email protected]> wrote:
> 
>> I'm working on the execwork stuff and if someone would like to help out,
>> here are a couple of things that need doing.  I figured I'd drop them here
>> and see if anyone wants to work on them in the next couple of days.  If so,
>> let me know otherwise I'll be picking them up soon.
>> 
>> *RPC*
>> - RPC Layer Handshakes: Currently, I haven't implemented the handshake that
>> should happen in either the User <> Bit or the Bit <> Bit layer.  The plan
>> was to use an additional inserted event handler that removed itself from
>> the event pipeline after a successful handshake or disconnected the channel
>> on a failed handshake (with appropriate logging).  The main validation at
>> this point will be simply confirming that both endpoints are running on the
>> same protocol version.   The only other information that is currently
>> needed is that that in the Bit <> Bit communication, the client should
>> inform the server of its DrillEndpoint so that the server can then map that
>> for future communication in the other direction.
>> 
>> *DataTypes*
>> - General Expansion: Currently, we have a hodgepodge of datatypes within
>> the org.apache.drill.common.expression.types.DataType.  We need to clean
>> this up.  There should be types that map to standard sql types.  My
>> thinking is that we should actually have separate types for each for
>> nullable, non-nullable and repeated (required, optional and repeated in
>> protobuf vernaciular) since we'll generally operate with those values
>> completely differently (and that each type should reveal which it is).  We
>> should also have a relationship mapping from each to the other (e.g. how to
>> convert a signed 32 bit int into a nullable signed 32 bit int.
>> 
>> - Map Types: We don't need nullable but we will need different map types:
>> inline and fieldwise.  I think these will useful for the execution engine
>> and will be leverage depending on the particular needs-- for example
>> fieldwise will be a natural fit where we're operating on columnar data and
>> doing an explode or other fieldwise nested operation and inline will be
>> useful when we're doing things like sorting a complex field.  Inline will
>> also be appropriate where we have extremely sparse record sets.  We'll just
>> need transformation methods between the two variations.  In the case of a
>> fieldwise map type field, the field is virtual and only exists to contain
>> its child fields.
>> 
>> - Non-static DataTypes: We have a need types that don't fit the static data
>> type model above.  Examples include fixed width types (e.g. 10 byte
>> string), polymorphic (inline encoded) types (number or string depending on
>> record) and repeated nested versions of our other types.  These are a
>> little more gnarly as we need to support canonicalization of these.  Optiq
>> has some methods for how to handle this kind of type system so it probably
>> makes sense to leverage that system.
>> 
>> *Expression Type Materialization*
>> - LogicalExpression type materialization: Right now, LogicalExpressions
>> include support for late type binding.  As part of the record batch
>> execution path, these need to get materialized with correct casting, etc
>> based on the actual found schema.  As such, we need to have a function
>> which takes a LogicalExpression tree, applies a materialized BatchSchema
>> and returns a new LogicalExpression tree with full type settings.  As part
>> of this process, all types need to be cast as necessary and full validation
>> of the tree should be done.  Timothy has a pending work for validation
>> specifically on a pull request that would be a good piece of code to
>> leverage that need.  We also have a visitor model for the expression tree
>> that should be able to aid in the updated LogicalExpression construction.
>> -LogicalExpression to Java expression conversion: We need to be able to
>> convert our logical expressions into Java code expressions.  Initially,
>> this should be done in a simplistic way, using something like implicit
>> boxing and the like just to get something working.  This will likely be
>> specialized per major type (nullable, non-nullable and repeated) and a
>> framework might the most sense actually just distinguishing the
>> LogicalExpression by these types.
>> 
>> *JDBC*
>> - The Drill JDBC driver layer needs to be updated to leverage our zookeeper
>> coordination locations so that it can correctly find the cluster location.
>> - The Drill JDBC driver should also manage reconnects so that if it loses
>> connection with a particular Drillbit partner, that it will reconnect to
>> another available node in the cluster.
>> - Someone should point SQuirreL at Julian's latest work and see how things
>> go...
>> 
>> *ByteCode Engineering*
>> - We need to put together a concrete class materialization strategy.  My
>> thinking for relational operators and code generation is that in most
>> cases, we'll have an interface and a template class for a particular
>> relational operator.  We will build a template class that has all the
>> generic stuff implemented but will make calls to empty methods where it
>> expects lower level operations to occur.  This allows things like the
>> looping and certain types of null management to be fully materialized in
>> source code without having to deal with the complexities of ByteCode
>> generation.  It also eases testing complexity.  When a particular
>> implementation is required, the Drillbit will be responsible for generating
>> updated method bodies as required for the record-level expressions, marking
>> all the methods and class as final, then loading the implementation into
>> the query-level classloader.  Note that the production Drillbit will never
>> load the template class into the JVM and will simply utilize it in ByteCode
>> form.  I was hoping someone can take a look at trying to pull together a
>> cohesive approach to doing this using ASM and Janino (likely utilizing the
>> JDK commons-compiler mode).  The interface should be pretty simple: input
>> is an interface, a template class name, a set of (method_signature,
>> method_body_text) objects and a varargs of objects that are required for
>> object instantiation.  The return should be an instance of the interface.
>> The interface should check things like method_signature provided to
>> available method blocks, the method blocks being replaced are empty, the
>> object constructor matches the set of object argument provided by the
>> object instantiation request, etc.
>> 
>> *ByteBuf Improvements*
>> - Our BufferAllocator should support child allocators (getChild()) with
>> their own memory maximums and accounting (so we can determine the memory
>> overhead to particular queries).  We also need to be able to release entire
>> child allocations at once.
>> - We need to create a number of primitive type specific wrapping classes
>> for ByteBuf.  These additions include fixed offset indexing for operations
>> (e.g. index 1 of an int buffer should be at 4 bytes), adding support for
>> unsigned values (my preference would be to leverage the work in Guava if
>> that makes sense) and modifying the hard bounds checks to softer assert
>> checks to increase production performance.  While we could do this
>> utilizing the ByteBuf interface, from everything I've experienced and read,
>> we need to minimize issues with inlining and performance so we really need
>> to be able to modify/refer to PooledUnsafeDirectByteBuf directly for the
>> wrapping classes.  Of course, it is a final package private class.  Short
>> term that means we really need to create a number of specific buffer types
>> that wrap it and just put them in the io.netty.buffer package (or
>> alternatively create a Drill version or wrapper).
>

Re: B[yi]teSize execwork tasks someone could potentially help out with...

Reply via email to