Re: B[yi]teSize execwork tasks someone could potentially help out with...

Timothy Chen Fri, 26 Apr 2013 10:05:04 -0700

Jacques I think this is the one I emailed you before that has no licensing info.


Tim

Sent from my iPhone

On Apr 26, 2013, at 9:30 AM, David Alves <[email protected]> wrote:

> i've looked through it and looks like it can leverage shared memory, which I 
> was looking for anyway.
> I also like the way garbage collection works (gc in java also clears 
> off-heap).
> I'll take a deeper look during the weekend.
> 
> -david
> 
> On Apr 26, 2013, at 11:25 AM, Jacques Nadeau <[email protected]> wrote:
> 
>> I've looked at that in the past and think the idea of using here is very
>> good.  It seems like ByteBuf is nice as it has things like endianess
>> capabilities, reference counting and management and Netty direct support.
>> On the flipside, larray is nice for its large array capabilities and
>> better input/output interfaces.  The best approach might be to define a new
>> ByteBuf implementation that leverages LArray.  I'll take a look at this in
>> a few days if someone else doesn't want to.
>> 
>> j
>> 
>> On Fri, Apr 26, 2013 at 8:39 AM, kishore g <[email protected]> wrote:
>> 
>>> Fort *ByteBuf Improvements*, Have you looked at LArrayJ
>>> https://github.com/xerial/larray. It has those wrappers and I found it
>>> quite useful. The same person has also written java version for snappy
>>> compression. Not sure if you guys have plan to add compression, but one of
>>> the nice things I could do was use the memory offsets for source(compressed
>>> data) and dest(uncompressed array) and do the decompression off-heap. It
>>> supports the need for looking up by index and has wrappers for most of the
>>> primitive data types.
>>> 
>>> Are you looking at something like this?
>>> 
>>> thanks,
>>> Kishore G
>>> 
>>> 
>>> 
>>> On Fri, Apr 26, 2013 at 7:53 AM, Jacques Nadeau <[email protected]>
>>> wrote:
>>> 
>>>> They are on the list but the list is long :)
>>>> 
>>>> Have a good weekend.
>>>> 
>>>> On Thu, Apr 25, 2013 at 9:51 PM, Timothy Chen <[email protected]> wrote:
>>>> 
>>>>> So if no one picks anything up you will be done with all the work in
>>> the
>>>>> next couple of days? :)
>>>>> 
>>>>> Would like to help out but I'm traveling to la over the weekend.
>>>>> 
>>>>> I'll sync with you Monday to see how I can help then.
>>>>> 
>>>>> Tim
>>>>> 
>>>>> Sent from my iPhone
>>>>> 
>>>>> On Apr 25, 2013, at 9:06 PM, Jacques Nadeau <[email protected]>
>>> wrote:
>>>>> 
>>>>>> I'm working on the execwork stuff and if someone would like to help
>>>> out,
>>>>>> here are a couple of things that need doing.  I figured I'd drop them
>>>>> here
>>>>>> and see if anyone wants to work on them in the next couple of days.
>>> If
>>>>> so,
>>>>>> let me know otherwise I'll be picking them up soon.
>>>>>> 
>>>>>> *RPC*
>>>>>> - RPC Layer Handshakes: Currently, I haven't implemented the
>>> handshake
>>>>> that
>>>>>> should happen in either the User <> Bit or the Bit <> Bit layer.  The
>>>>> plan
>>>>>> was to use an additional inserted event handler that removed itself
>>>> from
>>>>>> the event pipeline after a successful handshake or disconnected the
>>>>> channel
>>>>>> on a failed handshake (with appropriate logging).  The main
>>> validation
>>>> at
>>>>>> this point will be simply confirming that both endpoints are running
>>> on
>>>>> the
>>>>>> same protocol version.   The only other information that is currently
>>>>>> needed is that that in the Bit <> Bit communication, the client
>>> should
>>>>>> inform the server of its DrillEndpoint so that the server can then
>>> map
>>>>> that
>>>>>> for future communication in the other direction.
>>>>>> 
>>>>>> *DataTypes*
>>>>>> - General Expansion: Currently, we have a hodgepodge of datatypes
>>>> within
>>>>>> the org.apache.drill.common.expression.types.DataType.  We need to
>>>> clean
>>>>>> this up.  There should be types that map to standard sql types.  My
>>>>>> thinking is that we should actually have separate types for each for
>>>>>> nullable, non-nullable and repeated (required, optional and repeated
>>> in
>>>>>> protobuf vernaciular) since we'll generally operate with those values
>>>>>> completely differently (and that each type should reveal which it
>>> is).
>>>>> We
>>>>>> should also have a relationship mapping from each to the other (e.g.
>>>> how
>>>>> to
>>>>>> convert a signed 32 bit int into a nullable signed 32 bit int.
>>>>>> 
>>>>>> - Map Types: We don't need nullable but we will need different map
>>>> types:
>>>>>> inline and fieldwise.  I think these will useful for the execution
>>>> engine
>>>>>> and will be leverage depending on the particular needs-- for example
>>>>>> fieldwise will be a natural fit where we're operating on columnar
>>> data
>>>>> and
>>>>>> doing an explode or other fieldwise nested operation and inline will
>>> be
>>>>>> useful when we're doing things like sorting a complex field.  Inline
>>>> will
>>>>>> also be appropriate where we have extremely sparse record sets.
>>> We'll
>>>>> just
>>>>>> need transformation methods between the two variations.  In the case
>>>> of a
>>>>>> fieldwise map type field, the field is virtual and only exists to
>>>> contain
>>>>>> its child fields.
>>>>>> 
>>>>>> - Non-static DataTypes: We have a need types that don't fit the
>>> static
>>>>> data
>>>>>> type model above.  Examples include fixed width types (e.g. 10 byte
>>>>>> string), polymorphic (inline encoded) types (number or string
>>> depending
>>>>> on
>>>>>> record) and repeated nested versions of our other types.  These are a
>>>>>> little more gnarly as we need to support canonicalization of these.
>>>>> Optiq
>>>>>> has some methods for how to handle this kind of type system so it
>>>>> probably
>>>>>> makes sense to leverage that system.
>>>>>> 
>>>>>> *Expression Type Materialization*
>>>>>> - LogicalExpression type materialization: Right now,
>>> LogicalExpressions
>>>>>> include support for late type binding.  As part of the record batch
>>>>>> execution path, these need to get materialized with correct casting,
>>>> etc
>>>>>> based on the actual found schema.  As such, we need to have a
>>> function
>>>>>> which takes a LogicalExpression tree, applies a materialized
>>>> BatchSchema
>>>>>> and returns a new LogicalExpression tree with full type settings.  As
>>>>> part
>>>>>> of this process, all types need to be cast as necessary and full
>>>>> validation
>>>>>> of the tree should be done.  Timothy has a pending work for
>>> validation
>>>>>> specifically on a pull request that would be a good piece of code to
>>>>>> leverage that need.  We also have a visitor model for the expression
>>>> tree
>>>>>> that should be able to aid in the updated LogicalExpression
>>>> construction.
>>>>>> -LogicalExpression to Java expression conversion: We need to be able
>>> to
>>>>>> convert our logical expressions into Java code expressions.
>>> Initially,
>>>>>> this should be done in a simplistic way, using something like
>>> implicit
>>>>>> boxing and the like just to get something working.  This will likely
>>> be
>>>>>> specialized per major type (nullable, non-nullable and repeated) and
>>> a
>>>>>> framework might the most sense actually just distinguishing the
>>>>>> LogicalExpression by these types.
>>>>>> 
>>>>>> *JDBC*
>>>>>> - The Drill JDBC driver layer needs to be updated to leverage our
>>>>> zookeeper
>>>>>> coordination locations so that it can correctly find the cluster
>>>>> location.
>>>>>> - The Drill JDBC driver should also manage reconnects so that if it
>>>> loses
>>>>>> connection with a particular Drillbit partner, that it will reconnect
>>>> to
>>>>>> another available node in the cluster.
>>>>>> - Someone should point SQuirreL at Julian's latest work and see how
>>>>> things
>>>>>> go...
>>>>>> 
>>>>>> *ByteCode Engineering*
>>>>>> - We need to put together a concrete class materialization strategy.
>>>> My
>>>>>> thinking for relational operators and code generation is that in most
>>>>>> cases, we'll have an interface and a template class for a particular
>>>>>> relational operator.  We will build a template class that has all the
>>>>>> generic stuff implemented but will make calls to empty methods where
>>> it
>>>>>> expects lower level operations to occur.  This allows things like the
>>>>>> looping and certain types of null management to be fully materialized
>>>> in
>>>>>> source code without having to deal with the complexities of ByteCode
>>>>>> generation.  It also eases testing complexity.  When a particular
>>>>>> implementation is required, the Drillbit will be responsible for
>>>>> generating
>>>>>> updated method bodies as required for the record-level expressions,
>>>>> marking
>>>>>> all the methods and class as final, then loading the implementation
>>>> into
>>>>>> the query-level classloader.  Note that the production Drillbit will
>>>>> never
>>>>>> load the template class into the JVM and will simply utilize it in
>>>>> ByteCode
>>>>>> form.  I was hoping someone can take a look at trying to pull
>>> together
>>>> a
>>>>>> cohesive approach to doing this using ASM and Janino (likely
>>> utilizing
>>>>> the
>>>>>> JDK commons-compiler mode).  The interface should be pretty simple:
>>>> input
>>>>>> is an interface, a template class name, a set of (method_signature,
>>>>>> method_body_text) objects and a varargs of objects that are required
>>>> for
>>>>>> object instantiation.  The return should be an instance of the
>>>> interface.
>>>>>> The interface should check things like method_signature provided to
>>>>>> available method blocks, the method blocks being replaced are empty,
>>>> the
>>>>>> object constructor matches the set of object argument provided by the
>>>>>> object instantiation request, etc.
>>>>>> 
>>>>>> *ByteBuf Improvements*
>>>>>> - Our BufferAllocator should support child allocators (getChild())
>>> with
>>>>>> their own memory maximums and accounting (so we can determine the
>>>> memory
>>>>>> overhead to particular queries).  We also need to be able to release
>>>>> entire
>>>>>> child allocations at once.
>>>>>> - We need to create a number of primitive type specific wrapping
>>>> classes
>>>>>> for ByteBuf.  These additions include fixed offset indexing for
>>>>> operations
>>>>>> (e.g. index 1 of an int buffer should be at 4 bytes), adding support
>>>> for
>>>>>> unsigned values (my preference would be to leverage the work in Guava
>>>> if
>>>>>> that makes sense) and modifying the hard bounds checks to softer
>>> assert
>>>>>> checks to increase production performance.  While we could do this
>>>>>> utilizing the ByteBuf interface, from everything I've experienced and
>>>>> read,
>>>>>> we need to minimize issues with inlining and performance so we really
>>>>> need
>>>>>> to be able to modify/refer to PooledUnsafeDirectByteBuf directly for
>>>> the
>>>>>> wrapping classes.  Of course, it is a final package private class.
>>>> Short
>>>>>> term that means we really need to create a number of specific buffer
>>>>> types
>>>>>> that wrap it and just put them in the io.netty.buffer package (or
>>>>>> alternatively create a Drill version or wrapper).
>

Re: B[yi]teSize execwork tasks someone could potentially help out with...

Reply via email to