Re: B[yi]teSize execwork tasks someone could potentially help out with...

David Alves Fri, 26 Apr 2013 09:30:34 -0700

i've looked through it and looks like it can leverage shared memory, which I 
was looking for anyway.
I also like the way garbage collection works (gc in java also clears off-heap).
I'll take a deeper look during the weekend.


-david

On Apr 26, 2013, at 11:25 AM, Jacques Nadeau <[email protected]> wrote:

> I've looked at that in the past and think the idea of using here is very
> good.  It seems like ByteBuf is nice as it has things like endianess
> capabilities, reference counting and management and Netty direct support.
> On the flipside, larray is nice for its large array capabilities and
> better input/output interfaces.  The best approach might be to define a new
> ByteBuf implementation that leverages LArray.  I'll take a look at this in
> a few days if someone else doesn't want to.
> 
> j
> 
> On Fri, Apr 26, 2013 at 8:39 AM, kishore g <[email protected]> wrote:
> 
>> Fort *ByteBuf Improvements*, Have you looked at LArrayJ
>> https://github.com/xerial/larray. It has those wrappers and I found it
>> quite useful. The same person has also written java version for snappy
>> compression. Not sure if you guys have plan to add compression, but one of
>> the nice things I could do was use the memory offsets for source(compressed
>> data) and dest(uncompressed array) and do the decompression off-heap. It
>> supports the need for looking up by index and has wrappers for most of the
>> primitive data types.
>> 
>> Are you looking at something like this?
>> 
>> thanks,
>> Kishore G
>> 
>> 
>> 
>> On Fri, Apr 26, 2013 at 7:53 AM, Jacques Nadeau <[email protected]>
>> wrote:
>> 
>>> They are on the list but the list is long :)
>>> 
>>> Have a good weekend.
>>> 
>>> On Thu, Apr 25, 2013 at 9:51 PM, Timothy Chen <[email protected]> wrote:
>>> 
>>>> So if no one picks anything up you will be done with all the work in
>> the
>>>> next couple of days? :)
>>>> 
>>>> Would like to help out but I'm traveling to la over the weekend.
>>>> 
>>>> I'll sync with you Monday to see how I can help then.
>>>> 
>>>> Tim
>>>> 
>>>> Sent from my iPhone
>>>> 
>>>> On Apr 25, 2013, at 9:06 PM, Jacques Nadeau <[email protected]>
>> wrote:
>>>> 
>>>>> I'm working on the execwork stuff and if someone would like to help
>>> out,
>>>>> here are a couple of things that need doing.  I figured I'd drop them
>>>> here
>>>>> and see if anyone wants to work on them in the next couple of days.
>> If
>>>> so,
>>>>> let me know otherwise I'll be picking them up soon.
>>>>> 
>>>>> *RPC*
>>>>> - RPC Layer Handshakes: Currently, I haven't implemented the
>> handshake
>>>> that
>>>>> should happen in either the User <> Bit or the Bit <> Bit layer.  The
>>>> plan
>>>>> was to use an additional inserted event handler that removed itself
>>> from
>>>>> the event pipeline after a successful handshake or disconnected the
>>>> channel
>>>>> on a failed handshake (with appropriate logging).  The main
>> validation
>>> at
>>>>> this point will be simply confirming that both endpoints are running
>> on
>>>> the
>>>>> same protocol version.   The only other information that is currently
>>>>> needed is that that in the Bit <> Bit communication, the client
>> should
>>>>> inform the server of its DrillEndpoint so that the server can then
>> map
>>>> that
>>>>> for future communication in the other direction.
>>>>> 
>>>>> *DataTypes*
>>>>> - General Expansion: Currently, we have a hodgepodge of datatypes
>>> within
>>>>> the org.apache.drill.common.expression.types.DataType.  We need to
>>> clean
>>>>> this up.  There should be types that map to standard sql types.  My
>>>>> thinking is that we should actually have separate types for each for
>>>>> nullable, non-nullable and repeated (required, optional and repeated
>> in
>>>>> protobuf vernaciular) since we'll generally operate with those values
>>>>> completely differently (and that each type should reveal which it
>> is).
>>>> We
>>>>> should also have a relationship mapping from each to the other (e.g.
>>> how
>>>> to
>>>>> convert a signed 32 bit int into a nullable signed 32 bit int.
>>>>> 
>>>>> - Map Types: We don't need nullable but we will need different map
>>> types:
>>>>> inline and fieldwise.  I think these will useful for the execution
>>> engine
>>>>> and will be leverage depending on the particular needs-- for example
>>>>> fieldwise will be a natural fit where we're operating on columnar
>> data
>>>> and
>>>>> doing an explode or other fieldwise nested operation and inline will
>> be
>>>>> useful when we're doing things like sorting a complex field.  Inline
>>> will
>>>>> also be appropriate where we have extremely sparse record sets.
>> We'll
>>>> just
>>>>> need transformation methods between the two variations.  In the case
>>> of a
>>>>> fieldwise map type field, the field is virtual and only exists to
>>> contain
>>>>> its child fields.
>>>>> 
>>>>> - Non-static DataTypes: We have a need types that don't fit the
>> static
>>>> data
>>>>> type model above.  Examples include fixed width types (e.g. 10 byte
>>>>> string), polymorphic (inline encoded) types (number or string
>> depending
>>>> on
>>>>> record) and repeated nested versions of our other types.  These are a
>>>>> little more gnarly as we need to support canonicalization of these.
>>>> Optiq
>>>>> has some methods for how to handle this kind of type system so it
>>>> probably
>>>>> makes sense to leverage that system.
>>>>> 
>>>>> *Expression Type Materialization*
>>>>> - LogicalExpression type materialization: Right now,
>> LogicalExpressions
>>>>> include support for late type binding.  As part of the record batch
>>>>> execution path, these need to get materialized with correct casting,
>>> etc
>>>>> based on the actual found schema.  As such, we need to have a
>> function
>>>>> which takes a LogicalExpression tree, applies a materialized
>>> BatchSchema
>>>>> and returns a new LogicalExpression tree with full type settings.  As
>>>> part
>>>>> of this process, all types need to be cast as necessary and full
>>>> validation
>>>>> of the tree should be done.  Timothy has a pending work for
>> validation
>>>>> specifically on a pull request that would be a good piece of code to
>>>>> leverage that need.  We also have a visitor model for the expression
>>> tree
>>>>> that should be able to aid in the updated LogicalExpression
>>> construction.
>>>>> -LogicalExpression to Java expression conversion: We need to be able
>> to
>>>>> convert our logical expressions into Java code expressions.
>> Initially,
>>>>> this should be done in a simplistic way, using something like
>> implicit
>>>>> boxing and the like just to get something working.  This will likely
>> be
>>>>> specialized per major type (nullable, non-nullable and repeated) and
>> a
>>>>> framework might the most sense actually just distinguishing the
>>>>> LogicalExpression by these types.
>>>>> 
>>>>> *JDBC*
>>>>> - The Drill JDBC driver layer needs to be updated to leverage our
>>>> zookeeper
>>>>> coordination locations so that it can correctly find the cluster
>>>> location.
>>>>> - The Drill JDBC driver should also manage reconnects so that if it
>>> loses
>>>>> connection with a particular Drillbit partner, that it will reconnect
>>> to
>>>>> another available node in the cluster.
>>>>> - Someone should point SQuirreL at Julian's latest work and see how
>>>> things
>>>>> go...
>>>>> 
>>>>> *ByteCode Engineering*
>>>>> - We need to put together a concrete class materialization strategy.
>>> My
>>>>> thinking for relational operators and code generation is that in most
>>>>> cases, we'll have an interface and a template class for a particular
>>>>> relational operator.  We will build a template class that has all the
>>>>> generic stuff implemented but will make calls to empty methods where
>> it
>>>>> expects lower level operations to occur.  This allows things like the
>>>>> looping and certain types of null management to be fully materialized
>>> in
>>>>> source code without having to deal with the complexities of ByteCode
>>>>> generation.  It also eases testing complexity.  When a particular
>>>>> implementation is required, the Drillbit will be responsible for
>>>> generating
>>>>> updated method bodies as required for the record-level expressions,
>>>> marking
>>>>> all the methods and class as final, then loading the implementation
>>> into
>>>>> the query-level classloader.  Note that the production Drillbit will
>>>> never
>>>>> load the template class into the JVM and will simply utilize it in
>>>> ByteCode
>>>>> form.  I was hoping someone can take a look at trying to pull
>> together
>>> a
>>>>> cohesive approach to doing this using ASM and Janino (likely
>> utilizing
>>>> the
>>>>> JDK commons-compiler mode).  The interface should be pretty simple:
>>> input
>>>>> is an interface, a template class name, a set of (method_signature,
>>>>> method_body_text) objects and a varargs of objects that are required
>>> for
>>>>> object instantiation.  The return should be an instance of the
>>> interface.
>>>>> The interface should check things like method_signature provided to
>>>>> available method blocks, the method blocks being replaced are empty,
>>> the
>>>>> object constructor matches the set of object argument provided by the
>>>>> object instantiation request, etc.
>>>>> 
>>>>> *ByteBuf Improvements*
>>>>> - Our BufferAllocator should support child allocators (getChild())
>> with
>>>>> their own memory maximums and accounting (so we can determine the
>>> memory
>>>>> overhead to particular queries).  We also need to be able to release
>>>> entire
>>>>> child allocations at once.
>>>>> - We need to create a number of primitive type specific wrapping
>>> classes
>>>>> for ByteBuf.  These additions include fixed offset indexing for
>>>> operations
>>>>> (e.g. index 1 of an int buffer should be at 4 bytes), adding support
>>> for
>>>>> unsigned values (my preference would be to leverage the work in Guava
>>> if
>>>>> that makes sense) and modifying the hard bounds checks to softer
>> assert
>>>>> checks to increase production performance.  While we could do this
>>>>> utilizing the ByteBuf interface, from everything I've experienced and
>>>> read,
>>>>> we need to minimize issues with inlining and performance so we really
>>>> need
>>>>> to be able to modify/refer to PooledUnsafeDirectByteBuf directly for
>>> the
>>>>> wrapping classes.  Of course, it is a final package private class.
>>> Short
>>>>> term that means we really need to create a number of specific buffer
>>>> types
>>>>> that wrap it and just put them in the io.netty.buffer package (or
>>>>> alternatively create a Drill version or wrapper).
>>>> 
>>> 
>>

Re: B[yi]teSize execwork tasks someone could potentially help out with...

Reply via email to