Re: B[yi]teSize execwork tasks someone could potentially help out with...

David Alves Fri, 26 Apr 2013 10:12:45 -0700

good point, i'll try and ask the author.
it's a pretty recent lib so that might be an oversight…


-david

On Apr 26, 2013, at 12:04 PM, Timothy Chen <[email protected]> wrote:

> Jacques I think this is the one I emailed you before that has no licensing 
> info.
> 
> Tim
> 
> Sent from my iPhone
> 
> On Apr 26, 2013, at 9:30 AM, David Alves <[email protected]> wrote:
> 
>> i've looked through it and looks like it can leverage shared memory, which I 
>> was looking for anyway.
>> I also like the way garbage collection works (gc in java also clears 
>> off-heap).
>> I'll take a deeper look during the weekend.
>> 
>> -david
>> 
>> On Apr 26, 2013, at 11:25 AM, Jacques Nadeau <[email protected]> wrote:
>> 
>>> I've looked at that in the past and think the idea of using here is very
>>> good.  It seems like ByteBuf is nice as it has things like endianess
>>> capabilities, reference counting and management and Netty direct support.
>>> On the flipside, larray is nice for its large array capabilities and
>>> better input/output interfaces.  The best approach might be to define a new
>>> ByteBuf implementation that leverages LArray.  I'll take a look at this in
>>> a few days if someone else doesn't want to.
>>> 
>>> j
>>> 
>>> On Fri, Apr 26, 2013 at 8:39 AM, kishore g <[email protected]> wrote:
>>> 
>>>> Fort *ByteBuf Improvements*, Have you looked at LArrayJ
>>>> https://github.com/xerial/larray. It has those wrappers and I found it
>>>> quite useful. The same person has also written java version for snappy
>>>> compression. Not sure if you guys have plan to add compression, but one of
>>>> the nice things I could do was use the memory offsets for source(compressed
>>>> data) and dest(uncompressed array) and do the decompression off-heap. It
>>>> supports the need for looking up by index and has wrappers for most of the
>>>> primitive data types.
>>>> 
>>>> Are you looking at something like this?
>>>> 
>>>> thanks,
>>>> Kishore G
>>>> 
>>>> 
>>>> 
>>>> On Fri, Apr 26, 2013 at 7:53 AM, Jacques Nadeau <[email protected]>
>>>> wrote:
>>>> 
>>>>> They are on the list but the list is long :)
>>>>> 
>>>>> Have a good weekend.
>>>>> 
>>>>> On Thu, Apr 25, 2013 at 9:51 PM, Timothy Chen <[email protected]> wrote:
>>>>> 
>>>>>> So if no one picks anything up you will be done with all the work in
>>>> the
>>>>>> next couple of days? :)
>>>>>> 
>>>>>> Would like to help out but I'm traveling to la over the weekend.
>>>>>> 
>>>>>> I'll sync with you Monday to see how I can help then.
>>>>>> 
>>>>>> Tim
>>>>>> 
>>>>>> Sent from my iPhone
>>>>>> 
>>>>>> On Apr 25, 2013, at 9:06 PM, Jacques Nadeau <[email protected]>
>>>> wrote:
>>>>>> 
>>>>>>> I'm working on the execwork stuff and if someone would like to help
>>>>> out,
>>>>>>> here are a couple of things that need doing.  I figured I'd drop them
>>>>>> here
>>>>>>> and see if anyone wants to work on them in the next couple of days.
>>>> If
>>>>>> so,
>>>>>>> let me know otherwise I'll be picking them up soon.
>>>>>>> 
>>>>>>> *RPC*
>>>>>>> - RPC Layer Handshakes: Currently, I haven't implemented the
>>>> handshake
>>>>>> that
>>>>>>> should happen in either the User <> Bit or the Bit <> Bit layer.  The
>>>>>> plan
>>>>>>> was to use an additional inserted event handler that removed itself
>>>>> from
>>>>>>> the event pipeline after a successful handshake or disconnected the
>>>>>> channel
>>>>>>> on a failed handshake (with appropriate logging).  The main
>>>> validation
>>>>> at
>>>>>>> this point will be simply confirming that both endpoints are running
>>>> on
>>>>>> the
>>>>>>> same protocol version.   The only other information that is currently
>>>>>>> needed is that that in the Bit <> Bit communication, the client
>>>> should
>>>>>>> inform the server of its DrillEndpoint so that the server can then
>>>> map
>>>>>> that
>>>>>>> for future communication in the other direction.
>>>>>>> 
>>>>>>> *DataTypes*
>>>>>>> - General Expansion: Currently, we have a hodgepodge of datatypes
>>>>> within
>>>>>>> the org.apache.drill.common.expression.types.DataType.  We need to
>>>>> clean
>>>>>>> this up.  There should be types that map to standard sql types.  My
>>>>>>> thinking is that we should actually have separate types for each for
>>>>>>> nullable, non-nullable and repeated (required, optional and repeated
>>>> in
>>>>>>> protobuf vernaciular) since we'll generally operate with those values
>>>>>>> completely differently (and that each type should reveal which it
>>>> is).
>>>>>> We
>>>>>>> should also have a relationship mapping from each to the other (e.g.
>>>>> how
>>>>>> to
>>>>>>> convert a signed 32 bit int into a nullable signed 32 bit int.
>>>>>>> 
>>>>>>> - Map Types: We don't need nullable but we will need different map
>>>>> types:
>>>>>>> inline and fieldwise.  I think these will useful for the execution
>>>>> engine
>>>>>>> and will be leverage depending on the particular needs-- for example
>>>>>>> fieldwise will be a natural fit where we're operating on columnar
>>>> data
>>>>>> and
>>>>>>> doing an explode or other fieldwise nested operation and inline will
>>>> be
>>>>>>> useful when we're doing things like sorting a complex field.  Inline
>>>>> will
>>>>>>> also be appropriate where we have extremely sparse record sets.
>>>> We'll
>>>>>> just
>>>>>>> need transformation methods between the two variations.  In the case
>>>>> of a
>>>>>>> fieldwise map type field, the field is virtual and only exists to
>>>>> contain
>>>>>>> its child fields.
>>>>>>> 
>>>>>>> - Non-static DataTypes: We have a need types that don't fit the
>>>> static
>>>>>> data
>>>>>>> type model above.  Examples include fixed width types (e.g. 10 byte
>>>>>>> string), polymorphic (inline encoded) types (number or string
>>>> depending
>>>>>> on
>>>>>>> record) and repeated nested versions of our other types.  These are a
>>>>>>> little more gnarly as we need to support canonicalization of these.
>>>>>> Optiq
>>>>>>> has some methods for how to handle this kind of type system so it
>>>>>> probably
>>>>>>> makes sense to leverage that system.
>>>>>>> 
>>>>>>> *Expression Type Materialization*
>>>>>>> - LogicalExpression type materialization: Right now,
>>>> LogicalExpressions
>>>>>>> include support for late type binding.  As part of the record batch
>>>>>>> execution path, these need to get materialized with correct casting,
>>>>> etc
>>>>>>> based on the actual found schema.  As such, we need to have a
>>>> function
>>>>>>> which takes a LogicalExpression tree, applies a materialized
>>>>> BatchSchema
>>>>>>> and returns a new LogicalExpression tree with full type settings.  As
>>>>>> part
>>>>>>> of this process, all types need to be cast as necessary and full
>>>>>> validation
>>>>>>> of the tree should be done.  Timothy has a pending work for
>>>> validation
>>>>>>> specifically on a pull request that would be a good piece of code to
>>>>>>> leverage that need.  We also have a visitor model for the expression
>>>>> tree
>>>>>>> that should be able to aid in the updated LogicalExpression
>>>>> construction.
>>>>>>> -LogicalExpression to Java expression conversion: We need to be able
>>>> to
>>>>>>> convert our logical expressions into Java code expressions.
>>>> Initially,
>>>>>>> this should be done in a simplistic way, using something like
>>>> implicit
>>>>>>> boxing and the like just to get something working.  This will likely
>>>> be
>>>>>>> specialized per major type (nullable, non-nullable and repeated) and
>>>> a
>>>>>>> framework might the most sense actually just distinguishing the
>>>>>>> LogicalExpression by these types.
>>>>>>> 
>>>>>>> *JDBC*
>>>>>>> - The Drill JDBC driver layer needs to be updated to leverage our
>>>>>> zookeeper
>>>>>>> coordination locations so that it can correctly find the cluster
>>>>>> location.
>>>>>>> - The Drill JDBC driver should also manage reconnects so that if it
>>>>> loses
>>>>>>> connection with a particular Drillbit partner, that it will reconnect
>>>>> to
>>>>>>> another available node in the cluster.
>>>>>>> - Someone should point SQuirreL at Julian's latest work and see how
>>>>>> things
>>>>>>> go...
>>>>>>> 
>>>>>>> *ByteCode Engineering*
>>>>>>> - We need to put together a concrete class materialization strategy.
>>>>> My
>>>>>>> thinking for relational operators and code generation is that in most
>>>>>>> cases, we'll have an interface and a template class for a particular
>>>>>>> relational operator.  We will build a template class that has all the
>>>>>>> generic stuff implemented but will make calls to empty methods where
>>>> it
>>>>>>> expects lower level operations to occur.  This allows things like the
>>>>>>> looping and certain types of null management to be fully materialized
>>>>> in
>>>>>>> source code without having to deal with the complexities of ByteCode
>>>>>>> generation.  It also eases testing complexity.  When a particular
>>>>>>> implementation is required, the Drillbit will be responsible for
>>>>>> generating
>>>>>>> updated method bodies as required for the record-level expressions,
>>>>>> marking
>>>>>>> all the methods and class as final, then loading the implementation
>>>>> into
>>>>>>> the query-level classloader.  Note that the production Drillbit will
>>>>>> never
>>>>>>> load the template class into the JVM and will simply utilize it in
>>>>>> ByteCode
>>>>>>> form.  I was hoping someone can take a look at trying to pull
>>>> together
>>>>> a
>>>>>>> cohesive approach to doing this using ASM and Janino (likely
>>>> utilizing
>>>>>> the
>>>>>>> JDK commons-compiler mode).  The interface should be pretty simple:
>>>>> input
>>>>>>> is an interface, a template class name, a set of (method_signature,
>>>>>>> method_body_text) objects and a varargs of objects that are required
>>>>> for
>>>>>>> object instantiation.  The return should be an instance of the
>>>>> interface.
>>>>>>> The interface should check things like method_signature provided to
>>>>>>> available method blocks, the method blocks being replaced are empty,
>>>>> the
>>>>>>> object constructor matches the set of object argument provided by the
>>>>>>> object instantiation request, etc.
>>>>>>> 
>>>>>>> *ByteBuf Improvements*
>>>>>>> - Our BufferAllocator should support child allocators (getChild())
>>>> with
>>>>>>> their own memory maximums and accounting (so we can determine the
>>>>> memory
>>>>>>> overhead to particular queries).  We also need to be able to release
>>>>>> entire
>>>>>>> child allocations at once.
>>>>>>> - We need to create a number of primitive type specific wrapping
>>>>> classes
>>>>>>> for ByteBuf.  These additions include fixed offset indexing for
>>>>>> operations
>>>>>>> (e.g. index 1 of an int buffer should be at 4 bytes), adding support
>>>>> for
>>>>>>> unsigned values (my preference would be to leverage the work in Guava
>>>>> if
>>>>>>> that makes sense) and modifying the hard bounds checks to softer
>>>> assert
>>>>>>> checks to increase production performance.  While we could do this
>>>>>>> utilizing the ByteBuf interface, from everything I've experienced and
>>>>>> read,
>>>>>>> we need to minimize issues with inlining and performance so we really
>>>>>> need
>>>>>>> to be able to modify/refer to PooledUnsafeDirectByteBuf directly for
>>>>> the
>>>>>>> wrapping classes.  Of course, it is a final package private class.
>>>>> Short
>>>>>>> term that means we really need to create a number of specific buffer
>>>>>> types
>>>>>>> that wrap it and just put them in the io.netty.buffer package (or
>>>>>>> alternatively create a Drill version or wrapper).
>>

Re: B[yi]teSize execwork tasks someone could potentially help out with...

Reply via email to