Re: B[yi]teSize execwork tasks someone could potentially help out with...

Jacques Nadeau Sat, 27 Apr 2013 12:52:10 -0700

Great news!  Thanks for running that down.

J


On Sat, Apr 27, 2013 at 8:54 AM, kishore g <[email protected]> wrote:
> Good news, the author of larray got back and he will add the apache license
> to the source.
>  On Apr 26, 2013 11:13 AM, "kishore g" <[email protected]> wrote:
>
>> I have interacted with the Author, let me know if you want me to check.
>> Good thing was that he is responsive and even added few things for me.
>>
>>
>> On Fri, Apr 26, 2013 at 10:27 AM, Timothy Chen <[email protected]> wrote:
>>
>>> Ya, just bringing that up again that. Doubt it will be a blocker.
>>>
>>> Tim
>>>
>>>
>>> On Fri, Apr 26, 2013 at 10:12 AM, David Alves <[email protected]>
>>> wrote:
>>>
>>> > good point, i'll try and ask the author.
>>> > it's a pretty recent lib so that might be an oversight…
>>> >
>>> > -david
>>> >
>>> > On Apr 26, 2013, at 12:04 PM, Timothy Chen <[email protected]> wrote:
>>> >
>>> > > Jacques I think this is the one I emailed you before that has no
>>> > licensing info.
>>> > >
>>> > > Tim
>>> > >
>>> > > Sent from my iPhone
>>> > >
>>> > > On Apr 26, 2013, at 9:30 AM, David Alves <[email protected]>
>>> wrote:
>>> > >
>>> > >> i've looked through it and looks like it can leverage shared memory,
>>> > which I was looking for anyway.
>>> > >> I also like the way garbage collection works (gc in java also clears
>>> > off-heap).
>>> > >> I'll take a deeper look during the weekend.
>>> > >>
>>> > >> -david
>>> > >>
>>> > >> On Apr 26, 2013, at 11:25 AM, Jacques Nadeau <[email protected]>
>>> > wrote:
>>> > >>
>>> > >>> I've looked at that in the past and think the idea of using here is
>>> > very
>>> > >>> good.  It seems like ByteBuf is nice as it has things like endianess
>>> > >>> capabilities, reference counting and management and Netty direct
>>> > support.
>>> > >>> On the flipside, larray is nice for its large array capabilities and
>>> > >>> better input/output interfaces.  The best approach might be to
>>> define
>>> > a new
>>> > >>> ByteBuf implementation that leverages LArray.  I'll take a look at
>>> > this in
>>> > >>> a few days if someone else doesn't want to.
>>> > >>>
>>> > >>> j
>>> > >>>
>>> > >>> On Fri, Apr 26, 2013 at 8:39 AM, kishore g <[email protected]>
>>> > wrote:
>>> > >>>
>>> > >>>> Fort *ByteBuf Improvements*, Have you looked at LArrayJ
>>> > >>>> https://github.com/xerial/larray. It has those wrappers and I
>>> found
>>> > it
>>> > >>>> quite useful. The same person has also written java version for
>>> snappy
>>> > >>>> compression. Not sure if you guys have plan to add compression, but
>>> > one of
>>> > >>>> the nice things I could do was use the memory offsets for
>>> > source(compressed
>>> > >>>> data) and dest(uncompressed array) and do the decompression
>>> off-heap.
>>> > It
>>> > >>>> supports the need for looking up by index and has wrappers for most
>>> > of the
>>> > >>>> primitive data types.
>>> > >>>>
>>> > >>>> Are you looking at something like this?
>>> > >>>>
>>> > >>>> thanks,
>>> > >>>> Kishore G
>>> > >>>>
>>> > >>>>
>>> > >>>>
>>> > >>>> On Fri, Apr 26, 2013 at 7:53 AM, Jacques Nadeau <
>>> [email protected]>
>>> > >>>> wrote:
>>> > >>>>
>>> > >>>>> They are on the list but the list is long :)
>>> > >>>>>
>>> > >>>>> Have a good weekend.
>>> > >>>>>
>>> > >>>>> On Thu, Apr 25, 2013 at 9:51 PM, Timothy Chen <[email protected]>
>>> > wrote:
>>> > >>>>>
>>> > >>>>>> So if no one picks anything up you will be done with all the
>>> work in
>>> > >>>> the
>>> > >>>>>> next couple of days? :)
>>> > >>>>>>
>>> > >>>>>> Would like to help out but I'm traveling to la over the weekend.
>>> > >>>>>>
>>> > >>>>>> I'll sync with you Monday to see how I can help then.
>>> > >>>>>>
>>> > >>>>>> Tim
>>> > >>>>>>
>>> > >>>>>> Sent from my iPhone
>>> > >>>>>>
>>> > >>>>>> On Apr 25, 2013, at 9:06 PM, Jacques Nadeau <[email protected]>
>>> > >>>> wrote:
>>> > >>>>>>
>>> > >>>>>>> I'm working on the execwork stuff and if someone would like to
>>> help
>>> > >>>>> out,
>>> > >>>>>>> here are a couple of things that need doing.  I figured I'd drop
>>> > them
>>> > >>>>>> here
>>> > >>>>>>> and see if anyone wants to work on them in the next couple of
>>> days.
>>> > >>>> If
>>> > >>>>>> so,
>>> > >>>>>>> let me know otherwise I'll be picking them up soon.
>>> > >>>>>>>
>>> > >>>>>>> *RPC*
>>> > >>>>>>> - RPC Layer Handshakes: Currently, I haven't implemented the
>>> > >>>> handshake
>>> > >>>>>> that
>>> > >>>>>>> should happen in either the User <> Bit or the Bit <> Bit layer.
>>> >  The
>>> > >>>>>> plan
>>> > >>>>>>> was to use an additional inserted event handler that removed
>>> itself
>>> > >>>>> from
>>> > >>>>>>> the event pipeline after a successful handshake or disconnected
>>> the
>>> > >>>>>> channel
>>> > >>>>>>> on a failed handshake (with appropriate logging).  The main
>>> > >>>> validation
>>> > >>>>> at
>>> > >>>>>>> this point will be simply confirming that both endpoints are
>>> > running
>>> > >>>> on
>>> > >>>>>> the
>>> > >>>>>>> same protocol version.   The only other information that is
>>> > currently
>>> > >>>>>>> needed is that that in the Bit <> Bit communication, the client
>>> > >>>> should
>>> > >>>>>>> inform the server of its DrillEndpoint so that the server can
>>> then
>>> > >>>> map
>>> > >>>>>> that
>>> > >>>>>>> for future communication in the other direction.
>>> > >>>>>>>
>>> > >>>>>>> *DataTypes*
>>> > >>>>>>> - General Expansion: Currently, we have a hodgepodge of
>>> datatypes
>>> > >>>>> within
>>> > >>>>>>> the org.apache.drill.common.expression.types.DataType.  We need
>>> to
>>> > >>>>> clean
>>> > >>>>>>> this up.  There should be types that map to standard sql types.
>>>  My
>>> > >>>>>>> thinking is that we should actually have separate types for each
>>> > for
>>> > >>>>>>> nullable, non-nullable and repeated (required, optional and
>>> > repeated
>>> > >>>> in
>>> > >>>>>>> protobuf vernaciular) since we'll generally operate with those
>>> > values
>>> > >>>>>>> completely differently (and that each type should reveal which
>>> it
>>> > >>>> is).
>>> > >>>>>> We
>>> > >>>>>>> should also have a relationship mapping from each to the other
>>> > (e.g.
>>> > >>>>> how
>>> > >>>>>> to
>>> > >>>>>>> convert a signed 32 bit int into a nullable signed 32 bit int.
>>> > >>>>>>>
>>> > >>>>>>> - Map Types: We don't need nullable but we will need different
>>> map
>>> > >>>>> types:
>>> > >>>>>>> inline and fieldwise.  I think these will useful for the
>>> execution
>>> > >>>>> engine
>>> > >>>>>>> and will be leverage depending on the particular needs-- for
>>> > example
>>> > >>>>>>> fieldwise will be a natural fit where we're operating on
>>> columnar
>>> > >>>> data
>>> > >>>>>> and
>>> > >>>>>>> doing an explode or other fieldwise nested operation and inline
>>> > will
>>> > >>>> be
>>> > >>>>>>> useful when we're doing things like sorting a complex field.
>>> >  Inline
>>> > >>>>> will
>>> > >>>>>>> also be appropriate where we have extremely sparse record sets.
>>> > >>>> We'll
>>> > >>>>>> just
>>> > >>>>>>> need transformation methods between the two variations.  In the
>>> > case
>>> > >>>>> of a
>>> > >>>>>>> fieldwise map type field, the field is virtual and only exists
>>> to
>>> > >>>>> contain
>>> > >>>>>>> its child fields.
>>> > >>>>>>>
>>> > >>>>>>> - Non-static DataTypes: We have a need types that don't fit the
>>> > >>>> static
>>> > >>>>>> data
>>> > >>>>>>> type model above.  Examples include fixed width types (e.g. 10
>>> byte
>>> > >>>>>>> string), polymorphic (inline encoded) types (number or string
>>> > >>>> depending
>>> > >>>>>> on
>>> > >>>>>>> record) and repeated nested versions of our other types.  These
>>> > are a
>>> > >>>>>>> little more gnarly as we need to support canonicalization of
>>> these.
>>> > >>>>>> Optiq
>>> > >>>>>>> has some methods for how to handle this kind of type system so
>>> it
>>> > >>>>>> probably
>>> > >>>>>>> makes sense to leverage that system.
>>> > >>>>>>>
>>> > >>>>>>> *Expression Type Materialization*
>>> > >>>>>>> - LogicalExpression type materialization: Right now,
>>> > >>>> LogicalExpressions
>>> > >>>>>>> include support for late type binding.  As part of the record
>>> batch
>>> > >>>>>>> execution path, these need to get materialized with correct
>>> > casting,
>>> > >>>>> etc
>>> > >>>>>>> based on the actual found schema.  As such, we need to have a
>>> > >>>> function
>>> > >>>>>>> which takes a LogicalExpression tree, applies a materialized
>>> > >>>>> BatchSchema
>>> > >>>>>>> and returns a new LogicalExpression tree with full type
>>> settings.
>>> >  As
>>> > >>>>>> part
>>> > >>>>>>> of this process, all types need to be cast as necessary and full
>>> > >>>>>> validation
>>> > >>>>>>> of the tree should be done.  Timothy has a pending work for
>>> > >>>> validation
>>> > >>>>>>> specifically on a pull request that would be a good piece of
>>> code
>>> > to
>>> > >>>>>>> leverage that need.  We also have a visitor model for the
>>> > expression
>>> > >>>>> tree
>>> > >>>>>>> that should be able to aid in the updated LogicalExpression
>>> > >>>>> construction.
>>> > >>>>>>> -LogicalExpression to Java expression conversion: We need to be
>>> > able
>>> > >>>> to
>>> > >>>>>>> convert our logical expressions into Java code expressions.
>>> > >>>> Initially,
>>> > >>>>>>> this should be done in a simplistic way, using something like
>>> > >>>> implicit
>>> > >>>>>>> boxing and the like just to get something working.  This will
>>> > likely
>>> > >>>> be
>>> > >>>>>>> specialized per major type (nullable, non-nullable and repeated)
>>> > and
>>> > >>>> a
>>> > >>>>>>> framework might the most sense actually just distinguishing the
>>> > >>>>>>> LogicalExpression by these types.
>>> > >>>>>>>
>>> > >>>>>>> *JDBC*
>>> > >>>>>>> - The Drill JDBC driver layer needs to be updated to leverage
>>> our
>>> > >>>>>> zookeeper
>>> > >>>>>>> coordination locations so that it can correctly find the cluster
>>> > >>>>>> location.
>>> > >>>>>>> - The Drill JDBC driver should also manage reconnects so that
>>> if it
>>> > >>>>> loses
>>> > >>>>>>> connection with a particular Drillbit partner, that it will
>>> > reconnect
>>> > >>>>> to
>>> > >>>>>>> another available node in the cluster.
>>> > >>>>>>> - Someone should point SQuirreL at Julian's latest work and see
>>> how
>>> > >>>>>> things
>>> > >>>>>>> go...
>>> > >>>>>>>
>>> > >>>>>>> *ByteCode Engineering*
>>> > >>>>>>> - We need to put together a concrete class materialization
>>> > strategy.
>>> > >>>>> My
>>> > >>>>>>> thinking for relational operators and code generation is that in
>>> > most
>>> > >>>>>>> cases, we'll have an interface and a template class for a
>>> > particular
>>> > >>>>>>> relational operator.  We will build a template class that has
>>> all
>>> > the
>>> > >>>>>>> generic stuff implemented but will make calls to empty methods
>>> > where
>>> > >>>> it
>>> > >>>>>>> expects lower level operations to occur.  This allows things
>>> like
>>> > the
>>> > >>>>>>> looping and certain types of null management to be fully
>>> > materialized
>>> > >>>>> in
>>> > >>>>>>> source code without having to deal with the complexities of
>>> > ByteCode
>>> > >>>>>>> generation.  It also eases testing complexity.  When a
>>> particular
>>> > >>>>>>> implementation is required, the Drillbit will be responsible for
>>> > >>>>>> generating
>>> > >>>>>>> updated method bodies as required for the record-level
>>> expressions,
>>> > >>>>>> marking
>>> > >>>>>>> all the methods and class as final, then loading the
>>> implementation
>>> > >>>>> into
>>> > >>>>>>> the query-level classloader.  Note that the production Drillbit
>>> > will
>>> > >>>>>> never
>>> > >>>>>>> load the template class into the JVM and will simply utilize it
>>> in
>>> > >>>>>> ByteCode
>>> > >>>>>>> form.  I was hoping someone can take a look at trying to pull
>>> > >>>> together
>>> > >>>>> a
>>> > >>>>>>> cohesive approach to doing this using ASM and Janino (likely
>>> > >>>> utilizing
>>> > >>>>>> the
>>> > >>>>>>> JDK commons-compiler mode).  The interface should be pretty
>>> simple:
>>> > >>>>> input
>>> > >>>>>>> is an interface, a template class name, a set of
>>> (method_signature,
>>> > >>>>>>> method_body_text) objects and a varargs of objects that are
>>> > required
>>> > >>>>> for
>>> > >>>>>>> object instantiation.  The return should be an instance of the
>>> > >>>>> interface.
>>> > >>>>>>> The interface should check things like method_signature
>>> provided to
>>> > >>>>>>> available method blocks, the method blocks being replaced are
>>> > empty,
>>> > >>>>> the
>>> > >>>>>>> object constructor matches the set of object argument provided
>>> by
>>> > the
>>> > >>>>>>> object instantiation request, etc.
>>> > >>>>>>>
>>> > >>>>>>> *ByteBuf Improvements*
>>> > >>>>>>> - Our BufferAllocator should support child allocators
>>> (getChild())
>>> > >>>> with
>>> > >>>>>>> their own memory maximums and accounting (so we can determine
>>> the
>>> > >>>>> memory
>>> > >>>>>>> overhead to particular queries).  We also need to be able to
>>> > release
>>> > >>>>>> entire
>>> > >>>>>>> child allocations at once.
>>> > >>>>>>> - We need to create a number of primitive type specific wrapping
>>> > >>>>> classes
>>> > >>>>>>> for ByteBuf.  These additions include fixed offset indexing for
>>> > >>>>>> operations
>>> > >>>>>>> (e.g. index 1 of an int buffer should be at 4 bytes), adding
>>> > support
>>> > >>>>> for
>>> > >>>>>>> unsigned values (my preference would be to leverage the work in
>>> > Guava
>>> > >>>>> if
>>> > >>>>>>> that makes sense) and modifying the hard bounds checks to softer
>>> > >>>> assert
>>> > >>>>>>> checks to increase production performance.  While we could do
>>> this
>>> > >>>>>>> utilizing the ByteBuf interface, from everything I've
>>> experienced
>>> > and
>>> > >>>>>> read,
>>> > >>>>>>> we need to minimize issues with inlining and performance so we
>>> > really
>>> > >>>>>> need
>>> > >>>>>>> to be able to modify/refer to PooledUnsafeDirectByteBuf directly
>>> > for
>>> > >>>>> the
>>> > >>>>>>> wrapping classes.  Of course, it is a final package private
>>> class.
>>> > >>>>> Short
>>> > >>>>>>> term that means we really need to create a number of specific
>>> > buffer
>>> > >>>>>> types
>>> > >>>>>>> that wrap it and just put them in the io.netty.buffer package
>>> (or
>>> > >>>>>>> alternatively create a Drill version or wrapper).
>>> > >>
>>> >
>>> >
>>>
>>
>>

Re: B[yi]teSize execwork tasks someone could potentially help out with...

Reply via email to