Re: B[yi]teSize execwork tasks someone could potentially help out with...

kishore g Sat, 27 Apr 2013 08:55:15 -0700

Good news, the author of larray got back and he will add the apache license
to the source.
 On Apr 26, 2013 11:13 AM, "kishore g" <[email protected]> wrote:


> I have interacted with the Author, let me know if you want me to check.
> Good thing was that he is responsive and even added few things for me.
>
>
> On Fri, Apr 26, 2013 at 10:27 AM, Timothy Chen <[email protected]> wrote:
>
>> Ya, just bringing that up again that. Doubt it will be a blocker.
>>
>> Tim
>>
>>
>> On Fri, Apr 26, 2013 at 10:12 AM, David Alves <[email protected]>
>> wrote:
>>
>> > good point, i'll try and ask the author.
>> > it's a pretty recent lib so that might be an oversight…
>> >
>> > -david
>> >
>> > On Apr 26, 2013, at 12:04 PM, Timothy Chen <[email protected]> wrote:
>> >
>> > > Jacques I think this is the one I emailed you before that has no
>> > licensing info.
>> > >
>> > > Tim
>> > >
>> > > Sent from my iPhone
>> > >
>> > > On Apr 26, 2013, at 9:30 AM, David Alves <[email protected]>
>> wrote:
>> > >
>> > >> i've looked through it and looks like it can leverage shared memory,
>> > which I was looking for anyway.
>> > >> I also like the way garbage collection works (gc in java also clears
>> > off-heap).
>> > >> I'll take a deeper look during the weekend.
>> > >>
>> > >> -david
>> > >>
>> > >> On Apr 26, 2013, at 11:25 AM, Jacques Nadeau <[email protected]>
>> > wrote:
>> > >>
>> > >>> I've looked at that in the past and think the idea of using here is
>> > very
>> > >>> good.  It seems like ByteBuf is nice as it has things like endianess
>> > >>> capabilities, reference counting and management and Netty direct
>> > support.
>> > >>> On the flipside, larray is nice for its large array capabilities and
>> > >>> better input/output interfaces.  The best approach might be to
>> define
>> > a new
>> > >>> ByteBuf implementation that leverages LArray.  I'll take a look at
>> > this in
>> > >>> a few days if someone else doesn't want to.
>> > >>>
>> > >>> j
>> > >>>
>> > >>> On Fri, Apr 26, 2013 at 8:39 AM, kishore g <[email protected]>
>> > wrote:
>> > >>>
>> > >>>> Fort *ByteBuf Improvements*, Have you looked at LArrayJ
>> > >>>> https://github.com/xerial/larray. It has those wrappers and I
>> found
>> > it
>> > >>>> quite useful. The same person has also written java version for
>> snappy
>> > >>>> compression. Not sure if you guys have plan to add compression, but
>> > one of
>> > >>>> the nice things I could do was use the memory offsets for
>> > source(compressed
>> > >>>> data) and dest(uncompressed array) and do the decompression
>> off-heap.
>> > It
>> > >>>> supports the need for looking up by index and has wrappers for most
>> > of the
>> > >>>> primitive data types.
>> > >>>>
>> > >>>> Are you looking at something like this?
>> > >>>>
>> > >>>> thanks,
>> > >>>> Kishore G
>> > >>>>
>> > >>>>
>> > >>>>
>> > >>>> On Fri, Apr 26, 2013 at 7:53 AM, Jacques Nadeau <
>> [email protected]>
>> > >>>> wrote:
>> > >>>>
>> > >>>>> They are on the list but the list is long :)
>> > >>>>>
>> > >>>>> Have a good weekend.
>> > >>>>>
>> > >>>>> On Thu, Apr 25, 2013 at 9:51 PM, Timothy Chen <[email protected]>
>> > wrote:
>> > >>>>>
>> > >>>>>> So if no one picks anything up you will be done with all the
>> work in
>> > >>>> the
>> > >>>>>> next couple of days? :)
>> > >>>>>>
>> > >>>>>> Would like to help out but I'm traveling to la over the weekend.
>> > >>>>>>
>> > >>>>>> I'll sync with you Monday to see how I can help then.
>> > >>>>>>
>> > >>>>>> Tim
>> > >>>>>>
>> > >>>>>> Sent from my iPhone
>> > >>>>>>
>> > >>>>>> On Apr 25, 2013, at 9:06 PM, Jacques Nadeau <[email protected]>
>> > >>>> wrote:
>> > >>>>>>
>> > >>>>>>> I'm working on the execwork stuff and if someone would like to
>> help
>> > >>>>> out,
>> > >>>>>>> here are a couple of things that need doing.  I figured I'd drop
>> > them
>> > >>>>>> here
>> > >>>>>>> and see if anyone wants to work on them in the next couple of
>> days.
>> > >>>> If
>> > >>>>>> so,
>> > >>>>>>> let me know otherwise I'll be picking them up soon.
>> > >>>>>>>
>> > >>>>>>> *RPC*
>> > >>>>>>> - RPC Layer Handshakes: Currently, I haven't implemented the
>> > >>>> handshake
>> > >>>>>> that
>> > >>>>>>> should happen in either the User <> Bit or the Bit <> Bit layer.
>> >  The
>> > >>>>>> plan
>> > >>>>>>> was to use an additional inserted event handler that removed
>> itself
>> > >>>>> from
>> > >>>>>>> the event pipeline after a successful handshake or disconnected
>> the
>> > >>>>>> channel
>> > >>>>>>> on a failed handshake (with appropriate logging).  The main
>> > >>>> validation
>> > >>>>> at
>> > >>>>>>> this point will be simply confirming that both endpoints are
>> > running
>> > >>>> on
>> > >>>>>> the
>> > >>>>>>> same protocol version.   The only other information that is
>> > currently
>> > >>>>>>> needed is that that in the Bit <> Bit communication, the client
>> > >>>> should
>> > >>>>>>> inform the server of its DrillEndpoint so that the server can
>> then
>> > >>>> map
>> > >>>>>> that
>> > >>>>>>> for future communication in the other direction.
>> > >>>>>>>
>> > >>>>>>> *DataTypes*
>> > >>>>>>> - General Expansion: Currently, we have a hodgepodge of
>> datatypes
>> > >>>>> within
>> > >>>>>>> the org.apache.drill.common.expression.types.DataType.  We need
>> to
>> > >>>>> clean
>> > >>>>>>> this up.  There should be types that map to standard sql types.
>>  My
>> > >>>>>>> thinking is that we should actually have separate types for each
>> > for
>> > >>>>>>> nullable, non-nullable and repeated (required, optional and
>> > repeated
>> > >>>> in
>> > >>>>>>> protobuf vernaciular) since we'll generally operate with those
>> > values
>> > >>>>>>> completely differently (and that each type should reveal which
>> it
>> > >>>> is).
>> > >>>>>> We
>> > >>>>>>> should also have a relationship mapping from each to the other
>> > (e.g.
>> > >>>>> how
>> > >>>>>> to
>> > >>>>>>> convert a signed 32 bit int into a nullable signed 32 bit int.
>> > >>>>>>>
>> > >>>>>>> - Map Types: We don't need nullable but we will need different
>> map
>> > >>>>> types:
>> > >>>>>>> inline and fieldwise.  I think these will useful for the
>> execution
>> > >>>>> engine
>> > >>>>>>> and will be leverage depending on the particular needs-- for
>> > example
>> > >>>>>>> fieldwise will be a natural fit where we're operating on
>> columnar
>> > >>>> data
>> > >>>>>> and
>> > >>>>>>> doing an explode or other fieldwise nested operation and inline
>> > will
>> > >>>> be
>> > >>>>>>> useful when we're doing things like sorting a complex field.
>> >  Inline
>> > >>>>> will
>> > >>>>>>> also be appropriate where we have extremely sparse record sets.
>> > >>>> We'll
>> > >>>>>> just
>> > >>>>>>> need transformation methods between the two variations.  In the
>> > case
>> > >>>>> of a
>> > >>>>>>> fieldwise map type field, the field is virtual and only exists
>> to
>> > >>>>> contain
>> > >>>>>>> its child fields.
>> > >>>>>>>
>> > >>>>>>> - Non-static DataTypes: We have a need types that don't fit the
>> > >>>> static
>> > >>>>>> data
>> > >>>>>>> type model above.  Examples include fixed width types (e.g. 10
>> byte
>> > >>>>>>> string), polymorphic (inline encoded) types (number or string
>> > >>>> depending
>> > >>>>>> on
>> > >>>>>>> record) and repeated nested versions of our other types.  These
>> > are a
>> > >>>>>>> little more gnarly as we need to support canonicalization of
>> these.
>> > >>>>>> Optiq
>> > >>>>>>> has some methods for how to handle this kind of type system so
>> it
>> > >>>>>> probably
>> > >>>>>>> makes sense to leverage that system.
>> > >>>>>>>
>> > >>>>>>> *Expression Type Materialization*
>> > >>>>>>> - LogicalExpression type materialization: Right now,
>> > >>>> LogicalExpressions
>> > >>>>>>> include support for late type binding.  As part of the record
>> batch
>> > >>>>>>> execution path, these need to get materialized with correct
>> > casting,
>> > >>>>> etc
>> > >>>>>>> based on the actual found schema.  As such, we need to have a
>> > >>>> function
>> > >>>>>>> which takes a LogicalExpression tree, applies a materialized
>> > >>>>> BatchSchema
>> > >>>>>>> and returns a new LogicalExpression tree with full type
>> settings.
>> >  As
>> > >>>>>> part
>> > >>>>>>> of this process, all types need to be cast as necessary and full
>> > >>>>>> validation
>> > >>>>>>> of the tree should be done.  Timothy has a pending work for
>> > >>>> validation
>> > >>>>>>> specifically on a pull request that would be a good piece of
>> code
>> > to
>> > >>>>>>> leverage that need.  We also have a visitor model for the
>> > expression
>> > >>>>> tree
>> > >>>>>>> that should be able to aid in the updated LogicalExpression
>> > >>>>> construction.
>> > >>>>>>> -LogicalExpression to Java expression conversion: We need to be
>> > able
>> > >>>> to
>> > >>>>>>> convert our logical expressions into Java code expressions.
>> > >>>> Initially,
>> > >>>>>>> this should be done in a simplistic way, using something like
>> > >>>> implicit
>> > >>>>>>> boxing and the like just to get something working.  This will
>> > likely
>> > >>>> be
>> > >>>>>>> specialized per major type (nullable, non-nullable and repeated)
>> > and
>> > >>>> a
>> > >>>>>>> framework might the most sense actually just distinguishing the
>> > >>>>>>> LogicalExpression by these types.
>> > >>>>>>>
>> > >>>>>>> *JDBC*
>> > >>>>>>> - The Drill JDBC driver layer needs to be updated to leverage
>> our
>> > >>>>>> zookeeper
>> > >>>>>>> coordination locations so that it can correctly find the cluster
>> > >>>>>> location.
>> > >>>>>>> - The Drill JDBC driver should also manage reconnects so that
>> if it
>> > >>>>> loses
>> > >>>>>>> connection with a particular Drillbit partner, that it will
>> > reconnect
>> > >>>>> to
>> > >>>>>>> another available node in the cluster.
>> > >>>>>>> - Someone should point SQuirreL at Julian's latest work and see
>> how
>> > >>>>>> things
>> > >>>>>>> go...
>> > >>>>>>>
>> > >>>>>>> *ByteCode Engineering*
>> > >>>>>>> - We need to put together a concrete class materialization
>> > strategy.
>> > >>>>> My
>> > >>>>>>> thinking for relational operators and code generation is that in
>> > most
>> > >>>>>>> cases, we'll have an interface and a template class for a
>> > particular
>> > >>>>>>> relational operator.  We will build a template class that has
>> all
>> > the
>> > >>>>>>> generic stuff implemented but will make calls to empty methods
>> > where
>> > >>>> it
>> > >>>>>>> expects lower level operations to occur.  This allows things
>> like
>> > the
>> > >>>>>>> looping and certain types of null management to be fully
>> > materialized
>> > >>>>> in
>> > >>>>>>> source code without having to deal with the complexities of
>> > ByteCode
>> > >>>>>>> generation.  It also eases testing complexity.  When a
>> particular
>> > >>>>>>> implementation is required, the Drillbit will be responsible for
>> > >>>>>> generating
>> > >>>>>>> updated method bodies as required for the record-level
>> expressions,
>> > >>>>>> marking
>> > >>>>>>> all the methods and class as final, then loading the
>> implementation
>> > >>>>> into
>> > >>>>>>> the query-level classloader.  Note that the production Drillbit
>> > will
>> > >>>>>> never
>> > >>>>>>> load the template class into the JVM and will simply utilize it
>> in
>> > >>>>>> ByteCode
>> > >>>>>>> form.  I was hoping someone can take a look at trying to pull
>> > >>>> together
>> > >>>>> a
>> > >>>>>>> cohesive approach to doing this using ASM and Janino (likely
>> > >>>> utilizing
>> > >>>>>> the
>> > >>>>>>> JDK commons-compiler mode).  The interface should be pretty
>> simple:
>> > >>>>> input
>> > >>>>>>> is an interface, a template class name, a set of
>> (method_signature,
>> > >>>>>>> method_body_text) objects and a varargs of objects that are
>> > required
>> > >>>>> for
>> > >>>>>>> object instantiation.  The return should be an instance of the
>> > >>>>> interface.
>> > >>>>>>> The interface should check things like method_signature
>> provided to
>> > >>>>>>> available method blocks, the method blocks being replaced are
>> > empty,
>> > >>>>> the
>> > >>>>>>> object constructor matches the set of object argument provided
>> by
>> > the
>> > >>>>>>> object instantiation request, etc.
>> > >>>>>>>
>> > >>>>>>> *ByteBuf Improvements*
>> > >>>>>>> - Our BufferAllocator should support child allocators
>> (getChild())
>> > >>>> with
>> > >>>>>>> their own memory maximums and accounting (so we can determine
>> the
>> > >>>>> memory
>> > >>>>>>> overhead to particular queries).  We also need to be able to
>> > release
>> > >>>>>> entire
>> > >>>>>>> child allocations at once.
>> > >>>>>>> - We need to create a number of primitive type specific wrapping
>> > >>>>> classes
>> > >>>>>>> for ByteBuf.  These additions include fixed offset indexing for
>> > >>>>>> operations
>> > >>>>>>> (e.g. index 1 of an int buffer should be at 4 bytes), adding
>> > support
>> > >>>>> for
>> > >>>>>>> unsigned values (my preference would be to leverage the work in
>> > Guava
>> > >>>>> if
>> > >>>>>>> that makes sense) and modifying the hard bounds checks to softer
>> > >>>> assert
>> > >>>>>>> checks to increase production performance.  While we could do
>> this
>> > >>>>>>> utilizing the ByteBuf interface, from everything I've
>> experienced
>> > and
>> > >>>>>> read,
>> > >>>>>>> we need to minimize issues with inlining and performance so we
>> > really
>> > >>>>>> need
>> > >>>>>>> to be able to modify/refer to PooledUnsafeDirectByteBuf directly
>> > for
>> > >>>>> the
>> > >>>>>>> wrapping classes.  Of course, it is a final package private
>> class.
>> > >>>>> Short
>> > >>>>>>> term that means we really need to create a number of specific
>> > buffer
>> > >>>>>> types
>> > >>>>>>> that wrap it and just put them in the io.netty.buffer package
>> (or
>> > >>>>>>> alternatively create a Drill version or wrapper).
>> > >>
>> >
>> >
>>
>
>

Re: B[yi]teSize execwork tasks someone could potentially help out with...

Reply via email to