Re: Dynamic UDFs support

Arina Yelchiyeva Tue, 26 Jul 2016 10:12:20 -0700

Sure, I'll add this option. I'll send a link to final document once it's
done.


On Tue, Jul 26, 2016 at 8:06 PM Keys Botzum <kbot...@maprtech.com> wrote:

> +1
>
> Keys
> _______________________________
> Keys Botzum
> Senior Principal Technologist
> kbot...@maprtech.com <mailto:kbot...@maprtech.com>
> 443-718-0098
> MapR Technologies
> http://www.mapr.com <http://www.mapr.com/>
> > On Jul 26, 2016, at 1:05 PM, yuliya Feldman <yufeld...@yahoo.com.INVALID>
> wrote:
> >
> > I want to make sure (also will make a note in the design doc) that we
> have an option to disable dynamic loading/unloading of UDFs until we will
> be able to have an ability to do proper authentication AND authorization of
> the user(s).
> >
> >      From: Arina Yelchiyeva <arina.yelchiy...@gmail.com <mailto:
> arina.yelchiy...@gmail.com>>
> > To: dev@drill.apache.org <mailto:dev@drill.apache.org>
> > Sent: Monday, July 25, 2016 9:09 AM
> > Subject: Re: Dynamic UDFs support
> >
> > My fault, agree, DROP is more appropriate.
> > Thanks Julian!
> >
> > On Mon, Jul 25, 2016 at 7:07 PM Julian Hyde <jhyde.apa...@gmail.com
> <mailto:jhyde.apa...@gmail.com>> wrote:
> >
> >> But don't call it DELETE. In SQL the opposite of CREATE is DROP.
> >>
> >> Julian
> >>
> >>> On Jul 25, 2016, at 8:48 AM, Keys Botzum <kbot...@maprtech.com
> <mailto:kbot...@maprtech.com>> wrote:
> >>>
> >>> I like the approach to handling DELETE. This is very useful. I think an
> >> implementation that does not guarantee consistent behavior is perfectly
> >> fine for use that is targeted at developers that are working on UDFs. As
> >> long as the docs make the intent clear this makes me very happy.
> >>>
> >>> I'll defer to others more expert than I on the remainder of the design.
> >>>
> >>> Keys
> >>> _______________________________
> >>> Keys Botzum
> >>> Senior Principal Technologist
> >>> kbot...@maprtech.com <mailto:kbot...@maprtech.com> <mailto:
> kbot...@maprtech.com <mailto:kbot...@maprtech.com>>
> >>> 443-718-0098
> >>> MapR Technologies
> >>> http://www.mapr.com <http://www.mapr.com/> <http://www.mapr.com/ <
> http://www.mapr.com/>>
> >>>> On Jul 25, 2016, at 9:55 AM, Arina Yelchiyeva <
> >> arina.yelchiy...@gmail.com <mailto:arina.yelchiy...@gmail.com>> wrote:
> >>>>
> >>>> Taking into account all previous comments and discussion we had with
> >> Parth
> >>>> and Paul, please find below my design notes (I am going to prepare
> >> proper
> >>>> design document, just want to see if all agree with raw version).
> >>>> I propose will use lazy-init to dynamically loaded UDFs, in such case
> >> when
> >>>> user issues CREATE UDF command, foreman will only validate jar and
> >> update
> >>>> ZK function registry, and only if function is needed it will be loaded
> >> to
> >>>> appropriate drillbit (during planning stage or fragment execution). We
> >>>> might add listeners (as Paul proposed) to pre-load UDFs but I didn't
> >>>> include it to current release to simplify solution but we might
> >> re-consider
> >>>> this.
> >>>> I have looked at issue with class loading and unloading and if we ship
> >> each
> >>>> jar with its own classloader, DELETE functionality can be introduced
> in
> >>>> current release, at least marked as experimental or for developers use
> >>>> only, to ease UDF development process.
> >>>>
> >>>> Any comments are welcomed.
> >>>>
> >>>> *Invariants*
> >>>>
> >>>> 1. DFS staging area where user copies jar to be loaded
> >>>>
> >>>> 2. DFS udf area (former registration area) where all validated jars
> are
> >>>> present
> >>>>
> >>>> 3. ZK function registry - contains list of all dynamically loaded UDFs
> >> and
> >>>> their jars. UDF name will be represented as combination of name and
> >> input
> >>>> parameters.
> >>>>
> >>>> 4. Lazy-init - all dynamically loaded UDFs will be loaded to drillbit
> >> upon
> >>>> request, i.e. if drillbits receives query or fragment that contains
> >> such UDF
> >>>>
> >>>> 5. Currently only CREATE and DELETE statements are supported
> >>>>
> >>>>
> >>>> *Adding UDFs*
> >>>>
> >>>> 1. User copies source and binary (hereinafter jar) to DFS staging area
> >>>> 2. User issues CREATE UDF command
> >>>> 3. Foreman receives request to create UDF:
> >>>> a) checks if jar is present in staging area
> >>>> b) copies jar to temporary DFS location
> >>>> c) validates UDFs present in jar locally:
> >>>> 1) copies jar to temporary local fs
> >>>> 2) scans jar using temporary classloader
> >>>> 3) checks if there are any duplicates in local function registry
> >>>> 4) returns list of UDFs to be registered
> >>>> d) validates UDFs present in jar in ZK:
> >>>> 1) takes list of dynamically loaded UDFs from ZK
> >>>> 2) checks if there are no duplicates either by jar name or among UDFs
> >>>> 3) moves jar from DFS temporary area to DFS udf area
> >>>> 4) updates ZK with list of new dynamic UDFs
> >>>> 5) removes jar from staging area
> >>>> 6) returns confirmation to user that UDFs were registered
> >>>>
> >>>>
> >>>> *Lazy-init*
> >>>>
> >>>> 1. User issues query with dynamically loaded UDF.
> >>>>
> >>>> 2. During planning stage or fragment execution, if UDF is not present
> in
> >>>> local function registry,  drillbit:
> >>>>
> >>>> a) checks if such UDF is present in ZK function registry
> >>>>
> >>>> b) if present, loads UDF using jar name, otherwise return an error
> >>>>
> >>>> c) proceeds planning stage or fragment execution
> >>>>
> >>>>
> >>>> *New drillbit registration / Drillbit re-start*
> >>>>
> >>>> Local udf directory is re-created, to clean up previously loaded jars
> >> if any
> >>>>
> >>>>
> >>>> *Delete UDF*
> >>>>
> >>>> Each jar that going to be loaded dynamically will have its own
> >> classloader
> >>>> which will solve problem with loading and unloading classes with the
> >> same
> >>>> name.
> >>>>
> >>>>
> >>>> 1. User issues DELETE command (delete will operate on jar name level)
> >>>>
> >>>> 2. Foreman receives DELETE request:
> >>>>
> >>>> a) checks if such jar is present in ZK function registry
> >>>>
> >>>> b) creates ephemeral znode /udf/delete/jar_name
> >>>>
> >>>> c) removes record in ZK function registry
> >>>>
> >>>> d) removes jar from DFS udf area
> >>>>
> >>>> e) removes ephemeral znode from /udf/delete/jar_name
> >>>>
> >>>> f) returns confirmation to user that UDFs were deleted
> >>>>
> >>>> 3. Drillbits are subscribed to /udf/delete znode, when new znode with
> >> jar
> >>>> name appears, drillbit:
> >>>>
> >>>> a) removes all UDFs associated with jar name from local function
> >> registry
> >>>>
> >>>> b) removes jar from local udf directory
> >>>>
> >>>>
> >>>> *Limitations*
> >>>>
> >>>> 1. When user runs DELETE command, some queries that are using deleted
> >> UDFs
> >>>> may fail during fragment execution if by that time UDF has been
> deleted
> >>>> from local registry. Ideally, before submitting DELETE command, user
> >> needs
> >>>> to make sure, no one is running queries using UDFs from that
> particular
> >> jar.
> >>>>
> >>>>
> >>>> 2. We encourage users not to delete any jars from DFS udf area
> >> manually, as
> >>>> it may lead to inconsistency between ZK function registry and DFS udf
> >> area.
> >>>>
> >>>>
> >>>> 3. CREATE statement is not atomic in part when we copy validated jar
> to
> >> DFS
> >>>> udf area and updating ZK function registry with list of new UDFs. In
> >> case
> >>>> of failure between these two steps, some unused jars may be left in
> DFS
> >> udf
> >>>> area but they won’t harm current process. LIST JARS command can be
> >>>> introduced to show used jars.
> >>>>
> >>>>
> >>>> Kind regards
> >>>> Arina
> >>>>
> >>>>> On Fri, Jul 22, 2016 at 7:15 PM Keys Botzum <kbot...@maprtech.com
> <mailto:kbot...@maprtech.com>>
> >> wrote:
> >>>>>
> >>>>> No disagreement on deferral but I raised my initial concern precisely
> >>>>> because I'm concerned about the practicality of the "restart the
> >> cluster"
> >>>>> option. I  sighted my concerns about laptops and development
> >> clusters.  I
> >>>>> was wondering if there might be some small things Drill could do to
> >> help.
> >>>>> If there is nothing that can be done to make this easier, so be it,
> >> but I
> >>>>> think that's going to be a big impedance.
> >>>>>
> >>>>> Keys
> >>>>> _______________________________
> >>>>> Keys Botzum
> >>>>> Senior Principal Technologist
> >>>>> kbot...@maprtech.com <mailto:kbot...@maprtech.com> <mailto:
> kbot...@maprtech.com <mailto:kbot...@maprtech.com>>
> >>>>> 443-718-0098
> >>>>> MapR Technologies
> >>>>> http://www.mapr.com <http://www.mapr.com/> <http://www.mapr.com/ <
> http://www.mapr.com/>>
> >>>>>>> On Jul 22, 2016, at 1:37 AM, Neeraja Rentachintala <
> >>>>>> nrentachint...@maprtech.com <mailto:nrentachint...@maprtech.com>>
> wrote:
> >>>>>>
> >>>>>> It seems like we are reaching a conclusion here in terms of starting
> >>>>> with a
> >>>>>> simpler implementation i.e being able to deploy UDFs dynamically
> >> without
> >>>>>> Drillbit restarts based off a jars in DFS location.  Dropping
> >> functions
> >>>>>> dynamically is out of scope for version 1 of this feature (we assume
> >>>>>> development of UDFs is happening on user laptop or a dev cluster
> where
> >>>>> its
> >>>>>> ok to have restart).
> >>>>>>
> >>>>>> -Neeraja
> >>>>>>
> >>>>>>> On Thu, Jul 21, 2016 at 11:56 AM, Keys Botzum <
> kbot...@maprtech.com <mailto:kbot...@maprtech.com>>
> >>>>>> wrote:
> >>>>>>
> >>>>>>> Recognize the difficulty. Not suggesting this be addressed in first
> >>>>>>> version. Just suggesting some thought about how a real user will
> >>>>>>> workaround. Maybe some doc and/or small changes can make this
> easier.
> >>>>>>>
> >>>>>>> Keys
> >>>>>>> _______________________________
> >>>>>>> Keys Botzum
> >>>>>>> Senior Principal Technologist
> >>>>>>> kbot...@maprtech.com <mailto:kbot...@maprtech.com>
> >>>>>>> 443-718-0098
> >>>>>>> MapR Technologies
> >>>>>>> http://www.mapr.com
> >>>>>>>> On Jul 21, 2016 1:45 PM, "Paul Rogers" <prog...@maprtech.com>
> >> wrote:
> >>>>>>>>
> >>>>>>>> Hi All,
> >>>>>>>>
> >>>>>>>> Adding a dynamic DROP would, of course, be a great addition! The
> >> reason
> >>>>>>>> for suggesting we skip that was to control project scope.
> >>>>>>>>
> >>>>>>>> Dynamic DROP requires a synchronization step. Here’s the scenario:
> >>>>>>>>
> >>>>>>>> * Foreman A starts a query using UDF U.
> >>>>>>>> * Foreman B receives a request to drop UDF U, followed by a
> request
> >> to
> >>>>>>> add
> >>>>>>>> a new version of U, U’.
> >>>>>>>>
> >>>>>>>> How do we drop a function that may be in use? There are some
> tricky
> >>>>> bits
> >>>>>>>> to work out, which seemed too overwhelming to consider all in one
> >> go.
> >>>>>>>>
> >>>>>>>> Clearly just dropping U and adding a new version of U with the
> same
> >>>>> name
> >>>>>>>> leads to issues if not synchronized. If a Drillbit D is running a
> >> query
> >>>>>>>> with U when it receives notice to drop U, should D complete the
> >> query
> >>>>> or
> >>>>>>>> fail it? If the query completes, then how does D deal with the
> >> request
> >>>>> to
> >>>>>>>> register U’, which has the same name?
> >>>>>>>>
> >>>>>>>> Do we globally synchronize function deletion? (The foreman B that
> >>>>>>> receives
> >>>>>>>> the drop request waits for all queries using U to finish.) But,
> how
> >> do
> >>>>> we
> >>>>>>>> know which queries use U?
> >>>>>>>>
> >>>>>>>> An eventually consistent approach is to track the age of the
> oldest
> >>>>>>>> running query. Suppose B drops U at time T. Any query received
> >> after T
> >>>>>>> that
> >>>>>>>> uses U will fail in planning. A new U’ can’t be registered until
> all
> >>>>>>>> queries that started before T complete.
> >>>>>>>>
> >>>>>>>> The primary challenge we face in both the CREATE and DROP cases is
> >> that
> >>>>>>>> Drill is distributed with little central coordination. That’s
> great
> >> for
> >>>>>>>> scale, but makes it hard to design features that require
> >> coordination.
> >>>>>>> Some
> >>>>>>>> other tools solve this problem with a data dictionary (or
> >> “metastore").
> >>>>>>>> Alas, Drill does not have such a concept. So a seemingly simple
> >> feature
> >>>>>>>> like dynamic UDF becomes a major design challenge to get right.
> >>>>>>>>
> >>>>>>>> Thanks,
> >>>>>>>>
> >>>>>>>> - Paul
> >>>>>>>>
> >>>>>>>>>> On Jul 21, 2016, at 7:21 AM, Neeraja Rentachintala <
> >>>>>>>>> nrentachint...@maprtech.com> wrote:
> >>>>>>>>>
> >>>>>>>>> The whole point of this feature is to avoid Drill cluster
> restarts
> >> as
> >>>>>>> the
> >>>>>>>>> name indicates 'Dynamic' UDFs.
> >>>>>>>>> So any design that requires restarts I would think would beat the
> >>>>>>>> purpose.
> >>>>>>>>>
> >>>>>>>>> I also think this is an example of a feature we start with a
> simple
> >>>>>>>> design
> >>>>>>>>> to serve the purpose, take feedback on how it is being
> >> deployed/used
> >>>>> in
> >>>>>>>>> real user situations and improve it in subsequent releases.
> >>>>>>>>>
> >>>>>>>>> -thanks
> >>>>>>>>> Neeraja
> >>>>>>>>>
> >>>>>>>>>> On Thu, Jul 21, 2016 at 6:32 AM, Keys Botzum <
> >> kbot...@maprtech.com>
> >>>>>>>>> wrote:
> >>>>>>>>>
> >>>>>>>>>> I think there are a lot of great ideas here. My one concern is
> the
> >>>>>>> lack
> >>>>>>>> of
> >>>>>>>>>> unload and thus presumably replace functionality. I'm just
> >> thinking
> >>>>>>>> about
> >>>>>>>>>> typical actual usage.
> >>>>>>>>>>
> >>>>>>>>>> In a typical development cycle someone writes something, tries
> it,
> >>>>>>>> learns,
> >>>>>>>>>> changes it, and tries again. Assuming I understand the design
> that
> >>>>>>>> change
> >>>>>>>>>> step requires a full Drill cluster restart. That is going to be
> >> very
> >>>>>>>>>> disruptive and will make UDF work nearly impossible without a
> >>>>>>> dedicated
> >>>>>>>>>> "private" cluster for Drill. I realize that people should have
> >> access
> >>>>>>> to
> >>>>>>>>>> the data they need and Drill in a development cluster but even
> >> then
> >>>>>>>>>> restarts can be hard since development clusters are often
> shared -
> >>>>> and
> >>>>>>>>>> that's assuming such a cluster exists. I realize of course Drill
> >> can
> >>>>>>> be
> >>>>>>>> run
> >>>>>>>>>> as a standalone Drillbit but I'm not convinced that desktops
> will
> >>>>> have
> >>>>>>>>>> adequate access to the needed data.
> >>>>>>>>>>
> >>>>>>>>>> Having dealt with Java classloading over the years, I'm not
> >> claiming
> >>>>>>>> class
> >>>>>>>>>> replacement is an easy thing so I'll defer to others on the
> >> priority
> >>>>>>> of
> >>>>>>>>>> that, but I'm wondering if there isn't some way to make UDF
> >>>>>>>> experimentation
> >>>>>>>>>> a bit easier/practical.
> >>>>>>>>>>
> >>>>>>>>>> Given the above, let me toss out some possibly naive ideas that
> >> maybe
> >>>>>>>> are
> >>>>>>>>>> workable:
> >>>>>>>>>> * can I easily run a standalone Drillbit on a Hadoop cluster
> node
> >>>>> that
> >>>>>>>> is
> >>>>>>>>>> already running Drill servers? I'm sure this can be done, but is
> >> it
> >>>>>>>> easy?
> >>>>>>>>>> Could we perhaps make this clearer as an explicit kind of thing?
> >>>>>>>>>> * is there a way that when I deploy a UDF I can constrain the #
> of
> >>>>>>> bits
> >>>>>>>> it
> >>>>>>>>>> is loaded into and perhaps even specify the bits?
> >>>>>>>>>> * Obvious correlarary is I'd want my query to run on those bits
> >> and a
> >>>>>>>>>> not too disruptive way to restart just those bits
> >>>>>>>>>>
> >>>>>>>>>> The above may be obvious to Drill experts. If it is then perhaps
> >> the
> >>>>>>> UDF
> >>>>>>>>>> docs could just point out how to easily develop UDFs in an
> >> iterative
> >>>>>>>>>> fashion.
> >>>>>>>>>>
> >>>>>>>>>> Keys
> >>>>>>>>>> _______________________________
> >>>>>>>>>> Keys Botzum
> >>>>>>>>>> Senior Principal Technologist
> >>>>>>>>>> kbot...@maprtech.com <mailto:kbot...@maprtech.com>
> >>>>>>>>>> 443-718-0098
> >>>>>>>>>> MapR Technologies
> >>>>>>>>>> http://www.mapr.com <http://www.mapr.com/>
>
>

Re: Dynamic UDFs support

Reply via email to