[jira] [Commented] (CASSANDRA-7395) Support for pure user-defined functions (UDF)

Robert Stupp (JIRA) Wed, 09 Jul 2014 07:29:28 -0700

    [ 
https://issues.apache.org/jira/browse/CASSANDRA-7395?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14056291#comment-14056291
 ]


Robert Stupp commented on CASSANDRA-7395:
-----------------------------------------

I like the approach to define (and code as supposed in CASSANDRA-7526) UDFs 
directly in CQL although it requires to add UDFs to the system keyspace and 
implicitly require schema agreement like tables, indexes, UDT etc. 
And if we agree that CASSANDRA-7526 is the way to do it right, then we must 
agree that Java 8 is required for C* 3.0 (except for the "pure Java" idea 
below).

Using something like {{CREATE FUNCTION sum(a bigint, b bigint) AS ( return a + 
b; )}} is much easier to understand and to maintain than {{AS 
foo.bar.Class.method}}. Bundles could be implemented like this:
{noformat}
CREATE BUNDLE Math AS (
  FUNCTION sum(a bigint, b bigint) {
    return a + b;
  }
);
{noformat}
But in opposite to use Nashorn in the first step, it would be possible to use 
"plain" Java code using [Apache 
BCEL|https://commons.apache.org/proper/commons-bcel/] which does not have the 
Java8 requirement. Adding the language as a parameter could look like 
{{FUNCTION sum(a bigint, b bigint) AS JAVA ...}} or {{AS JAVASCRIPT}} or Groovy 
or whatever.

The _deterministic_ option was intended for use of UDFs in functional indexes - 
functional indexes require deterministic methods whereas "normal" execution 
does not require deterministic functions. So I'd like to keep this flag even in 
{{CREATE FUNCTION}} or {{CREATE BUNDLE ... FUNCTION}} syntax, but assume 
deterministic or non-deterministic as a default.

As a conclusion a bundle in CQL syntax using BCEL could look like this:
{noformat}
CREATE OR UPDATE BUNDLE MyUDFs (
    FUNCTION double sin(input double) AS JAVA {
        return input == null ? null : Math.sin(input);
    }

    FUNCTION float sin(input float) AS JAVA {
        return input == null ? null : Math.sin(input);
    }

    NON DETERMINISTIC FUNCTION double random() AS JAVA {
        return Math.random();
    }
)
{noformat}

But we should keep some "backdoor" to pass the raw blob for a UDF - 
{{fooToBlob}} sounds straightforward, if it's cheap. If it's not cheap, it's 
just possible and if demand is there, we can add a special "raw" wildcard type 
for UDF parameters later.

UDFs could be held in a table : 
{noformat}
CREATE TABLE system.user_functions (
   bundle       text,       -- bundle name
   signature    text,       -- function name + argument types ; might be a MD5 
hash of these
   name         text,       -- function name
   arguments    list<text>, -- list of CQL argument types
   return_type  text,       -- CQL return type
   language     text,       -- programming language
   body         text        -- code
   PRIMARY KEY ( ( bundle ), signature )
);
{noformat}

Altogether this one does not expose internals to UDFs and using/porting 
{{DataType}} + {{TypeCodec}} + {{CassandraTypeParser}} from the Java Driver to 
parse "complex" CQL types is not a big deal - primitive types can be easily 
parsed using the {{CQL3Type.Native.valueOf(parsedTypeDef.toUpperCase())}}.

As a "marketing bullet list" :
* pure CQL functionality
* no C* internals exposed
* support for "pure Java" plus scripting languages
* type raw representation support (using {{fooToBlob}})
* no periodic polling of filesystem or system tables
* UDFs distributed "transparently" using schema agreement
* no tooling necessary - cqlsh and everything that supports CQL is enough
* UDF development help could be integrated for example in "DevCenter" that 
would itself compile a UDF bundle and allows test / execution of individual 
functions - since it's based on Eclipse it might be possible even to "debug" 
UDFs in Java and Nashorn supported scripting languages - but that's stuff for 
another ticket...
* Access rules can be enforced using Java {{SecureClassLoader}} (UDF invocation 
surrounded with {{Thread.setContextClassLoader(...)}})

Drawbacks:
* no official support to use external code
* cluster schema agreement on UDFs necessary
* changes of UDF bundles force compilation on each node - but that should not 
be a big issue since UDFs should be small and efficient - they are not "full 
blown libraries"

I'm still not sure whether prepared statements must be invalidated if the 
bundle changes. As long as a UDF with the same signature exists execution can 
continue - and if the bundle/function is removed, execution will fail (which is 
ok).

Yes - I really like the "pure CQL" idea - simple to understand - easy for users 
to start with - explanation would just need two bullet points on a slide. I 
think it's worth the BCEL and schema agreement effort.

> Support for pure user-defined functions (UDF)
> ---------------------------------------------
>
>                 Key: CASSANDRA-7395
>                 URL: https://issues.apache.org/jira/browse/CASSANDRA-7395
>             Project: Cassandra
>          Issue Type: New Feature
>          Components: API, Core
>            Reporter: Jonathan Ellis
>              Labels: cql
>             Fix For: 3.0
>
>         Attachments: 7395-v2.diff, 7395.diff
>
>
> We have some tickets for various aspects of UDF (CASSANDRA-4914, 
> CASSANDRA-5970, CASSANDRA-4998) but they all suffer from various degrees of 
> ocean-boiling.
> Let's start with something simple: allowing pure user-defined functions in 
> the SELECT clause of a CQL query.  That's it.
> By "pure" I mean, must depend only on the input parameters.  No side effects. 
>  No exposure to C* internals.  Column values in, result out.  
> http://en.wikipedia.org/wiki/Pure_function



--
This message was sent by Atlassian JIRA
(v6.2#6252)

[jira] [Commented] (CASSANDRA-7395) Support for pure user-defined functions (UDF)

Reply via email to