This is an automated email from the ASF dual-hosted git repository.

imaxon pushed a commit to branch master
in repository https://gitbox.apache.org/repos/asf/asterixdb.git

commit eed4941d2d8dfb3395ae17a36c349ecac12c9cce
Author: Ian Maxon <[email protected]>
AuthorDate: Thu Apr 29 12:46:44 2021 -0700

    [ASTERIXDB-2894] Update UDF docs
    
    - user model changes: no
    - storage format changes: no
    - interface changes: no
    
    Details:
    
    - Update API examples to include type
    - Include details about typing and execution model
    
    Change-Id: Id9780d72960f9094c29f7f5766185782069fe7cf
    Reviewed-on: https://asterix-gerrit.ics.uci.edu/c/asterixdb/+/11225
    Reviewed-by: Ian Maxon <[email protected]>
    Reviewed-by: Dmitry Lychagin <[email protected]>
    Integration-Tests: Jenkins <[email protected]>
    Tested-by: Jenkins <[email protected]>
---
 .../src/main/user-defined_function/udf.md          | 63 +++++++++++++++++-----
 1 file changed, 50 insertions(+), 13 deletions(-)

diff --git a/asterixdb/asterix-doc/src/main/user-defined_function/udf.md 
b/asterixdb/asterix-doc/src/main/user-defined_function/udf.md
index 7ca23bb..655113b 100644
--- a/asterixdb/asterix-doc/src/main/user-defined_function/udf.md
+++ b/asterixdb/asterix-doc/src/main/user-defined_function/udf.md
@@ -19,7 +19,7 @@
 
 ## <a name="introduction">Introduction</a>
 
-Apache AsterixDB supports three languages for writing user-defined functions 
(UDFs): SQL++, Java and Python
+Apache AsterixDB supports three languages for writing user-defined functions 
(UDFs): SQL++, Java, and Python
 A user can encapsulate data processing logic into a UDF and invoke it
 later repeatedly. For SQL++ functions, a user can refer to [SQL++ 
Functions](sqlpp/manual.html#Functions)
 for their usages. This document will focus on UDFs in languages other than 
SQL++
@@ -27,8 +27,10 @@ for their usages. This document will focus on UDFs in 
languages other than SQL++
 
 ## <a name="authentication">Endpoints and Authentication</a>
 
-The UDF endpoint is not enabled by default until authentication has been 
configured properly. To enable it, we
-will need to set the path to the credential file and populate it with our 
username and password.
+The UDF API endpoint used to deploy functions is not enabled by default until 
authentication has been configured properly.
+Even if the endpoint is enabled, it is only accessible on the loopback 
interface on each NC to restrict access.
+
+To enable it, we need to set the path to the credential file and populate it 
with our username and password.
 
 The credential file is a simple `/etc/passwd` style text file with usernames 
and corresponding `bcrypt` hashed and salted
 passwords. You can populate this on your own if you would like, but the 
`asterixhelper` utility can write the entries as
@@ -50,9 +52,7 @@ Now,restart the cluster if it was already started to allow 
the Cluster Controlle
 ## <a name="installingUDF">Installing a Java UDF Library</a>
 
 To install a UDF package to the cluster, we need to send a Multipart Form-data 
HTTP request to the `/admin/udf` endpoint
-of the CC at the normal API port (`19002` by default). The request should use 
HTTP Basic authentication. This means your
-credentials will *not* be obfuscated or encrypted *in any way*, so submit to 
this endpoint over localhost or a network
-where you know your traffic is safe from eavesdropping. Any suitable tool will 
do, but for the example here I will use
+of the CC at the normal API port (`19004` by default). Any suitable tool will 
do, but for the example here I will use
 `curl` which is widely available.
 
 For example, to install a library with the following criteria:
@@ -65,7 +65,7 @@ For example, to install a library with the following criteria:
 
 we would execute
 
-    curl -v -u admin:admin -X POST -F 'data=@./lib.zip' 
localhost:19002/admin/udf/udfs/testlib
+    curl -v -u admin:admin -X POST -F 'data=@./lib.zip' -F 'type=java' 
localhost:19004/admin/udf/udfs/testlib
 
 Any response other than `200` indicates an error in deployment.
 
@@ -119,7 +119,7 @@ scikit-learn here (our method is obviously better!), but 
it's just included as a
 
 Then, deploy it the same as the Java UDF was, with the library name `pylib` in 
`udfs` dataverse
 
-    curl -v -u admin:admin -X POST -F 'data=@./lib.pyz' 
localhost:19002/admin/udf/udfs/pylib
+    curl -v -u admin:admin -X POST -F 'data=@./lib.pyz' -F 'type=python' 
localhost:19002/admin/udf/udfs/pylib
 
 With the library deployed, we can define a function within it for use. For 
example, to expose the Python function
 `sentiment` in the module `sentiment_mod` in the class `sent_model`, the 
`CREATE FUNCTION` would be as follows
@@ -131,14 +131,14 @@ With the library deployed, we can define a function 
within it for use. For examp
       AS "sentiment_mod", "sent_model.sentiment" AT pylib;
 
 By default, AsterixDB will treat all external functions as deterministic. It 
means the function must return the same
-result for the same input, irrespective of when or how many times the function 
is called on that input. 
-This particular function behaves the same on each input, so it satisfies the 
deterministic property. 
+result for the same input, irrespective of when or how many times the function 
is called on that input.
+This particular function behaves the same on each input, so it satisfies the 
deterministic property.
 This enables better optimization of queries including this function.
-If a function is not deterministic then it should be declared as such by using 
`WITH` sub-clause:
+If a function is not deterministic then it should be declared as such by using 
a `WITH` sub-clause:
 
     USE udfs;
 
-    CREATE FUNCTION sentiment(a)
+    CREATE FUNCTION sentiment(text)
       AS "sentiment_mod", "sent_model.sentiment" AT pylib
       WITH { "deterministic": false }
 
@@ -155,6 +155,43 @@ With the function now defined, it can then be used as any 
other scalar SQL++ fun
     SELECT t.msg as msg, sentiment(t.msg) as sentiment
     FROM Tweets t;
 
+## <a name="pytpes">Python Type Mappings</a>
+
+Currently only a subset of AsterixDB types are supported in Python UDFs. The 
supported types are as follows:
+
+- Integer types (int8,16,32,64)
+- Floating point types (float, double)
+- String
+- Boolean
+- Arrays, Sets (cast to lists)
+- Objects (cast to dict)
+
+Unsupported types can be cast to these in SQL++ first in order to be passed to 
a Python UDF
+
+## <a name="execution">Execution Model For UDFs</a>
+
+AsterixDB queries are deployed across the cluster as Hyracks jobs. A Hyracks 
job has a lifecycle that can be simplified
+for the purposes of UDFs to
+ - A pre-run phase which allocates resources, `open`
+ - The time during which the job has data flowing through it, `nextFrame`
+ - Cleanup and shutdown in `close`.
+
+If a SQL++ function is defined as a member of a class in the library, the 
class will be instantiated
+during `open`. The class will exist in memory for the lifetime of the query. 
Therefore if your function needs to reference
+files or other data that would be costly to load per-call, making it a member 
variable that is initialized in the constructor
+of the object will greatly increase the performance of the SQL++ function.
+
+For each function invoked during a query, there will be an independent 
instance of the function per data partition. This
+means that the function must not assume there is any global state or that it 
can assume things about the layout
+of the data. The execution of the function will be parallel to the same degree 
as the level of data parallelism in the
+cluster.
+
+After initialization, the function bound in the SQL++ function definition is 
called once per tuple during the query
+execution (i.e. `nextFrame`). Unless the function specifies `null-call` in the 
`WITH` clause, `NULL` values will be
+skipped.
+
+At the close of the query, the function is torn down and not re-used in any 
way. All functions should assume that
+nothing will persist in-memory outside of the lifetime of a query, and any 
behavior contrary to this is undefined.
 
 ## <a id="UDFOnFeeds">Attaching a UDF on Data Feeds</a>
 
@@ -239,7 +276,7 @@ If you want to uninstall the UDF library, simply issue a 
`DELETE` against the en
 functions declared with the library are removed. First we'll drop the function 
we declared earlier:
 
     USE udfs;
-    DROP FUNCTION mysum@2;
+    DROP FUNCTION mysum(a,b);
 
 Then issue the proper `DELETE` request
 

Reply via email to