This is an automated email from the ASF dual-hosted git repository. imaxon pushed a commit to branch master in repository https://gitbox.apache.org/repos/asf/asterixdb.git
commit eed4941d2d8dfb3395ae17a36c349ecac12c9cce Author: Ian Maxon <[email protected]> AuthorDate: Thu Apr 29 12:46:44 2021 -0700 [ASTERIXDB-2894] Update UDF docs - user model changes: no - storage format changes: no - interface changes: no Details: - Update API examples to include type - Include details about typing and execution model Change-Id: Id9780d72960f9094c29f7f5766185782069fe7cf Reviewed-on: https://asterix-gerrit.ics.uci.edu/c/asterixdb/+/11225 Reviewed-by: Ian Maxon <[email protected]> Reviewed-by: Dmitry Lychagin <[email protected]> Integration-Tests: Jenkins <[email protected]> Tested-by: Jenkins <[email protected]> --- .../src/main/user-defined_function/udf.md | 63 +++++++++++++++++----- 1 file changed, 50 insertions(+), 13 deletions(-) diff --git a/asterixdb/asterix-doc/src/main/user-defined_function/udf.md b/asterixdb/asterix-doc/src/main/user-defined_function/udf.md index 7ca23bb..655113b 100644 --- a/asterixdb/asterix-doc/src/main/user-defined_function/udf.md +++ b/asterixdb/asterix-doc/src/main/user-defined_function/udf.md @@ -19,7 +19,7 @@ ## <a name="introduction">Introduction</a> -Apache AsterixDB supports three languages for writing user-defined functions (UDFs): SQL++, Java and Python +Apache AsterixDB supports three languages for writing user-defined functions (UDFs): SQL++, Java, and Python A user can encapsulate data processing logic into a UDF and invoke it later repeatedly. For SQL++ functions, a user can refer to [SQL++ Functions](sqlpp/manual.html#Functions) for their usages. This document will focus on UDFs in languages other than SQL++ @@ -27,8 +27,10 @@ for their usages. This document will focus on UDFs in languages other than SQL++ ## <a name="authentication">Endpoints and Authentication</a> -The UDF endpoint is not enabled by default until authentication has been configured properly. To enable it, we -will need to set the path to the credential file and populate it with our username and password. +The UDF API endpoint used to deploy functions is not enabled by default until authentication has been configured properly. +Even if the endpoint is enabled, it is only accessible on the loopback interface on each NC to restrict access. + +To enable it, we need to set the path to the credential file and populate it with our username and password. The credential file is a simple `/etc/passwd` style text file with usernames and corresponding `bcrypt` hashed and salted passwords. You can populate this on your own if you would like, but the `asterixhelper` utility can write the entries as @@ -50,9 +52,7 @@ Now,restart the cluster if it was already started to allow the Cluster Controlle ## <a name="installingUDF">Installing a Java UDF Library</a> To install a UDF package to the cluster, we need to send a Multipart Form-data HTTP request to the `/admin/udf` endpoint -of the CC at the normal API port (`19002` by default). The request should use HTTP Basic authentication. This means your -credentials will *not* be obfuscated or encrypted *in any way*, so submit to this endpoint over localhost or a network -where you know your traffic is safe from eavesdropping. Any suitable tool will do, but for the example here I will use +of the CC at the normal API port (`19004` by default). Any suitable tool will do, but for the example here I will use `curl` which is widely available. For example, to install a library with the following criteria: @@ -65,7 +65,7 @@ For example, to install a library with the following criteria: we would execute - curl -v -u admin:admin -X POST -F 'data=@./lib.zip' localhost:19002/admin/udf/udfs/testlib + curl -v -u admin:admin -X POST -F 'data=@./lib.zip' -F 'type=java' localhost:19004/admin/udf/udfs/testlib Any response other than `200` indicates an error in deployment. @@ -119,7 +119,7 @@ scikit-learn here (our method is obviously better!), but it's just included as a Then, deploy it the same as the Java UDF was, with the library name `pylib` in `udfs` dataverse - curl -v -u admin:admin -X POST -F 'data=@./lib.pyz' localhost:19002/admin/udf/udfs/pylib + curl -v -u admin:admin -X POST -F 'data=@./lib.pyz' -F 'type=python' localhost:19002/admin/udf/udfs/pylib With the library deployed, we can define a function within it for use. For example, to expose the Python function `sentiment` in the module `sentiment_mod` in the class `sent_model`, the `CREATE FUNCTION` would be as follows @@ -131,14 +131,14 @@ With the library deployed, we can define a function within it for use. For examp AS "sentiment_mod", "sent_model.sentiment" AT pylib; By default, AsterixDB will treat all external functions as deterministic. It means the function must return the same -result for the same input, irrespective of when or how many times the function is called on that input. -This particular function behaves the same on each input, so it satisfies the deterministic property. +result for the same input, irrespective of when or how many times the function is called on that input. +This particular function behaves the same on each input, so it satisfies the deterministic property. This enables better optimization of queries including this function. -If a function is not deterministic then it should be declared as such by using `WITH` sub-clause: +If a function is not deterministic then it should be declared as such by using a `WITH` sub-clause: USE udfs; - CREATE FUNCTION sentiment(a) + CREATE FUNCTION sentiment(text) AS "sentiment_mod", "sent_model.sentiment" AT pylib WITH { "deterministic": false } @@ -155,6 +155,43 @@ With the function now defined, it can then be used as any other scalar SQL++ fun SELECT t.msg as msg, sentiment(t.msg) as sentiment FROM Tweets t; +## <a name="pytpes">Python Type Mappings</a> + +Currently only a subset of AsterixDB types are supported in Python UDFs. The supported types are as follows: + +- Integer types (int8,16,32,64) +- Floating point types (float, double) +- String +- Boolean +- Arrays, Sets (cast to lists) +- Objects (cast to dict) + +Unsupported types can be cast to these in SQL++ first in order to be passed to a Python UDF + +## <a name="execution">Execution Model For UDFs</a> + +AsterixDB queries are deployed across the cluster as Hyracks jobs. A Hyracks job has a lifecycle that can be simplified +for the purposes of UDFs to + - A pre-run phase which allocates resources, `open` + - The time during which the job has data flowing through it, `nextFrame` + - Cleanup and shutdown in `close`. + +If a SQL++ function is defined as a member of a class in the library, the class will be instantiated +during `open`. The class will exist in memory for the lifetime of the query. Therefore if your function needs to reference +files or other data that would be costly to load per-call, making it a member variable that is initialized in the constructor +of the object will greatly increase the performance of the SQL++ function. + +For each function invoked during a query, there will be an independent instance of the function per data partition. This +means that the function must not assume there is any global state or that it can assume things about the layout +of the data. The execution of the function will be parallel to the same degree as the level of data parallelism in the +cluster. + +After initialization, the function bound in the SQL++ function definition is called once per tuple during the query +execution (i.e. `nextFrame`). Unless the function specifies `null-call` in the `WITH` clause, `NULL` values will be +skipped. + +At the close of the query, the function is torn down and not re-used in any way. All functions should assume that +nothing will persist in-memory outside of the lifetime of a query, and any behavior contrary to this is undefined. ## <a id="UDFOnFeeds">Attaching a UDF on Data Feeds</a> @@ -239,7 +276,7 @@ If you want to uninstall the UDF library, simply issue a `DELETE` against the en functions declared with the library are removed. First we'll drop the function we declared earlier: USE udfs; - DROP FUNCTION mysum@2; + DROP FUNCTION mysum(a,b); Then issue the proper `DELETE` request
