[jira] [Updated] (DRILL-6074) Corrections to UDF tutorial documentation page

Paul Rogers (JIRA) Sun, 07 Jan 2018 17:29:20 -0800

     [ 
https://issues.apache.org/jira/browse/DRILL-6074?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]


Paul Rogers updated DRILL-6074:
-------------------------------
    Description: 
Consider the [UDF 
Tutorial|http://drill.apache.org/docs/tutorial-develop-a-simple-function/]. 
Some of the details are a bit off.

Step 3:

bq. The function will be generated dynamically, as you can see in the 
DrillSimpleFuncHolder, and the input parameters and output holders are defined 
using holders by annotations. Define the parameters using the \@Param 
annotation.

Better: Drill uses your function template to in-line your function code into 
Drill's own generated code. The \@Param annotation identifies the input 
arguments. The order of the annotated fields indicates the order of the 
function parameters. Each parameter field must be one of Drill's holder types.

bq. Use a holder classes to provide a buffer to manage larger objects in an 
efficient way: VarCharHolder or NullableVarCharHolder.

Better: Our function template tells Drill to handle nulls, so all three of our 
arguments can be declared using the VarCharHolder type.

(Then, fix the code to use that type. The bit about larger objects is probably 
obsolete: holders are the only way to work with any value: large or otherwise.)

bq. NOTE: Drill doesn’t actually use the Java heap for data being processed in 
a query but instead keeps this data off the heap and manages the life-cycle for 
us without using the Java garbage collector.

Better: NOTE: VARCHAR data is stored in direct memory. The DrillBuf object in 
the VarCharHolder provides access to the data for the VARCHAR.

(For context: simple types, such as INT, are stored on the heap when passed to 
a UDF, so we don't want to make a blanket statement.)

Step 4.

bq. Also, using the \@Output annotation, define the returned value as 
VarCharHolder type. Because you are manipulating a VarChar, you also have to 
inject a buffer that Drill uses for the output.

Better: Identify the function's return value using the \@Output annotation. 
Like parameters, the output must be a holder type. Drill, however, does not 
provide the output buffer; we have to request one using the \@Inject 
annotation. The injected field must be of type DrillBuf. Then, in our code, we 
set the output holder to point to the injected buffer.

Step 5. The code is inefficient and not a good example. Replace this:

{code}
    out.end = outputValue.getBytes().length;
    buffer.setBytes(0, outputValue.getBytes());
{code}

With this:

{code}
    byte result[] = outputValue.getBytes();
    out.end = result.length;
    buffer.setBytes(0, result);
{code}

(But see comments for additional changes.)

While we are at it, we might as well make another line a bit more readable.

{code}
    String outputValue = (new 
StringBuilder(maskSubString)).append(stringValue.substring(numberOfCharToReplace)).toString();
{code}

Should be rewritten as:

{code}
    String outputValue = new StringBuilder(maskSubString)
        .append(stringValue.substring(numberOfCharToReplace)
        .toString();
{code}

Then in the list of steps:

bq. Gets the number of character to replace

The word "character" should be "characters" (plural)

And:

bq. Creates and populates the output buffer

Better:

* Copies the new string into the temporary DrillBuf
* Sets up the output holder to point to the data in the DrillBuf

Then:

bq. Even to a seasoned Java developer, the eval() method might look a bit 
strange because Drill generates the final code on the fly to fulfill a query 
request. This technique leverages Java’s just-in-time (JIT) compiler for 
maximum speed.

Better: Even to a seasoned Java developer, the eval() method might look a bit 
strange. It is best to think of the UDF declaration as a Domain-Specific 
Language (DSL) that Drill uses to describe the function. Drill uses the 
declaration to in-line your function into generated code. That is, Drill does 
not call your function code; instead Drill extracts the code and copies it into 
Drill's own generated code.

(Note: the bit about the JIT compiler is plain wrong. Drills code generation 
has nothing to do with Java's JIT compiler.)

Basic Coding Rules

bq. To leverage Java’s just-in-time (JIT) compiler for maximum speed, you need 
to adhere to some basic rules.

Better: Drill's code generation mechanism supports a restricted subset of Java, 
meaning that you must adhere to some basic rules.

bq. Do not use imports. Instead, use the fully qualified class name as required 
by the Google Guava API packaged in Apache Drill and as shown in "Step 3: 
Declare input parameters".

(This mixes up a couple of ideas.) Better: Do not use imports. Instead, use the 
fully qualified class name.

bq. Manipulate the ValueHolders classes, for example VarCharHolder and 
IntHolder, as structs by calling helper methods, such as 
getStringFromVarCharHolder and toStringFromUTF8 as shown in "Step 5: Implement 
the eval() function".
bq. Do not call methods such as toString because this causes serious problems.

Better: Do not call any methods on the holder classes. The holders will be 
optimized away by Drill's scalar replacement mechanism.

Some additional restrictions:

* All class fields (member variables) must be preceded by one of the three 
annotations discussed above (\@Param, \@Output or \@Inject), or by the 
\@Workspace annotation which identifies internal temporary fields. (If you omit 
the annotations, then functions using your query will fail at runtime.)
* Do not use static fields (such as to declare constants.) If you must declare 
constants, declare them in a class other than the UDF class.
* Do not pass holders to other functions; all references must be within your 
UDF.

Prepare the Package

bq. Because Drill generates the source, ...

Better: Because Drill copies your code into is own generated code, ...

Basic Coding Rules
Build and Deploy the Function
Test the New Function

The above three lines probably want to be a heading; it appears as normal text.

bq. Add the JAR files to Drill, by copying them to the following location: 
<Drill installation directory>/jars/3rdparty

Perhaps add the following: Be sure to copy the jars into the above folder each 
time you rebuild, reinstall or upgrade Drill. If running in a cluster, copy the 
jars to the Drill installation on every node.

As an alternative, you can create a site directory as described (need link. Do 
we describe this anywhere except in the Drill-on-YARN PR?) Copy your files into 
the {{$DRILL_SITE/jars}} folder. This way, you need not remember to copy the 
jars each time you reinstall Drill.


  was:
Consider the [UDF 
Tutorial|http://drill.apache.org/docs/tutorial-develop-a-simple-function/]. 
Some of the details are a bit off.

Step 3:

bq. The function will be generated dynamically, as you can see in the 
DrillSimpleFuncHolder, and the input parameters and output holders are defined 
using holders by annotations. Define the parameters using the \@Param 
annotation.

Better: Drill uses your function template to in-line your function code into 
Drill's own generated code. The \@Param annotation identifies the input 
arguments. The order of the annotated fields indicates the order of the 
function parameters. Each parameter field must be one of Drill's holder types.

bq. Use a holder classes to provide a buffer to manage larger objects in an 
efficient way: VarCharHolder or NullableVarCharHolder.

Better: Our function template tells Drill to handle nulls, so all three of our 
arguments can be declared using the VarCharHolder type.

(Then, fix the code to use that type. The bit about larger objects is probably 
obsolete: holders are the only way to work with any value: large or otherwise.)

bq. NOTE: Drill doesn’t actually use the Java heap for data being processed in 
a query but instead keeps this data off the heap and manages the life-cycle for 
us without using the Java garbage collector.

Better: NOTE: VARCHAR data is stored in direct memory. The DrillBuf object in 
the VarCharHolder provides access to the data for the VARCHAR.

(For context: simple types, such as INT, are stored on the heap when passed to 
a UDF, so we don't want to make a blanket statement.)

Step 4.

bq. Also, using the \@Output annotation, define the returned value as 
VarCharHolder type. Because you are manipulating a VarChar, you also have to 
inject a buffer that Drill uses for the output.

Better: Identify the function's return value using the \@Output annotation. 
Like parameters, the output must be a holder type. Drill, however, does not 
provide the output buffer; we have to request one using the \@Inject 
annotation. The injected field must be of type DrillBuf. Then, in our code, we 
set the output holder to point to the injected buffer.

Step 5. The code is inefficient and not a good example. Replace this:

{code}
    out.end = outputValue.getBytes().length;
    buffer.setBytes(0, outputValue.getBytes());
{code}

With this:

{code}
    byte result[] = outputValue.getBytes();
    out.end = result.length;
    buffer.setBytes(0, result);
{code}

While we are at it, we might as well make another line a bit more readable.

{code}
    String outputValue = (new 
StringBuilder(maskSubString)).append(stringValue.substring(numberOfCharToReplace)).toString();
{code}

Should be rewritten as:

{code}
    String outputValue = new StringBuilder(maskSubString)
        .append(stringValue.substring(numberOfCharToReplace)
        .toString();
{code}

Then in the list of steps:

bq. Gets the number of character to replace

The word "character" should be "characters" (plural)

And:

bq. Creates and populates the output buffer

Better:

* Copies the new string into the temporary DrillBuf
* Sets up the output holder to point to the data in the DrillBuf

Then:

bq. Even to a seasoned Java developer, the eval() method might look a bit 
strange because Drill generates the final code on the fly to fulfill a query 
request. This technique leverages Java’s just-in-time (JIT) compiler for 
maximum speed.

Better: Even to a seasoned Java developer, the eval() method might look a bit 
strange. It is best to think of the UDF declaration as a Domain-Specific 
Language (DSL) that Drill uses to describe the function. Drill uses the 
declaration to in-line your function into generated code. That is, Drill does 
not call your function code; instead Drill extracts the code and copies it into 
Drill's own generated code.

(Note: the bit about the JIT compiler is plain wrong. Drills code generation 
has nothing to do with Java's JIT compiler.)

Basic Coding Rules

bq. To leverage Java’s just-in-time (JIT) compiler for maximum speed, you need 
to adhere to some basic rules.

Better: Drill's code generation mechanism supports a restricted subset of Java, 
meaning that you must adhere to some basic rules.

bq. Do not use imports. Instead, use the fully qualified class name as required 
by the Google Guava API packaged in Apache Drill and as shown in "Step 3: 
Declare input parameters".

(This mixes up a couple of ideas.) Better: Do not use imports. Instead, use the 
fully qualified class name.

bq. Manipulate the ValueHolders classes, for example VarCharHolder and 
IntHolder, as structs by calling helper methods, such as 
getStringFromVarCharHolder and toStringFromUTF8 as shown in "Step 5: Implement 
the eval() function".
bq. Do not call methods such as toString because this causes serious problems.

Better: Do not call any methods on the holder classes. The holders will be 
optimized away by Drill's scalar replacement mechanism.

Some additional restrictions:

* All class fields (member variables) must be preceded by one of the three 
annotations discussed above (\@Param, \@Output or \@Inject), or by the 
\@Workspace annotation which identifies internal temporary fields. (If you omit 
the annotations, then functions using your query will fail at runtime.)
* Do not use static fields (such as to declare constants.) If you must declare 
constants, declare them in a class other than the UDF class.

Prepare the Package

bq. Because Drill generates the source, ...

Better: Because Drill copies your code into is own generated code, ...

Basic Coding Rules
Build and Deploy the Function
Test the New Function

The above three lines probably want to be a heading; it appears as normal text.

bq. Add the JAR files to Drill, by copying them to the following location: 
<Drill installation directory>/jars/3rdparty

Perhaps add the following: Be sure to copy the jars into the above folder each 
time you rebuild, reinstall or upgrade Drill. If running in a cluster, copy the 
jars to the Drill installation on every node.

As an alternative, you can create a site directory as described (need link. Do 
we describe this anywhere except in the Drill-on-YARN PR?) Copy your files into 
the {{$DRILL_SITE/jars}} folder. This way, you need not remember to copy the 
jars each time you reinstall Drill.



> Corrections to UDF tutorial documentation page
> ----------------------------------------------
>
>                 Key: DRILL-6074
>                 URL: https://issues.apache.org/jira/browse/DRILL-6074
>             Project: Apache Drill
>          Issue Type: Bug
>          Components: Documentation
>            Reporter: Paul Rogers
>            Assignee: Bridget Bevens
>            Priority: Minor
>              Labels: doc-impacting
>
> Consider the [UDF 
> Tutorial|http://drill.apache.org/docs/tutorial-develop-a-simple-function/]. 
> Some of the details are a bit off.
> Step 3:
> bq. The function will be generated dynamically, as you can see in the 
> DrillSimpleFuncHolder, and the input parameters and output holders are 
> defined using holders by annotations. Define the parameters using the \@Param 
> annotation.
> Better: Drill uses your function template to in-line your function code into 
> Drill's own generated code. The \@Param annotation identifies the input 
> arguments. The order of the annotated fields indicates the order of the 
> function parameters. Each parameter field must be one of Drill's holder types.
> bq. Use a holder classes to provide a buffer to manage larger objects in an 
> efficient way: VarCharHolder or NullableVarCharHolder.
> Better: Our function template tells Drill to handle nulls, so all three of 
> our arguments can be declared using the VarCharHolder type.
> (Then, fix the code to use that type. The bit about larger objects is 
> probably obsolete: holders are the only way to work with any value: large or 
> otherwise.)
> bq. NOTE: Drill doesn’t actually use the Java heap for data being processed 
> in a query but instead keeps this data off the heap and manages the 
> life-cycle for us without using the Java garbage collector.
> Better: NOTE: VARCHAR data is stored in direct memory. The DrillBuf object in 
> the VarCharHolder provides access to the data for the VARCHAR.
> (For context: simple types, such as INT, are stored on the heap when passed 
> to a UDF, so we don't want to make a blanket statement.)
> Step 4.
> bq. Also, using the \@Output annotation, define the returned value as 
> VarCharHolder type. Because you are manipulating a VarChar, you also have to 
> inject a buffer that Drill uses for the output.
> Better: Identify the function's return value using the \@Output annotation. 
> Like parameters, the output must be a holder type. Drill, however, does not 
> provide the output buffer; we have to request one using the \@Inject 
> annotation. The injected field must be of type DrillBuf. Then, in our code, 
> we set the output holder to point to the injected buffer.
> Step 5. The code is inefficient and not a good example. Replace this:
> {code}
>     out.end = outputValue.getBytes().length;
>     buffer.setBytes(0, outputValue.getBytes());
> {code}
> With this:
> {code}
>     byte result[] = outputValue.getBytes();
>     out.end = result.length;
>     buffer.setBytes(0, result);
> {code}
> (But see comments for additional changes.)
> While we are at it, we might as well make another line a bit more readable.
> {code}
>     String outputValue = (new 
> StringBuilder(maskSubString)).append(stringValue.substring(numberOfCharToReplace)).toString();
> {code}
> Should be rewritten as:
> {code}
>     String outputValue = new StringBuilder(maskSubString)
>         .append(stringValue.substring(numberOfCharToReplace)
>         .toString();
> {code}
> Then in the list of steps:
> bq. Gets the number of character to replace
> The word "character" should be "characters" (plural)
> And:
> bq. Creates and populates the output buffer
> Better:
> * Copies the new string into the temporary DrillBuf
> * Sets up the output holder to point to the data in the DrillBuf
> Then:
> bq. Even to a seasoned Java developer, the eval() method might look a bit 
> strange because Drill generates the final code on the fly to fulfill a query 
> request. This technique leverages Java’s just-in-time (JIT) compiler for 
> maximum speed.
> Better: Even to a seasoned Java developer, the eval() method might look a bit 
> strange. It is best to think of the UDF declaration as a Domain-Specific 
> Language (DSL) that Drill uses to describe the function. Drill uses the 
> declaration to in-line your function into generated code. That is, Drill does 
> not call your function code; instead Drill extracts the code and copies it 
> into Drill's own generated code.
> (Note: the bit about the JIT compiler is plain wrong. Drills code generation 
> has nothing to do with Java's JIT compiler.)
> Basic Coding Rules
> bq. To leverage Java’s just-in-time (JIT) compiler for maximum speed, you 
> need to adhere to some basic rules.
> Better: Drill's code generation mechanism supports a restricted subset of 
> Java, meaning that you must adhere to some basic rules.
> bq. Do not use imports. Instead, use the fully qualified class name as 
> required by the Google Guava API packaged in Apache Drill and as shown in 
> "Step 3: Declare input parameters".
> (This mixes up a couple of ideas.) Better: Do not use imports. Instead, use 
> the fully qualified class name.
> bq. Manipulate the ValueHolders classes, for example VarCharHolder and 
> IntHolder, as structs by calling helper methods, such as 
> getStringFromVarCharHolder and toStringFromUTF8 as shown in "Step 5: 
> Implement the eval() function".
> bq. Do not call methods such as toString because this causes serious problems.
> Better: Do not call any methods on the holder classes. The holders will be 
> optimized away by Drill's scalar replacement mechanism.
> Some additional restrictions:
> * All class fields (member variables) must be preceded by one of the three 
> annotations discussed above (\@Param, \@Output or \@Inject), or by the 
> \@Workspace annotation which identifies internal temporary fields. (If you 
> omit the annotations, then functions using your query will fail at runtime.)
> * Do not use static fields (such as to declare constants.) If you must 
> declare constants, declare them in a class other than the UDF class.
> * Do not pass holders to other functions; all references must be within your 
> UDF.
> Prepare the Package
> bq. Because Drill generates the source, ...
> Better: Because Drill copies your code into is own generated code, ...
> Basic Coding Rules
> Build and Deploy the Function
> Test the New Function
> The above three lines probably want to be a heading; it appears as normal 
> text.
> bq. Add the JAR files to Drill, by copying them to the following location: 
> <Drill installation directory>/jars/3rdparty
> Perhaps add the following: Be sure to copy the jars into the above folder 
> each time you rebuild, reinstall or upgrade Drill. If running in a cluster, 
> copy the jars to the Drill installation on every node.
> As an alternative, you can create a site directory as described (need link. 
> Do we describe this anywhere except in the Drill-on-YARN PR?) Copy your files 
> into the {{$DRILL_SITE/jars}} folder. This way, you need not remember to copy 
> the jars each time you reinstall Drill.



--
This message was sent by Atlassian JIRA
(v6.4.14#64029)

[jira] [Updated] (DRILL-6074) Corrections to UDF tutorial documentation page

Reply via email to