http://git-wip-us.apache.org/repos/asf/apex-site/blob/d396fa83/content/docs/apex-3.4/mkdocs/search_index.json
----------------------------------------------------------------------
diff --git a/content/docs/apex-3.4/mkdocs/search_index.json 
b/content/docs/apex-3.4/mkdocs/search_index.json
index 3512a2f..611f195 100644
--- a/content/docs/apex-3.4/mkdocs/search_index.json
+++ b/content/docs/apex-3.4/mkdocs/search_index.json
@@ -12,7 +12,7 @@
         }, 
         {
             "location": "/apex_development_setup/", 
-            "text": "Apache Apex Development Environment Setup\n\n\nThis 
document discusses the steps needed for setting up a development environment 
for creating applications that run on the Apache Apex 
platform.\n\n\nDevelopment Tools\n\n\nThere are a few tools that will be 
helpful when developing Apache Apex applications, including:\n\n\n\n\n\n\ngit\n 
- A revision control system (version 1.7.1 or later). There are multiple git 
clients available for Windows (\nhttp://git-scm.com/download/win\n for 
example), so download and install a client of your choice.\n\n\n\n\n\n\njava 
JDK\n (not JRE) - Includes the Java Runtime Environment as well as the Java 
compiler and a variety of tools (version 1.7.0_79 or later). Can be downloaded 
from the Oracle website.\n\n\n\n\n\n\nmaven\n - Apache Maven is a build system 
for Java projects (version 3.0.5 or later). It can be downloaded from 
\nhttps://maven.apache.org/download.cgi\n.\n\n\n\n\n\n\nIDE\n (Optional) - If 
you prefer to use an IDE (Integra
 ted Development Environment) such as \nNetBeans\n, \nEclipse\n or 
\nIntelliJ\n, install that as well.\n\n\n\n\n\n\nAfter installing these tools, 
make sure that the directories containing the executable files are in your PATH 
environment variable.\n\n\n\n\nWindows\n - Open a console window and enter the 
command \necho %PATH%\n to see the value of the \nPATH\n variable and verify 
that the above directories for Java, git, and maven executables are present.  
JDK executables like \njava\n and \njavac\n, the directory might be something 
like \nC:\\Program Files\\Java\\jdk1.7.0\\_80\\bin\n; for \ngit\n it might be 
\nC:\\Program Files\\Git\\bin\n; and for maven it might be 
\nC:\\Users\\user\\Software\\apache-maven-3.3.3\\bin\n.  If not, you can change 
its value clicking on the button at \nControl Panel\n \n \nAdvanced System 
Settings\n \n \nAdvanced tab\n \n \nEnvironment Variables\n.\n\n\nLinux and 
Mac\n - Open a console/terminal window and enter the command \necho $PATH\n to 
see the value
  of the \nPATH\n variable and verify that the above directories for Java, git, 
and maven executables are present.  If not, make sure software is downloaded 
and installed, and optionally PATH reference is added and exported  in a 
\n~/.profile\n or \n~/.bash_profile\n.  For example to add maven located in 
\n/sfw/maven/apache-maven-3.3.3\n to PATH add the line: \nexport 
PATH=$PATH:/sfw/maven/apache-maven-3.3.3/bin\n\n\n\n\nConfirm by running the 
following commands and comparing with output that show in the table 
below:\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\nCommand\n\n\nOutput\n\n\n\n\n\n\njavac 
-version\n\n\njavac 1.7.0_80\n\n\n\n\n\n\njava -version\n\n\njava version 
\n1.7.0_80\n\n\nJava(TM) SE Runtime Environment (build 1.7.0_80-b15)\n\n\nJava 
HotSpot(TM) 64-Bit Server VM (build 24.80-b11, mixed mode)\n\n\n\n\n\n\ngit 
--version\n\n\ngit version 2.6.1.windows.1\n\n\n\n\n\n\nmvn 
--version\n\n\nApache Maven 3.3.3 (7994120775791599e205a5524ec3e0dfe41d4a06; 
2015-04-22T06:57:37-05:00)\n\n\n...\n
 \n\n\n\n\n\n\n\n\n\n\nCreating New Apex Project\n\n\nAfter development tools 
are configured, you can now use the maven archetype to create a basic Apache 
Apex project.  \nNote:\n When executing the commands below, replace \n3.4.0\n 
by \nlatest available version\n of Apache Apex.\n\n\n\n\n\n\nWindows\n - Create 
a new Windows command file called \nnewapp.cmd\n by copying the lines below, 
and execute it.  When you run this file, the properties will be displayed and 
you will be prompted with \nY: :\n; just press \nEnter\n to complete the 
project generation.  The caret (^) at the end of some lines indicates that a 
continuation line follows. \n\n\n@echo off\n@rem Script for creating a new 
application\nsetlocal\nmvn archetype:generate ^\n 
-DarchetypeGroupId=org.apache.apex ^\n -DarchetypeArtifactId=apex-app-archetype 
-DarchetypeVersion=3.4.0 ^\n -DgroupId=com.example 
-Dpackage=com.example.myapexapp -DartifactId=myapexapp ^\n 
-Dversion=1.0-SNAPSHOT\nendlocal\n\n\n\n\n\n\n\nLinux\n - Execute
  the lines below in a terminal window.  New project will be created in the 
curent working directory.  The backslash (\\) at the end of the lines indicates 
continuation.\n\n\nmvn archetype:generate \\\n 
-DarchetypeGroupId=org.apache.apex \\\n 
-DarchetypeArtifactId=apex-app-archetype -DarchetypeVersion=3.4.0 \\\n 
-DgroupId=com.example -Dpackage=com.example.myapexapp -DartifactId=myapexapp 
\\\n -Dversion=1.0-SNAPSHOT\n\n\n\n\n\n\n\nWhen the run completes successfully, 
you should see a new directory named \nmyapexapp\n containing a maven project 
for building a basic Apache Apex application. It includes 3 source 
files:\nApplication.java\n,  \nRandomNumberGenerator.java\n and 
\nApplicationTest.java\n. You can now build the application by stepping into 
the new directory and running the maven package command:\n\n\ncd myapexapp\nmvn 
clean package -DskipTests\n\n\n\nThe build should create the application 
package file \nmyapexapp/target/myapexapp-1.0-SNAPSHOT.apa\n. This application 
package c
 an then be used to launch example application via \napex\n CLI, or other 
visual management tools.  When running, this application will generate a stream 
of random numbers and print them out, each prefixed by the string \nhello 
world:\n.\n\n\nRunning Unit Tests\n\n\nTo run unit tests on Linux or OSX, 
simply run the usual maven command, for example: \nmvn test\n.\n\n\nOn Windows, 
an additional file, \nwinutils.exe\n, is required; download it 
from\n\nhttps://github.com/srccodes/hadoop-common-2.2.0-bin/archive/master.zip\n\nand
 unpack the archive to, say, \nC:\\hadoop\n; this file should be present 
under\n\nhadoop-common-2.2.0-bin-master\\bin\n within it.\n\n\nSet the 
\nHADOOP_HOME\n environment variable system-wide 
to\n\nc:\\hadoop\\hadoop-common-2.2.0-bin-master\n as described 
at:\n\nhttps://www.microsoft.com/resources/documentation/windows/xp/all/proddocs/en-us/sysdm_advancd_environmnt_addchange_variable.mspx?mfr=true\n.
 You should now be able to run unit tests normally.\n\n\nIf you 
 prefer not to set the variable globally, you can set it on the command line or 
within\nyour IDE. For example, on the command line, specify the maven\nproperty 
\nhadoop.home.dir\n:\n\n\nmvn 
-Dhadoop.home.dir=c:\\hadoop\\hadoop-common-2.2.0-bin-master test\n\n\n\nor set 
the environment variable separately:\n\n\nset 
HADOOP_HOME=c:\\hadoop\\hadoop-common-2.2.0-bin-master\nmvn test\n\n\n\nWithin 
your IDE, set the environment variable and then run the desired\nunit test in 
the usual way. For example, with NetBeans you can 
add:\n\n\nEnv.HADOOP_HOME=c:/hadoop/hadoop-common-2.2.0-bin-master\n\n\n\nat 
\nProperties \n Actions \n Run project \n Set Properties\n.\n\n\nSimilarly, in 
Eclipse (Mars) add it to the\nproject properties at \nProperties \n Run/Debug 
Settings \n ApplicationTest\n\n Environment\n tab.\n\n\nBuilding Apex 
Demos\n\n\nIf you want to see more substantial Apex demo applications and the 
associated source code, you can follow these simple steps to check out and 
build them.\n\n\n\
 n\n\n\nCheck out the source code repositories:\n\n\ngit clone 
https://github.com/apache/incubator-apex-core\ngit clone 
https://github.com/apache/incubator-apex-malhar\n\n\n\n\n\n\n\nSwitch to the 
appropriate release branch and build each repository:\n\n\ncd 
incubator-apex-core\nmvn clean install -DskipTests\n\ncd 
incubator-apex-malhar\nmvn clean install -DskipTests\n\n\n\n\n\n\n\nThe 
\ninstall\n argument to the \nmvn\n command installs resources from each 
project to your local maven repository (typically \n.m2/repository\n under your 
home directory), and \nnot\n to the system directories, so Administrator 
privileges are not required. The  \n-DskipTests\n argument skips running unit 
tests since they take a long time. If this is a first-time installation, it 
might take several minutes to complete because maven will download a number of 
associated plugins.\n\n\nAfter the build completes, you should see the demo 
application package files in the target directory under each demo subdirect
 ory in \nincubator-apex-malhar/demos\n.\n\n\nSandbox\n\n\nTo jump start 
development with an Apache Hadoop single node cluster, \nDataTorrent Sandbox\n 
powered by VirtualBox is available on Windows, Linux, or Mac platforms.  The 
sandbox is configured by default to run with 6GB RAM; if your development 
machine has 16GB or more, you can increase the sandbox RAM to 8GB or more using 
the VirtualBox console.  This will yield better performance and support larger 
applications.  The advantage of developing in the sandbox is that most of the 
tools (e.g. \njdk\n, \ngit\n, \nmaven\n), Hadoop YARN and HDFS, and a 
distribution of Apache Apex and DataTorrent RTS are pre-installed.  The 
disadvantage is that the sandbox is a memory-limited environment, and requires 
settings changes and restarts to adjust memory available for development and 
testing.", 
+            "text": "Apache Apex Development Environment Setup\n\n\nThis 
document discusses the steps needed for setting up a development environment 
for creating applications that run on the Apache Apex 
platform.\n\n\nDevelopment Tools\n\n\nThere are a few tools that will be 
helpful when developing Apache Apex applications, including:\n\n\n\n\n\n\ngit\n 
- A revision control system (version 1.7.1 or later). There are multiple git 
clients available for Windows (\nhttp://git-scm.com/download/win\n for 
example), so download and install a client of your choice.\n\n\n\n\n\n\njava 
JDK\n (not JRE) - Includes the Java Runtime Environment as well as the Java 
compiler and a variety of tools (version 1.7.0_79 or later). Can be downloaded 
from the Oracle website.\n\n\n\n\n\n\nmaven\n - Apache Maven is a build system 
for Java projects (version 3.0.5 or later). It can be downloaded from 
\nhttps://maven.apache.org/download.cgi\n.\n\n\n\n\n\n\nIDE\n (Optional) - If 
you prefer to use an IDE (Integra
 ted Development Environment) such as \nNetBeans\n, \nEclipse\n or 
\nIntelliJ\n, install that as well.\n\n\n\n\n\n\nAfter installing these tools, 
make sure that the directories containing the executable files are in your PATH 
environment variable.\n\n\n\n\nWindows\n - Open a console window and enter the 
command \necho %PATH%\n to see the value of the \nPATH\n variable and verify 
that the above directories for Java, git, and maven executables are present.  
JDK executables like \njava\n and \njavac\n, the directory might be something 
like \nC:\\Program Files\\Java\\jdk1.7.0\\_80\\bin\n; for \ngit\n it might be 
\nC:\\Program Files\\Git\\bin\n; and for maven it might be 
\nC:\\Users\\user\\Software\\apache-maven-3.3.3\\bin\n.  If not, you can change 
its value clicking on the button at \nControl Panel\n \n \nAdvanced System 
Settings\n \n \nAdvanced tab\n \n \nEnvironment Variables\n.\n\n\nLinux and 
Mac\n - Open a console/terminal window and enter the command \necho $PATH\n to 
see the value
  of the \nPATH\n variable and verify that the above directories for Java, git, 
and maven executables are present.  If not, make sure software is downloaded 
and installed, and optionally PATH reference is added and exported  in a 
\n~/.profile\n or \n~/.bash_profile\n.  For example to add maven located in 
\n/sfw/maven/apache-maven-3.3.3\n to PATH add the line: \nexport 
PATH=$PATH:/sfw/maven/apache-maven-3.3.3/bin\n\n\n\n\nConfirm by running the 
following commands and comparing with output that show in the table 
below:\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\nCommand\n\n\nOutput\n\n\n\n\n\n\njavac 
-version\n\n\njavac 1.7.0_80\n\n\n\n\n\n\njava -version\n\n\njava version 
\n1.7.0_80\n\n\nJava(TM) SE Runtime Environment (build 1.7.0_80-b15)\n\n\nJava 
HotSpot(TM) 64-Bit Server VM (build 24.80-b11, mixed mode)\n\n\n\n\n\n\ngit 
--version\n\n\ngit version 2.6.1.windows.1\n\n\n\n\n\n\nmvn 
--version\n\n\nApache Maven 3.3.3 (7994120775791599e205a5524ec3e0dfe41d4a06; 
2015-04-22T06:57:37-05:00)\n\n\n...\n
 \n\n\n\n\n\n\n\n\n\n\nCreating New Apex Project\n\n\nAfter development tools 
are configured, you can now use the maven archetype to create a basic Apache 
Apex project.  \nNote:\n When executing the commands below, replace \n3.4.0\n 
by \nlatest available version\n of Apache Apex.\n\n\n\n\n\n\nWindows\n - Create 
a new Windows command file called \nnewapp.cmd\n by copying the lines below, 
and execute it.  When you run this file, the properties will be displayed and 
you will be prompted with \nY: :\n; just press \nEnter\n to complete the 
project generation.  The caret (^) at the end of some lines indicates that a 
continuation line follows. \n\n\n@echo off\n@rem Script for creating a new 
application\nsetlocal\nmvn archetype:generate ^\n 
-DarchetypeGroupId=org.apache.apex ^\n -DarchetypeArtifactId=apex-app-archetype 
-DarchetypeVersion=3.4.0 ^\n -DgroupId=com.example 
-Dpackage=com.example.myapexapp -DartifactId=myapexapp ^\n 
-Dversion=1.0-SNAPSHOT\nendlocal\n\n\n\n\n\n\n\nLinux\n - Execute
  the lines below in a terminal window.  New project will be created in the 
curent working directory.  The backslash (\\) at the end of the lines indicates 
continuation.\n\n\nmvn archetype:generate \\\n 
-DarchetypeGroupId=org.apache.apex \\\n 
-DarchetypeArtifactId=apex-app-archetype -DarchetypeVersion=3.4.0 \\\n 
-DgroupId=com.example -Dpackage=com.example.myapexapp -DartifactId=myapexapp 
\\\n -Dversion=1.0-SNAPSHOT\n\n\n\n\n\n\n\nWhen the run completes successfully, 
you should see a new directory named \nmyapexapp\n containing a maven project 
for building a basic Apache Apex application. It includes 3 source 
files:\nApplication.java\n,  \nRandomNumberGenerator.java\n and 
\nApplicationTest.java\n. You can now build the application by stepping into 
the new directory and running the maven package command:\n\n\ncd myapexapp\nmvn 
clean package -DskipTests\n\n\n\nThe build should create the application 
package file \nmyapexapp/target/myapexapp-1.0-SNAPSHOT.apa\n. This application 
package c
 an then be used to launch example application via \napex\n CLI, or other 
visual management tools.  When running, this application will generate a stream 
of random numbers and print them out, each prefixed by the string \nhello 
world:\n.\n\n\nRunning Unit Tests\n\n\nTo run unit tests on Linux or OSX, 
simply run the usual maven command, for example: \nmvn test\n.\n\n\nOn Windows, 
an additional file, \nwinutils.exe\n, is required; download it 
from\n\nhttps://github.com/srccodes/hadoop-common-2.2.0-bin/archive/master.zip\n\nand
 unpack the archive to, say, \nC:\\hadoop\n; this file should be present 
under\n\nhadoop-common-2.2.0-bin-master\\bin\n within it.\n\n\nSet the 
\nHADOOP_HOME\n environment variable system-wide 
to\n\nc:\\hadoop\\hadoop-common-2.2.0-bin-master\n as described 
at:\n\nhttps://www.microsoft.com/resources/documentation/windows/xp/all/proddocs/en-us/sysdm_advancd_environmnt_addchange_variable.mspx?mfr=true\n.
 You should now be able to run unit tests normally.\n\n\nIf you 
 prefer not to set the variable globally, you can set it on the command line or 
within\nyour IDE. For example, on the command line, specify the maven\nproperty 
\nhadoop.home.dir\n:\n\n\nmvn 
-Dhadoop.home.dir=c:\\hadoop\\hadoop-common-2.2.0-bin-master test\n\n\n\nor set 
the environment variable separately:\n\n\nset 
HADOOP_HOME=c:\\hadoop\\hadoop-common-2.2.0-bin-master\nmvn test\n\n\n\nWithin 
your IDE, set the environment variable and then run the desired\nunit test in 
the usual way. For example, with NetBeans you can 
add:\n\n\nEnv.HADOOP_HOME=c:/hadoop/hadoop-common-2.2.0-bin-master\n\n\n\nat 
\nProperties \n Actions \n Run project \n Set Properties\n.\n\n\nSimilarly, in 
Eclipse (Mars) add it to the\nproject properties at \nProperties \n Run/Debug 
Settings \n ApplicationTest\n\n Environment\n tab.\n\n\nBuilding Apex 
Demos\n\n\nIf you want to see more substantial Apex demo applications and the 
associated source code, you can follow these simple steps to check out and 
build them.\n\n\n\
 n\n\n\nCheck out the source code repositories:\n\n\ngit clone 
https://github.com/apache/apex-core\ngit clone 
https://github.com/apache/apex-malhar\n\n\n\n\n\n\n\nSwitch to the appropriate 
release branch and build each repository:\n\n\ncd apex-core\nmvn clean install 
-DskipTests\n\ncd apex-malhar\nmvn clean install -DskipTests\n\n\n\n\n\n\n\nThe 
\ninstall\n argument to the \nmvn\n command installs resources from each 
project to your local maven repository (typically \n.m2/repository\n under your 
home directory), and \nnot\n to the system directories, so Administrator 
privileges are not required. The  \n-DskipTests\n argument skips running unit 
tests since they take a long time. If this is a first-time installation, it 
might take several minutes to complete because maven will download a number of 
associated plugins.\n\n\nAfter the build completes, you should see the demo 
application package files in the target directory under each demo subdirectory 
in \napex-malhar/demos\n.\n\n\nSandb
 ox\n\n\nTo jump start development with an Apache Hadoop single node cluster, 
\nDataTorrent Sandbox\n powered by VirtualBox is available on Windows, Linux, 
or Mac platforms.  The sandbox is configured by default to run with 6GB RAM; if 
your development machine has 16GB or more, you can increase the sandbox RAM to 
8GB or more using the VirtualBox console.  This will yield better performance 
and support larger applications.  The advantage of developing in the sandbox is 
that most of the tools (e.g. \njdk\n, \ngit\n, \nmaven\n), Hadoop YARN and 
HDFS, and a distribution of Apache Apex and DataTorrent RTS are pre-installed.  
The disadvantage is that the sandbox is a memory-limited environment, and 
requires settings changes and restarts to adjust memory available for 
development and testing.", 
             "title": "Development Setup"
         }, 
         {
@@ -37,7 +37,7 @@
         }, 
         {
             "location": "/apex_development_setup/#building-apex-demos", 
-            "text": "If you want to see more substantial Apex demo 
applications and the associated source code, you can follow these simple steps 
to check out and build them.    Check out the source code repositories:  git 
clone https://github.com/apache/incubator-apex-core\ngit clone 
https://github.com/apache/incubator-apex-malhar    Switch to the appropriate 
release branch and build each repository:  cd incubator-apex-core\nmvn clean 
install -DskipTests\n\ncd incubator-apex-malhar\nmvn clean install -DskipTests  
  The  install  argument to the  mvn  command installs resources from each 
project to your local maven repository (typically  .m2/repository  under your 
home directory), and  not  to the system directories, so Administrator 
privileges are not required. The   -DskipTests  argument skips running unit 
tests since they take a long time. If this is a first-time installation, it 
might take several minutes to complete because maven will download a number of 
associated plugins.  A
 fter the build completes, you should see the demo application package files in 
the target directory under each demo subdirectory in  
incubator-apex-malhar/demos .", 
+            "text": "If you want to see more substantial Apex demo 
applications and the associated source code, you can follow these simple steps 
to check out and build them.    Check out the source code repositories:  git 
clone https://github.com/apache/apex-core\ngit clone 
https://github.com/apache/apex-malhar    Switch to the appropriate release 
branch and build each repository:  cd apex-core\nmvn clean install 
-DskipTests\n\ncd apex-malhar\nmvn clean install -DskipTests    The  install  
argument to the  mvn  command installs resources from each project to your 
local maven repository (typically  .m2/repository  under your home directory), 
and  not  to the system directories, so Administrator privileges are not 
required. The   -DskipTests  argument skips running unit tests since they take 
a long time. If this is a first-time installation, it might take several 
minutes to complete because maven will download a number of associated plugins. 
 After the build completes, you should see
  the demo application package files in the target directory under each demo 
subdirectory in  apex-malhar/demos .", 
             "title": "Building Apex Demos"
         }, 
         {
@@ -821,6 +821,71 @@
             "title": "System Metrics"
         }, 
         {
+            "location": "/development_best_practices/", 
+            "text": "Development Best Practices\n\n\nThis document describes 
the best practices to follow when developing operators and other application 
components such as partitoners, stream codecs etc on the Apache Apex 
platform.\n\n\nOperators\n\n\nThese are general guidelines for all operators 
that are covered in the current section. The subsequent sections talk about 
special considerations for input and output operators.\n\n\n\n\nWhen writing a 
new operator to be used in an application, consider breaking it down 
into\n\n\nAn abstract operator that encompasses the core functionality but 
leaves application specific schemas and logic to the implementation.\n\n\nAn 
optional concrete operator also in the library that extends the abstract 
operator and provides commonly used schema types such as strings, byte[] or 
POJOs.\n\n\n\n\n\n\nFollow these conventions for the life cycle 
methods:\n\n\nDo one time initialization of entities that apply for the entire 
lifetime of the operator in t
 he \nsetup\n method, e.g., factory initializations. Initializations in 
\nsetup\n are done in the container where the operator is deployed. Allocating 
memory for fields in the constructor is not efficient as it would lead to extra 
garbage in memory for the following reason. The operator is instantiated on the 
client from where the application is launched, serialized and started one of 
the Hadoop nodes in a container. So the constructor is first called on the 
client and if it were to initialize any of the fields, that state would be 
saved during serialization. In the Hadoop container the operator is 
deserialized and started. This would invoke the constructor again, which will 
initialize the fields but their state will get overwritten by the serialized 
state and the initial values would become garbage in memory.\n\n\nDo one time 
initialization for live entities in \nactivate\n method, e.g., opening 
connections to a database server or starting a thread for asynchronous 
operations. The \
 nactivate\n method is called right before processing starts so it is a better 
place for these initializations than at \nsetup\n which can lead to a delay 
before processing data from the live entity.  \n\n\nPerform periodic tasks 
based on processing time in application window boundaries.\n\n\nPerform 
initializations needed for each application window in 
\nbeginWindow\n.\n\n\nPerform aggregations needed for each application window  
in \nendWindow\n.\n\n\nTeardown of live entities (inverse of tasks performed 
during activate) should be in the \ndeactivate\n method.\n\n\nTeardown of 
lifetime entities (those initialized in setup method) should happen in the 
\nteardown\n method.\n\n\nIf the operator implementation is not finalized mark 
it with the \n@Evolving\n annotation.\n\n\n\n\n\n\nIf the operator needs to 
perform operations based on event time of the individual tuples and not the 
processing time, extend and use the \nWindowedOperator\n. Refer to 
documentation of that operator for deta
 ils on how to use it.\n\n\nIf an operator needs to do some work when it is not 
receiving any input, it should implement \nIdleTimeHandler\n interface. This 
interface contains \nhandleIdleTime\n method which will be called whenever the 
platform isn\u2019t doing anything else and the operator can do the work in 
this method. If for any reason the operator does not have any work to do when 
this method is called, it should sleep for a small amount of time such as that 
specified by the \nSPIN_MILLIS\n attribute so that it does not cause a busy 
wait when called repeatedly by the platform. Also, the method should not block 
and return in a reasonable amount of time that is less than the streaming 
window size (which is 500ms by default).\n\n\nOften operators have customizable 
parameters such as information about locations of external systems or 
parameters that modify the behavior of the operator. Users should be able to 
specify these easily without having to change source code. This can be do
 ne by making them properties of the operator because they can then be 
initialized from external properties files.\n\n\nWhere possible default values 
should be provided for the properties in the source code.\n\n\nValidation rules 
should be specified for the properties using javax constraint validations that 
check whether the values specified for the properties are in the correct 
format, range or other operator requirements. Required properties should have 
at least a \n@NotNull\n validation specifying that they have to be specified by 
the user.\n\n\n\n\n\n\n\n\nCheckpointing\n\n\nCheckpointing is a process of 
snapshotting the state of an operator and saving it so that in case of failure 
the state can be used to restore the operator to a prior state and continue 
processing. It is automatically performed by the platform at a configurable 
interval. All operators in the application are checkpointed in a distributed 
fashion, thus allowing the entire state of the application to be saved and
  available for recovery if needed. Here are some things to remember when it 
comes to checkpointing:\n\n\n\n\nThe process of checkpointing involves 
snapshotting the state by serializing the operator and saving it to a store. 
This is done using a \nStorageAgent\n. By default a \nStorageAgent\n is already 
provided by the platform and it is called \nAsyncFSStorageAgent\n. It 
serializes the operator using Kryo and saves the serialized state 
asynchronously to a filesystem such as HDFS. There are other implementations of 
\nStorageAgent\n available such as \nGeodeKeyValueStorageAgent\n that stores 
the serialized state in Geode which is an in-memory replicated data 
grid.\n\n\nAll variables in the operator marked neither transient nor final are 
saved so any variables in the operator that are not part of the state should be 
marked transient. Specifically any variables like connection objects, i/o 
streams, ports are transient, because they need to be setup again on failure 
recovery.\n\n\nIf the
  operator does not keep any state between windows, mark it with the 
\n@Stateless\n annotation. This results in efficiencies during checkpointing 
and recovery. The operator will not be checkpointed and is always restored to 
the initial state\n\n\nThe checkpoint interval can be set using the 
\nCHECKPOINT_WINDOW_COUNT\n attribute which specifies the interval in terms of 
number of streaming windows.\n\n\nIf the correct functioning of the operator 
requires the \nendWindow\n method be called before checkpointing can happen, 
then the checkpoint interval should align with application window interval 
i.e., it should be a multiple of application window interval. In this case the 
operator should be marked with \nOperatorAnnotation\n and 
\ncheckpointableWithinAppWindow\n set to false. If the window intervals are 
configured by the user and they don\u2019t align, it will result in a DAG 
validation error and application won\u2019t launch.\n\n\nIn some cases the 
operator state related to a piece of
  data needs to be purged once that data is no longer required by the 
application, otherwise the state will continue to build up indefinitely. The 
platform provides a way to let the operator know about this using a callback 
listener called \nCheckpointNotificationListener\n. This listener has a 
callback method called \ncommitted\n, which is called by the platform from time 
to time with a window id that has been processed successfully by all the 
operators in the DAG and hence is no longer needed. The operator can delete all 
the state corresponding to window ids less than or equal to the provided window 
id.\n\n\nSometimes operators need to perform some tasks just before 
checkpointing. For example, filesystem operators may want to flush the files 
just before checkpoint so they can be sure that all pending data is written to 
disk and no data is lost if there is an operator failure just after the 
checkpoint and the operator restarts from the checkpoint. To do this the 
operator would imple
 ment the same \nCheckpointNotificationListener\n interface and implement the 
\nbeforeCheckpoint\n method where it can do these tasks.\n\n\nIf the operator 
is going to have a large state, checkpointing the entire state each time 
becomes unviable. Furthermore, the amount of memory needed to hold the state 
could be larger than the amount of physical memory available. In these cases 
the operator should checkpoint the state incrementally and also manage the 
memory for the state more efficiently. The platform provides a utiltiy called 
\nManagedState\n that uses a combination of in memory and disk cache to 
efficiently store and retrieve data in a performant, fault tolerant way and 
also checkpoint it in an incremental fashion. There are operators in the 
platform that use \nManagedState\n and can be used as a reference on how to use 
this utility such as Dedup or Join operators.\n\n\n\n\nInput 
Operators\n\n\nInput operators have additional requirements:\n\n\n\n\nThe 
\nemitTuples\n method impl
 emented by the operator, is called by the platform, to give the operator an 
opportunity to emit some data. This method is always called within a window 
boundary but can be called multiple times within the same window. There are 
some important guidelines on how to implement this method:\n\n\nThis should not 
be a blocking method and should return in a reasonable time that is less than 
the streaming window size (which is 500ms by default). This also applies to 
other callback methods called by the platform such as \nbeginWindow\n, 
\nendWindow\n etc., but is more important here since this method will be called 
continuously by the platform.\n\n\nIf the operator needs to interact with 
external systems to obtain data and this can potentially take a long time, then 
this should be performed asynchronously in a different thread. Refer to the 
threading section below for the guidelines when using threading.\n\n\nIn each 
invocation, the method can emit any number of data tuples.\n\n\n\n\n\n\n\n\n
 Idempotence\n\n\nMany applications write data to external systems using output 
operators. To ensure that data is present exactly once in the external system 
even in a failure recovery scenario, the output operators expect the replayed 
windows during recovery contain the same data as before the failure. This is 
called idempotency. Since operators within the DAG are merely responding to 
input data provided to them by the upstream operators and the input operator 
has no upstream operator, the responsibility of idempotent replay falls on the 
input operators.\n\n\n\n\nFor idempotent replay of data, the operator needs to 
store some meta-information for every window that would allow it to identify 
what data was sent in that window. This is called the idempotent state.\n\n\nIf 
the external source of the input operator allows replayability, this could be 
information such as offset of last piece of data in the window, an identifier 
of the last piece of data itself or number of data tuples sen
 t.\n\n\nHowever if the external source does not allow replayability from an 
operator specified point, then the entire data sent within the window may need 
to be persisted by the operator.\n\n\n\n\n\n\nThe platform provides a utility 
called \nWindowDataManager\n to allow operators to save and retrieve idempotent 
state every window. Operators should use this to implement 
idempotency.\n\n\n\n\nOutput Operators\n\n\nOutput operators typically connect 
to external storage systems such as filesystems, databases or key value stores 
to store data.\n\n\n\n\nIn some situations, the external systems may not be 
functioning in a reliable fashion. They may be having prolonged outages or 
performance problems. If the operator is being designed to work in such 
environments, it needs to be able to to handle these problems gracefully and 
not block the DAG or fail. In these scenarios the operator should cache the 
data into a local store such as HDFS and interact with external systems in a 
separate threa
 d so as to not have problems in the operator lifecycle thread. This pattern is 
called the \nReconciler\n pattern and there are operators that implement this 
pattern available in the library for reference.\n\n\n\n\nEnd-to-End Exactly 
Once\n\n\nWhen output operators store data in external systems, it is important 
that they do not lose data or write duplicate data when there is a failure 
event and the DAG recovers from that failure. In failure recovery, the windows 
from the previous checkpoint are replayed and the operator receives this data 
again. The operator should ensure that it does not write this data again. 
Operator developers should figure out how to do this specifically for the 
operators they are developing depending on the logic of the operators. Below 
are examples of how a couple of existing output operators do this for 
reference.\n\n\n\n\nFile output operator that writes data to files keeps track 
of the file lengths in the state. These lengths are checkpointed and restored 
 on failure recovery. On restart, the operator truncates the file to the length 
equal to the length in the recovered state. This makes the data in the file 
same as it was at the time of checkpoint before the failure. The operator now 
writes the replayed data from the checkpoint in regular fashion as any other 
data. This ensures no data is lost or duplicated in the file.\n\n\nThe JDBC 
output operator that writes data to a database table writes the data in a 
window in a single transaction. It also writes the current window id into a 
meta table along with the data as part of the same transaction. It commits the 
transaction at the end of the window. When there is an operator failure before 
the final commit, the state of the database is that it contains the data from 
the previous fully processed window and its window id since the current window 
transaction isn\u2019t yet committed. On recovery, the operator reads this 
window id back from the meta table. It ignores all the replayed windows
  whose window id is less than or equal to the recovered window id and thus 
ensures that it does not duplicate data already present in the database. It 
starts writing data normally again when window id of data becomes greater than 
recovered window thus ensuring no data is 
lost.\n\n\n\n\nPartitioning\n\n\nPartitioning allows an operation to be scaled 
to handle more pieces of data than before but with a similar SLA. This is done 
by creating multiple instances of an operator and distributing the data among 
them. Input operators can also be partitioned to stream more pieces of data 
into the application. The platform provides a lot of flexibility and options 
for partitioning. Partitioning can happen once at startup or can be dynamically 
changed anytime while the application is running, and it can be done in a 
stateless or stateful way by distributing state from the old partitions to new 
partitions.\n\n\nIn the platform, the responsibility for partitioning is shared 
among different entitie
 s. These are:\n\n\n\n\nA \npartitioner\n that specifies \nhow\n to partition 
the operator, specifically it takes an old set of partitions and creates a new 
set of partitions. At the start of the application the old set has one 
partition and the partitioner can return more than one partitions to start the 
application with multiple partitions. The partitioner can have any custom JAVA 
logic to determine the number of new partitions, set their initial state as a 
brand new state or derive it from the state of the old partitions. It also 
specifies how the data gets distributed among the new partitions. The new set 
doesn't have to contain only new partitions, it can carry over some old 
partitions if desired.\n\n\nAn optional \nstatistics (stats) listener\n that 
specifies \nwhen\n to partition. The reason it is optional is that it is needed 
only when dynamic partitioning is needed. With the stats listener, the stats 
can be used to determine when to partition.\n\n\nIn some cases the \noperat
 or\n itself should be aware of partitioning and would need to provide 
supporting code.\n\n\nIn case of input operators each partition should have a 
property or a set of properties that allow it to distinguish itself from the 
other partitions and fetch unique data.\n\n\n\n\n\n\nWhen an operator that was 
originally a single instance is split into multiple partitions with each 
partition working on a subset of data, the results of the partitions may need 
to be combined together to compute the final result. The combining logic would 
depend on the logic of the operator. This would be specified by the developer 
using a \nUnifier\n, which is deployed as another operator by the platform. If 
no \nUnifier\n is specified, the platform inserts a \ndefault unifier\n that 
merges the results of the multiple partition streams into a single stream. Each 
output port can have a different \nUnifier\n and this is specified by returning 
the corresponding \nUnifier\n in the \ngetUnifier\n method of the out
 put port. The operator developer should provide a custom \nUnifier\n wherever 
applicable.\n\n\nThe Apex \nengine\n that brings everything together and 
effects the partitioning.\n\n\n\n\nSince partitioning is critical for 
scalability of applications, operators must support it. There should be a 
strong reason for an operator to not support partitioning, such as, the logic 
performed by the operator not lending itself to parallelism. In order to 
support partitioning, an operator developer, apart from developing the 
functionality of the operator, may also need to provide a partitioner, stats 
listener and supporting code in the operator as described in the steps above. 
The next sections delve into this. \n\n\nOut of the box partitioning\n\n\nThe 
platform comes with some built-in partitioning utilities that can be used in 
certain scenarios.\n\n\n\n\n\n\nStatelessPartitioner\n provides a default 
partitioner, that can be used for an operator in certain conditions. If the 
operator satisfies t
 hese conditions, the partitioner can be specified for the operator with a 
simple setting and no other partitioning code is needed. The conditions 
are:\n\n\n\n\nNo dynamic partitioning is needed, see next point about dynamic 
partitioning. \n\n\nThere is no distinct initial state for the partitions, 
i.e., all partitions start with the same initial state submitted during 
application launch.\n\n\n\n\nTypically input or output operators do not fall 
into this category, although there are some exceptions. This partitioner is 
mainly used with operators that are in the middle of the DAG, after the input 
and before the output operators. When used with non-input operators, only the 
data for the first declared input port is distributed among the different 
partitions. All other input ports are treated as broadcast and all partitions 
receive all the data for that 
port.\n\n\n\n\n\n\nStatelessThroughputBasedPartitioner\n in Malhar provides a 
dynamic partitioner based on throughput thresholds. Simil
 arly \nStatelessLatencyBasedPartitioner\n provides a latency based dynamic 
partitioner in RTS. If these partitioners can be used, then separate 
partitioning related code is not needed. The conditions under which these can 
be used are:\n\n\n\n\nThere is no distinct initial state for the 
partitions.\n\n\nThere is no state being carried over by the operator from one 
window to the next i.e., operator is stateless.\n\n\n\n\n\n\n\n\nCustom 
partitioning\n\n\nIn many cases, operators don\u2019t satisfy the above 
conditions and a built-in partitioner cannot be used. Custom partitioning code 
needs to be written by the operator developer. Below are guidelines for 
it.\n\n\n\n\nSince the operator developer is providing a \npartitioner\n for 
the operator, the partitioning code should be added to the operator itself by 
making the operator implement the Partitioner interface and implementing the 
required methods, rather than creating a separate partitioner. The advantage is 
the user of the operator
  does not have to explicitly figure out the partitioner and set it for the 
operator but still has the option to override this built-in partitioner with a 
different one.\n\n\nThe \npartitioner\n is responsible for setting the initial 
state of the new partitions, whether it is at the start of the application or 
when partitioning is happening while the application is running as in the 
dynamic partitioning case. In the dynamic partitioning scenario, the 
partitioner needs to take the state from the old partitions and distribute it 
among the new partitions. It is important to note that apart from the 
checkpointed state the partitioner also needs to distribute idempotent 
state.\n\n\nThe \npartitioner\n interface has two methods, \ndefinePartitions\n 
and \npartitioned\n. The method \ndefinePartitons\n is first called to 
determine the new partitions, and if enough resources are available on the 
cluster, the \npartitioned\n method is called passing in the new partitions. 
This happens both dur
 ing initial partitioning and dynamic partitioning. If resources are not 
available, partitioning is abandoned and existing partitions continue to run 
untouched. This means that any processing intensive operations should be 
deferred to the \npartitioned\n call instead of doing them in 
\ndefinePartitions\n, as they may not be needed if there are not enough 
resources available in the cluster.\n\n\nThe \npartitioner\n, along with 
creating the new partitions, should also specify how the data gets distributed 
across the new partitions. It should do this by specifying a mapping called 
\nPartitionKeys\n for each partition that maps the data to that partition. This 
mapping needs to be specified for every input port in the operator. If the 
\npartitioner\n wants to use the standard mapping it can use a utility method 
called \nDefaultPartition.assignPartitionKeys\n.\n\n\nWhen the partitioner is 
scaling the operator up to more partitions, try to reuse the existing 
partitions and create new partit
 ions to augment the current set. The reuse can be achieved by the partitioner 
returning the current partitions unchanged. This will result in the current 
partitions continuing to run untouched.\n\n\nIn case of dynamic partitioning, 
as mentioned earlier, a stats listener is also needed to determine when to 
re-partition. Like the \nPartitioner\n interface, the operator can also 
implement the \nStatsListener\n interface to provide a stats listener 
implementation that will be automatically used.\n\n\nThe \nStatsListener\n has 
access to all operator statistics to make its decision on partitioning. Apart 
from the statistics that the platform computes for the operators such as 
throughput, latency etc, operator developers can include their own business 
metrics by using the AutoMetric feature.\n\n\nIf the operator is not 
partitionable, mark it so with \nOperatorAnnotation\n and \npartitionable\n 
element set to false.\n\n\n\n\nStreamCodecs\n\n\nA \nStreamCodec\n is used in 
partitioning to dis
 tribute the data tuples among the partitions. The \nStreamCodec\n computes an 
integer hashcode for a data tuple and this is used along with \nPartitionKeys\n 
mapping to determine which partition or partitions receive the data tuple. If a 
\nStreamCodec\n is not specified, then a default one is used by the platform 
which returns the JAVA hashcode of the tuple. \n\n\nStreamCodec\n is also 
useful in another aspect of the application. It is used to serialize and 
deserialize the tuple to transfer it between operators. The default 
\nStreamCodec\n uses Kryo library for serialization. \n\n\nThe following 
guidelines are useful when considering a custom \nStreamCodec\n\n\n\n\nA custom 
\nStreamCodec\n is needed if the tuples need to be distributed based on a 
criteria different from the hashcode of the tuple. If the correct working of an 
operator depends on the data from the upstream operator being distributed using 
a custom criteria such as being sticky on a \u201ckey\u201d field within the tup
 le, then a custom \nStreamCodec\n should be provided by the operator 
developer. This codec can implement the custom criteria. The operator should 
also return this custom codec in the \ngetStreamCodec\n method of the input 
port.\n\n\nWhen implementing a custom \nStreamCodec\n for the purpose of using 
a different criteria to distribute the tuples, the codec can extend an existing 
\nStreamCodec\n and implement the hashcode method, so that the codec does not 
have to worry about the serialization and deserialization functionality. The 
Apex platform provides two pre-built \nStreamCodec\n implementations for this 
purpose, one is \nKryoSerializableStreamCodec\n that uses Kryo for 
serialization and another one \nJavaSerializationStreamCodec\n that uses JAVA 
serialization.\n\n\nDifferent \nStreamCodec\n implementations can be used for 
different inputs in a stream with multiple inputs when different criteria of 
distributing the tuples is desired between the multiple inputs. 
\n\n\n\n\nThreads\n
 \n\nThe operator lifecycle methods such as \nsetup\n, \nbeginWindow\n, 
\nendWindow\n, \nprocess\n in \nInputPorts\n are all called from a single 
operator lifecycle thread, by the platform, unbeknownst to the user. So the 
user does not have to worry about dealing with the issues arising from 
multi-threaded code. Use of separate threads in an operator is discouraged 
because in most cases the motivation for this is parallelism, but parallelism 
can already be achieved by using multiple partitions and furthermore mistakes 
can be made easily when writing multi-threaded code. When dealing with high 
volume and velocity data, the corner cases with incorrectly written 
multi-threaded code are encountered more easily and exposed. However, there are 
times when separate threads are needed, for example, when interacting with 
external systems the delay in retrieving or sending data can be large at times, 
blocking the operator and other DAG processing such as committed windows. In 
these cases the fo
 llowing guidelines must be followed strictly.\n\n\n\n\nThreads should be 
started in \nactivate\n and stopped in \ndeactivate\n. In \ndeactivate\n the 
operator should wait till any threads it launched, have finished execution. It 
can do so by calling \njoin\n on the threads or if using \nExecutorService\n, 
calling \nawaitTermination\n on the service.\n\n\nThreads should not call any 
methods on the ports directly as this can cause concurrency exceptions and also 
result in invalid states.\n\n\nThreads can share state with the lifecycle 
methods using data structures that are either explicitly protected by 
synchronization or are inherently thread safe such as thread safe 
queues.\n\n\nIf this shared state needs to be protected against failure then it 
needs to be persisted during checkpoint. To have a consistent checkpoint, the 
state should not be modified by the thread when it is being serialized and 
saved by the operator lifecycle thread during checkpoint. Since the checkpoint 
process ha
 ppens outside the window boundary the thread should be quiesced between 
\nendWindow\n and \nbeginWindow\n or more efficiently between pre-checkpoint 
and checkpointed callbacks.", 
+            "title": "Best Practices"
+        }, 
+        {
+            "location": 
"/development_best_practices/#development-best-practices", 
+            "text": "This document describes the best practices to follow when 
developing operators and other application components such as partitoners, 
stream codecs etc on the Apache Apex platform.", 
+            "title": "Development Best Practices"
+        }, 
+        {
+            "location": "/development_best_practices/#operators", 
+            "text": "These are general guidelines for all operators that are 
covered in the current section. The subsequent sections talk about special 
considerations for input and output operators.   When writing a new operator to 
be used in an application, consider breaking it down into  An abstract operator 
that encompasses the core functionality but leaves application specific schemas 
and logic to the implementation.  An optional concrete operator also in the 
library that extends the abstract operator and provides commonly used schema 
types such as strings, byte[] or POJOs.    Follow these conventions for the 
life cycle methods:  Do one time initialization of entities that apply for the 
entire lifetime of the operator in the  setup  method, e.g., factory 
initializations. Initializations in  setup  are done in the container where the 
operator is deployed. Allocating memory for fields in the constructor is not 
efficient as it would lead to extra garbage in memory for the following
  reason. The operator is instantiated on the client from where the application 
is launched, serialized and started one of the Hadoop nodes in a container. So 
the constructor is first called on the client and if it were to initialize any 
of the fields, that state would be saved during serialization. In the Hadoop 
container the operator is deserialized and started. This would invoke the 
constructor again, which will initialize the fields but their state will get 
overwritten by the serialized state and the initial values would become garbage 
in memory.  Do one time initialization for live entities in  activate  method, 
e.g., opening connections to a database server or starting a thread for 
asynchronous operations. The  activate  method is called right before 
processing starts so it is a better place for these initializations than at  
setup  which can lead to a delay before processing data from the live entity.   
 Perform periodic tasks based on processing time in application window bou
 ndaries.  Perform initializations needed for each application window in  
beginWindow .  Perform aggregations needed for each application window  in  
endWindow .  Teardown of live entities (inverse of tasks performed during 
activate) should be in the  deactivate  method.  Teardown of lifetime entities 
(those initialized in setup method) should happen in the  teardown  method.  If 
the operator implementation is not finalized mark it with the  @Evolving  
annotation.    If the operator needs to perform operations based on event time 
of the individual tuples and not the processing time, extend and use the  
WindowedOperator . Refer to documentation of that operator for details on how 
to use it.  If an operator needs to do some work when it is not receiving any 
input, it should implement  IdleTimeHandler  interface. This interface contains 
 handleIdleTime  method which will be called whenever the platform isn\u2019t 
doing anything else and the operator can do the work in this method. If fo
 r any reason the operator does not have any work to do when this method is 
called, it should sleep for a small amount of time such as that specified by 
the  SPIN_MILLIS  attribute so that it does not cause a busy wait when called 
repeatedly by the platform. Also, the method should not block and return in a 
reasonable amount of time that is less than the streaming window size (which is 
500ms by default).  Often operators have customizable parameters such as 
information about locations of external systems or parameters that modify the 
behavior of the operator. Users should be able to specify these easily without 
having to change source code. This can be done by making them properties of the 
operator because they can then be initialized from external properties files.  
Where possible default values should be provided for the properties in the 
source code.  Validation rules should be specified for the properties using 
javax constraint validations that check whether the values specified 
 for the properties are in the correct format, range or other operator 
requirements. Required properties should have at least a  @NotNull  validation 
specifying that they have to be specified by the user.", 
+            "title": "Operators"
+        }, 
+        {
+            "location": "/development_best_practices/#checkpointing", 
+            "text": "Checkpointing is a process of snapshotting the state of 
an operator and saving it so that in case of failure the state can be used to 
restore the operator to a prior state and continue processing. It is 
automatically performed by the platform at a configurable interval. All 
operators in the application are checkpointed in a distributed fashion, thus 
allowing the entire state of the application to be saved and available for 
recovery if needed. Here are some things to remember when it comes to 
checkpointing:   The process of checkpointing involves snapshotting the state 
by serializing the operator and saving it to a store. This is done using a  
StorageAgent . By default a  StorageAgent  is already provided by the platform 
and it is called  AsyncFSStorageAgent . It serializes the operator using Kryo 
and saves the serialized state asynchronously to a filesystem such as HDFS. 
There are other implementations of  StorageAgent  available such as  
GeodeKeyValueStorageAge
 nt  that stores the serialized state in Geode which is an in-memory replicated 
data grid.  All variables in the operator marked neither transient nor final 
are saved so any variables in the operator that are not part of the state 
should be marked transient. Specifically any variables like connection objects, 
i/o streams, ports are transient, because they need to be setup again on 
failure recovery.  If the operator does not keep any state between windows, 
mark it with the  @Stateless  annotation. This results in efficiencies during 
checkpointing and recovery. The operator will not be checkpointed and is always 
restored to the initial state  The checkpoint interval can be set using the  
CHECKPOINT_WINDOW_COUNT  attribute which specifies the interval in terms of 
number of streaming windows.  If the correct functioning of the operator 
requires the  endWindow  method be called before checkpointing can happen, then 
the checkpoint interval should align with application window interval i.e.
 , it should be a multiple of application window interval. In this case the 
operator should be marked with  OperatorAnnotation  and  
checkpointableWithinAppWindow  set to false. If the window intervals are 
configured by the user and they don\u2019t align, it will result in a DAG 
validation error and application won\u2019t launch.  In some cases the operator 
state related to a piece of data needs to be purged once that data is no longer 
required by the application, otherwise the state will continue to build up 
indefinitely. The platform provides a way to let the operator know about this 
using a callback listener called  CheckpointNotificationListener . This 
listener has a callback method called  committed , which is called by the 
platform from time to time with a window id that has been processed 
successfully by all the operators in the DAG and hence is no longer needed. The 
operator can delete all the state corresponding to window ids less than or 
equal to the provided window id.  So
 metimes operators need to perform some tasks just before checkpointing. For 
example, filesystem operators may want to flush the files just before 
checkpoint so they can be sure that all pending data is written to disk and no 
data is lost if there is an operator failure just after the checkpoint and the 
operator restarts from the checkpoint. To do this the operator would implement 
the same  CheckpointNotificationListener  interface and implement the  
beforeCheckpoint  method where it can do these tasks.  If the operator is going 
to have a large state, checkpointing the entire state each time becomes 
unviable. Furthermore, the amount of memory needed to hold the state could be 
larger than the amount of physical memory available. In these cases the 
operator should checkpoint the state incrementally and also manage the memory 
for the state more efficiently. The platform provides a utiltiy called  
ManagedState  that uses a combination of in memory and disk cache to 
efficiently store and 
 retrieve data in a performant, fault tolerant way and also checkpoint it in an 
incremental fashion. There are operators in the platform that use  ManagedState 
 and can be used as a reference on how to use this utility such as Dedup or 
Join operators.", 
+            "title": "Checkpointing"
+        }, 
+        {
+            "location": "/development_best_practices/#input-operators", 
+            "text": "Input operators have additional requirements:   The  
emitTuples  method implemented by the operator, is called by the platform, to 
give the operator an opportunity to emit some data. This method is always 
called within a window boundary but can be called multiple times within the 
same window. There are some important guidelines on how to implement this 
method:  This should not be a blocking method and should return in a reasonable 
time that is less than the streaming window size (which is 500ms by default). 
This also applies to other callback methods called by the platform such as  
beginWindow ,  endWindow  etc., but is more important here since this method 
will be called continuously by the platform.  If the operator needs to interact 
with external systems to obtain data and this can potentially take a long time, 
then this should be performed asynchronously in a different thread. Refer to 
the threading section below for the guidelines when using threading.  In 
 each invocation, the method can emit any number of data tuples.", 
+            "title": "Input Operators"
+        }, 
+        {
+            "location": "/development_best_practices/#idempotence", 
+            "text": "Many applications write data to external systems using 
output operators. To ensure that data is present exactly once in the external 
system even in a failure recovery scenario, the output operators expect the 
replayed windows during recovery contain the same data as before the failure. 
This is called idempotency. Since operators within the DAG are merely 
responding to input data provided to them by the upstream operators and the 
input operator has no upstream operator, the responsibility of idempotent 
replay falls on the input operators.   For idempotent replay of data, the 
operator needs to store some meta-information for every window that would allow 
it to identify what data was sent in that window. This is called the idempotent 
state.  If the external source of the input operator allows replayability, this 
could be information such as offset of last piece of data in the window, an 
identifier of the last piece of data itself or number of data tuples sent.  How
 ever if the external source does not allow replayability from an operator 
specified point, then the entire data sent within the window may need to be 
persisted by the operator.    The platform provides a utility called  
WindowDataManager  to allow operators to save and retrieve idempotent state 
every window. Operators should use this to implement idempotency.", 
+            "title": "Idempotence"
+        }, 
+        {
+            "location": "/development_best_practices/#output-operators", 
+            "text": "Output operators typically connect to external storage 
systems such as filesystems, databases or key value stores to store data.   In 
some situations, the external systems may not be functioning in a reliable 
fashion. They may be having prolonged outages or performance problems. If the 
operator is being designed to work in such environments, it needs to be able to 
to handle these problems gracefully and not block the DAG or fail. In these 
scenarios the operator should cache the data into a local store such as HDFS 
and interact with external systems in a separate thread so as to not have 
problems in the operator lifecycle thread. This pattern is called the  
Reconciler  pattern and there are operators that implement this pattern 
available in the library for reference.", 
+            "title": "Output Operators"
+        }, 
+        {
+            "location": 
"/development_best_practices/#end-to-end-exactly-once", 
+            "text": "When output operators store data in external systems, it 
is important that they do not lose data or write duplicate data when there is a 
failure event and the DAG recovers from that failure. In failure recovery, the 
windows from the previous checkpoint are replayed and the operator receives 
this data again. The operator should ensure that it does not write this data 
again. Operator developers should figure out how to do this specifically for 
the operators they are developing depending on the logic of the operators. 
Below are examples of how a couple of existing output operators do this for 
reference.   File output operator that writes data to files keeps track of the 
file lengths in the state. These lengths are checkpointed and restored on 
failure recovery. On restart, the operator truncates the file to the length 
equal to the length in the recovered state. This makes the data in the file 
same as it was at the time of checkpoint before the failure. The operator 
 now writes the replayed data from the checkpoint in regular fashion as any 
other data. This ensures no data is lost or duplicated in the file.  The JDBC 
output operator that writes data to a database table writes the data in a 
window in a single transaction. It also writes the current window id into a 
meta table along with the data as part of the same transaction. It commits the 
transaction at the end of the window. When there is an operator failure before 
the final commit, the state of the database is that it contains the data from 
the previous fully processed window and its window id since the current window 
transaction isn\u2019t yet committed. On recovery, the operator reads this 
window id back from the meta table. It ignores all the replayed windows whose 
window id is less than or equal to the recovered window id and thus ensures 
that it does not duplicate data already present in the database. It starts 
writing data normally again when window id of data becomes greater than rec
 overed window thus ensuring no data is lost.", 
+            "title": "End-to-End Exactly Once"
+        }, 
+        {
+            "location": "/development_best_practices/#partitioning", 
+            "text": "Partitioning allows an operation to be scaled to handle 
more pieces of data than before but with a similar SLA. This is done by 
creating multiple instances of an operator and distributing the data among 
them. Input operators can also be partitioned to stream more pieces of data 
into the application. The platform provides a lot of flexibility and options 
for partitioning. Partitioning can happen once at startup or can be dynamically 
changed anytime while the application is running, and it can be done in a 
stateless or stateful way by distributing state from the old partitions to new 
partitions.  In the platform, the responsibility for partitioning is shared 
among different entities. These are:   A  partitioner  that specifies  how  to 
partition the operator, specifically it takes an old set of partitions and 
creates a new set of partitions. At the start of the application the old set 
has one partition and the partitioner can return more than one partitions to sta
 rt the application with multiple partitions. The partitioner can have any 
custom JAVA logic to determine the number of new partitions, set their initial 
state as a brand new state or derive it from the state of the old partitions. 
It also specifies how the data gets distributed among the new partitions. The 
new set doesn't have to contain only new partitions, it can carry over some old 
partitions if desired.  An optional  statistics (stats) listener  that 
specifies  when  to partition. The reason it is optional is that it is needed 
only when dynamic partitioning is needed. With the stats listener, the stats 
can be used to determine when to partition.  In some cases the  operator  
itself should be aware of partitioning and would need to provide supporting 
code.  In case of input operators each partition should have a property or a 
set of properties that allow it to distinguish itself from the other partitions 
and fetch unique data.    When an operator that was originally a single ins
 tance is split into multiple partitions with each partition working on a 
subset of data, the results of the partitions may need to be combined together 
to compute the final result. The combining logic would depend on the logic of 
the operator. This would be specified by the developer using a  Unifier , which 
is deployed as another operator by the platform. If no  Unifier  is specified, 
the platform inserts a  default unifier  that merges the results of the 
multiple partition streams into a single stream. Each output port can have a 
different  Unifier  and this is specified by returning the corresponding  
Unifier  in the  getUnifier  method of the output port. The operator developer 
should provide a custom  Unifier  wherever applicable.  The Apex  engine  that 
brings everything together and effects the partitioning.   Since partitioning 
is critical for scalability of applications, operators must support it. There 
should be a strong reason for an operator to not support partitioning, 
 such as, the logic performed by the operator not lending itself to 
parallelism. In order to support partitioning, an operator developer, apart 
from developing the functionality of the operator, may also need to provide a 
partitioner, stats listener and supporting code in the operator as described in 
the steps above. The next sections delve into this.", 
+            "title": "Partitioning"
+        }, 
+        {
+            "location": 
"/development_best_practices/#out-of-the-box-partitioning", 
+            "text": "The platform comes with some built-in partitioning 
utilities that can be used in certain scenarios.    StatelessPartitioner  
provides a default partitioner, that can be used for an operator in certain 
conditions. If the operator satisfies these conditions, the partitioner can be 
specified for the operator with a simple setting and no other partitioning code 
is needed. The conditions are:   No dynamic partitioning is needed, see next 
point about dynamic partitioning.   There is no distinct initial state for the 
partitions, i.e., all partitions start with the same initial state submitted 
during application launch.   Typically input or output operators do not fall 
into this category, although there are some exceptions. This partitioner is 
mainly used with operators that are in the middle of the DAG, after the input 
and before the output operators. When used with non-input operators, only the 
data for the first declared input port is distributed among the different 
 partitions. All other input ports are treated as broadcast and all partitions 
receive all the data for that port.    StatelessThroughputBasedPartitioner  in 
Malhar provides a dynamic partitioner based on throughput thresholds. Similarly 
 StatelessLatencyBasedPartitioner  provides a latency based dynamic partitioner 
in RTS. If these partitioners can be used, then separate partitioning related 
code is not needed. The conditions under which these can be used are:   There 
is no distinct initial state for the partitions.  There is no state being 
carried over by the operator from one window to the next i.e., operator is 
stateless.", 
+            "title": "Out of the box partitioning"
+        }, 
+        {
+            "location": "/development_best_practices/#custom-partitioning", 
+            "text": "In many cases, operators don\u2019t satisfy the above 
conditions and a built-in partitioner cannot be used. Custom partitioning code 
needs to be written by the operator developer. Below are guidelines for it.   
Since the operator developer is providing a  partitioner  for the operator, the 
partitioning code should be added to the operator itself by making the operator 
implement the Partitioner interface and implementing the required methods, 
rather than creating a separate partitioner. The advantage is the user of the 
operator does not have to explicitly figure out the partitioner and set it for 
the operator but still has the option to override this built-in partitioner 
with a different one.  The  partitioner  is responsible for setting the initial 
state of the new partitions, whether it is at the start of the application or 
when partitioning is happening while the application is running as in the 
dynamic partitioning case. In the dynamic partitioning scenario, 
 the partitioner needs to take the state from the old partitions and distribute 
it among the new partitions. It is important to note that apart from the 
checkpointed state the partitioner also needs to distribute idempotent state.  
The  partitioner  interface has two methods,  definePartitions  and  
partitioned . The method  definePartitons  is first called to determine the new 
partitions, and if enough resources are available on the cluster, the  
partitioned  method is called passing in the new partitions. This happens both 
during initial partitioning and dynamic partitioning. If resources are not 
available, partitioning is abandoned and existing partitions continue to run 
untouched. This means that any processing intensive operations should be 
deferred to the  partitioned  call instead of doing them in  definePartitions , 
as they may not be needed if there are not enough resources available in the 
cluster.  The  partitioner , along with creating the new partitions, should 
also spec
 ify how the data gets distributed across the new partitions. It should do this 
by specifying a mapping called  PartitionKeys  for each partition that maps the 
data to that partition. This mapping needs to be specified for every input port 
in the operator. If the  partitioner  wants to use the standard mapping it can 
use a utility method called  DefaultPartition.assignPartitionKeys .  When the 
partitioner is scaling the operator up to more partitions, try to reuse the 
existing partitions and create new partitions to augment the current set. The 
reuse can be achieved by the partitioner returning the current partitions 
unchanged. This will result in the current partitions continuing to run 
untouched.  In case of dynamic partitioning, as mentioned earlier, a stats 
listener is also needed to determine when to re-partition. Like the  
Partitioner  interface, the operator can also implement the  StatsListener  
interface to provide a stats listener implementation that will be automatically 
u
 sed.  The  StatsListener  has access to all operator statistics to make its 
decision on partitioning. Apart from the statistics that the platform computes 
for the operators such as throughput, latency etc, operator developers can 
include their own business metrics by using the AutoMetric feature.  If the 
operator is not partitionable, mark it so with  OperatorAnnotation  and  
partitionable  element set to false.", 
+            "title": "Custom partitioning"
+        }, 
+        {
+            "location": "/development_best_practices/#streamcodecs", 
+            "text": "A  StreamCodec  is used in partitioning to distribute the 
data tuples among the partitions. The  StreamCodec  computes an integer 
hashcode for a data tuple and this is used along with  PartitionKeys  mapping 
to determine which partition or partitions receive the data tuple. If a  
StreamCodec  is not specified, then a default one is used by the platform which 
returns the JAVA hashcode of the tuple.   StreamCodec  is also useful in 
another aspect of the application. It is used to serialize and deserialize the 
tuple to transfer it between operators. The default  StreamCodec  uses Kryo 
library for serialization.   The following guidelines are useful when 
considering a custom  StreamCodec   A custom  StreamCodec  is needed if the 
tuples need to be distributed based on a criteria different from the hashcode 
of the tuple. If the correct working of an operator depends on the data from 
the upstream operator being distributed using a custom criteria such as being 
sticky o
 n a \u201ckey\u201d field within the tuple, then a custom  StreamCodec  should 
be provided by the operator developer. This codec can implement the custom 
criteria. The operator should also return this custom codec in the  
getStreamCodec  method of the input port.  When implementing a custom  
StreamCodec  for the purpose of using a different criteria to distribute the 
tuples, the codec can extend an existing  StreamCodec  and implement the 
hashcode method, so that the codec does not have to worry about the 
serialization and deserialization functionality. The Apex platform provides two 
pre-built  StreamCodec  implementations for this purpose, one is  
KryoSerializableStreamCodec  that uses Kryo for serialization and another one  
JavaSerializationStreamCodec  that uses JAVA serialization.  Different  
StreamCodec  implementations can be used for different inputs in a stream with 
multiple inputs when different criteria of distributing the tuples is desired 
between the multiple inputs.", 
+            "title": "StreamCodecs"
+        }, 
+        {
+            "location": "/development_best_practices/#threads", 
+            "text": "The operator lifecycle methods such as  setup ,  
beginWindow ,  endWindow ,  process  in  InputPorts  are all called from a 
single operator lifecycle thread, by the platform, unbeknownst to the user. So 
the user does not have to worry about dealing with the issues arising from 
multi-threaded code. Use of separate threads in an operator is discouraged 
because in most cases the motivation for this is parallelism, but parallelism 
can already be achieved by using multiple partitions and furthermore mistakes 
can be made easily when writing multi-threaded code. When dealing with high 
volume and velocity data, the corner cases with incorrectly written 
multi-threaded code are encountered more easily and exposed. However, there are 
times when separate threads are needed, for example, when interacting with 
external systems the delay in retrieving or sending data can be large at times, 
blocking the operator and other DAG processing such as committed windows. In 
these cases
  the following guidelines must be followed strictly.   Threads should be 
started in  activate  and stopped in  deactivate . In  deactivate  the operator 
should wait till any threads it launched, have finished execution. It can do so 
by calling  join  on the threads or if using  ExecutorService , calling  
awaitTermination  on the service.  Threads should not call any methods on the 
ports directly as this can cause concurrency exceptions and also result in 
invalid states.  Threads can share state with the lifecycle methods using data 
structures that are either explicitly protected by synchronization or are 
inherently thread safe such as thread safe queues.  If this shared state needs 
to be protected against failure then it needs to be persisted during 
checkpoint. To have a consistent checkpoint, the state should not be modified 
by the thread when it is being serialized and saved by the operator lifecycle 
thread during checkpoint. Since the checkpoint process happens outside the 
window
  boundary the thread should be quiesced between  endWindow  and  beginWindow  
or more efficiently between pre-checkpoint and checkpointed callbacks.", 
+            "title": "Threads"
+        }, 
+        {
             "location": "/apex_cli/", 
             "text": "Apache Apex Command Line Interface\n\n\nApex CLI, the 
Apache Apex command line interface, can be used to launch, monitor, and manage 
Apache Apex applications.  It provides a developer friendly way of interacting 
with Apache Apex platform.  Another advantage of Apex CLI is to provide scope, 
by connecting and executing commands in a context of specific application.  
Apex CLI enables easy integration with existing enterprise toolset for 
automated application monitoring and management.  Currently the following high 
level tasks are supported.\n\n\n\n\nLaunch or kill applications\n\n\nView 
system metrics including load, throughput, latency, etc.\n\n\nStart or stop 
tuple recording\n\n\nRead operator, stream, port properties and 
attributes\n\n\nWrite to operator properties\n\n\nDynamically change the 
application logical plan\n\n\nCreate custom macros\n\n\n\n\nApex CLI 
Commands\n\n\nApex CLI can be launched by running following 
command\n\n\napex\n\n\n\nHelp on all comman
 ds is available via \u201chelp\u201d command in the CLI\n\n\nGlobal 
Commands\n\n\nGLOBAL COMMANDS EXCEPT WHEN CHANGING LOGICAL PLAN:\n\nalias 
alias-name command\n    Create a command alias\n\nbegin-macro name\n    Begin 
Macro Definition ($1...$9 to access parameters and type 'end' to end the 
definition)\n\nconnect app-id\n    Connect to an app\n\ndump-properties-file 
out-file jar-file class-name\n    Dump the properties file of an app 
class\n\necho [arg ...]\n    Echo the arguments\n\nexit\n    Exit the 
CLI\n\nget-app-info app-id\n    Get the information of an 
app\n\nget-app-package-info app-package-file\n    Get info on the app package 
file\n\nget-app-package-operator-properties app-package-file operator-class\n   
 Get operator properties within the given app 
package\n\nget-app-package-operators [options] app-package-file [search-term]\n 
   Get operators within the given app package\n    Options:\n            
-parent    Specify the parent class for the operators\n\nget-config-param
 eter [parameter-name]\n    Get the configuration 
parameter\n\nget-jar-operator-classes [options] jar-files-comma-separated 
[search-term]\n    List operators in a jar list\n    Options:\n            
-parent    Specify the parent class for the 
operators\n\nget-jar-operator-properties jar-files-comma-separated 
operator-class-name\n    List properties in specified operator\n\nhelp 
[command]\n    Show help\n\nkill-app app-id [app-id ...]\n    Kill an app\n\n  
launch [options] jar-file/json-file/properties-file/app-package-file 
[matching-app-name]\n    Launch an app\n    Options:\n            -apconf \napp 
package configuration file\n        Specify an application\n                    
                                        configuration file\n                    
                                        within the app\n                        
                                    package if launching\n                      
                                      an app package.\n            -a
 rchives \ncomma separated list of archives\n    Specify comma\n                
                                            separated archives\n                
                                            to be unarchived on\n               
                                             the compute machines.\n            
-conf \nconfiguration file\n                      Specify an\n                  
                                          application\n                         
                                   configuration file.\n            -D 
\nproperty=value\n                             Use value for given\n            
                                                property.\n            
-exactMatch                                     Only consider\n                 
                                           applications with\n                  
                                          exact app name\n            -files 
\ncomma separated list of files\n          Specify comma\n 
                                                            separated files 
to\n                                                            be copied on 
the\n                                                            compute 
machines.\n            -ignorepom                                      Do not 
run maven to\n                                                            find 
the dependency\n            -libjars \ncomma separated list of libjars\n      
Specify comma\n                                                            
separated jar files\n                                                           
 or other resource\n                                                            
files to include in\n                                                           
 the classpath.\n            -local                                          
Run application in\n                                                            
local mode.\n            -originalAppId \napplication id\n       
           Specify original\n                                                   
         application\n                                                          
  identifier for restart.\n            -queue \nqueue name\n                    
         Specify the queue to\n                                                 
           launch the application\n\nlist-application-attributes\n    Lists the 
application attributes\nlist-apps [pattern]\n    List 
applications\nlist-operator-attributes\n    Lists the operator 
attributes\nlist-port-attributes\n    Lists the port attributes\nset-pager 
on/off\n    Set the pager program for output\nshow-logical-plan [options] 
jar-file/app-package-file [class-name]\n    List apps in a jar or show logical 
plan of an app class\n    Options:\n            -exactMatch                     
           Only consider exact match\n                                          
             for app name\n            -ignorepom                               
  Do not run 
 maven to find\n                                                       the 
dependency\n            -libjars \ncomma separated list of jars\n    Specify 
comma separated\n                                                       
jar/resource files to\n                                                       
include in the classpath.\nshutdown-app app-id [app-id ...]\n    Shutdown an 
app\nsource file\n    Execute the commands in a file\n\n\n\n\nCommands after 
connecting to an application\n\n\nCOMMANDS WHEN CONNECTED TO AN APP (via 
connect \nappid\n) EXCEPT WHEN CHANGING LOGICAL 
PLAN:\n\nbegin-logical-plan-change\n    Begin Logical Plan 
Change\ndump-properties-file out-file [jar-file] [class-name]\n    Dump the 
properties file of an app class\nget-app-attributes [attribute-name]\n    Get 
attributes of the connected app\nget-app-info [app-id]\n    Get the information 
of an app\nget-operator-attributes operator-name [attribute-name]\n    Get 
attributes of an operator\nget-operator-properties op
 erator-name [property-name]\n    Get properties of a logical 
operator\nget-physical-operator-properties [options] operator-id\n    Get 
properties of a physical operator\n    Options:\n            -propertyName 
\nproperty name\n    The name of the property whose\n                           
                  value needs to be retrieved\n            -waitTime \nwait 
time\n            How long to wait to get the result\nget-port-attributes 
operator-name port-name [attribute-name]\n    Get attributes of a 
port\nget-recording-info [operator-id] [start-time]\n    Get tuple recording 
info\nkill-app [app-id ...]\n    Kill an app\nkill-container container-id 
[container-id ...]\n    Kill a container\nlist-containers\n    List 
containers\nlist-operators [pattern]\n    List operators\nset-operator-property 
operator-name property-name property-value\n    Set a property of an 
operator\nset-physical-operator-property operator-id property-name 
property-value\n    Set a property of an operator\nshow-
 logical-plan [options] [jar-file/app-package-file] [class-name]\n    Show 
logical plan of an app class\n    Options:\n            -exactMatch             
                   Only consider exact match\n                                  
                     for app name\n            -ignorepom                       
          Do not run maven to find\n                                            
           the dependency\n            -libjars \ncomma separated list of 
jars\n    Specify comma separated\n                                             
          jar/resource files to\n                                               
        include in the classpath.\nshow-physical-plan\n    Show physical 
plan\nshutdown-app [app-id ...]\n    Shutdown an app\nstart-recording 
operator-id [port-name] [num-windows]\n    Start recording\nstop-recording 
operator-id [port-name]\n    Stop recording\nwait timeout\n    Wait for 
completion of current application\n\n\n\n\nCommands when changing the logical p
 lan\n\n\nCOMMANDS WHEN CHANGING LOGICAL PLAN (via 
begin-logical-plan-change):\n\nabort\n    Abort the plan 
change\nadd-stream-sink stream-name to-operator-name to-port-name\n    Add a 
sink to an existing stream\ncreate-operator operator-name class-name\n    
Create an operator\ncreate-stream stream-name from-operator-name from-port-name 
to-operator-name to-port-name\n    Create a stream\nhelp [command]\n    Show 
help\nremove-operator operator-name\n    Remove an operator\nremove-stream 
stream-name\n    Remove a stream\nset-operator-attribute operator-name 
attr-name attr-value\n    Set an attribute of an 
operator\nset-operator-property operator-name property-name property-value\n    
Set a property of an operator\nset-port-attribute operator-name port-name 
attr-name attr-value\n    Set an attribute of a port\nset-stream-attribute 
stream-name attr-name attr-value\n    Set an attribute of a 
stream\nshow-queue\n    Show the queue of the plan change\nsubmit\n    Submit 
the plan change\n\n\
 n\n\nExamples\n\n\nAn example of defining a custom macro.  The macro updates a 
running application by inserting a new operator.  It takes three parameters and 
executes a logical plan changes.\n\n\napex\n begin-macro 
add-console-output\nmacro\n begin-logical-plan-change\nmacro\n create-operator 
$1 com.datatorrent.lib.io.ConsoleOutputOperator\nmacro\n create-stream 
stream_$1 $2 $3 $1 in\nmacro\n submit\n\n\n\n\nThen execute the 
\nadd-console-output\n macro like this\n\n\napex\n add-console-output xyz 
opername portname\n\n\n\n\nThis macro then expands to run the following 
command\n\n\nbegin-logical-plan-change\ncreate-operator xyz 
com.datatorrent.lib.io.ConsoleOutputOperator\ncreate-stream stream_xyz opername 
portname xyz in\nsubmit\n\n\n\n\nNote\n:  To perform runtime logical plan 
changes, like ability to add new operators,\nthey must be part of the jar files 
that were deployed at application launch time.", 
             "title": "Apex CLI"
@@ -857,7 +922,7 @@
         }, 
         {
             "location": "/security/", 
-            "text": "Security\n\n\nApplications built on Apex run as native 
YARN applications on Hadoop. The security framework and apparatus in Hadoop 
apply to the applications. The default security mechanism in Hadoop is 
Kerberos.\n\n\nKerberos Authentication\n\n\nKerberos is a ticket based 
authentication system that provides authentication in a distributed environment 
where authentication is needed between multiple users, hosts and services. It 
is the de-facto authentication mechanism supported in Hadoop. To use Kerberos 
authentication, the Hadoop installation must first be configured for secure 
mode with Kerberos. Please refer to the administration guide of your Hadoop 
distribution on how to do that. Once Hadoop is configured, there is some 
configuration needed on Apex side as well.\n\n\nConfiguring security\n\n\nThere 
is Hadoop configuration and CLI configuration. Hadoop configuration may be 
optional.\n\n\nHadoop Configuration\n\n\nAn Apex application uses delegation 
tokens to 
 authenticate with the ResourceManager (YARN) and NameNode (HDFS) and these 
tokens are issued by those servers respectively. Since the application is 
long-running,\nthe tokens should be valid for the lifetime of the application. 
Hadoop has a configuration setting for the maximum lifetime of the tokens and 
they should be set to cover the lifetime of the application. There are separate 
settings for ResourceManager and NameNode delegation\ntokens.\n\n\nThe 
ResourceManager delegation token max lifetime is specified in \nyarn-site.xml\n 
and can be specified as follows for example for a lifetime of 1 
year\n\n\nproperty\n\n  
\nname\nyarn.resourcemanager.delegation.token.max-lifetime\n/name\n\n  
\nvalue\n31536000000\n/value\n\n\n/property\n\n\n\n\n\nThe NameNode delegation 
token max lifetime is specified in\nhdfs-site.xml and can be specified as 
follows for example for a lifetime of 1 year\n\n\nproperty\n\n   
\nname\ndfs.namenode.delegation.token.max-lifetime\n/name\n\n   
\nvalue\n3153600000
 0\n/value\n\n \n/property\n\n\n\n\n\nCLI Configuration\n\n\nThe Apex command 
line interface is used to launch\napplications along with performing various 
other operations and administrative tasks on the applications. \u00a0When 
Kerberos security is enabled in Hadoop, a Kerberos ticket granting ticket (TGT) 
or the Kerberos credentials of the user are needed b

<TRUNCATED>

Reply via email to