Repository: incubator-airflow-site Updated Branches: refs/heads/asf-site 69cff4922 -> 28a3eb600
http://git-wip-us.apache.org/repos/asf/incubator-airflow-site/blob/28a3eb60/security.html ---------------------------------------------------------------------- diff --git a/security.html b/security.html index d0156b7..9bde2ce 100644 --- a/security.html +++ b/security.html @@ -13,6 +13,8 @@ + + @@ -81,7 +83,10 @@ - <ul class="current"> + + + + <ul class="current"> <li class="toctree-l1"><a class="reference internal" href="project.html">Project</a></li> <li class="toctree-l1"><a class="reference internal" href="license.html">License</a></li> <li class="toctree-l1"><a class="reference internal" href="start.html">Quick Start</a></li> @@ -104,7 +109,11 @@ <li class="toctree-l2"><a class="reference internal" href="#multi-tenancy">Multi-tenancy</a></li> <li class="toctree-l2"><a class="reference internal" href="#kerberos">Kerberos</a><ul> <li class="toctree-l3"><a class="reference internal" href="#limitations">Limitations</a></li> -<li class="toctree-l3"><a class="reference internal" href="#enabling-kerberos">Enabling kerberos</a></li> +<li class="toctree-l3"><a class="reference internal" href="#enabling-kerberos">Enabling kerberos</a><ul> +<li class="toctree-l4"><a class="reference internal" href="#airflow">Airflow</a></li> +<li class="toctree-l4"><a class="reference internal" href="#hadoop">Hadoop</a></li> +</ul> +</li> <li class="toctree-l3"><a class="reference internal" href="#using-kerberos-authentication">Using kerberos authentication</a></li> </ul> </li> @@ -119,8 +128,9 @@ </li> </ul> </li> -<li class="toctree-l2"><a class="reference internal" href="#ssl">SSL</a><ul> -<li class="toctree-l3"><a class="reference internal" href="#impersonation">Impersonation</a></li> +<li class="toctree-l2"><a class="reference internal" href="#ssl">SSL</a></li> +<li class="toctree-l2"><a class="reference internal" href="#impersonation">Impersonation</a><ul> +<li class="toctree-l3"><a class="reference internal" href="#default-impersonation">Default Impersonation</a></li> </ul> </li> </ul> @@ -198,7 +208,8 @@ to the web application is to do it at the network level, or by using SSH tunnels.</p> <p>It is however possible to switch on authentication by either using one of the supplied -backends or create your own.</p> +backends or creating your own.</p> +<p>Be sure to checkout <a class="reference internal" href="api.html"><span class="doc">Experimental Rest API</span></a> for securing the API.</p> <div class="section" id="web-authentication"> <h2>Web Authentication<a class="headerlink" href="#web-authentication" title="Permalink to this headline">¶</a></h2> <div class="section" id="password"> @@ -217,7 +228,7 @@ attack. Creating a new user has to be done via a Python REPL on the same machine <div class="highlight-bash"><div class="highlight"><pre><span></span><span class="c1"># navigate to the airflow installation directory</span> $ <span class="nb">cd</span> ~/airflow $ python -Python 2.7.9 <span class="o">(</span>default, Feb <span class="m">10</span> 2015, 03:28:08<span class="o">)</span> +Python <span class="m">2</span>.7.9 <span class="o">(</span>default, Feb <span class="m">10</span> <span class="m">2015</span>, <span class="m">03</span>:28:08<span class="o">)</span> Type <span class="s2">"help"</span>, <span class="s2">"copyright"</span>, <span class="s2">"credits"</span> or <span class="s2">"license"</span> <span class="k">for</span> more information. >>> import airflow >>> from airflow import models, settings @@ -240,7 +251,7 @@ Type <span class="s2">"help"</span>, <span class="s2">"copyright& an encrypted connection to the ldap server as you probably do not want passwords be readable on the network level. It is however possible to configure without encryption if you really want to.</p> <p>Additionally, if you are using Active Directory, and are not explicitly specifying an OU that your users are in, -you will need to change <code class="docutils literal"><span class="pre">search_scope</span></code> to “SUBTREE”.</p> +you will need to change <code class="docutils literal"><span class="pre">search_scope</span></code> to âSUBTREEâ.</p> <p>Valid search_scope options can be found in the <a class="reference external" href="http://ldap3.readthedocs.org/searches.html?highlight=search_scope">ldap3 Documentation</a></p> <div class="highlight-bash"><div class="highlight"><pre><span></span><span class="o">[</span>webserver<span class="o">]</span> <span class="nv">authenticate</span> <span class="o">=</span> True @@ -252,6 +263,11 @@ you will need to change <code class="docutils literal"><span class="pre">search_ <span class="nv">user_filter</span> <span class="o">=</span> <span class="nv">objectClass</span><span class="o">=</span>* <span class="c1"># in case of Active Directory you would use: user_name_attr = sAMAccountName</span> <span class="nv">user_name_attr</span> <span class="o">=</span> uid +<span class="c1"># group_member_attr should be set accordingly with *_filter</span> +<span class="c1"># eg :</span> +<span class="c1"># group_member_attr = groupMembership</span> +<span class="c1"># superuser_filter = groupMembership=CN=airflow-super-users...</span> +<span class="nv">group_member_attr</span> <span class="o">=</span> memberOf <span class="nv">superuser_filter</span> <span class="o">=</span> <span class="nv">memberOf</span><span class="o">=</span><span class="nv">CN</span><span class="o">=</span>airflow-super-users,OU<span class="o">=</span>Groups,OU<span class="o">=</span>RWC,OU<span class="o">=</span>US,OU<span class="o">=</span>NORAM,DC<span class="o">=</span>example,DC<span class="o">=</span>com <span class="nv">data_profiler_filter</span> <span class="o">=</span> <span class="nv">memberOf</span><span class="o">=</span><span class="nv">CN</span><span class="o">=</span>airflow-data-profilers,OU<span class="o">=</span>Groups,OU<span class="o">=</span>RWC,OU<span class="o">=</span>US,OU<span class="o">=</span>NORAM,DC<span class="o">=</span>example,DC<span class="o">=</span>com <span class="nv">bind_user</span> <span class="o">=</span> <span class="nv">cn</span><span class="o">=</span>Manager,dc<span class="o">=</span>example,dc<span class="o">=</span>com @@ -269,7 +285,7 @@ you will need to change <code class="docutils literal"><span class="pre">search_ <h3>Roll your own<a class="headerlink" href="#roll-your-own" title="Permalink to this headline">¶</a></h3> <p>Airflow uses <code class="docutils literal"><span class="pre">flask_login</span></code> and exposes a set of hooks in the <code class="docutils literal"><span class="pre">airflow.default_login</span></code> module. You can -alter the content and make it part of the <code class="docutils literal"><span class="pre">PYTHONPATH</span></code> and configure it as a backend in <code class="docutils literal"><span class="pre">airflow.cfg`</span></code>.</p> +alter the content and make it part of the <code class="docutils literal"><span class="pre">PYTHONPATH</span></code> and configure it as a backend in <code class="docutils literal"><span class="pre">airflow.cfg</span></code>.</p> <div class="highlight-bash"><div class="highlight"><pre><span></span><span class="o">[</span>webserver<span class="o">]</span> <span class="nv">authenticate</span> <span class="o">=</span> True <span class="nv">auth_backend</span> <span class="o">=</span> mypackage.auth @@ -279,12 +295,13 @@ alter the content and make it part of the <code class="docutils literal"><span c </div> <div class="section" id="multi-tenancy"> <h2>Multi-tenancy<a class="headerlink" href="#multi-tenancy" title="Permalink to this headline">¶</a></h2> -<p>You can filter the list of dags in webserver by owner name, when authentication -is turned on, by setting webserver.filter_by_owner as true in your <code class="docutils literal"><span class="pre">airflow.cfg</span></code> -With this, when a user authenticates and logs into webserver, it will see only the dags -which it is owner of. A super_user, will be able to see all the dags although. -This makes the web UI a multi-tenant UI, where a user will only be able to see dags -created by itself.</p> +<p>You can filter the list of dags in webserver by owner name when authentication +is turned on by setting <code class="docutils literal"><span class="pre">webserver:filter_by_owner</span></code> in your config. With this, a user will see +only the dags which it is owner of, unless it is a superuser.</p> +<div class="highlight-bash"><div class="highlight"><pre><span></span><span class="o">[</span>webserver<span class="o">]</span> +<span class="nv">filter_by_owner</span> <span class="o">=</span> True +</pre></div> +</div> </div> <div class="section" id="kerberos"> <h2>Kerberos<a class="headerlink" href="#kerberos" title="Permalink to this headline">¶</a></h2> @@ -293,15 +310,16 @@ tickets for itself and store it in the ticket cache. The hooks and dags can make to authenticate against kerberized services.</p> <div class="section" id="limitations"> <h3>Limitations<a class="headerlink" href="#limitations" title="Permalink to this headline">¶</a></h3> -<p>Please note that at this time not all hooks have been adjusted to make use of this functionality yet. +<p>Please note that at this time, not all hooks have been adjusted to make use of this functionality. Also it does not integrate kerberos into the web interface and you will have to rely on network level security for now to make sure your service remains secure.</p> -<p>Celery integration has not been tried and tested yet. However if you generate a key tab for every host -and launch a ticket renewer next to every worker it will most likely work.</p> +<p>Celery integration has not been tried and tested yet. However, if you generate a key tab for every +host and launch a ticket renewer next to every worker it will most likely work.</p> </div> <div class="section" id="enabling-kerberos"> <h3>Enabling kerberos<a class="headerlink" href="#enabling-kerberos" title="Permalink to this headline">¶</a></h3> -<p>#### Airflow</p> +<div class="section" id="airflow"> +<h4>Airflow<a class="headerlink" href="#airflow" title="Permalink to this headline">¶</a></h4> <p>To enable kerberos you will need to generate a (service) key tab.</p> <div class="highlight-bash"><div class="highlight"><pre><span></span><span class="c1"># in the kadmin.local or kadmin shell, create the airflow principal</span> kadmin: addprinc -randkey airflow/[email protected] @@ -317,7 +335,7 @@ your <code class="docutils literal"><span class="pre">airflow.cfg</span></code>< <span class="o">[</span>kerberos<span class="o">]</span> <span class="nv">keytab</span> <span class="o">=</span> /etc/airflow/airflow.keytab -<span class="nv">reinit_frequency</span> <span class="o">=</span> 3600 +<span class="nv">reinit_frequency</span> <span class="o">=</span> <span class="m">3600</span> <span class="nv">principal</span> <span class="o">=</span> airflow </pre></div> </div> @@ -326,7 +344,9 @@ your <code class="docutils literal"><span class="pre">airflow.cfg</span></code>< airflow kerberos </pre></div> </div> -<p>#### Hadoop</p> +</div> +<div class="section" id="hadoop"> +<h4>Hadoop<a class="headerlink" href="#hadoop" title="Permalink to this headline">¶</a></h4> <p>If want to use impersonation this needs to be enabled in <code class="docutils literal"><span class="pre">core-site.xml</span></code> of your hadoop config.</p> <div class="highlight-bash"><div class="highlight"><pre><span></span><property> <name>hadoop.proxyuser.airflow.groups</name> @@ -346,17 +366,18 @@ airflow kerberos </div> <p>Of course if you need to tighten your security replace the asterisk with something more appropriate.</p> </div> +</div> <div class="section" id="using-kerberos-authentication"> <h3>Using kerberos authentication<a class="headerlink" href="#using-kerberos-authentication" title="Permalink to this headline">¶</a></h3> -<p>The hive hook has been updated to take advantage of kerberos authentication. To allow your DAGs to use it simply -update the connection details with, for example:</p> +<p>The hive hook has been updated to take advantage of kerberos authentication. To allow your DAGs to +use it, simply update the connection details with, for example:</p> <div class="highlight-bash"><div class="highlight"><pre><span></span><span class="o">{</span> <span class="s2">"use_beeline"</span>: true, <span class="s2">"principal"</span>: <span class="s2">"hive/[email protected]"</span><span class="o">}</span> </pre></div> </div> <p>Adjust the principal to your settings. The _HOST part will be replaced by the fully qualified domain name of the server.</p> <p>You can specify if you would like to use the dag owner as the user for the connection or the user specified in the login -section of the connection. For the login user specify the following as extra:</p> +section of the connection. For the login user, specify the following as extra:</p> <div class="highlight-bash"><div class="highlight"><pre><span></span><span class="o">{</span> <span class="s2">"use_beeline"</span>: true, <span class="s2">"principal"</span>: <span class="s2">"hive/[email protected]"</span>, <span class="s2">"proxy_user"</span>: <span class="s2">"login"</span><span class="o">}</span> </pre></div> </div> @@ -364,7 +385,7 @@ section of the connection. For the login user specify the following as extra:</p <div class="highlight-bash"><div class="highlight"><pre><span></span><span class="o">{</span> <span class="s2">"use_beeline"</span>: true, <span class="s2">"principal"</span>: <span class="s2">"hive/[email protected]"</span>, <span class="s2">"proxy_user"</span>: <span class="s2">"owner"</span><span class="o">}</span> </pre></div> </div> -<p>and in your DAG, when initializing the HiveOperator, specify</p> +<p>and in your DAG, when initializing the HiveOperator, specify:</p> <div class="highlight-bash"><div class="highlight"><pre><span></span><span class="nv">run_as_owner</span><span class="o">=</span>True </pre></div> </div> @@ -378,8 +399,6 @@ section of the connection. For the login user specify the following as extra:</p against an installation of GitHub Enterprise using OAuth2. You can optionally specify a team whitelist (composed of slug cased team names) to restrict login to only members of those teams.</p> -<p><em>NOTE</em> If you do not specify a team whitelist, anyone with a valid account on -your GHE installation will be able to login to Airflow.</p> <div class="highlight-bash"><div class="highlight"><pre><span></span><span class="o">[</span>webserver<span class="o">]</span> <span class="nv">authenticate</span> <span class="o">=</span> True <span class="nv">auth_backend</span> <span class="o">=</span> airflow.contrib.auth.backends.github_enterprise_auth @@ -389,21 +408,26 @@ your GHE installation will be able to login to Airflow.</p> <span class="nv">client_id</span> <span class="o">=</span> oauth_key_from_github_enterprise <span class="nv">client_secret</span> <span class="o">=</span> oauth_secret_from_github_enterprise <span class="nv">oauth_callback_route</span> <span class="o">=</span> /example/ghe_oauth/callback -<span class="nv">allowed_teams</span> <span class="o">=</span> 1, 345, 23 +<span class="nv">allowed_teams</span> <span class="o">=</span> <span class="m">1</span>, <span class="m">345</span>, <span class="m">23</span> </pre></div> </div> +<div class="admonition note"> +<p class="first admonition-title">Note</p> +<p class="last">If you do not specify a team whitelist, anyone with a valid account on +your GHE installation will be able to login to Airflow.</p> +</div> <div class="section" id="setting-up-ghe-authentication"> <h4>Setting up GHE Authentication<a class="headerlink" href="#setting-up-ghe-authentication" title="Permalink to this headline">¶</a></h4> <p>An application must be setup in GHE before you can use the GHE authentication backend. In order to setup an application:</p> <ol class="arabic simple"> <li>Navigate to your GHE profile</li> -<li>Select ‘Applications’ from the left hand nav</li> -<li>Select the ‘Developer Applications’ tab</li> -<li>Click ‘Register new application’</li> -<li>Fill in the required information (the ‘Authorization callback URL’ must be fully qualifed e.g. <a class="reference external" href="http://airflow.example.com/example/ghe_oauth/callback">http://airflow.example.com/example/ghe_oauth/callback</a>)</li> -<li>Click ‘Register application’</li> -<li>Copy ‘Client ID’, ‘Client Secret’, and your callback route to your airflow.cfg according to the above example</li> +<li>Select âApplicationsâ from the left hand nav</li> +<li>Select the âDeveloper Applicationsâ tab</li> +<li>Click âRegister new applicationâ</li> +<li>Fill in the required information (the âAuthorization callback URLâ must be fully qualifed e.g. <a class="reference external" href="http://airflow.example.com/example/ghe_oauth/callback">http://airflow.example.com/example/ghe_oauth/callback</a>)</li> +<li>Click âRegister applicationâ</li> +<li>Copy âClient IDâ, âClient Secretâ, and your callback route to your airflow.cfg according to the above example</li> </ol> </div> </div> @@ -429,47 +453,67 @@ to only members of that domain.</p> backend. In order to setup an application:</p> <ol class="arabic simple"> <li>Navigate to <a class="reference external" href="https://console.developers.google.com/apis/">https://console.developers.google.com/apis/</a></li> -<li>Select ‘Credentials’ from the left hand nav</li> -<li>Click ‘Create credentials’ and choose ‘OAuth client ID’</li> -<li>Choose ‘Web application’</li> -<li>Fill in the required information (the ‘Authorized redirect URIs’ must be fully qualifed e.g. <a class="reference external" href="http://airflow.example.com/oauth2callback">http://airflow.example.com/oauth2callback</a>)</li> -<li>Click ‘Create’</li> -<li>Copy ‘Client ID’, ‘Client Secret’, and your redirect URI to your airflow.cfg according to the above example</li> </ol> +<p>2. Select âCredentialsâ from the left hand nav +2. Select âCredentialsâ from the left hand nav +3. Click âCreate credentialsâ and choose âOAuth client IDâ +4. Choose âWeb applicationâ +5. Fill in the required information (the âAuthorized redirect URIsâ must be fully qualifed e.g. <a class="reference external" href="http://airflow.example.com/oauth2callback">http://airflow.example.com/oauth2callback</a>) +6. Click âCreateâ +7. Copy âClient IDâ, âClient Secretâ, and your redirect URI to your airflow.cfg according to the above example</p> </div> </div> </div> <div class="section" id="ssl"> <h2>SSL<a class="headerlink" href="#ssl" title="Permalink to this headline">¶</a></h2> <p>SSL can be enabled by providing a certificate and key. Once enabled, be sure to use -“<a class="reference external" href="https://">https://</a>” in your browser.</p> +â<a class="reference external" href="https://">https://</a>â in your browser.</p> <div class="highlight-bash"><div class="highlight"><pre><span></span><span class="o">[</span>webserver<span class="o">]</span> <span class="nv">web_server_ssl_cert</span> <span class="o">=</span> <path to cert> <span class="nv">web_server_ssl_key</span> <span class="o">=</span> <path to key> </pre></div> </div> <p>Enabling SSL will not automatically change the web server port. If you want to use the -standard port 443, you’ll need to configure that too. Be aware that super user privileges +standard port 443, youâll need to configure that too. Be aware that super user privileges (or cap_net_bind_service on Linux) are required to listen on port 443.</p> <div class="highlight-bash"><div class="highlight"><pre><span></span><span class="c1"># Optionally, set the server to listen on the standard SSL port.</span> -<span class="nv">web_server_port</span> <span class="o">=</span> 443 +<span class="nv">web_server_port</span> <span class="o">=</span> <span class="m">443</span> <span class="nv">base_url</span> <span class="o">=</span> http://<hostname or IP>:443 </pre></div> </div> +<p>Enable CeleryExecutor with SSL. Ensure you properly generate client and server +certs and keys.</p> +<div class="highlight-bash"><div class="highlight"><pre><span></span><span class="o">[</span>celery<span class="o">]</span> +<span class="nv">CELERY_SSL_ACTIVE</span> <span class="o">=</span> True +<span class="nv">CELERY_SSL_KEY</span> <span class="o">=</span> <path to key> +<span class="nv">CELERY_SSL_CERT</span> <span class="o">=</span> <path to cert> +<span class="nv">CELERY_SSL_CACERT</span> <span class="o">=</span> <path to cacert> +</pre></div> +</div> +</div> <div class="section" id="impersonation"> -<h3>Impersonation<a class="headerlink" href="#impersonation" title="Permalink to this headline">¶</a></h3> +<h2>Impersonation<a class="headerlink" href="#impersonation" title="Permalink to this headline">¶</a></h2> <p>Airflow has the ability to impersonate a unix user while running task -instances based on the task’s <code class="docutils literal"><span class="pre">run_as_user</span></code> parameter, which takes a user’s name.</p> -<p><em>NOTE</em> For impersonations to work, Airflow must be run with <cite>sudo</cite> as subtasks are run +instances based on the taskâs <code class="docutils literal"><span class="pre">run_as_user</span></code> parameter, which takes a userâs name.</p> +<p><strong>NOTE:</strong> For impersonations to work, Airflow must be run with <cite>sudo</cite> as subtasks are run with <cite>sudo -u</cite> and permissions of files are changed. Furthermore, the unix user needs to exist on the worker. Here is what a simple sudoers file entry could look like to achieve this, assuming as airflow is running as the <cite>airflow</cite> user. Note that this means that the airflow user must be trusted and treated the same way as the root user.</p> +<div class="highlight-none"><div class="highlight"><pre><span></span>airflow ALL=(ALL) NOPASSWD: ALL +</pre></div> +</div> <p>Subtasks with impersonation will still log to the same folder, except that the files they log to will have permissions changed such that only the unix user can write to it.</p> -<p><em>Default impersonation</em> To prevent tasks that don’t use impersonation to be run with -<cite>sudo</cite> privileges, you can set the <cite>default_impersonation</cite> config in <cite>core</cite> which sets a -default user impersonate if <cite>run_as_user</cite> is not set.</p> +<div class="section" id="default-impersonation"> +<h3>Default Impersonation<a class="headerlink" href="#default-impersonation" title="Permalink to this headline">¶</a></h3> +<p>To prevent tasks that donât use impersonation to be run with <cite>sudo</cite> privileges, you can set the +<code class="docutils literal"><span class="pre">core:default_impersonation</span></code> config which sets a default user impersonate if <cite>run_as_user</cite> is +not set.</p> +<div class="highlight-bash"><div class="highlight"><pre><span></span><span class="o">[</span>core<span class="o">]</span> +<span class="nv">default_impersonation</span> <span class="o">=</span> airflow +</pre></div> +</div> </div> </div> </div> http://git-wip-us.apache.org/repos/asf/incubator-airflow-site/blob/28a3eb60/start.html ---------------------------------------------------------------------- diff --git a/start.html b/start.html index 511a95b..d1aa874 100644 --- a/start.html +++ b/start.html @@ -13,6 +13,8 @@ + + @@ -81,11 +83,14 @@ - <ul class="current"> + + + + <ul class="current"> <li class="toctree-l1"><a class="reference internal" href="project.html">Project</a></li> <li class="toctree-l1"><a class="reference internal" href="license.html">License</a></li> <li class="toctree-l1 current"><a class="current reference internal" href="#">Quick Start</a><ul> -<li class="toctree-l2"><a class="reference internal" href="#what-s-next">What’s Next?</a></li> +<li class="toctree-l2"><a class="reference internal" href="#what-s-next">Whatâs Next?</a></li> </ul> </li> <li class="toctree-l1"><a class="reference internal" href="installation.html">Installation</a></li> @@ -174,17 +179,17 @@ <span class="nb">export</span> <span class="nv">AIRFLOW_HOME</span><span class="o">=</span>~/airflow <span class="c1"># install from pypi using pip</span> -pip install airflow +pip install apache-airflow <span class="c1"># initialize the database</span> airflow initdb <span class="c1"># start the web server, default port is 8080</span> -airflow webserver -p 8080 +airflow webserver -p <span class="m">8080</span> </pre></div> </div> <p>Upon running these commands, Airflow will create the <code class="docutils literal"><span class="pre">$AIRFLOW_HOME</span></code> folder -and lay an “airflow.cfg” file with defaults that get you going fast. You can +and lay an âairflow.cfgâ file with defaults that get you going fast. You can inspect the file either in <code class="docutils literal"><span class="pre">$AIRFLOW_HOME/airflow.cfg</span></code>, or through the UI in the <code class="docutils literal"><span class="pre">Admin->Configuration</span></code> menu. The PID file for the webserver will be stored in <code class="docutils literal"><span class="pre">$AIRFLOW_HOME/airflow-webserver.pid</span></code> or in <code class="docutils literal"><span class="pre">/run/airflow/webserver.pid</span></code> @@ -199,14 +204,14 @@ command line utilities.</p> be able to see the status of the jobs change in the <code class="docutils literal"><span class="pre">example1</span></code> DAG as you run the commands below.</p> <div class="highlight-bash"><div class="highlight"><pre><span></span><span class="c1"># run your first task instance</span> -airflow run example_bash_operator runme_0 2015-01-01 +airflow run example_bash_operator runme_0 <span class="m">2015</span>-01-01 <span class="c1"># run a backfill over 2 days</span> -airflow backfill example_bash_operator -s 2015-01-01 -e 2015-01-02 +airflow backfill example_bash_operator -s <span class="m">2015</span>-01-01 -e <span class="m">2015</span>-01-02 </pre></div> </div> <div class="section" id="what-s-next"> -<h2>What’s Next?<a class="headerlink" href="#what-s-next" title="Permalink to this headline">¶</a></h2> -<p>From this point, you can head to the <a class="reference internal" href="tutorial.html"><span class="doc">Tutorial</span></a> section for further examples or the <a class="reference internal" href="configuration.html"><span class="doc">Configuration</span></a> section if you’re ready to get your hands dirty.</p> +<h2>Whatâs Next?<a class="headerlink" href="#what-s-next" title="Permalink to this headline">¶</a></h2> +<p>From this point, you can head to the <a class="reference internal" href="tutorial.html"><span class="doc">Tutorial</span></a> section for further examples or the <a class="reference internal" href="configuration.html"><span class="doc">Configuration</span></a> section if youâre ready to get your hands dirty.</p> </div> </div> http://git-wip-us.apache.org/repos/asf/incubator-airflow-site/blob/28a3eb60/tutorial.html ---------------------------------------------------------------------- diff --git a/tutorial.html b/tutorial.html index 2a55053..ea3af9b 100644 --- a/tutorial.html +++ b/tutorial.html @@ -13,6 +13,8 @@ + + @@ -81,14 +83,17 @@ - <ul class="current"> + + + + <ul class="current"> <li class="toctree-l1"><a class="reference internal" href="project.html">Project</a></li> <li class="toctree-l1"><a class="reference internal" href="license.html">License</a></li> <li class="toctree-l1"><a class="reference internal" href="start.html">Quick Start</a></li> <li class="toctree-l1"><a class="reference internal" href="installation.html">Installation</a></li> <li class="toctree-l1 current"><a class="current reference internal" href="#">Tutorial</a><ul> <li class="toctree-l2"><a class="reference internal" href="#example-pipeline-definition">Example Pipeline definition</a></li> -<li class="toctree-l2"><a class="reference internal" href="#it-s-a-dag-definition-file">It’s a DAG definition file</a></li> +<li class="toctree-l2"><a class="reference internal" href="#it-s-a-dag-definition-file">Itâs a DAG definition file</a></li> <li class="toctree-l2"><a class="reference internal" href="#importing-modules">Importing Modules</a></li> <li class="toctree-l2"><a class="reference internal" href="#default-arguments">Default Arguments</a></li> <li class="toctree-l2"><a class="reference internal" href="#instantiate-a-dag">Instantiate a DAG</a></li> @@ -103,7 +108,7 @@ <li class="toctree-l3"><a class="reference internal" href="#backfill">Backfill</a></li> </ul> </li> -<li class="toctree-l2"><a class="reference internal" href="#what-s-next">What’s Next?</a></li> +<li class="toctree-l2"><a class="reference internal" href="#what-s-next">Whatâs Next?</a></li> </ul> </li> <li class="toctree-l1"><a class="reference internal" href="configuration.html">Configuration</a></li> @@ -202,7 +207,7 @@ complicated, a line by line explanation follows below.</p> <span class="s1">'owner'</span><span class="p">:</span> <span class="s1">'airflow'</span><span class="p">,</span> <span class="s1">'depends_on_past'</span><span class="p">:</span> <span class="kc">False</span><span class="p">,</span> <span class="s1">'start_date'</span><span class="p">:</span> <span class="n">datetime</span><span class="p">(</span><span class="mi">2015</span><span class="p">,</span> <span class="mi">6</span><span class="p">,</span> <span class="mi">1</span><span class="p">),</span> - <span class="s1">'email'</span><span class="p">:</span> <span class="p">[</span><span class="s1">'[email protected]'</span><span class="p">],</span> + <span class="s1">'email'</span><span class="p">:</span> <span class="p">[</span><span class="s1">'[email protected]'</span><span class="p">],</span> <span class="s1">'email_on_failure'</span><span class="p">:</span> <span class="kc">False</span><span class="p">,</span> <span class="s1">'email_on_retry'</span><span class="p">:</span> <span class="kc">False</span><span class="p">,</span> <span class="s1">'retries'</span><span class="p">:</span> <span class="mi">1</span><span class="p">,</span> @@ -247,10 +252,10 @@ complicated, a line by line explanation follows below.</p> </div> </div> <div class="section" id="it-s-a-dag-definition-file"> -<h2>It’s a DAG definition file<a class="headerlink" href="#it-s-a-dag-definition-file" title="Permalink to this headline">¶</a></h2> +<h2>Itâs a DAG definition file<a class="headerlink" href="#it-s-a-dag-definition-file" title="Permalink to this headline">¶</a></h2> <p>One thing to wrap your head around (it may not be very intuitive for everyone at first) is that this Airflow Python script is really -just a configuration file specifying the DAG’s structure as code. +just a configuration file specifying the DAGâs structure as code. The actual tasks defined here will run in a different context from the context of this script. Different tasks run on different workers at different points in time, which means that this script cannot be used @@ -258,14 +263,14 @@ to cross communicate between tasks. Note that for this purpose we have a more advanced feature called <code class="docutils literal"><span class="pre">XCom</span></code>.</p> <p>People sometimes think of the DAG definition file as a place where they can do some actual data processing - that is not the case at all! -The script’s purpose is to define a DAG object. It needs to evaluate +The scriptâs purpose is to define a DAG object. It needs to evaluate quickly (seconds, not minutes) since the scheduler will execute it periodically to reflect the changes if any.</p> </div> <div class="section" id="importing-modules"> <h2>Importing Modules<a class="headerlink" href="#importing-modules" title="Permalink to this headline">¶</a></h2> <p>An Airflow pipeline is just a Python script that happens to define an -Airflow DAG object. Let’s start by importing the libraries we will need.</p> +Airflow DAG object. Letâs start by importing the libraries we will need.</p> <div class="code python highlight-default"><div class="highlight"><pre><span></span><span class="c1"># The DAG object; we'll need this to instantiate a DAG</span> <span class="kn">from</span> <span class="nn">airflow</span> <span class="k">import</span> <span class="n">DAG</span> @@ -276,8 +281,8 @@ Airflow DAG object. Let’s start by importing the libraries we will need.</ </div> <div class="section" id="default-arguments"> <h2>Default Arguments<a class="headerlink" href="#default-arguments" title="Permalink to this headline">¶</a></h2> -<p>We’re about to create a DAG and some tasks, and we have the choice to -explicitly pass a set of arguments to each task’s constructor +<p>Weâre about to create a DAG and some tasks, and we have the choice to +explicitly pass a set of arguments to each taskâs constructor (which would become redundant), or (better!) we can define a dictionary of default parameters that we can use when creating tasks.</p> <div class="code python highlight-default"><div class="highlight"><pre><span></span><span class="kn">from</span> <span class="nn">datetime</span> <span class="k">import</span> <span class="n">datetime</span><span class="p">,</span> <span class="n">timedelta</span> @@ -286,7 +291,7 @@ of default parameters that we can use when creating tasks.</p> <span class="s1">'owner'</span><span class="p">:</span> <span class="s1">'airflow'</span><span class="p">,</span> <span class="s1">'depends_on_past'</span><span class="p">:</span> <span class="kc">False</span><span class="p">,</span> <span class="s1">'start_date'</span><span class="p">:</span> <span class="n">datetime</span><span class="p">(</span><span class="mi">2015</span><span class="p">,</span> <span class="mi">6</span><span class="p">,</span> <span class="mi">1</span><span class="p">),</span> - <span class="s1">'email'</span><span class="p">:</span> <span class="p">[</span><span class="s1">'[email protected]'</span><span class="p">],</span> + <span class="s1">'email'</span><span class="p">:</span> <span class="p">[</span><span class="s1">'[email protected]'</span><span class="p">],</span> <span class="s1">'email_on_failure'</span><span class="p">:</span> <span class="kc">False</span><span class="p">,</span> <span class="s1">'email_on_retry'</span><span class="p">:</span> <span class="kc">False</span><span class="p">,</span> <span class="s1">'retries'</span><span class="p">:</span> <span class="mi">1</span><span class="p">,</span> @@ -298,7 +303,7 @@ of default parameters that we can use when creating tasks.</p> <span class="p">}</span> </pre></div> </div> -<p>For more information about the BaseOperator’s parameters and what they do, +<p>For more information about the BaseOperatorâs parameters and what they do, refer to the :py:class:<code class="docutils literal"><span class="pre">airflow.models.BaseOperator</span></code> documentation.</p> <p>Also, note that you could easily define different sets of arguments that would serve different purposes. An example of that would be to have @@ -306,7 +311,7 @@ different settings between a production and development environment.</p> </div> <div class="section" id="instantiate-a-dag"> <h2>Instantiate a DAG<a class="headerlink" href="#instantiate-a-dag" title="Permalink to this headline">¶</a></h2> -<p>We’ll need a DAG object to nest our tasks into. Here we pass a string +<p>Weâll need a DAG object to nest our tasks into. Here we pass a string that defines the <code class="docutils literal"><span class="pre">dag_id</span></code>, which serves as a unique identifier for your DAG. We also pass the default argument dictionary that we just defined and define a <code class="docutils literal"><span class="pre">schedule_interval</span></code> of 1 day for the DAG.</p> @@ -334,14 +339,14 @@ instantiated from an operator is called a constructor. The first argument </div> <p>Notice how we pass a mix of operator specific arguments (<code class="docutils literal"><span class="pre">bash_command</span></code>) and an argument common to all operators (<code class="docutils literal"><span class="pre">retries</span></code>) inherited -from BaseOperator to the operator’s constructor. This is simpler than +from BaseOperator to the operatorâs constructor. This is simpler than passing every argument for every constructor call. Also, notice that in the second task we override the <code class="docutils literal"><span class="pre">retries</span></code> parameter with <code class="docutils literal"><span class="pre">3</span></code>.</p> <p>The precedence rules for a task are as follows:</p> <ol class="arabic simple"> <li>Explicitly passed arguments</li> <li>Values that exist in the <code class="docutils literal"><span class="pre">default_args</span></code> dictionary</li> -<li>The operator’s default value, if one exists</li> +<li>The operatorâs default value, if one exists</li> </ol> <p>A task must include or inherit the arguments <code class="docutils literal"><span class="pre">task_id</span></code> and <code class="docutils literal"><span class="pre">owner</span></code>, otherwise Airflow will raise an exception.</p> @@ -357,7 +362,8 @@ templates.</p> <p>This tutorial barely scratches the surface of what you can do with templating in Airflow, but the goal of this section is to let you know this feature exists, get you familiar with double curly brackets, and -point to the most common template variable: <code class="docutils literal"><span class="pre">{{</span> <span class="pre">ds</span> <span class="pre">}}</span></code>.</p> +point to the most common template variable: <code class="docutils literal"><span class="pre">{{</span> <span class="pre">ds</span> <span class="pre">}}</span></code> (todayâs âdate +stampâ).</p> <div class="code python highlight-default"><div class="highlight"><pre><span></span><span class="n">templated_command</span> <span class="o">=</span> <span class="s2">"""</span> <span class="s2"> {</span><span class="si">% f</span><span class="s2">or i in range(5) %}</span> <span class="s2"> echo "{{ ds }}"</span> @@ -383,17 +389,26 @@ to understand how the parameter <code class="docutils literal"><span class="pre" <p>Files can also be passed to the <code class="docutils literal"><span class="pre">bash_command</span></code> argument, like <code class="docutils literal"><span class="pre">bash_command='templated_command.sh'</span></code>, where the file location is relative to the directory containing the pipeline file (<code class="docutils literal"><span class="pre">tutorial.py</span></code> in this case). This -may be desirable for many reasons, like separating your script’s logic and +may be desirable for many reasons, like separating your scriptâs logic and pipeline code, allowing for proper code highlighting in files composed in different languages, and general flexibility in structuring pipelines. It is also possible to define your <code class="docutils literal"><span class="pre">template_searchpath</span></code> as pointing to any folder locations in the DAG constructor call.</p> +<p>Using that same DAG constructor call, it is possible to define +<code class="docutils literal"><span class="pre">user_defined_macros</span></code> which allow you to specify your own variables. +For example, passing <code class="docutils literal"><span class="pre">dict(foo='bar')</span></code> to this argument allows you +to use <code class="docutils literal"><span class="pre">{{</span> <span class="pre">foo</span> <span class="pre">}}</span></code> in your templates. Moreover, specifying +<code class="docutils literal"><span class="pre">user_defined_filters</span></code> allow you to register you own filters. For example, +passing <code class="docutils literal"><span class="pre">dict(hello=lambda</span> <span class="pre">name:</span> <span class="pre">'Hello</span> <span class="pre">%s'</span> <span class="pre">%</span> <span class="pre">name)</span></code> to this argument allows +you to use <code class="docutils literal"><span class="pre">{{</span> <span class="pre">'world'</span> <span class="pre">|</span> <span class="pre">hello</span> <span class="pre">}}</span></code> in your templates. For more information +regarding custom filters have a look at the +<a class="reference external" href="http://jinja.pocoo.org/docs/dev/api/#writing-filters">Jinja Documentation</a></p> <p>For more information on the variables and macros that can be referenced in templates, make sure to read through the <a class="reference internal" href="code.html#macros"><span class="std std-ref">Macros</span></a> section</p> </div> <div class="section" id="setting-up-dependencies"> <h2>Setting up Dependencies<a class="headerlink" href="#setting-up-dependencies" title="Permalink to this headline">¶</a></h2> -<p>We have two simple tasks that do not depend on each other. Here’s a few ways +<p>We have two simple tasks that do not depend on each other. Hereâs a few ways you can define dependencies between them:</p> <div class="code python highlight-default"><div class="highlight"><pre><span></span><span class="n">t2</span><span class="o">.</span><span class="n">set_upstream</span><span class="p">(</span><span class="n">t1</span><span class="p">)</span> @@ -430,7 +445,7 @@ something like this:</p> <span class="s1">'owner'</span><span class="p">:</span> <span class="s1">'airflow'</span><span class="p">,</span> <span class="s1">'depends_on_past'</span><span class="p">:</span> <span class="kc">False</span><span class="p">,</span> <span class="s1">'start_date'</span><span class="p">:</span> <span class="n">datetime</span><span class="p">(</span><span class="mi">2015</span><span class="p">,</span> <span class="mi">6</span><span class="p">,</span> <span class="mi">1</span><span class="p">),</span> - <span class="s1">'email'</span><span class="p">:</span> <span class="p">[</span><span class="s1">'[email protected]'</span><span class="p">],</span> + <span class="s1">'email'</span><span class="p">:</span> <span class="p">[</span><span class="s1">'[email protected]'</span><span class="p">],</span> <span class="s1">'email_on_failure'</span><span class="p">:</span> <span class="kc">False</span><span class="p">,</span> <span class="s1">'email_on_retry'</span><span class="p">:</span> <span class="kc">False</span><span class="p">,</span> <span class="s1">'retries'</span><span class="p">:</span> <span class="mi">1</span><span class="p">,</span> @@ -479,20 +494,20 @@ something like this:</p> <h2>Testing<a class="headerlink" href="#testing" title="Permalink to this headline">¶</a></h2> <div class="section" id="running-the-script"> <h3>Running the Script<a class="headerlink" href="#running-the-script" title="Permalink to this headline">¶</a></h3> -<p>Time to run some tests. First let’s make sure that the pipeline -parses. Let’s assume we’re saving the code from the previous step in +<p>Time to run some tests. First letâs make sure that the pipeline +parses. Letâs assume weâre saving the code from the previous step in <code class="docutils literal"><span class="pre">tutorial.py</span></code> in the DAGs folder referenced in your <code class="docutils literal"><span class="pre">airflow.cfg</span></code>. The default location for your DAGs is <code class="docutils literal"><span class="pre">~/airflow/dags</span></code>.</p> <div class="highlight-bash"><div class="highlight"><pre><span></span>python ~/airflow/dags/tutorial.py </pre></div> </div> -<p>If the script does not raise an exception it means that you haven’t done +<p>If the script does not raise an exception it means that you havenât done anything horribly wrong, and that your Airflow environment is somewhat sound.</p> </div> <div class="section" id="command-line-metadata-validation"> <h3>Command Line Metadata Validation<a class="headerlink" href="#command-line-metadata-validation" title="Permalink to this headline">¶</a></h3> -<p>Let’s run a few commands to validate this script further.</p> +<p>Letâs run a few commands to validate this script further.</p> <div class="highlight-bash"><div class="highlight"><pre><span></span><span class="c1"># print the list of active DAGs</span> airflow list_dags @@ -506,36 +521,36 @@ airflow list_tasks tutorial --tree </div> <div class="section" id="id1"> <h3>Testing<a class="headerlink" href="#id1" title="Permalink to this headline">¶</a></h3> -<p>Let’s test by running the actual task instances on a specific date. The +<p>Letâs test by running the actual task instances on a specific date. The date specified in this context is an <code class="docutils literal"><span class="pre">execution_date</span></code>, which simulates the scheduler running your task or dag at a specific date + time:</p> <div class="highlight-bash"><div class="highlight"><pre><span></span><span class="c1"># command layout: command subcommand dag_id task_id date</span> <span class="c1"># testing print_date</span> -airflow <span class="nb">test</span> tutorial print_date 2015-06-01 +airflow <span class="nb">test</span> tutorial print_date <span class="m">2015</span>-06-01 <span class="c1"># testing sleep</span> -airflow <span class="nb">test</span> tutorial sleep 2015-06-01 +airflow <span class="nb">test</span> tutorial sleep <span class="m">2015</span>-06-01 </pre></div> </div> <p>Now remember what we did with templating earlier? See how this template gets rendered and executed by running this command:</p> <div class="highlight-bash"><div class="highlight"><pre><span></span><span class="c1"># testing templated</span> -airflow <span class="nb">test</span> tutorial templated 2015-06-01 +airflow <span class="nb">test</span> tutorial templated <span class="m">2015</span>-06-01 </pre></div> </div> <p>This should result in displaying a verbose log of events and ultimately running your bash command and printing the result.</p> <p>Note that the <code class="docutils literal"><span class="pre">airflow</span> <span class="pre">test</span></code> command runs task instances locally, outputs -their log to stdout (on screen), doesn’t bother with dependencies, and -doesn’t communicate state (running, success, failed, ...) to the database. +their log to stdout (on screen), doesnât bother with dependencies, and +doesnât communicate state (running, success, failed, â¦) to the database. It simply allows testing a single task instance.</p> </div> <div class="section" id="backfill"> <h3>Backfill<a class="headerlink" href="#backfill" title="Permalink to this headline">¶</a></h3> -<p>Everything looks like it’s running fine so let’s run a backfill. +<p>Everything looks like itâs running fine so letâs run a backfill. <code class="docutils literal"><span class="pre">backfill</span></code> will respect your dependencies, emit logs into files and talk to -the database to record status. If you do have a webserver up, you’ll be able +the database to record status. If you do have a webserver up, youâll be able to track the progress. <code class="docutils literal"><span class="pre">airflow</span> <span class="pre">webserver</span></code> will start a web server if you are interested in tracking the progress visually as your backfill progresses.</p> <p>Note that if you use <code class="docutils literal"><span class="pre">depends_on_past=True</span></code>, individual task instances @@ -547,17 +562,17 @@ which are used to populate the run schedule with task instances from this dag.</ <span class="c1"># airflow webserver --debug &</span> <span class="c1"># start your backfill on a date range</span> -airflow backfill tutorial -s 2015-06-01 -e 2015-06-07 +airflow backfill tutorial -s <span class="m">2015</span>-06-01 -e <span class="m">2015</span>-06-07 </pre></div> </div> </div> </div> <div class="section" id="what-s-next"> -<h2>What’s Next?<a class="headerlink" href="#what-s-next" title="Permalink to this headline">¶</a></h2> -<p>That’s it, you’ve written, tested and backfilled your very first Airflow +<h2>Whatâs Next?<a class="headerlink" href="#what-s-next" title="Permalink to this headline">¶</a></h2> +<p>Thatâs it, youâve written, tested and backfilled your very first Airflow pipeline. Merging your code into a code repository that has a master scheduler running against it should get it to get triggered and run every day.</p> -<p>Here’s a few things you might want to do next:</p> +<p>Hereâs a few things you might want to do next:</p> <ul> <li><p class="first">Take an in-depth tour of the UI - click all the things!</p> </li> http://git-wip-us.apache.org/repos/asf/incubator-airflow-site/blob/28a3eb60/ui.html ---------------------------------------------------------------------- diff --git a/ui.html b/ui.html index 5852c52..92fa47d 100644 --- a/ui.html +++ b/ui.html @@ -13,6 +13,8 @@ + + @@ -81,7 +83,10 @@ - <ul class="current"> + + + + <ul class="current"> <li class="toctree-l1"><a class="reference internal" href="project.html">Project</a></li> <li class="toctree-l1"><a class="reference internal" href="license.html">License</a></li> <li class="toctree-l1"><a class="reference internal" href="start.html">Quick Start</a></li> @@ -175,7 +180,7 @@ <div class="section" id="ui-screenshots"> <h1>UI / Screenshots<a class="headerlink" href="#ui-screenshots" title="Permalink to this headline">¶</a></h1> <p>The Airflow UI make it easy to monitor and troubleshoot your data pipelines. -Here’s a quick overview of some of the features and visualizations you +Hereâs a quick overview of some of the features and visualizations you can find in the Airflow UI.</p> <div class="section" id="dags-view"> <h2>DAGs View<a class="headerlink" href="#dags-view" title="Permalink to this headline">¶</a></h2> @@ -197,7 +202,7 @@ the blocking ones.</p> <hr class="docutils" /> <div class="section" id="graph-view"> <h2>Graph View<a class="headerlink" href="#graph-view" title="Permalink to this headline">¶</a></h2> -<p>The graph view is perhaps the most comprehensive. Visualize your DAG’s +<p>The graph view is perhaps the most comprehensive. Visualize your DAGâs dependencies and their current status for a specific run.</p> <hr class="docutils" /> <img alt="_images/graph.png" src="_images/graph.png" /> @@ -207,7 +212,7 @@ dependencies and their current status for a specific run.</p> <h2>Variable View<a class="headerlink" href="#variable-view" title="Permalink to this headline">¶</a></h2> <p>The variable view allows you to list, create, edit or delete the key-value pair of a variable used during jobs. Value of a variable will be hidden if the key contains -any words in (‘password’, ‘secret’, ‘passwd’, ‘authorization’, ‘api_key’, ‘apikey’, ‘access_token’) +any words in (âpasswordâ, âsecretâ, âpasswdâ, âauthorizationâ, âapi_keyâ, âapikeyâ, âaccess_tokenâ) by default, but can be configured to show in clear-text.</p> <hr class="docutils" /> <img alt="_images/variable_hidden.png" src="_images/variable_hidden.png" /> @@ -242,7 +247,7 @@ provide yet more context.</p> <hr class="docutils" /> <div class="section" id="task-instance-context-menu"> <h2>Task Instance Context Menu<a class="headerlink" href="#task-instance-context-menu" title="Permalink to this headline">¶</a></h2> -<p>From the pages seen above (tree view, graph view, gantt, ...), it is always +<p>From the pages seen above (tree view, graph view, gantt, â¦), it is always possible to click on a task instance, and get to this rich context menu that can take you to more detailed metadata, and perform some actions.</p> <hr class="docutils" />
