This is an automated email from the ASF dual-hosted git repository.
dzamo pushed a commit to branch asf-site
in repository https://gitbox.apache.org/repos/asf/drill-site.git
The following commit(s) were added to refs/heads/asf-site by this push:
new 8aa50b9 Website update.
8aa50b9 is described below
commit 8aa50b91d6d8d318449795d35df94aa9f3e3a4a5
Author: James Turton <[email protected]>
AuthorDate: Mon Aug 23 10:22:12 2021 +0200
Website update.
---
docs/orchestrating-queries-with-airflow/index.html | 24 +++++++++++-----------
1 file changed, 12 insertions(+), 12 deletions(-)
diff --git a/docs/orchestrating-queries-with-airflow/index.html
b/docs/orchestrating-queries-with-airflow/index.html
index 6d1dac5..4f33288 100644
--- a/docs/orchestrating-queries-with-airflow/index.html
+++ b/docs/orchestrating-queries-with-airflow/index.html
@@ -1418,7 +1418,7 @@
<div class="int_text" align="left">
- <p>This tutorial walks through the development of Apache Airflow DAG
that implements a basic ETL process using Apache Drill. We’ll install Airflow
into a Python virtualenv using pip before writing and testing our new DAG.
Consult the <a
href="https://airflow.apache.org/docs/apache-airflow/stable/installation.html">Airflow
installation documentation</a> for more information about installing
Airflow.</p>
+ <p>This tutorial walks through the development of an Apache Airflow
DAG that implements a basic ETL process using Apache Drill. We’ll install
Airflow into a Python virtualenv using pip before writing and testing our new
DAG. Consult the <a
href="https://airflow.apache.org/docs/apache-airflow/stable/installation.html">Airflow
installation documentation</a> for more information about installing
Airflow.</p>
<p>I’ll be issuing commands using a shell on a Debian Linux machine in this
tutorial but it should be possible with a little translation to follow along on
other platforms.</p>
@@ -1439,11 +1439,11 @@ virtualenv <span class="nt">-p</span> /usr/bin/python3
<span class="nv">$VIRT_EN
<h2 id="install-airflow">Install Airflow</h2>
-<p>If you’ve read their installation guide you’ll have seen that the Airflow
project provides constraints files the pin the versions of its Python package
dependencies to known-good versions. In many cases things work fine without
constraints but, for the sake of reproducibility, we’ll apply the constraints
file applicable to our Python version using the script 0they provide for the
purpose.</p>
+<p>If you’ve read their installation guide, you’ll have seen that the Airflow
project provides constraints files that pin its Python package dependencies to
known-good versions. In many cases things work fine without constraints but,
for the sake of reproducibility, we’ll apply the constraints file applicable to
our Python version using the script they provide for the purpose.</p>
<div class="language-sh highlighter-rouge"><div class="highlight"><pre
class="highlight"><code><span class="nv">AIRFLOW_VERSION</span><span
class="o">=</span>2.1.2
<span class="nv">PYTHON_VERSION</span><span class="o">=</span><span
class="s2">"</span><span class="si">$(</span>python <span
class="nt">--version</span> | <span class="nb">cut</span> <span
class="nt">-d</span> <span class="s2">" "</span> <span class="nt">-f</span> 2 |
<span class="nb">cut</span> <span class="nt">-d</span> <span
class="s2">"."</span> <span class="nt">-f</span> 1-2<span
class="si">)</span><span class="s2">"</span>
<span class="nv">CONSTRAINT_URL</span><span class="o">=</span><span
class="s2">"https://raw.githubusercontent.com/apache/airflow/constraints-</span><span
class="k">${</span><span class="nv">AIRFLOW_VERSION</span><span
class="k">}</span><span class="s2">/constraints-</span><span
class="k">${</span><span class="nv">PYTHON_VERSION</span><span
class="k">}</span><span class="s2">.txt"</span>
-pip <span class="nb">install</span> <span
class="s2">"apache-0airflow==</span><span class="k">${</span><span
class="nv">AIRFLOW_VERSION</span><span class="k">}</span><span
class="s2">"</span> <span class="nt">--constraint</span> <span
class="s2">"</span><span class="k">${</span><span
class="nv">CONSTRAINT_URL</span><span class="k">}</span><span
class="s2">"</span>
+pip <span class="nb">install</span> <span
class="s2">"apache-airflow==</span><span class="k">${</span><span
class="nv">AIRFLOW_VERSION</span><span class="k">}</span><span
class="s2">"</span> <span class="nt">--constraint</span> <span
class="s2">"</span><span class="k">${</span><span
class="nv">CONSTRAINT_URL</span><span class="k">}</span><span
class="s2">"</span>
pip <span class="nb">install </span>apache-airflow-providers-apache-drill
</code></pre></div></div>
@@ -1452,7 +1452,7 @@ pip <span class="nb">install
</span>apache-airflow-providers-apache-drill
<p>We’re just experimenting here so we’ll have Airflow set up a local SQLite
database and add an admin user for ourselves.</p>
<div class="language-sh highlighter-rouge"><div class="highlight"><pre
class="highlight"><code><span class="c"># Optional: change Airflow's data dir
from the default of ~/airflow</span>
<span class="nb">export </span><span class="nv">AIRFLOW_HOME</span><span
class="o">=</span>~/Development/airflow
-<span class="nb">mkdir</span> <span class="nt">-p</span> ~/Development/airflow/
+<span class="nb">mkdir</span> <span class="nt">-p</span> ~/Development/airflow
<span class="c"># Create a new SQLite database for Airflow</span>
airflow db init
@@ -1469,7 +1469,7 @@ airflow <span class="nb">users </span>create <span
class="se">\</span>
<h2 id="configure-a-drill-connection">Configure a Drill connection</h2>
-<p>At this point we should have a working Airflow installation. Fire up the
web UI with <code class="language-plaintext highlighter-rouge">airflow
webserver</code> and browse to http://localhost:8080. Click on Admin ->
Connections. Add a new Drill connection called <code class="language-plaintext
highlighter-rouge">drill_tutorial</code>, setting configuration according to
your Drill environment. If you’re using embedded mode Drill locally like I am
then you’ll want the following co [...]
+<p>At this point we should have a working Airflow installation. Fire up the
web UI with <code class="language-plaintext highlighter-rouge">airflow
webserver</code> and browse to http://localhost:8080. Click on Admin ->
Connections and add a new Drill connection called <code
class="language-plaintext highlighter-rouge">drill_tutorial</code>, setting
configuration according to your Drill environment. If you’re using embedded
mode Drill locally like I am, then you’ll want the following [...]
<table>
<thead>
@@ -1508,15 +1508,15 @@ airflow <span class="nb">users </span>create <span
class="se">\</span>
<h2 id="explore-the-source-data">Explore the source data</h2>
-<p>If you’ve built ETLs before you know that you can’t build anything until
you’ve come to grips with the source data. Let’s obtain a sample of the first
1m rows from the source take a look.</p>
+<p>If you’ve developed ETLs before you know that you can’t build anything
until you’ve come to grips with the source data. Let’s obtain a sample of the
first 1m rows from the source take a look.</p>
<div class="language-sh highlighter-rouge"><div class="highlight"><pre
class="highlight"><code>curl <span class="nt">-s</span>
https://data.cdc.gov/api/views/vbim-akqf/rows.csv<span
class="se">\?</span>accessType<span class="se">\=</span>DOWNLOAD | pv <span
class="nt">-lSs</span> 1000000 <span class="o">></span>
/tmp/cdc_covid_cases.csvh
</code></pre></div></div>
-<p>You can replace <code class="language-plaintext highlighter-rouge">pv -lSs
1000000</code> above with <code class="language-plaintext
highlighter-rouge">head -n1000000</code> or just drop it if you don’t mind
fetching the whole file. Downloading it with a web browser will also work
fine. Note that for a default Drill installation, saving with the file
extension <code class="language-plaintext highlighter-rouge">.csvh</code> does
matter for what follows because it will set <code class [...]
+<p>You can replace <code class="language-plaintext highlighter-rouge">pv -lSs
1000000</code> above with <code class="language-plaintext
highlighter-rouge">head -n1000000</code>, or just drop it if you don’t mind
fetching the whole file. Downloading the CSV file with a web browser will also
get the job done. Note that for a default Drill installation, saving with the
file extension <code class="language-plaintext highlighter-rouge">.csvh</code>
does matter for what follows because it wi [...]
-<p>It’s time to break out Drill. Instead of dumping my entire interactive SQL
session here, I’ll just list queries that I ran and the corresponding
observations that I made.</p>
-<div class="language-sql highlighter-rouge"><div class="highlight"><pre
class="highlight"><code><span class="k">select</span> <span class="o">*</span>
<span class="k">from</span> <span class="n">dfs</span><span
class="p">.</span><span class="n">tmp</span><span class="p">.</span><span
class="nv">`cdc_covid_case.csvh`</span>
+<p>It’s time to break out Drill. Instead of dumping my entire interactive SQL
session here, I’ll just list relevant queries that I ran and the corresponding
observations that I made.</p>
+<div class="language-sql highlighter-rouge"><div class="highlight"><pre
class="highlight"><code><span class="k">select</span> <span class="o">*</span>
<span class="k">from</span> <span class="n">dfs</span><span
class="p">.</span><span class="n">tmp</span><span class="p">.</span><span
class="nv">`cdc_covid_case.csvh`</span><span class="p">;</span>
<span class="c1">-- 1. In date fields, the empty string '' can be converted to
SQL NULL</span>
<span class="c1">-- 2. Age groups can be split into two numerical fields, with
the final</span>
<span class="c1">-- group being unbounded above.</span>
@@ -1535,7 +1535,7 @@ airflow <span class="nb">users </span>create <span
class="se">\</span>
<span class="c1">-- so they cannot be transformed to nullable
booleans</span>
</code></pre></div></div>
-<p>So… this is what it feels like to be a data scientist 😆. Jokes aside, we
learned a lot of neccesary stuff pretty quickly there and it’s easy to see that
we could have carried on for a long way, testing ranges, casts and regexps and
even creating reports if we didn’t reign ourselves in. Let’s skip forward to
the ETL statement I ended up creating after exploring.</p>
+<p>So… this is what it feels like to be a data scientist 😆! Jokes aside, we
learned a lot of neccesary stuff pretty quickly there and it’s easy to see that
we could have carried on for a long way, testing ranges, casts and regexps and
even creating reports if we didn’t reign ourselves in. Let’s skip forward to
the ETL statement I ended up creating after exploring.</p>
<h2 id="develop-a-ctas-create-table-as-select-etl">Develop a CTAS (Create
Table As Select) ETL</h2>
@@ -1598,13 +1598,13 @@ airflow <span class="nb">users </span>create <span
class="se">\</span>
<h2 id="develop-an-airflow-dag">Develop an Airflow DAG</h2>
-<p>The definition of our DAG will reside in a single Python script. The
complete listing of that script follows immediately, with my commentary
continuing as inline source code comments. You should save this script to a
new file at <code class="language-plaintext
highlighter-rouge">$AIRFLOW_HOME/dags/drill_tutorial.py</code>.</p>
+<p>The definition of our DAG will reside in a single Python script. The
complete listing of that script follows immediately, with my commentary
continuing as inline source code comments. You should save this script to a
new file at <code class="language-plaintext
highlighter-rouge">$AIRFLOW_HOME/dags/drill-tutorial.py</code>.</p>
<div class="language-python highlighter-rouge"><div class="highlight"><pre
class="highlight"><code><span class="s">'''
Uses the Apache Drill provider to transform, load and report from COVID case
data downloaded from the website of the CDC.
-Data source citatation.
+Data source citation.
Centers for Disease Control and Prevention, COVID-19 Response. COVID-19 Case
Surveillance Public Data Access, Summary, and Limitations.