[drill] 01/02: Drill provider for Airflow blog post.

dzamo Sun, 22 Aug 2021 10:58:57 -0700

This is an automated email from the ASF dual-hosted git repository.

dzamo pushed a commit to branch gh-pages
in repository https://gitbox.apache.org/repos/asf/drill.git


commit aa99123c5690cfacb74925df740b02f5c3b6350b
Author: James Turton <[email protected]>
AuthorDate: Thu Aug 5 16:01:44 2021 +0200

    Drill provider for Airflow blog post.
---
 .../install/047-installing-drill-on-the-cluster.md |  2 +-
 ...leased.md => 2018-03-18-drill-1.13-released.md} |  0
 .../en/2021-08-05-drill-provider-for-airflow.md    | 28 ++++++++++++++++++++++
 3 files changed, 29 insertions(+), 1 deletion(-)

diff --git a/_docs/en/install/047-installing-drill-on-the-cluster.md 
b/_docs/en/install/047-installing-drill-on-the-cluster.md
index de359b0..2761af9 100644
--- a/_docs/en/install/047-installing-drill-on-the-cluster.md
+++ b/_docs/en/install/047-installing-drill-on-the-cluster.md
@@ -16,7 +16,7 @@ You install Drill on nodes in the cluster, configure a 
cluster ID, and add Zooke
 
 ### (Optional) Create the site directory
 
-The site directory contains your site-specific files for Drill.  Putting these 
in a separate directory to the Drill installation means that upgrading Drill 
will not clobber your configuration and custom code.  It is possible to skip 
this step, meaning that your configuration and custom code will live in the 
`$DRILL_HOME/conf` and `$DRILL_HOME/jars/3rdparty` subdirectories respectively.
+The site directory contains your site-specific files for Drill.  Putting these 
in a separate directory to the Drill installation means that upgrading Drill 
will not overwrite your configuration and custom code.  It is possible to skip 
this step, meaning that your configuration and custom code will live in the 
`$DRILL_HOME/conf` and `$DRILL_HOME/jars/3rdparty` subdirectories respectively.
 
 Create the site directory in a suitable location, e.g.
 
diff --git a/blog/_posts/en/2018-3-18-drill-1.13-released.md 
b/blog/_posts/en/2018-03-18-drill-1.13-released.md
similarity index 100%
rename from blog/_posts/en/2018-3-18-drill-1.13-released.md
rename to blog/_posts/en/2018-03-18-drill-1.13-released.md
diff --git a/blog/_posts/en/2021-08-05-drill-provider-for-airflow.md 
b/blog/_posts/en/2021-08-05-drill-provider-for-airflow.md
new file mode 100644
index 0000000..b643924
--- /dev/null
+++ b/blog/_posts/en/2021-08-05-drill-provider-for-airflow.md
@@ -0,0 +1,28 @@
+---
+layout: post
+title: "Drill provider for Airflow"
+code: drill-provider-for-airflow
+excerpt: In its provider package release this month, the Apache Airflow 
project added a provider for interacting with Apache Drill.  This allows data 
engineers and data scientists to incorporate Drill queries in their Airflow 
DAGs, enabling the automation of big data and data science workflows.
+
+authors: ["jturton"]
+---
+
+You're building a new report, visualisation or ML model.  Most of the data 
involved comes from sources well known to you but a new source has become 
available, allowing your team to measure and model new variables.  Eager to get 
to a prototype and an early sense of what the new analytics look like, you head 
straight for the first order of business and start to construct a first version 
of the dataset upon which your final output will be based.
+
+The data sources you need to combine are immediately accessible but 
heteregenous: transactional data in PostgreSQL must be combined with data from 
another team that uses Splunk, lookup data maintained by operational team in an 
Excel spreadsheet, thousands of XML exports received from a partner and some 
Parquet files already in your big data environment just for good measure.
+
+Using Drill iteratively you query and join in each data source one at a time, 
applying grouping, filtering and other intensive transformations as you go, 
finally producing a dataset with the fields and grain you need.  You store it 
by adding CREATE TABLE AS in front of your final SELECT then write a few 
counting and summing queries against the original data sources and your 
transformed dataset to check that your code produces the expected outputs.
+
+Apart from possibly configuring some new storage plugins in the Drill web UI, 
you have so far not left DBeaver (or your editor of choice).  The onerous data 
exploration and plumbing parts of your project have flashed by in a blaze of 
SQL, and you move your dataset into the next tool for visualisation or 
modelling.  The results are good and you know that your users will immediately 
ask for the outputs to incorporate new data on a regular schedule.
+
+While Drill can assemble your dataset on the fly, as it did while you 
prototyped,  doing that for the full set takes over 20 minutes, places more 
load than you'd like in office hours on to your data sources and limits you to 
the history that the sources keep, in some cases only a few weeks.
+
+It's time for ETL, you concede.  In the past that meant you had to choose 
between keeping your working Drill SQL and scheduling it using 70s Unix tools 
like Cron and Bash, or recreating your Drill SQL in other tools and languages, 
perhaps Apache Beam or PySpark, and requiring multiple tools if you don't have 
one that is as omnivorous as Drill.  But this time it's different...
+
+[Apache Airflow](https://airflow.apache.org) is a workflow engine built in the 
Python programming ecosystem that has grown into a leading choice for 
orchestrating big data pipelines, amongst its other applications.  Perhaps the 
first point to understand about Airflow in the context of ETL is that it is 
designed only for workflow _control_, and not for data flow.  This makes it 
different from some of the ETL tools you might have encountered like 
Microsoft's SSIS or Pentaho's PDI which han [...]
+
+In contrast Airflow is, unless you're doing it wrong, used only to instruct 
other software like Spark, Beam, PostgreSQL, Bash, Celery, Scikit-learn 
scripts, Slack, (... the list  of connectors is long and varied) to kick off 
actions at scheduled times.  While Airflow does load its schedules from the 
crontab format, a comparison to cron stops there.  Airflow can resolve and 
execute complex job DAGs with options for clustering, parallelism, retries, 
backfilling and performance monitoring.
+
+The exciting news for Drill users is that [a new provider package adding 
support for 
Drill](https://pypi.org/project/apache-airflow-providers-apache-drill/) was 
added to Airflow this month.  This provider is based on the [sqlalchemy-drill 
package](https://pypi.org/project/sqlalchemy-drill/) which provides Drill 
connectivity for Python programs.  This means that you can add tasks which 
execute queries on Drill to your Airflow DAGs without any hacky intermediate 
shell scripts, or build new [...]
+
+In the coming days a basic tutorial for using Drill with Airflow will be added 
to this site, and this sentence replaced with a link.

[drill] 01/02: Drill provider for Airflow blog post.

Reply via email to