Repository: drill-site Updated Branches: refs/heads/asf-site 36b5428e1 -> edc3f206c
http://git-wip-us.apache.org/repos/asf/drill-site/blob/edc3f206/feed.xml ---------------------------------------------------------------------- diff --git a/feed.xml b/feed.xml index af7a909..fc2708d 100644 --- a/feed.xml +++ b/feed.xml @@ -6,11 +6,94 @@ </description> <link>/</link> <atom:link href="/feed.xml" rel="self" type="application/rss+xml"/> - <pubDate>Wed, 07 Feb 2018 18:35:47 -0800</pubDate> - <lastBuildDate>Wed, 07 Feb 2018 18:35:47 -0800</lastBuildDate> + <pubDate>Thu, 08 Feb 2018 16:20:28 -0800</pubDate> + <lastBuildDate>Thu, 08 Feb 2018 16:20:28 -0800</lastBuildDate> <generator>Jekyll v2.5.2</generator> <item> + <title>Running SQL Queries on Amazon S3</title> + <description><p>The functionality and sheer usefulness of Drill is growing fast. If you&#39;re a user of some of the popular BI tools out there like Tableau or SAP Lumira, now is a good time to take a look at how Drill can make your life easier, especially if you&#39;re faced with the task of quickly getting a handle on large sets of unstructured data. With schema generated on the fly, you can save a lot of time and headaches by running SQL queries on the data where it rests without knowing much about columns or formats. There&#39;s even more good news: Drill also works with data stored in the cloud. With a few simple steps, you can configure the S3 storage plugin for Drill and be off to the races running queries. In this post we&#39;ll look at how to configure Drill to access data stored in an S3 bucket.</p> + +<p>If you&#39;re more of a visual person, you can skip this article entirely and <a href="https://www.youtube.com/watch?v=w8gZ2nn_ZUQ">go straight to a video</a> I put together that walks through an end-to-end example with Tableau. This example is easily extended to other BI tools, as the steps are identical on the Drill side.</p> + +<p>At a high level, configuring Drill to access S3 bucket data is accomplished with the following steps on each node running a drillbit.</p> + +<ul> +<li>Download and install the <a href="http://www.jets3t.org/">JetS3t</a> JAR files and enable them.</li> +<li>Add your S3 credentials in the relevant XML configuration file.</li> +<li>Configure and enable the S3 storage plugin through the Drill web interface.</li> +<li>Connect your BI tool of choice and query away.</li> +</ul> + +<p>Consult the <a href="https://cwiki.apache.org/confluence/display/DRILL/Architectural+Overview">Architectural Overview</a> for a refresher on the architecture of Drill.</p> + +<h2 id="prerequisites">Prerequisites</h2> + +<p>These steps assume you have a <a href="https://cwiki.apache.org/confluence/display/DRILL/Apache+Drill+in+10+Minutes">typical Drill cluster and ZooKeeper quorum</a> configured and running. To access data in S3, you will need an S3 bucket configured and have the required Amazon security credentials in your possession. An <a href="http://blogs.aws.amazon.com/security/post/Tx1R9KDN9ISZ0HF/Where-s-my-secret-access-key">Amazon blog post</a> has more information on how to get these from your account.</p> + +<h2 id="configuration-steps">Configuration Steps</h2> + +<p>To connect Drill to S3, all of the drillbit nodes will need to access code in the JetS3t library developed by Amazon. As of this writing, 0.9.2 is the latest version but you might want to check <a href="https://jets3t.s3.amazonaws.com/toolkit/toolkit.html">the main page</a> to see if anything has been updated. Be sure to get version 0.9.2 or later as earlier versions have a bug relating to reading Parquet data.</p> +<div class="highlight"><pre><code class="language-bash" data-lang="bash">wget http://bitbucket.org/jmurty/jets3t/downloads/jets3t-0.9.2.zip +cp jets3t-0.9.2/jars/jets3t-0.9.2.jar <span class="nv">$DRILL_HOME</span>/jars/3rdparty +</code></pre></div> +<p>Next, enable the plugin by editing the file:</p> +<div class="highlight"><pre><code class="language-bash" data-lang="bash"><span class="nv">$DRILL_HOME</span>/bin/hadoop_excludes.txt +</code></pre></div> +<p>and removing the line <code>jets3t</code>.</p> + +<p>Drill will need to know your S3 credentials in order to access data there. These credentials will need to be placed in the core-site.xml file for your installation. If you already have a core-site.xml file configured for your environment, add the following parameters to it, otherwise create the file from scratch. If you do end up creating it from scratch you will need to wrap these parameters with <code>&lt;configuration&gt;</code> and <code>&lt;/configuration&gt;</code>.</p> +<div class="highlight"><pre><code class="language-xml" data-lang="xml"><span class="nt">&lt;property&gt;</span> + <span class="nt">&lt;name&gt;</span>fs.s3.awsAccessKeyId<span class="nt">&lt;/name&gt;</span> + <span class="nt">&lt;value&gt;</span>ID<span class="nt">&lt;/value&gt;</span> +<span class="nt">&lt;/property&gt;</span> + +<span class="nt">&lt;property&gt;</span> + <span class="nt">&lt;name&gt;</span>fs.s3.awsSecretAccessKey<span class="nt">&lt;/name&gt;</span> + <span class="nt">&lt;value&gt;</span>SECRET<span class="nt">&lt;/value&gt;</span> +<span class="nt">&lt;/property&gt;</span> + +<span class="nt">&lt;property&gt;</span> + <span class="nt">&lt;name&gt;</span>fs.s3n.awsAccessKeyId<span class="nt">&lt;/name&gt;</span> + <span class="nt">&lt;value&gt;</span>ID<span class="nt">&lt;/value&gt;</span> +<span class="nt">&lt;/property&gt;</span> + +<span class="nt">&lt;property&gt;</span> + <span class="nt">&lt;name&gt;</span>fs.s3n.awsSecretAccessKey<span class="nt">&lt;/name&gt;</span> + <span class="nt">&lt;value&gt;</span>SECRET<span class="nt">&lt;/value&gt;</span> +<span class="nt">&lt;/property&gt;</span> +</code></pre></div> +<p>The steps so far give Drill enough information to connect to the S3 service. Remember, you have to do this on all the nodes running drillbit.</p> + +<p>Next, let&#39;s go into the Drill web interface and enable the S3 storage plugin. In this case you only need to connect to <strong>one</strong> of the nodes because Drill&#39;s configuration is synchronized across the cluster. Complete the following steps:</p> + +<ol> +<li>Point your browser to <code>http://&lt;host&gt;:8047</code></li> +<li>Select the &#39;Storage&#39; tab.</li> +<li>A good starting configuration for S3 can be entirely the same as the <code>dfs</code> plugin, except the connection parameter is changed to <code>s3://bucket</code>. So first select the <code>Update</code> button for <code>dfs</code>, then select the text area and copy it into the clipboard (on Windows, ctrl-A, ctrl-C works).</li> +<li>Press <code>Back</code>, then create a new plugin by typing the name into the <code>New Storage Plugin</code>, then press <code>Create</code>. You can choose any name, but a good convention is to use <code>s3-&lt;bucketname&gt;</code> so you can easily identify it later.</li> +<li>In the configuration area, paste the configuration you just grabbed from &#39;dfs&#39;. Change the line <code>connection: &quot;file:///&quot;</code> to <code>connection: &quot;s3://&lt;bucket&gt;&quot;</code>.</li> +<li>Click <code>Update</code>. You should see a message that indicates success.</li> +</ol> + +<p>At this point you can run queries on the data directly and you have a couple of options on how you want to access it. You can use Drill Explorer and create a custom view (based on an SQL query) that you can then access in Tableau or other BI tools, or just use Drill directly from within the tool.</p> + +<p>You may want to check out the <a href="http://www.youtube.com/watch?v=jNUsprJNQUg">Tableau demo</a>.</p> + +<p>With just a few lines of configuration, you&#39;ve just opened the vast world of data available in the Amazon cloud and reduced the amount of work you have to do in advance to access data stored there with SQL. There are even some <a href="https://aws.amazon.com/datasets">public datasets</a> available directly on S3 that are great for experimentation.</p> + +<p>Happy Drilling!</p> +</description> + <pubDate>Fri, 09 Feb 2018 00:16:07 -0000</pubDate> + <link>/blog/2018/02/09/running-sql-queries-on-amazon-s3/</link> + <guid isPermaLink="true">/blog/2018/02/09/running-sql-queries-on-amazon-s3/</guid> + + + <category>blog</category> + + </item> + + <item> <title>Drill 1.12 Released</title> <description><p>Today, we&#39;re happy to announce the availability of Drill 1.12.0. You can download it <a href="https://drill.apache.org/download/">here</a>.</p> @@ -398,74 +481,5 @@ exist. Instead, Drill now returns <code>null</code> values for that </item> - <item> - <title>Drill 1.3 Released</title> - <description><p>Today I&#39;m happy to announce the availability of the Drill 1.3 release. This release addresses <a href="https://issues.apache.org/jira/secure/ReleaseNote.jspa?projectId=12313820&amp;version=12332946">58 JIRAs</a> on top of the 1.2 release. Highlights include:</p> - -<h2 id="enhanced-amazon-s3-support">Enhanced Amazon S3 Support</h2> - -<p>Drill 1.3 utilizes a new library, called s3a, for reading data from S3. The s3a library includes improvements over the previous s3n library, such as higher performance and the ability to read large files (over 5GB).</p> - -<p>In addition to the new s3a library, Drill 1.3 makes it easier to set up your AWS credentials. Simply edit the file <code>conf/core-site.xml</code> in the Drill install directory. For more information, check out the <a href="/docs/s3-storage-plugin/">step-by-step instructions</a> in the documentation.</p> - -<h2 id="heterogeneous-types">Heterogeneous Types</h2> - -<p>Drill 1.3 includes support for mixed-type columns, often found in systems like MongoDB and file formats like JSON. For example, Drill can now columns that evolve from one data type to another over time.</p> - -<p>Drill 1.3 provides a collection of functions that enable you to test the data type of a value. For example, if you have a column that has both lists (arrays) and numbers, you can use the following query to extract the first element from the array values:</p> - -<p><code>SELECT 1 + CASE WHEN is_list(a) THEN a[0] ELSE a END FROM table;</code></p> - -<h2 id="text-file-headers">Text File Headers</h2> - -<p>Drill is now able to parse the header row in a text file (CSV, TSV, etc.). Prior to Drill 1.3, data had to be accessed through the <code>columns</code> array:</p> -<div class="highlight"><pre><code class="language-text" data-lang="text">SELECT columns[0], columns[1] FROM dfs.`/path/to/users.csv` -</code></pre></div> -<p>With Drill 1.3, you can use the actual column names in the CSV file:</p> -<div class="highlight"><pre><code class="language-text" data-lang="text">SELECT name, address FROM dfs.`/path/to/users.csv` -</code></pre></div> -<p>Enabling header parsing is as simple as setting the <code>extractHeader</code> parameter in the storage plugin configuration for the desired file extensions. For more information, check out <a href="/docs/text-files-csv-tsv-psv/">the documentation</a>.</p> - -<h2 id="sequence-files">Sequence Files</h2> - -<p>Drill now <a href="/docs/querying-sequence-files/">supports sequence files</a>, a format commonly used in the Hadoop ecosystem. A sequence file contains a series of keys and values, and querying it with Drill is as easy as querying any other self-describing format:</p> -<div class="highlight"><pre><code class="language-text" data-lang="text">SELECT * -FROM dfs.tmp.`simple.seq` -LIMIT 1; -+--------------+---------------+ -| binary_key | binary_value | -+--------------+---------------+ -| [B@70828f46 | [B@b8c765f | -+--------------+---------------+ -</code></pre></div> -<p>Drill&#39;s <code>CONVERT_FROM</code> function makes it easy to decode the binary values:</p> -<div class="highlight"><pre><code class="language-text" data-lang="text">SELECT CONVERT_FROM(binary_key, &#39;UTF8&#39;), CONVERT_FROM(binary_value, &#39;UTF8&#39;) -FROM dfs.tmp.`simple.seq` -LIMIT 1 -; -+-----------+-------------+ -| EXPR$0 | EXPR$1 | -+-----------+-------------+ -| key0 | value0 | -+-----------+-------------+ -</code></pre></div> -<h2 id="many-more-fixes">Many More Fixes</h2> - -<p>Drill 1.3 includes many other improvements, including enhancements related to querying Hive tables, MongoDB collections and Avro files. Check out the complete list of <a href="https://issues.apache.org/jira/secure/ReleaseNote.jspa?projectId=12313820&amp;version=12332946">fixes and enhancements</a> for more information.</p> - -<p>Download the <a href="https://drill.apache.org/download/">Drill 1.3 release</a> now and let us know your thoughts.</p> - -<p>Drill On! -Jacques Nadeau</p> -</description> - <pubDate>Mon, 23 Nov 2015 00:00:00 -0800</pubDate> - <link>/blog/2015/11/23/drill-1.3-released/</link> - <guid isPermaLink="true">/blog/2015/11/23/drill-1.3-released/</guid> - - - <category>blog</category> - - </item> - </channel> </rss> http://git-wip-us.apache.org/repos/asf/drill-site/blob/edc3f206/index.html ---------------------------------------------------------------------- diff --git a/index.html b/index.html index 2543a6c..637a409 100644 --- a/index.html +++ b/index.html @@ -166,7 +166,7 @@ $(document).ready(function() { </div><!-- header --> <div class="alertbar"> - <div class="news">News:</div><div><a href="/blog/2017/12/15/drill-1.12-released/">Drill 1.12 Released</a><br/><span>(Bridget Bevens)</span></div><div><a href="/blog/2017/07/31/drill-1.11-released/">Drill 1.11 Released</a><br/><span>(Bridget Bevens)</span></div> + <div class="news">News:</div><div><a href="/blog/2018/02/09/running-sql-queries-on-amazon-s3/">Running SQL Queries on Amazon S3</a><br/><span>(Nick Amato)</span></div><div><a href="/blog/2017/12/15/drill-1.12-released/">Drill 1.12 Released</a><br/><span>(Bridget Bevens)</span></div> </div> <div class="mw introWrapper">