This is an automated email from the ASF dual-hosted git repository.
bridgetb pushed a commit to branch asf-site
in repository https://gitbox.apache.org/repos/asf/drill-site.git
The following commit(s) were added to refs/heads/asf-site by this push:
new ca00b83 edit create schema and refresh table metadata docs
ca00b83 is described below
commit ca00b83e5b9c8ad246656cee29dafcd2b44bbdad
Author: Bridget Bevens <[email protected]>
AuthorDate: Mon Apr 29 13:53:10 2019 -0700
edit create schema and refresh table metadata docs
---
docs/create-or-replace-schema/index.html | 8 +-
docs/refresh-table-metadata/index.html | 415 +++++++++++++++----------------
feed.xml | 4 +-
3 files changed, 211 insertions(+), 216 deletions(-)
diff --git a/docs/create-or-replace-schema/index.html
b/docs/create-or-replace-schema/index.html
index 7ce7f74..b9eddf9 100644
--- a/docs/create-or-replace-schema/index.html
+++ b/docs/create-or-replace-schema/index.html
@@ -1316,13 +1316,13 @@
</div>
- Apr 25, 2019
+ Apr 29, 2019
<link href="/css/docpage.css" rel="stylesheet" type="text/css">
<div class="int_text" align="left">
- <p>Starting in Drill 1.16, you can define a schema for text files
using the CREATE OR REPLACE SCHEMA command. Running this command generates a
hidden .drill.schema file in the table’s root directory. The .drill.schema file
stores the schema definition in JSON format. Drill uses the schema file at
runtime if the exec.storage.enable_v3_text_reader and
store.table.use_schema_file options are enabled. Alternatively, you can create
the schema file manually. When created manually, the [...]
+ <p>Starting in Drill 1.16, you can define a schema for text files
using the CREATE OR REPLACE SCHEMA command. Running this command generates a
hidden <code>.drill.schema</code> file in the table’s root directory. The
<code>.drill.schema</code> file stores the schema definition in JSON format.
Drill uses the schema file at runtime if the
<code>exec.storage.enable_v3_text_reader</code> and
<code>store.table.use_schema_file</code> options are enabled. Alternatively,
you can create t [...]
<h2 id="syntax">Syntax</h2>
@@ -1503,7 +1503,7 @@ A property that sets how Drill handles blank column
values. Accepts the followin
<p>The schema mode determines the ordering of columns returned for wildcard
(*) queries. The mode is set through the <code>drill.strict</code> property.
You can set this property to true (strict) or false (not strict). If you do not
indicate the mode, the default is false (not strict). </p>
-<p><strong>Not Strict (Default)</strong>
+<p><strong>Not Strict (Default)</strong><br>
Columns defined in the schema are projected in the defined order. Columns not
defined in the schema are appended to the defined columns, as shown: </p>
<div class="highlight"><pre><code class="language-text"
data-lang="text">create or replace schema (id int, start_date date format
'yyyy-MM-dd') for table dfs.tmp.`text_table` properties
('drill.strict' = 'false');
+------+-----------------------------------------+
@@ -1525,7 +1525,7 @@ select * from dfs.tmp.`text_table`;
</code></pre></div>
<p>Note that the “name” column, which was not included in the schema was
appended to the end of the table.</p>
-<p><strong>Strict</strong>
+<p><strong>Strict</strong><br>
Setting the <code>drill.strict</code> property to “true” changes the schema
mode to strict, which means that the reader ignores any columns NOT included in
the schema. The query only returns the columns defined in the schema, as
shown:</p>
<div class="highlight"><pre><code class="language-text"
data-lang="text">create or replace schema (id int, start_date date format
'yyyy-MM-dd') for table dfs.tmp.`text_table` properties
('drill.strict' = 'true');
+------+-----------------------------------------+
diff --git a/docs/refresh-table-metadata/index.html
b/docs/refresh-table-metadata/index.html
index 8b17a86..2a8b71f 100644
--- a/docs/refresh-table-metadata/index.html
+++ b/docs/refresh-table-metadata/index.html
@@ -1316,7 +1316,7 @@
</div>
- Apr 23, 2019
+ Apr 29, 2019
<link href="/css/docpage.css" rel="stylesheet" type="text/css">
@@ -1354,7 +1354,7 @@ Required. The name of the table or directory for which
Drill will refresh metada
<h3 id="metadata-storage">Metadata Storage</h3>
<ul>
-<li>Drill traverses directories for Parquet files and gathers the metadata
from the footer of the files. Drill stores the collected metadata in a metadata
cache file, <code>.drill.parquet_metadata</code>, at each directory
level.<br></li>
+<li>Drill traverses directories for Parquet files and gathers the metadata
from the footer of the files. Drill stores the collected metadata in a metadata
cache file, <code>.drill.parquet_file_metadata.v4</code>, a summary file,
<code>.drill.parquet_summary_metadata.v4</code>, and a directories file,
<code>.drill.parquet_metadata_directories</code> file at each directory
level.<br></li>
<li>The metadata cache file stores metadata for files in that directory, as
well as the metadata for the files in the subdirectories.<br></li>
<li>For each row group in a Parquet file, the metadata cache file stores the
column names in the row group and the column statistics, such as the min/max
values and null count.<br></li>
<li>If the Parquet data is updated, for example data is added to a file, Drill
automatically refreshes the Parquet metadata when you issue the next query
against the Parquet data.<br></li>
@@ -1404,217 +1404,212 @@ Sets the number of row groups that a table can have.
You can increase the thresh
<p>Currently, Drill does not support runtime rowgroup pruning. </p>
-<!--
-## Examples
-These examples use a schema, `dfs.samples`, which points to the `/home`
directory. The `/home` directory contains a subdirectory, `parquet`, which
-contains the `nation.parquet` and a subdirectory, `dir1` with the
`region.parquet` file. You can access the `nation.parquet` and `region.parquet`
Parquet files in the `sample-data` directory of your Drill installation.
-
- [root@doc23 dir1]# pwd
- /home/parquet/dir1
-
- [root@doc23 parquet]# ls
- dir1 nation.parquet
-
- [root@doc23 dir1]# ls
- region.parquet
-
-Change schemas to use `dfs.samples`:
-
- use dfs.samples;
- +-------+------------------------------------------+
- | ok | summary |
- +-------+------------------------------------------+
- | true | Default schema changed to [dfs.samples] |
- +-------+------------------------------------------+
-
-### Running REFRESH TABLE METADATA on a Directory
-Running the REFRESH TABLE METADATA command on the `parquet` directory
generates metadata cache files at each directory level.
-
- REFRESH TABLE METADATA parquet;
- +-------+---------------------------------------------------+
- | ok | summary |
- +-------+---------------------------------------------------+
- | true | Successfully updated metadata for table parquet. |
- +-------+---------------------------------------------------+
-
-When looking at the `parquet` directory and `dir1` subdirectory, you can see
that a metadata cache file was created at each level:
-
- [root@doc23 parquet]# ls -la
- drwxr-xr-x 2 root root 95 Mar 18 17:49 dir1
- -rw-r--r-- 1 root root 2642 Mar 18 17:52 .drill.parquet_metadata
- -rw-r--r-- 1 root root 32 Mar 18 17:52 ..drill.parquet_metadata.crc
- -rwxr-xr-x 1 root root 1210 Mar 13 13:32 nation.parquet
-
- [root@doc23 dir1]# ls -la
- -rw-r--r-- 1 root root 1235 Mar 18 17:52 .drill.parquet_metadata
- -rw-r--r-- 1 root root 20 Mar 18 17:52 ..drill.parquet_metadata.crc
- -rwxr-xr-x 1 root root 455 Mar 18 17:41 region.parquet
-
-The following sections compare the content of the metadata cache file in the
`parquet` and `dir1` directories:
-
-**Content of the metadata cache file in the directory named `parquet` that
contains the nation.parquet file and subdirectory `dir1`.**
-
-
- [root@doc23 parquet]# cat .drill.parquet_metadata
- {
- "metadata_version" : "3.3",
- "columnTypeInfo" : {
- "`N_COMMENT`" : {
- "name" : [ "N_COMMENT" ],
- "primitiveType" : "BINARY",
- "originalType" : "UTF8",
- "precision" : 0,
- "scale" : 0,
- "repetitionLevel" : 0,
- "definitionLevel" : 0
- },
- "`N_NATIONKEY`" : {
- "name" : [ "N_NATIONKEY" ],
- "primitiveType" : "INT64",
- "originalType" : null,
- "precision" : 0,
- "scale" : 0,
- "repetitionLevel" : 0,
- "definitionLevel" : 0
- },
- "`R_REGIONKEY`" : {
- "name" : [ "R_REGIONKEY" ],
- "primitiveType" : "INT64",
- "originalType" : null,
- "precision" : 0,
- "scale" : 0,
- "repetitionLevel" : 0,
- "definitionLevel" : 0
- },
- "`R_COMMENT`" : {
- "name" : [ "R_COMMENT" ],
- "primitiveType" : "BINARY",
- "originalType" : "UTF8",
- "precision" : 0,
- "scale" : 0,
- "repetitionLevel" : 0,
- "definitionLevel" : 0
- },
- "`N_REGIONKEY`" : {
- "name" : [ "N_REGIONKEY" ],
- "primitiveType" : "INT64",
- "originalType" : null,
- "precision" : 0,
- "scale" : 0,
- "repetitionLevel" : 0,
- "definitionLevel" : 0
- },
- "`R_NAME`" : {
- "name" : [ "R_NAME" ],
- "primitiveType" : "BINARY",
- "originalType" : "UTF8",
- "precision" : 0,
- "scale" : 0,
- "repetitionLevel" : 0,
- "definitionLevel" : 0
- },
- "`N_NAME`" : {
- "name" : [ "N_NAME" ],
- "primitiveType" : "BINARY",
- "originalType" : "UTF8",
- "precision" : 0,
- "scale" : 0,
- "repetitionLevel" : 0,
- "definitionLevel" : 0
- }
+<h2 id="examples">Examples</h2>
+
+<p>These examples use a schema, <code>dfs.samples</code>, which points to the
<code>/tmp</code> directory. The <code>/tmp</code> directory contains the
following subdirectories and files used in the examples: </p>
+<div class="highlight"><pre><code class="language-text"
data-lang="text">[root@doc23 parquet1]# pwd
+/tmp/parquet1
+
+[root@doc23 parquet1]# ls
+Parquet
+
+[root@doc23 parquet1]# cd parquet
+
+[root@doc23 parquet]# ls
+nation.parquet test
+
+[root@doc23 parquet]# cd test
+
+[root@doc23 test]# ls
+nation.parquet
+</code></pre></div>
+<p><strong>Note:</strong> You can access the sample
<code>nation.parquet</code> file in the <code>sample-data</code> directory of
your Drill installation.</p>
+
+<p>Change schemas to switch to <code>dfs.samples</code>: </p>
+<div class="highlight"><pre><code class="language-text" data-lang="text">use
dfs.samples;
++-------+------------------------------------------+
+| ok | summary |
++-------+------------------------------------------+
+| true | Default schema changed to [dfs.samples] |
++-------+------------------------------------------+
+</code></pre></div>
+<h3 id="running-refresh-table-metadata-on-a-directory">Running REFRESH TABLE
METADATA on a Directory</h3>
+
+<p>Running the REFRESH TABLE METADATA command on the “parquet1” directory
generates metadata cache files at each directory level.</p>
+<div class="highlight"><pre><code class="language-text"
data-lang="text">apache drill (dfs.samples)> REFRESH TABLE METADATA parquet1;
++------+---------------------------------------------------+
+| ok | summary |
++------+---------------------------------------------------+
+| true | Successfully updated metadata for table parquet1. |
++------+---------------------------------------------------+
+</code></pre></div>
+<p>When looking at the “parquet1” directory and subdirectories, you can see
that a metadata cache and summary (hidden) files were created at each level:</p>
+
+<p><strong>Note:</strong> The CRC files are Cyclical Redundancy Check checksum
files used to verify the data integrity of other files. </p>
+<div class="highlight"><pre><code class="language-text"
data-lang="text">[root@doc23 parquet1]# ls -la
+total 36
+drwxr-xr-x 3 root root 284 Apr 29 11:46 .
+drwxrwxrwt. 51 root root 8192 Apr 29 11:44 ..
+-rw-r--r-- 1 root root 1037 Apr 29 11:46 .drill.parquet_file_metadata.v4
+-rw-r--r-- 1 root root 20 Apr 29 11:46 ..drill.parquet_file_metadata.v4.crc
+-rw-r--r-- 1 root root 51 Apr 29 11:46 .drill.parquet_metadata_directories
+-rw-r--r-- 1 root root 12 Apr 29 11:46
..drill.parquet_metadata_directories.crc
+-rw-r--r-- 1 root root 1334 Apr 29 11:46 .drill.parquet_summary_metadata.v4
+-rw-r--r-- 1 root root 20 Apr 29 11:46
..drill.parquet_summary_metadata.v4.crc
+drwxr-xr-x 3 root root 212 Apr 29 11:30 parquet
+
+[root@doc23 parquet1]# cd parquet
+[root@doc23 parquet]# ls -la
+total 20
+drwxr-xr-x 3 root root 212 Apr 29 11:30 .
+drwxr-xr-x 3 root root 284 Apr 29 11:46 ..
+-rw-r--r-- 1 root root 1021 Apr 29 11:46 .drill.parquet_file_metadata.v4
+-rw-r--r-- 1 root root 16 Apr 29 11:46 ..drill.parquet_file_metadata.v4.crc
+-rw-r--r-- 1 root root 1315 Apr 29 11:46 .drill.parquet_summary_metadata.v4
+-rw-r--r-- 1 root root 20 Apr 29 11:46
..drill.parquet_summary_metadata.v4.crc
+-rwxr-xr-x 1 root root 1210 Apr 29 11:23 nation.parquet
+drwxr-xr-x 2 root root 200 Apr 29 11:46 test
+
+[root@doc23 test]# ls -la
+total 20
+drwxr-xr-x 2 root root 200 Apr 29 11:46 .
+drwxr-xr-x 3 root root 212 Apr 29 11:30 ..
+-rw-r--r-- 1 root root 517 Apr 29 11:46 .drill.parquet_file_metadata.v4
+-rw-r--r-- 1 root root 16 Apr 29 11:46 ..drill.parquet_file_metadata.v4.crc
+-rw-r--r-- 1 root root 1308 Apr 29 11:46 .drill.parquet_summary_metadata.v4
+-rw-r--r-- 1 root root 20 Apr 29 11:46
..drill.parquet_summary_metadata.v4.crc
+-rwxr-xr-x 1 root root 1210 Apr 29 11:23 nation.parquet
+</code></pre></div>
+<p>Looking at the <code>.drill.parquet_file_metadata.v4</code> file in the
<code>/tmp/parquet1</code> directory, you can see that the file contains the
paths to the Parquet files in the subdirectories, as well as metadata for those
files: </p>
+<div class="highlight"><pre><code class="language-text"
data-lang="text">[root@doc23 parquet1]# cat .drill.parquet_file_metadata.v4
+{
+ "files" : [ {
+ "path" : "parquet/test/nation.parquet",
+ "length" : 1210,
+ "rowGroups" : [ {
+ "start" : 4,
+ "length" : 944,
+ "rowCount" : 25,
+ "hostAffinity" : {
+ "localhost" : 1.0
},
- "files" : [ {
- "path" : "dir1/region.parquet",
- "length" : 455,
- "rowGroups" : [ {
- "start" : 4,
- "length" : 250,
- "rowCount" : 5,
- "hostAffinity" : {
- "localhost" : 1.0
- },
- "columns" : [ ]
- } ]
+ "columns" : [ {
+ "name" : [ "N_NATIONKEY" ],
+ "nulls" : -1
+ }, {
+ "name" : [ "N_NAME" ],
+ "nulls" : -1
}, {
- "path" : "nation.parquet",
- "length" : 1210,
- "rowGroups" : [ {
- "start" : 4,
- "length" : 944,
- "rowCount" : 25,
- "hostAffinity" : {
- "localhost" : 1.0
- },
- "columns" : [ ]
- } ]
- } ],
- "directories" : [ "dir1" ],
- "drillVersion" : "1.16.0-SNAPSHOT"
-
-**Content of the directory named `dir1` that contains the `region.parquet`
file and no subdirectories.**
-
- [root@doc23 dir1]# cat .drill.parquet_metadata
- {
- "metadata_version" : "3.3",
- "columnTypeInfo" : {
- "`R_REGIONKEY`" : {
- "name" : [ "R_REGIONKEY" ],
- "primitiveType" : "INT64",
- "originalType" : null,
- "precision" : 0,
- "scale" : 0,
- "repetitionLevel" : 0,
- "definitionLevel" : 0
- },
- "`R_COMMENT`" : {
- "name" : [ "R_COMMENT" ],
- "primitiveType" : "BINARY",
- "originalType" : "UTF8",
- "precision" : 0,
- "scale" : 0,
- "repetitionLevel" : 0,
- "definitionLevel" : 0
- },
- "`R_NAME`" : {
- "name" : [ "R_NAME" ],
- "primitiveType" : "BINARY",
- "originalType" : "UTF8",
- "precision" : 0,
- "scale" : 0,
- "repetitionLevel" : 0,
- "definitionLevel" : 0
- }
+ "name" : [ "N_REGIONKEY" ],
+ "nulls" : -1
+ }, {
+ "name" : [ "N_COMMENT" ],
+ "nulls" : -1
+ } ]
+ } ]
+ }, {
+ "path" : "parquet/nation.parquet",
+ "length" : 1210,
+ "rowGroups" : [ {
+ "start" : 4,
+ "length" : 944,
+ "rowCount" : 25,
+ "hostAffinity" : {
+ "localhost" : 1.0
},
- "files" : [ {
- "path" : "region.parquet",
- "length" : 455,
- "rowGroups" : [ {
- "start" : 4,
- "length" : 250,
- "rowCount" : 5,
- "hostAffinity" : {
- "localhost" : 1.0
- },
- "columns" : [ ]
- } ]
- } ],
- "directories" : [ ],
- "drillVersion" : "1.16.0-SNAPSHOT"
- }
-
-### Verifying that the Planner is Using the Metadata Cache File
-
-When the planner uses metadata cache files, the query plan includes the
`usedMetadataFile` flag. You can access the query plan in the Drill Web UI, by
clicking on the query in the Profiles page, or by running the EXPLAIN PLAN FOR
command, as shown:
-
- EXPLAIN PLAN FOR SELECT * FROM parquet;
-
- | 00-00 Screen
- 00-01 Project(**=[$0])
- 00-02 Scan(table=[[dfs, samples, parquet]],
groupscan=[ParquetGroupScan [entries=[ReadEntryWithPath [path=/home/parquet]],
selectionRoot=/home/parquet, numFiles=1, numRowGroups=2, usedMetadataFile=true,
cacheFileRoot=/home/parquet, columns=[`**`]]])
- |...
-
--->
-
+ "columns" : [ {
+ "name" : [ "N_NATIONKEY" ],
+ "nulls" : -1
+ }, {
+ "name" : [ "N_NAME" ],
+ "nulls" : -1
+ }, {
+ "name" : [ "N_REGIONKEY" ],
+ "nulls" : -1
+ }, {
+ "name" : [ "N_COMMENT" ],
+ "nulls" : -1
+ } ]
+ } ]
+ } ]
+</code></pre></div>
+<p>Looking at the <code>.drill.parquet_summary_metadata.v4</code> file in the
<code>parquet1</code> directory, you can see information about each of the
columns in the files and the list of subdirectories and interesting columns
(useful when indicating columns in the REFRESH TABLE METADATA command): </p>
+<div class="highlight"><pre><code class="language-text"
data-lang="text">[root@doc23 parquet1]# cat .drill.parquet_summary_metadata.v4
+{
+ "columnTypeInfo" : {
+ "`N_COMMENT`" : {
+ "name" : [ "N_COMMENT" ],
+ "primitiveType" : "BINARY",
+ "originalType" : "UTF8",
+ "precision" : 0,
+ "scale" : 0,
+ "repetitionLevel" : 0,
+ "definitionLevel" : 0,
+ "totalNullCount" : -1,
+ "isInteresting" : true
+ },
+ "`N_NATIONKEY`" : {
+ "name" : [ "N_NATIONKEY" ],
+ "primitiveType" : "INT64",
+ "originalType" : null,
+ "precision" : 0,
+ "scale" : 0,
+ "repetitionLevel" : 0,
+ "definitionLevel" : 0,
+ "totalNullCount" : -1,
+ "isInteresting" : true
+ },
+ "`N_REGIONKEY`" : {
+ "name" : [ "N_REGIONKEY" ],
+ "primitiveType" : "INT64",
+ "originalType" : null,
+ "precision" : 0,
+ "scale" : 0,
+ "repetitionLevel" : 0,
+ "definitionLevel" : 0,
+ "totalNullCount" : -1,
+ "isInteresting" : true
+ },
+ "`N_NAME`" : {
+ "name" : [ "N_NAME" ],
+ "primitiveType" : "BINARY",
+ "originalType" : "UTF8",
+ "precision" : 0,
+ "scale" : 0,
+ "repetitionLevel" : 0,
+ "definitionLevel" : 0,
+ "totalNullCount" : -1,
+ "isInteresting" : true
+ }
+ },
+ "directories" : [ "parquet/test", "parquet" ],
+ "drillVersion" : "1.16.0-SNAPSHOT",
+ "totalRowCount" : 50,
+ "allColumnsInteresting" : true,
+ "metadata_version" : "4"
+</code></pre></div>
+<h3
id="verifying-that-the-planner-is-using-the-metadata-cache-or-summary-files">Verifying
that the Planner is Using the Metadata Cache or Summary Files</h3>
+
+<p>When the planner uses metadata cache files, the query plan includes the
<code>usedMetadataFile</code> flag. You can access the query plan in the Drill
Web UI, by clicking on the query in the Profiles page, or by running the
EXPLAIN PLAN FOR command, as shown:</p>
+<div class="highlight"><pre><code class="language-text"
data-lang="text">apache drill (dfs.samples)> explain plan for select * from
parquet1;
++----------------------------------------------------------------------------------+----------------------------------------------------------------------------------+
+| text
| json
|
++----------------------------------------------------------------------------------+----------------------------------------------------------------------------------+
+| 00-00 Screen
+00-01 Project(**=[$0])
+00-02 Scan(table=[[dfs, samples, parquet1]],
groupscan=[ParquetGroupScan [entries=[ReadEntryWithPath [path=/tmp/parquet1]],
selectionRoot=/tmp/parquet1, numFiles=1, numRowGroups=2, usedMetadataFile=true,
cacheFileRoot=/tmp/parquet1, columns=[`**`]]])
+ |
+</code></pre></div>
+<p>When you run the EXPLAIN command with a COUNT() query, as shown, you can
see that the query planner uses the summary cache file and avoids reading the
larger metadata cache file. The query plan includes the
<code>usedMetadataSummaryFile</code> flag.</p>
+<div class="highlight"><pre><code class="language-text"
data-lang="text">apache drill (dfs.samples)> explain plan for select
count(*) from parquet1;
++----------------------------------------------------------------------------------+----------------------------------------------------------------------------------+
+| text
| json
|
++----------------------------------------------------------------------------------+----------------------------------------------------------------------------------+
+| 00-00 Screen
+00-01 Project(EXPR$0=[$0])
+00-02 DirectScan(groupscan=[files =
[file:/tmp/parquet1/.drill.parquet_summary_metadata.v4], numFiles = 1,
usedMetadataSummaryFile = true, DynamicPojoRecordReader{records = [[50]]}])
+ |
+</code></pre></div>
<div class="doc-nav">
diff --git a/feed.xml b/feed.xml
index aeb26d3..729a31a 100644
--- a/feed.xml
+++ b/feed.xml
@@ -6,8 +6,8 @@
</description>
<link>/</link>
<atom:link href="/feed.xml" rel="self" type="application/rss+xml"/>
- <pubDate>Fri, 26 Apr 2019 12:50:50 -0700</pubDate>
- <lastBuildDate>Fri, 26 Apr 2019 12:50:50 -0700</lastBuildDate>
+ <pubDate>Mon, 29 Apr 2019 13:50:47 -0700</pubDate>
+ <lastBuildDate>Mon, 29 Apr 2019 13:50:47 -0700</lastBuildDate>
<generator>Jekyll v2.5.2</generator>
<item>