This is an automated email from the ASF dual-hosted git repository.
bridgetb pushed a commit to branch gh-pages
in repository https://gitbox.apache.org/repos/asf/drill.git
The following commit(s) were added to refs/heads/gh-pages by this push:
new 46b33f9 add doc for analyze refresh schema
46b33f9 is described below
commit 46b33f942771d9872ae67235ad05fa57871c421d
Author: Bridget Bevens <[email protected]>
AuthorDate: Fri Apr 26 12:45:50 2019 -0700
add doc for analyze refresh schema
---
.../020-query-profiles.md | 6 +-
.../sql-commands/009-analyze-table.md | 323 ++++++++++
.../sql-commands/011-refresh-table-metadata.md | 282 +++++++++
.../sql-commands/021-create-schema.md | 650 +++++++++++++++++++++
4 files changed, 1258 insertions(+), 3 deletions(-)
diff --git
a/_docs/performance-tuning/identifying-performance-issues/020-query-profiles.md
b/_docs/performance-tuning/identifying-performance-issues/020-query-profiles.md
index 7ce4652..2730da4 100644
---
a/_docs/performance-tuning/identifying-performance-issues/020-query-profiles.md
+++
b/_docs/performance-tuning/identifying-performance-issues/020-query-profiles.md
@@ -1,6 +1,6 @@
---
title: "Query Profiles"
-date: 2019-04-22
+date: 2019-04-26
parent: "Identifying Performance Issues"
---
@@ -57,8 +57,8 @@ The following table lists the warnings, icons, configurable
options related to t
| Warning Message
| Icon
| Related Option(s)
| Description
[...]
|------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------|-----------------------------------------------------------------------------------------------------------------------------------------|-----------------------------------------------------------------------------------------|----------------------------------------------------------------------------
[...]
-| !WARNING: No fragments have made any progress in the last 300 seconds.
(See Last Progress below.)
| None
|
drill.exec.http.profile.warning.progress.threshold [...]
-| !WARNING: Some operators have data spilled to disk. This will result in
performance loss. (See Avg Peak memory and Max Peak Memory below)
| 
Look for the icon in the Avg Peak Memory and Max Peak Memory columns to find
the operators that spilled data to disk. | See Sort-Based and
Hash-Based Memory Constrained operators for related options. | Hovering
the mouse over the icon [...]
+| !WARNING: No fragments have made any progress in the last 300 seconds.
(See Last Progress below.)
| None
[...]
+| !WARNING: Some operators have data spilled to disk. This will result in
performance loss. (See Avg Peak memory and Max Peak Memory below)
| 
Look for the icon in the Avg Peak Memory and Max Peak Memory columns to find
the operators that spilled data to disk. | See [Sort-Based and
Hash-Based Memory Constrained
Operators]({{site.baseurl}}/docs/sort-based-and-hash-based-memory-constr [...]
| !WARNING: Some of the operators spent more time waiting for data than
processing it. (See AVG WAIT TIME as compared to Average Process Time for the
operators below) | 
Look for the icon in the Max Process Time or Max Wait Time columns to locate
the operators that waited or processed too long. |
drill.exec.http.profile.warning.time.skew.min
| Sets the minimum threshold for opera [...]
| See warning for drill.exec.http.profile.warning.time.skew.min
| 
|
drill.exec.http.profile.warning.time.skew.ratio.process
| Defines the threshold ratio for [...]
| See warning for drill.exec.http.profile.warning.time.skew.min
| 
|
drill.exec.http.profile.warning.time.skew.ratio.wait
| Defines the threshold ratio for wa [...]
diff --git a/_docs/sql-reference/sql-commands/009-analyze-table.md
b/_docs/sql-reference/sql-commands/009-analyze-table.md
new file mode 100644
index 0000000..9aa65f9
--- /dev/null
+++ b/_docs/sql-reference/sql-commands/009-analyze-table.md
@@ -0,0 +1,323 @@
+---
+title: "ANALYZE TABLE"
+date: 2019-04-23
+parent: "SQL Commands"
+---
+
+Starting in Drill 1.16, Drill supports the ANALYZE TABLE statement. The
ANALYZE TABLE statement computes statistics on Parquet data stored in tables
and directories. ANALYZE TABLE writes statistics to a JSON file in the
`.stats.drill` directory, for example `/user/table1/.stats.drill/0_0.json`. The
optimizer in Drill uses these statistics to estimate filter, aggregation, and
join cardinalities to create more efficient query plans.
+
+You can run the ANALYZE TABLE statement to calculate statistics for tables,
columns, and directories with Parquet data; however, Drill will not use the
statistics for query planning unless you enable the `planner.statistics.use`
option, as shown:
+
+ SET `planner.statistics.use` = true;
+
+Alternatively, you can enable the option in the Drill Web UI at
`http://<drill-hostname-or-ip>:8047/options`.
+
+## Syntax
+
+The ANALYZE TABLE statement supports the following syntax:
+
+ ANALYZE TABLE [workspace.]table_name COMPUTE STATISTICS [(column1,
column2,...)] [SAMPLE number PERCENT]
+
+
+## Parameters
+
+*workspace*
+Optional. A configured storage plugin and workspace, like `dfs.samples`. For
example, in `dfs.samples`, `dfs` is the file system storage plugin and samples
is the `workspace` configured to point to a directory on the file system.
+
+*table_name*
+The name of the table or directory for which Drill will generate statistics.
+
+*COMPUTE STATISTICS*
+Generates statistics for the table, columns, or directory specified.
+
+*column*
+The name of the column(s) for which Drill will generate statistics.
+
+*SAMPLE*
+Optional. Indicates that compute statistics should run on a subset of the data.
+
+*number PERCENT*
+An integer that specifies the percentage of data on which to compute
statistics. For example, if a table has 100 rows, `SAMPLE 50 PERCENT` indicates
that statistics should be computed on 50 rows. The optimizer selects the rows
at random.
+
+## Related Command
+
+If you drop a table on which you have run ANALYZE TABLE, the statistics are
automatically removed with the table:
+
+ DROP TABLE [IF EXISTS] [workspace.]name
+
+If you want to remove statistics for a table (and keep the table), you must
remove the directory in which Drill stores the statistics:
+
+ DROP TABLE [IF EXISTS] [workspace.]name/.stats.drill
+
+If you have already issued the ANALYZE TABLE statement against specific
columns, a table, or directory, you must run the DROP TABLE statement with
`/.stats.drill` before you can successfully run the ANALYZE TABLE statement
against the data source again:
+
+ DROP TABLE dfs.samples.`nation1/.stats.drill`;
+
+Note that `/.stats.drill` is the directory to which the JSON file with
statistics is written.
+
+## Usage Notes
+- The ANALYZE TABLE statement can compute statistics for Parquet data stored
in tables, columns, and directories.
+- The user running the ANALYZE TABLE statement must have read and write
permissions on the data source.
+- The optimizer in Drill computes the following types of statistics for each
column:
+ - Rowcount (total number of entries in the table)
+ - Nonnullrowcount (total number of non-null entries in the table)
+ - NDV (total distinct values in the table)
+ - Avgwidth (average width of columns/average number of characters in a
column)
+ - Majortype (data type of the column values)
+ - Histogram (represents the frequency distribution of values (numeric
data) in a column; designed for estimations on data with skewed distribution;
sorts data into “buckets” such that each bucket contains the same number of
rows determined by ceiling(num_rows/n) where n is the number of buckets; the
number of distinct values in each bucket depends on the distribution of the
column's values)
+
+- ANALYZE TABLE can compute statistics on nested scalar columns; however, you
must explicitly state the columns, for example:
+
+ ANALYZE TABLE employee_table COMPUTE STATISTICS
(name.firstname, name.lastname);
+- ANALYZE TABLE can compute statistics at the root directory level, but not at
the partition level.
+- Drill does not compute statistics for complex types (maps, arrays).
+
+## Related Options
+You can set the following options related to the ANALYZE TABLE statement at
the system or session level with the SET (session level) or ALTER SYSTEM SET
(system level) statements, or through the Drill Web UI at
`http://<drill-hostname-or-ip>:8047/options`:
+
+- **planner.statistics.use**
+Enables the query planner to use statistics. When disabled, ANALYZE TABLE
generates statistics, but the query planner will not use the statistics unless
this option is enabled. Disabled (false) by default.
+- **exec.statistics.ndv_accuracy**
+Controls the trade-off between NDV statistic computation memory cost and
accuracy. Controls the amount of memory for estimates. More memory produces
more accurate estimates. The default value should suffice for most scenarios.
Default is 20. Range is 0- 30.
+- **exec.statistics.ndv_extrapolation_bf_elements**
+Controls the trade-off between NDV statistics computation memory cost and
sampling extrapolation accuracy. Relates specifically to SAMPLE. The default
value should suffice for most scenarios. Increasing the value requires
additional memory. Default is 1000000.
+- **exec.statistics.ndv_extrapolation_bf_fpprobability**
+Controls the trade-off between NDV statistics computation memory cost and
sampling extrapolation accuracy. Controls the overall accuracy of statistics
when using sampling. Default is 10 percent. Range is 0-100.
+- **exec.statistics.deterministic_sampling**
+Turns deterministic sampling on and off. Relates specifically to SAMPLE.
Default is false.
+- **exec.statistics.tdigest_compression**
+Controls the 'compression' factor for the TDigest algorithm used for histogram
statistics. Controls trade-off between t-digest quantile statistic storage cost
and accuracy. Higher values use more groups (clusters) for the t-digest and
improve accuracy at the expense of extra storage. Positive integer values in
the range [1, 10000]. Default is 100.
+
+
+
+## Reserved Keywords
+
+The ANALYZE TABLE statement introduces the following reserved keywords:
+
+ Analyze
+ Compute
+ Estimate
+ Statistics
+ Sample
+
+If you use any of these words in a Drill query, you must enclose the word in
backticks. For example, if you query a table named “estimate,” you would
enclose the word "estimate" in backticks, as shown:
+
+ SELECT * FROM `estimate`;
+
+
+## ANALYZE TABLE Performance
+
+- After you run the ANALYZE TABLE statement, you can view the profile for
ANALYZE in the Drill Web UI. Go to
`http://<drill-hostname-or-ip>:8047/profiles`, and click the ANALYZE TABLE
statement for which you want to view the profile.
+- Should you notice any performance issues, you may want to decrease the value
of the `planner.slice_target` option.
+- Generating statistics on large data sets can unnecessarily consume time and
resources, such as memory and CPU. ANALYZE TABLE can compute statistics on a
sample (subset of the data indicated as a percentage) to limit the amount of
resources needed for computation. Drill still scans the entire data set, but
only computes on the rows selected for sampling. Rows are randomly selected for
the sample. Note that the quality of statistics increases with the sample size.
+
+## Limitations
+
+
+- Drill does not cache statistics.
+- ANALYZE TABLE runs only on directory-based Parquet tables.
+- ANALYZE TABLE cannot do the following:
+ - compute statistics on schema-less file formats, such as text and CSV
+ - provide up-to-date statistics for operational data due to potential
mismatches that can occur between operational updates and manually running
ANALYZE TABLE
+- Running the ANALYZE TABLE statement against multiple files in which some of
the files have null values and others have no null values may return the
following generic Drill error, which is not specific to the ANALYZE command:
+
+ Error: SYSTEM ERROR: IllegalStateException: Failure while
reading vector.
+ Expected vector class of
org.apache.drill.exec.vector.NullableBigIntVector
+ but was holding vector class
org.apache.drill.exec.vector.IntVector, field= [`o_custkey` (INT:REQUIRED)]
+
+ //If you encounter this error, run the ANALYZE TABLE statement
on each file with null values individually instead of running the statement
against all the files at once.
+
+- Running the ANALYZE TABLE statement against a table with a metadata cache
file inadvertently updates the timestamp on the metadata cache file, which
automatically triggers the REFRESH TABLE METADATA command.
+
+## EXAMPLES
+
+These examples use a schema, `dfs.samples`, which points to the `/home`
directory. The `/home` directory contains a subdirectory, `parquet`, which
contains the `nation.parquet` and `region.parquet` files. You can access these
Parquet files in the `sample-data` directory of your Drill installation.
+
+ [root@doc23 parquet]# pwd
+ /home/parquet
+
+ [root@doc23 parquet]# ls
+ nation.parquet region.parquet
+
+Change schemas to use `dfs.samples`:
+
+ use dfs.samples;
+ +-------+------------------------------------------+
+ | ok | summary |
+ +-------+------------------------------------------+
+ | true | Default schema changed to [dfs.samples] |
+ +-------+------------------------------------------+
+
+### Enabling Statistics for Query Planning
+
+You can run the ANALYZE TABLE statement at any time to compute statistics;
however, you must enable the following option if you want Drill to use
statistics during query planning:
+
+ set `planner.statistics.use`=true;
+ +-------+----------------------------------+
+ | ok | summary |
+ +-------+----------------------------------+
+ | true | planner.statistics.use updated. |
+ +-------+----------------------------------+
+
+### Computing Statistics on a Directory
+
+If you want to compute statistics for all Parquet data in a directory, you can
run the ANALYZE TABLE statement against the directory, as shown:
+
+ ANALYZE TABLE `/parquet` COMPUTE STATISTICS;
+ +-----------+----------------------------+
+ | Fragment | Number of records written |
+ +-----------+----------------------------+
+ | 0_0 | 4 |
+ +-----------+----------------------------+
+
+### Computing Statistics on a Table
+
+You can create a table from the data in the `nation.parquet` file, as shown:
+
+ CREATE TABLE nation1 AS SELECT * from `parquet/nation.parquet`;
+ +-----------+----------------------------+
+ | Fragment | Number of records written |
+ +-----------+----------------------------+
+ | 0_0 | 25 |
+ +-----------+----------------------------+
+
+Drill writes the table to the `/home` directory, which is where the
`dfs.samples` workspace points:
+
+ [root@doc23 home]# ls
+ nation1 parquet
+
+Changing to the `nation1` directory, you can see that the table is written as
a parquet file:
+
+ [root@doc23 home]# cd nation1
+ [root@doc23 nation1]# ls
+ 0_0_0.parquet
+
+You can run the ANALYZE TABLE statement on a subset of columns in the table to
generate statistics for those columns only, as shown:
+
+ ANALYZE TABLE dfs.samples.nation1 COMPUTE STATISTICS (N_NATIONKEY,
N_REGIONKEY);
+ +-----------+----------------------------+
+ | Fragment | Number of records written |
+ +-----------+----------------------------+
+ | 0_0 | 2 |
+ +-----------+----------------------------+
+
+Or, you can run the ANALYZE TABLE statement on the entire table if you want
statistics generated for all columns in the table:
+
+ ANALYZE TABLE dfs.samples.nation1 COMPUTE STATISTICS;
+ +-----------+----------------------------+
+ | Fragment | Number of records written |
+ +-----------+----------------------------+
+ | 0_0 | 4 |
+ +-----------+----------------------------+
+
+### Computing Statistics on a SAMPLE
+You can also run ANALYZE TABLE on a percentage of the data in a table using
the SAMPLE command, as shown:
+
+ ANALYZE TABLE dfs.samples.nation1 COMPUTE STATISTICS SAMPLE 50 PERCENT;
+ +-----------+----------------------------+
+ | Fragment | Number of records written |
+ +-----------+----------------------------+
+ | 0_0 | 4 |
+ +-----------+----------------------------+
+
+### Storing Statistics
+When you generate statistics, a statistics directory (`.stats.drill`) is
created with a JSON file that contains the statistical data.
+
+For tables, the `.stats.drill` directory is nested within the table directory.
For example, if you ran ANALYZE TABLE against a table named “nation1,” you
could access the statistic file in:
+
+ [root@doc23 home]# cd nation1/.stats.drill
+ [root@doc23 .stats.drill]# ls
+ 0_0.json
+
+For directories, a new directory is written with the same name as the
directory on which you ran ANALYZE TABLE and appended by `.stats.drill`. For
example, if you ran ANALYZE TABLE against a directory named “parquet,” you
could access the statistic file in:
+
+ [root@doc23 home]# cd parquet.stats.drill
+ [root@doc23 parquet.stats.drill]# ls
+ 0_0.json
+
+You can query the statistics file, as shown in the following two examples:
+
+ SELECT * FROM dfs.samples.`parquet.stats.drill`;
+
+--------------------+----------------------------------------------------------------------------------+
+ | statistics_version | directories
|
+
+--------------------+----------------------------------------------------------------------------------+
+ | v1 |
[{"computed":"2019-04-23","columns":[{"column":"`R_REGIONKEY`","majortype":{"type":"BIGINT","mode":"REQUIRED"},"schema":1.0,"rowcount":5.0,"nonnullrowcount":5.0,"ndv":5,"avgwidth":8.0,"histogram":{"category":"numeric-equi-depth","numRowsPerBucket":1,"buckets":[1.0,0.0,0.0,2.9999999999999996,2.0,4.0]}},{"column":"`R_NAME`","majortype":{"type":"VARCHAR","mode":"REQUIRED"},"schema":1.0,"rowcount":5.0,"nonnullrowcount":5.0,"ndv":5,"avgwidth":6.8,"histogram":{"buckets"
[...]
+
+--------------------+----------------------------------------------------------------------------------+
+
+
+
+ SELECT t.directories.columns[0].ndv as ndv,
t.directories.columns[0].rowcount as rc, t.directories.columns[0].non
nullrowcount AS nnrc, t.directories.columns[0].histogram as histogram FROM
dfs.samples.`parquet.stats.drill` t;
+
+-----+-----+------+----------------------------------------------------------------------------------+
+ | ndv | rc | nnrc | histogram
|
+
+-----+-----+------+----------------------------------------------------------------------------------+
+ | 5 | 5.0 | 5.0 |
{"category":"numeric-equi-depth","numRowsPerBucket":1,"buckets":[1.0,0.0,0.0,2.9999999999999996,2.0,4.0]}
|
+
+-----+-----+------+----------------------------------------------------------------------------------+
+
+### Dropping Statistics
+
+If you want to compute statistics on a table or directory that you have
already run the ANALYZE TABLE statement against, you must first drop the
statistics before you can run ANALYZE TABLE statement on the table again.
+
+The following example demonstrates how to drop statistics on a table:
+
+ DROP TABLE dfs.samples.`parquet/.stats.drill`;
+ +-------+-------------------------------------+
+ | ok | summary |
+ +-------+-------------------------------------+
+ | true | Table [parquet/.stats.drill] dropped |
+ +-------+-------------------------------------+
+
+The following example demonstrates how to drop statistics on a directory:
+
+ DROP TABLE dfs.samples.`/parquet.stats.drill`;
+ +-------+------------------------------------+
+ | ok | summary |
+ +-------+------------------------------------+
+ | true | Table [parquet.stats.drill] dropped |
+ +-------+------------------------------------+
+
+When you drop statistics, the statistics directory no longer exists for the
table:
+
+ [root@doc23 home]# cd parquet/.stats.drill
+ -bash: cd: parquet/.stats.drill: No such file or directory
+
+ SELECT * FROM dfs.samples.`parquet/.stats.drill`;
+ Error: VALIDATION ERROR: From line 1, column 15 to line 1, column 17:
Object 'parquet/.stats.drill' not found within 'dfs.samples'
+ [Error Id: 0b9a0c35-f058-4e0a-91d5-034d095393d7 on doc23.lab:31010]
(state=,code=0)
+
+## Troubleshooting
+
+Typical errors you may get when running ANALYZE TABLE result from running the
statement against an individual file or against a data source other than
Parquet, as shown in the following examples:
+
+**Running ANALYZE TABLE on a file.**
+
+ ANALYZE TABLE `/parquet/nation.parquet` COMPUTE STATISTICS;
+
+--------+----------------------------------------------------------------------------------+
+ | ok | summary
|
+
+--------+----------------------------------------------------------------------------------+
+ | false | Table /parquet/nation.parquet is not supported by ANALYZE.
Support is currently limited to directory-based Parquet tables. |
+
+--------+----------------------------------------------------------------------------------+
+
+
+**Running ANALYZE TABLE on a data source other than Parquet.**
+
+ ANALYZE TABLE nation1_json COMPUTE STATISTICS;
+
+--------+----------------------------------------------------------------------------------+
+ | ok | summary
|
+
+--------+----------------------------------------------------------------------------------+
+ | false | Table nation1_json is not supported by ANALYZE. Support is
currently limited to directory-based Parquet tables. |
+
+--------+----------------------------------------------------------------------------------+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
diff --git a/_docs/sql-reference/sql-commands/011-refresh-table-metadata.md
b/_docs/sql-reference/sql-commands/011-refresh-table-metadata.md
new file mode 100644
index 0000000..d678619
--- /dev/null
+++ b/_docs/sql-reference/sql-commands/011-refresh-table-metadata.md
@@ -0,0 +1,282 @@
+---
+title: "REFRESH TABLE METADATA"
+date: 2019-04-23
+parent: "SQL Commands"
+---
+Run the REFRESH TABLE METADATA command on Parquet tables and directories to
generate a metadata cache file. REFRESH TABLE METADATA collects metadata from
the footers of Parquet files and writes the metadata to a metadata file
(`.drill.parquet_file_metadata.v4`) and a summary file
(`.drill.parquet_summary_metadata.v4`). The planner uses the metadata cache
file to prune extraneous data during the query planning phase. Run the REFRESH
TABLE METADATA command if planning time is a significant [...]
+
+Starting in Drill 1.11, Drill stores the paths to Parquet files as relative
paths instead of absolute paths. You can move partitioned Parquet directories
from one location in the distributed file system to another without issuing the
REFRESH TABLE METADATA command to rebuild the Parquet metadata cache files; the
metadata remains valid in the new location.
+
+## Syntax
+
+The REFRESH TABLE METADATA command supports the following syntax:
+
+ REFRESH TABLE METADATA [ COLUMNS ( column1, column2...) | NONE ]
table_path
+
+
+## Parameters
+*COLUMNS*
+Optional. Introduced in Drill 1.16. Use to indicate the columns on which the
command should refresh metadata. When omitted, metadata is collected for all
the columns.
+
+*column*
+Required when using the COLUMNS clause. The columns for which metadata will be
stored.
+
+*NONE*
+Optional. Introduced in Drill 1.16. When used, the REFRESH command does not
refresh any metadata; however, a summary file is generated and contains a
summary section (ColumnTypeInfo) that lists all the columns with their data
types.
+
+*table_path*
+Required. The name of the table or directory for which Drill will refresh
metadata.
+
+## Related Command
+Run the [EXPLAIN]({{site.baseurl}}/docs/explain/) command to determine the
query execution time. If the query execution time is the most significant time
factor, running REFRESH TABLE METADATA will not improve query performance.
+
+
+## Usage Notes
+
+### Metadata Storage
+- Drill traverses directories for Parquet files and gathers the metadata from
the footer of the files. Drill stores the collected metadata in a metadata
cache file, `.drill.parquet_metadata`, at each directory level.
+- The metadata cache file stores metadata for files in that directory, as well
as the metadata for the files in the subdirectories.
+- For each row group in a Parquet file, the metadata cache file stores the
column names in the row group and the column statistics, such as the min/max
values and null count.
+- If the Parquet data is updated, for example data is added to a file, Drill
automatically refreshes the Parquet metadata when you issue the next query
against the Parquet data.
+
+### Refreshing Columns
+- Starting in Drill 1.16, you can run the REFRESH TABLE METADATA command on
specific columns. When you run the command on specific columns, metadata is
refreshed for the indicated columns only.
+- Refreshing the metadata for sorted or partitioned columns improves the
planning time for queries that filter on these columns. The query planner uses
the min and max column statistics to determine which data meets the filter
criteria. The planner can prune files and row groups that do not meet the
filter criteria, which reduces the amount of time it takes the query planner to
scan the metadata during query planning. Refreshing column metadata (versus
entire tables) also reduces the siz [...]
+
+### Automatic Refresh
+
+- You only have to run the REFRESH TABLE METADATA command against a table once
to generate the initial metadata cache file. Thereafter, Drill automatically
refreshes stale cache files when you issue queries against the table. An
automatic refresh is triggered when data is modified.The query planner uses the
timestamp of the cache file and table to determine if the cache file is stale.
+- An automatic refresh updates the metadata cache file the same way the
REFRESH TABLE METADATA command did when it was last issued against a table. For
example, if you ran REFRESH TABLE METADATA on col1 and col2 in table t1, the
next time a query is issued against table t1, the query planner checks the
timestamp on the cache file and the table, t1. If the timestamp on the data is
later than the timestamp on the metadata cache file, automatic refresh is
triggered and metadata is refreshed [...]
+- Note that the elapsed time of a query that triggers an automatic refresh can
be greater than that of subsequent queries that use the metadata. To avoid
this, you can manually run the REFRESH TABLE METADATA command.
+
+### Query Planning and Execution
+
+- Drill reads the metadata cache file during query planning, which improves
the query planning time.
+- Parquet metadata caching has no effect on query execution time. At execution
time, Drill reads the files. Metadata caching will not improve query
performance if the time it takes to execute the query is greater than the time
used to plan a query.
+- Run [EXPLAIN]({{site.baseurl}}/docs/explain/) for the query to determine if
query execution time is the most significant time factor for the query. Compare
the query execution time to the overall time for the query to determine whether
metadata caching would be useful.
+
+### Not Supported
+- Parquet metadata caching does not benefit queries on Hive tables, HBase
tables, or text files. Drill only uses the Hive metastore to query Parquet
files when a query is issued against the Hive storage plugin.
+
+
+## Related Options
+You can set the following options related to the REFRESH TABLE METADATA
command at the system or session level with the SET (session level) or ALTER
SYSTEM SET (system level) statements, or through the Drill Web UI at
`http://<drill-hostname-or-ip>:8047/options`:
+
+- **planner.store.parquet.rowgroup.filter.pushdown.enabled**
+Enables filter pushdown optimization for Parquet files. Drill reads the file
metadata, stored in the footer, to eliminate row groups based on the filter
condition. Default is true. (Drill 1.9+)
+- **planner.store.parquet.rowgroup.filter.pushdown.threshold**
+Sets the number of row groups that a table can have. You can increase the
threshold if the filter can prune many row groups. However, if this setting is
too high, the filter evaluation overhead increases. Base this setting on the
data set. Reduce this setting if the planning time is significant or you do not
see any benefit at runtime. Default is 10000. (Drill 1.9+)
+
+## Limitations
+Currently, Drill does not support runtime rowgroup pruning.
+
+<!--
+## Examples
+These examples use a schema, `dfs.samples`, which points to the `/home`
directory. The `/home` directory contains a subdirectory, `parquet`, which
+contains the `nation.parquet` and a subdirectory, `dir1` with the
`region.parquet` file. You can access the `nation.parquet` and `region.parquet`
Parquet files in the `sample-data` directory of your Drill installation.
+
+ [root@doc23 dir1]# pwd
+ /home/parquet/dir1
+
+ [root@doc23 parquet]# ls
+ dir1 nation.parquet
+
+ [root@doc23 dir1]# ls
+ region.parquet
+
+Change schemas to use `dfs.samples`:
+
+ use dfs.samples;
+ +-------+------------------------------------------+
+ | ok | summary |
+ +-------+------------------------------------------+
+ | true | Default schema changed to [dfs.samples] |
+ +-------+------------------------------------------+
+
+### Running REFRESH TABLE METADATA on a Directory
+Running the REFRESH TABLE METADATA command on the `parquet` directory
generates metadata cache files at each directory level.
+
+ REFRESH TABLE METADATA parquet;
+ +-------+---------------------------------------------------+
+ | ok | summary |
+ +-------+---------------------------------------------------+
+ | true | Successfully updated metadata for table parquet. |
+ +-------+---------------------------------------------------+
+
+When looking at the `parquet` directory and `dir1` subdirectory, you can see
that a metadata cache file was created at each level:
+
+ [root@doc23 parquet]# ls -la
+ drwxr-xr-x 2 root root 95 Mar 18 17:49 dir1
+ -rw-r--r-- 1 root root 2642 Mar 18 17:52 .drill.parquet_metadata
+ -rw-r--r-- 1 root root 32 Mar 18 17:52 ..drill.parquet_metadata.crc
+ -rwxr-xr-x 1 root root 1210 Mar 13 13:32 nation.parquet
+
+ [root@doc23 dir1]# ls -la
+ -rw-r--r-- 1 root root 1235 Mar 18 17:52 .drill.parquet_metadata
+ -rw-r--r-- 1 root root 20 Mar 18 17:52 ..drill.parquet_metadata.crc
+ -rwxr-xr-x 1 root root 455 Mar 18 17:41 region.parquet
+
+The following sections compare the content of the metadata cache file in the
`parquet` and `dir1` directories:
+
+**Content of the metadata cache file in the directory named `parquet` that
contains the nation.parquet file and subdirectory `dir1`.**
+
+
+ [root@doc23 parquet]# cat .drill.parquet_metadata
+ {
+ "metadata_version" : "3.3",
+ "columnTypeInfo" : {
+ "`N_COMMENT`" : {
+ "name" : [ "N_COMMENT" ],
+ "primitiveType" : "BINARY",
+ "originalType" : "UTF8",
+ "precision" : 0,
+ "scale" : 0,
+ "repetitionLevel" : 0,
+ "definitionLevel" : 0
+ },
+ "`N_NATIONKEY`" : {
+ "name" : [ "N_NATIONKEY" ],
+ "primitiveType" : "INT64",
+ "originalType" : null,
+ "precision" : 0,
+ "scale" : 0,
+ "repetitionLevel" : 0,
+ "definitionLevel" : 0
+ },
+ "`R_REGIONKEY`" : {
+ "name" : [ "R_REGIONKEY" ],
+ "primitiveType" : "INT64",
+ "originalType" : null,
+ "precision" : 0,
+ "scale" : 0,
+ "repetitionLevel" : 0,
+ "definitionLevel" : 0
+ },
+ "`R_COMMENT`" : {
+ "name" : [ "R_COMMENT" ],
+ "primitiveType" : "BINARY",
+ "originalType" : "UTF8",
+ "precision" : 0,
+ "scale" : 0,
+ "repetitionLevel" : 0,
+ "definitionLevel" : 0
+ },
+ "`N_REGIONKEY`" : {
+ "name" : [ "N_REGIONKEY" ],
+ "primitiveType" : "INT64",
+ "originalType" : null,
+ "precision" : 0,
+ "scale" : 0,
+ "repetitionLevel" : 0,
+ "definitionLevel" : 0
+ },
+ "`R_NAME`" : {
+ "name" : [ "R_NAME" ],
+ "primitiveType" : "BINARY",
+ "originalType" : "UTF8",
+ "precision" : 0,
+ "scale" : 0,
+ "repetitionLevel" : 0,
+ "definitionLevel" : 0
+ },
+ "`N_NAME`" : {
+ "name" : [ "N_NAME" ],
+ "primitiveType" : "BINARY",
+ "originalType" : "UTF8",
+ "precision" : 0,
+ "scale" : 0,
+ "repetitionLevel" : 0,
+ "definitionLevel" : 0
+ }
+ },
+ "files" : [ {
+ "path" : "dir1/region.parquet",
+ "length" : 455,
+ "rowGroups" : [ {
+ "start" : 4,
+ "length" : 250,
+ "rowCount" : 5,
+ "hostAffinity" : {
+ "localhost" : 1.0
+ },
+ "columns" : [ ]
+ } ]
+ }, {
+ "path" : "nation.parquet",
+ "length" : 1210,
+ "rowGroups" : [ {
+ "start" : 4,
+ "length" : 944,
+ "rowCount" : 25,
+ "hostAffinity" : {
+ "localhost" : 1.0
+ },
+ "columns" : [ ]
+ } ]
+ } ],
+ "directories" : [ "dir1" ],
+ "drillVersion" : "1.16.0-SNAPSHOT"
+
+**Content of the directory named `dir1` that contains the `region.parquet`
file and no subdirectories.**
+
+ [root@doc23 dir1]# cat .drill.parquet_metadata
+ {
+ "metadata_version" : "3.3",
+ "columnTypeInfo" : {
+ "`R_REGIONKEY`" : {
+ "name" : [ "R_REGIONKEY" ],
+ "primitiveType" : "INT64",
+ "originalType" : null,
+ "precision" : 0,
+ "scale" : 0,
+ "repetitionLevel" : 0,
+ "definitionLevel" : 0
+ },
+ "`R_COMMENT`" : {
+ "name" : [ "R_COMMENT" ],
+ "primitiveType" : "BINARY",
+ "originalType" : "UTF8",
+ "precision" : 0,
+ "scale" : 0,
+ "repetitionLevel" : 0,
+ "definitionLevel" : 0
+ },
+ "`R_NAME`" : {
+ "name" : [ "R_NAME" ],
+ "primitiveType" : "BINARY",
+ "originalType" : "UTF8",
+ "precision" : 0,
+ "scale" : 0,
+ "repetitionLevel" : 0,
+ "definitionLevel" : 0
+ }
+ },
+ "files" : [ {
+ "path" : "region.parquet",
+ "length" : 455,
+ "rowGroups" : [ {
+ "start" : 4,
+ "length" : 250,
+ "rowCount" : 5,
+ "hostAffinity" : {
+ "localhost" : 1.0
+ },
+ "columns" : [ ]
+ } ]
+ } ],
+ "directories" : [ ],
+ "drillVersion" : "1.16.0-SNAPSHOT"
+ }
+
+### Verifying that the Planner is Using the Metadata Cache File
+
+When the planner uses metadata cache files, the query plan includes the
`usedMetadataFile` flag. You can access the query plan in the Drill Web UI, by
clicking on the query in the Profiles page, or by running the EXPLAIN PLAN FOR
command, as shown:
+
+ EXPLAIN PLAN FOR SELECT * FROM parquet;
+
+ | 00-00 Screen
+ 00-01 Project(**=[$0])
+ 00-02 Scan(table=[[dfs, samples, parquet]],
groupscan=[ParquetGroupScan [entries=[ReadEntryWithPath [path=/home/parquet]],
selectionRoot=/home/parquet, numFiles=1, numRowGroups=2, usedMetadataFile=true,
cacheFileRoot=/home/parquet, columns=[`**`]]])
+ |...
+
+-->
diff --git a/_docs/sql-reference/sql-commands/021-create-schema.md
b/_docs/sql-reference/sql-commands/021-create-schema.md
new file mode 100644
index 0000000..f236735
--- /dev/null
+++ b/_docs/sql-reference/sql-commands/021-create-schema.md
@@ -0,0 +1,650 @@
+---
+title: "CREATE OR REPLACE SCHEMA"
+date: 2019-04-25
+parent: "SQL Commands"
+---
+
+Starting in Drill 1.16, you can define a schema for text files using the
CREATE OR REPLACE SCHEMA command. Running this command generates a hidden
.drill.schema file in the table’s root directory. The .drill.schema file stores
the schema definition in JSON format. Drill uses the schema file at runtime if
the exec.storage.enable_v3_text_reader and store.table.use_schema_file options
are enabled. Alternatively, you can create the schema file manually. When
created manually, the file conten [...]
+
+##Syntax
+
+The CREATE OR REPLACE SCHEMA command supports the following syntax:
+
+ CREATE [OR REPLACE] SCHEMA
+ [LOAD 'file:///path/to/file']
+ [(column_name data_type nullability format default properties
{prop='val', ...})]
+ [FOR TABLE `table_name`]
+ [PATH 'file:///schema_file_path/schema_file_name']
+ [PROPERTIES ('key1'='value1', 'key2'='value2', ...)]
+
+##Parameters
+
+*OR REPLACE*
+Existing schema is dropped and replaced with the new schema. Only supported
when using FOR TABLE. Not supported when using PATH because it prevents
malicious deletion of any file. You must manually delete any schema file
created in a custom location.
+
+*LOAD*
+Loads raw schema (list of column names with their attributes) from a file. You
must indicate the path to the file after the LOAD keyword. Note that columns
should be listed or provided when using the LOAD clause; at least one option is
required for the successful schema creation.
+
+*column_name*
+Name of the column for which schema is created. Case-insensitive.
+
+*data_type*
+Data type defined for the column. See Supported Data Types.
+
+*format*
+Sets the format for date and time data types when converting from string.
+
+*default*
+Sets a default value for non-nullable columns, such that queries return the
default value instead of null.
+
+*properties*
+Keyword to include optional properties. See Related Options below.
+
+*property*
+Name of the property applied to the column or table.
+
+*value*
+Value set for the indicated property.
+
+*table_name*
+Name of the table associated with the schema being created or replaced.
Enclose the table name in backticks if there are spaces after the FOR TABLE
keywords. If the table does not exist, the command fails and schema is not
created. If you indicate the table name without schema, the table is assumed to
be in the current workspace, and you must specify the PATH property. If you
indicate FOR TABLE and PATH, or you do not indicate either, the CREATE SCHEMA
command fails. In this case, the ta [...]
+
+*PATH*
+Path to the schema file. You must indicate the path to the file after the PATH
keyword.
+
+*properties*
+List of properties as key-value pairs in parenthesis.
+
+## Related Options
+
+You must enable the following options for Drill to use the schema created
during query execution:
+
+ set `exec.storage.enable_v3_text_reader` = true;
+ +------+---------------------------------------------+
+ | ok | summary |
+ +------+---------------------------------------------+
+ | true | exec.storage.enable_v3_text_reader updated. |
+ +------+---------------------------------------------+
+
+ set `store.table.use_schema_file` = true;
+ +------+--------------------------------------+
+ | ok | summary |
+ +------+--------------------------------------+
+ | true | store.table.use_schema_file updated. |
+ +------+--------------------------------------+
+
+## Related Properties
+
+When you create a schema, you can set the following properties within the
CREATE [OR REPLACE] SCHEMA command:
+
+**drill.strict**
+A table property that determines the ordering of columns returned for wildcard
(*) queries. Accepts a value of true or false. See Schema Mode (Column Order).
+
+**drill.format**
+A column property that ensures proper conversion when converting string values
to date and time data types. See Format for Date, Time Conversion.
+
+**drill.default**
+A column property that sets non-nullable columns to a “default” value when
creating the schema. See Column Modes (Nullable and Non-Nullable Columns).
+
+**drill.blank-as**
+A property that sets how Drill handles blank column values. Accepts the
following values:
+- **null**: If the column is nullable, treat the blank as null. If
non-nullable, leave the blank unchanged.
+- **0**: Replace blanks with the value "0" for numeric types.
+- **skip**: Skip blank values. This sets the column to its default value: NULL
for nullable columns, the default value for non-nullable columns.
+- If left empty, blanks have no special meaning. A blank is parsed as any
other string, which typically produces an error.
+
+See Handling Policy for Blank Column Values.
+
+### Setting Properties
+Include properties after the “properties” keyword, as shown in the following
example where the date format is set to `'yyyy-MM-dd'` through the
`drill.format` column property:
+
+ create or replace schema (start_date date properties {'drill.format' =
'yyyy-MM-dd'}) for table dfs.tmp.`text_table`;
+
+Alternatively, you can use “default” and “format” as keywords when creating
schema, as shown in the following examples:
+
+Setting the default for the non-nullable column “id” to -1 using the keyword
“default”:
+
+ create or replace schema (id int not null default '-1') for table
dfs.tmp.`text_table`;
+
+When you query the text_table, all blank values in the “id” column return a
value of -1.
+
+###Storing Properties
+The defined schema and configured properties are stored and reflected in the
schema file, `.drill.schema`, which you can see when you run DESCRIBE SCHEMA
FOR TABLE.
+
+ describe schema for table dfs.tmp.`text_table`;
+
+----------------------------------------------------------------------------------+
+ | schema
|
+
+----------------------------------------------------------------------------------+
+ | {
+ "table" : "dfs.tmp.`text_table`",
+ "schema" : {
+ "columns" : [
+ {
+ "name" : "id",
+ "type" : "INT",
+ "mode" : "REQUIRED",
+ "properties" : {
+ "drill.default" : "-1"
+ }
+ },
+ {
+ "name" : "start_date",
+ "type" : "DATE",
+ "mode" : "REQUIRED",
+ "properties" : {
+ "drill.format" : "yyyy-MM-dd",
+ "drill.default" : "2017-01-01"
+ }
+ }
+ ],
+ "properties" : {
+ "drill.strict" : "true"
+ }
+ },
+ "version" : 1
+ } |
+
+----------------------------------------------------------------------------------+
+
+
+## Related Commands
+
+
+ DROP SCHEMA [IF EXISTS] FOR TABLE `table_name`
+See Dropping Schema for a Table in the Examples section at the end of this
topic.
+
+
+ DESCRIBE SCHEMA FOR TABLE `table_name`
+See Describing Schema for a Table in the Examples section at the end of this
topic.
+
+## Supported Data Types
+
+Text files store information in string format and only support simple data
types. You can use the CREATE [OR REPLACE] SCHEMA command to convert string
data types in text files to the following data types:
+- INTEGER
+- BIGINT
+- DOUBLE
+- FLOAT
+- DECIMAL
+- BOOLEAN
+- VARCHAR
+- TIMESTAMP
+- DATE
+- TIME
+- INTERVAL [YEAR, MONTH, DAY, HOUR, MINUTE, SECOND]
+
+**Note:** Complex data types (arrays and maps) are not supported.
+
+Values are trimmed when converting to any type, except for varchar.
+
+## Usage Notes
+
+### General Information
+- Schema provisioning only works with tables defined as directories because
Drill must have a place to store the schema file. The directory can contain one
or more files.
+- Text files must have headers. The default extension for delimited text files
with headers is `.csvh`. Note that the column names that appear in the headers
match column definitions in the schema.
+- You do not have to enumerate all columns in a file when creating a schema.
You can indicate the columns of interest only.
+- Columns in the defined schema do not have to be in the same order as in the
data file. However, the names must match. The case can differ, for example
“name” and “NAME” are acceptable.
+- Queries on columns with data types that cannot be converted fail with a
`DATA_READ_ERROR`.
+
+### Schema Mode (Column Order)
+The schema mode determines the ordering of columns returned for wildcard (*)
queries. The mode is set through the `drill.strict` property. You can set this
property to true (strict) or false (not strict). If you do not indicate the
mode, the default is false (not strict).
+
+**Not Strict (Default)**
+Columns defined in the schema are projected in the defined order. Columns not
defined in the schema are appended to the defined columns, as shown:
+
+ create or replace schema (id int, start_date date format 'yyyy-MM-dd')
for table dfs.tmp.`text_table` properties ('drill.strict' = 'false');
+ +------+-----------------------------------------+
+ | ok | summary |
+ +------+-----------------------------------------+
+ | true | Created schema for [dfs.tmp.text_table] |
+ +------+-----------------------------------------+
+
+ select * from dfs.tmp.`text_table`;
+ +------+------------+---------+
+ | id | start_date | name |
+ +------+------------+---------+
+ | 1 | 2019-02-01 | Fred |
+ | 2 | 2018-11-30 | Wilma |
+ | 3 | 2016-01-01 | Pebbles |
+ | 4 | null | Barney |
+ | null | null | Dino |
+ +------+------------+---------+
+
+Note that the “name” column, which was not included in the schema was appended
to the end of the table.
+
+**Strict**
+Setting the `drill.strict` property to “true” changes the schema mode to
strict, which means that the reader ignores any columns NOT included in the
schema. The query only returns the columns defined in the schema, as shown:
+
+ create or replace schema (id int, start_date date format 'yyyy-MM-dd')
for table dfs.tmp.`text_table` properties ('drill.strict' = 'true');
+ +------+-----------------------------------------+
+ | ok | summary |
+ +------+-----------------------------------------+
+ | true | Created schema for [dfs.tmp.text_table] |
+ +------+-----------------------------------------+
+
+ select * from dfs.tmp.`text_table`;
+ +------+------------+
+ | id | start_date |
+ +------+------------+
+ | 1 | 2019-02-01 |
+ | 2 | 2018-11-30 |
+ | 3 | 2016-01-01 |
+ | 4 | null |
+ | null | null |
+ +------+------------+
+
+Note that the “name” column, which was not included in the schema was ignored
and not returned in the result set.
+
+## Including Additional Columns in the Schema
+When you create a schema, you can include columns that do not exist in the
table and these columns will be projected. This feature ensures that queries
return the correct results whether the files have a specific column or not.
Note that schema mode does not affect the behavior of this feature.
+
+For example, the “comment” column is not in the text_table, but added when
creating the schema:
+
+ create or replace schema (id int, start_date date format 'yyyy-MM-dd',
comment varchar) for table dfs.tmp.`text_table`;
+ +------+-----------------------------------------+
+ | ok | summary |
+ +------+-----------------------------------------+
+ | true | Created schema for [dfs.tmp.text_table] |
+ +------+-----------------------------------------+
+
+You can see the “comment” column returned in the result set.
+
+ select * from dfs.tmp.`text_table`;
+ +------+------------+---------+---------+
+ | id | start_date | comment | name |
+ +------+------------+---------+---------+
+ | 1 | 2019-02-01 | null | Fred |
+ | 2 | 2018-11-30 | null | Wilma |
+ | 3 | 2016-01-01 | null | Pebbles |
+ | 4 | null | null | Barney |
+ | null | null | null | Dino |
+ +------+------------+---------+---------+
+
+## Column Modes (Nullable and Non-Nullable Columns)
+If a column in the schema is nullable (allows null values), and the column has
a null value, the column value is returned as null. If the column is required
(not nullable), but contains a null value, Drill returns the default value
provided. If no default value is provided, Drill sets the column value to the
natural default.
+
+For example, if you create a strict schema with two nullable columns (id and
start_date), you can see that the missing values in both cases are null.
+
+ create or replace schema (id int, start_date date format 'yyyy-MM-dd')
for table dfs.tmp.`text_table` properties ('drill.strict' = 'true');
+ +------+-----------------------------------------+
+ | ok | summary |
+ +------+-----------------------------------------+
+ | true | Created schema for [dfs.tmp.text_table] |
+ +------+-----------------------------------------+
+
+ select * from dfs.tmp.`text_table`;
+ +------+------------+
+ | id | start_date |
+ +------+------------+
+ | 1 | 2019-02-01 |
+ | 2 | 2018-11-30 |
+ | 3 | 2016-01-01 |
+ | 4 | null |
+ | null | null |
+ +------+------------+
+
+Updating the strict schema to have two required columns (id and start_date),
you can see that the natural default was applied; 0 for id and 1970-01-01 for
start_date.
+
+ create or replace schema (id
+ int not null, start_date date not null format 'yyyy-MM-dd') for table
+ dfs.tmp.`text_table` properties ('drill.strict' = 'true');
+ +------+-----------------------------------------+
+ | ok | summary |
+ +------+-----------------------------------------+
+ | true | Created schema for [dfs.tmp.text_table] |
+ +------+-----------------------------------------+
+
+ select * from dfs.tmp.`text_table`;
+ +----+------------+
+ | id | start_date |
+ +----+------------+
+ | 1 | 2019-02-01 |
+ | 2 | 2018-11-30 |
+ | 3 | 2016-01-01 |
+ | 4 | 1970-01-01 |
+ | 0 | 1970-01-01 |
+ +----+------------+
+
+Adding a default for each of these columns (-1 for id and 2017-01-01 for
start_date), you can see that the columns return the defined default value
instead of the natural default.
+
+ create or replace schema (id
+ int not null default '-1', start_date date not null format 'yyyy-MM-dd'
default
+ '2017-01-01') for table dfs.tmp.`text_table` properties ('drill.strict'
=
+ 'true');
+ +------+-----------------------------------------+
+ | ok | summary |
+ +------+-----------------------------------------+
+ | true | Created schema for [dfs.tmp.text_table] |
+ +------+-----------------------------------------+
+
+ select * from dfs.tmp.`text_table`;
+ +----+------------+
+ | id | start_date |
+ +----+------------+
+ | 1 | 2019-02-01 |
+ | 2 | 2018-11-30 |
+ | 3 | 2016-01-01 |
+ | 4 | 2017-01-01 |
+ | -1 | 2017-01-01 |
+ +----+------------+
+
+## Handling Policy for Blank Column Values
+It is common for CSV files to have blank column values. The default
+output for blank column values are empty strings (''), as shown:
+
+ select * from dfs.tmp.`text_blank`;
+ +----+--------+------------+
+ | id | amount | start_date |
+ +----+--------+------------+
+ | 1 | 20 | 2019-01-01 |
+ | 2 | | |
+ | 3 | 30 | |
+ +----+--------+------------+
+
+When a schema is defined for columns, the default blank handling policy is
`skip` which treats blank values as null, as shown:
+
+ create or replace schema (id
+ int, amount double, start_date date format 'yyyy-MM-dd') for table
+ dfs.tmp.`text_blank`;
+ +------+-----------------------------------------+
+ | ok | summary |
+ +------+-----------------------------------------+
+ | true | Created schema for [dfs.tmp.text_blank] |
+ +------+-----------------------------------------+
+
+ select * from dfs.tmp.`text_blank`;
+ +----+--------+------------+
+ | id | amount | start_date |
+ +----+--------+------------+
+ | 1 | 20.0 | 2019-01-01 |
+ | 2 | null | null |
+ | 3 | 30.0 | null |
+ +----+--------+------------+
+
+If a column is absent in the schema, the blank handling policy is default.
Note that the blank handling policy is not applicable to varchar columns since
they do not go through the type conversion logic.
+
+You can configure how Drill handles blank column values through the
`drill.blank-as` property when you create schema.
+
+In the following example, you can see the blank handling policy for the
defined schema with the `drill.blank-as` property set to `0` on the “amount”
column:
+
+ create or replace schema (id int, amount double properties
{'drill.blank-as' = '0'}, start_date date format 'yyyy-MM-dd') for table
dfs.tmp.`text_blank`;
+ +------+-----------------------------------------+
+ | ok | summary |
+ +------+-----------------------------------------+
+ | true | Created schema for [dfs.tmp.text_blank] |
+ +------+-----------------------------------------+
+
+ select * from dfs.tmp.`text_blank`;
+ +----+--------+------------+
+ | id | amount | start_date |
+ +----+--------+------------+
+ | 1 | 20.0 | 2019-01-01 |
+ | 2 | 0.0 | null |
+ | 3 | 30.0 | null |
+ +----+--------+------------+
+
+## Format for Date, Time Conversion
+When you convert string values to date and time data types, include the format
for proper conversion.
+
+You can include the format using the keyword “format,” as shown:
+
+ create or replace schema (start_date date format 'yyyy-MM-dd') for
table dfs.tmp.`text_table`;
+
+Alternatively, you can include the format in the column properties, as shown:
+
+ create or replace schema (start_date date properties {'drill.format' =
'yyyy-MM-dd'}) for table dfs.tmp.`text_table`;
+
+Note that date, time type conversion uses the Joda time library, thus the
format pattern must comply with the [Joda time supported format
pattern](https://www.joda.org/joda-time/key_format.html). If the format is not
indicated, ISO datetime formats are used:
+
+| **Type** | **Accepted Format**
|
+|-----------|--------------------------------------------------------------------------|
+|
[Timestamp](https://www.joda.org/joda-time/apidocs/org/joda/time/format/ISODateTimeFormat.html#dateTimeNoMillis--)
| yyyy-MM-dd'T'HH:mm:ssZZ |
+|
[Date](https://www.joda.org/joda-time/apidocs/org/joda/time/format/ISODateTimeFormat.html#localDateParser--)
| date-element = std-date-element | ord-date-element |
week-date-element |
+| | std-date-element = yyyy ['-' MM ['-' dd]]
|
+| | ord-date-element = yyyy ['-' DDD]
|
+| | week-date-element = xxxx '-W' ww ['-' e]
|
+|
[Time](https://www.joda.org/joda-time/apidocs/org/joda/time/format/ISODateTimeFormat.html#localTimeParser--)
| time = ['T'] time-element
|
+| | time-element = HH [minute-element] | [fraction]
|
+| | minute-element = ':' mm [second-element] | [fraction]
|
+| | second-element = ':' ss [fraction]
|
+| | fraction = ('.' | ',') digit+
|
+
+
+## Limitations
+None
+
+## Examples
+Examples throughout this topic use the files and directories described in the
following section, Directory and File Setup.
+
+###Directory and File Setup
+
+ [root@doc23 text_table]# pwd
+ /tmp/text_table
+ [root@doc23 text_table]# ls
+ 1.csvh 2.csvh
+
+ [root@doc23 text_table]# cat 1.csvh
+ id,name,start_date
+ 1,Fred,2019-02-01
+ 2,Wilma,2018-11-30
+ 3,Pebbles,2016-01-01
+ 4,Barney
+
+ [root@doc23 text_table]# cat 2.csvh
+ name
+ Dino
+
+ [root@doc23 tmp]# cd text_blank/
+ [root@doc23 text_blank]# ls
+ blank.csvh
+
+ [root@doc23 text_blank]# cat blank.csvh
+ id,amount,start_date
+ 1,20,2019-01-01
+ 2,,
+ 3,30,
+
+
+Query the directory text_table.
+
+ select * from dfs.tmp.`text_table`;
+ +----+---------+------------+
+ | id | name | start_date |
+ +----+---------+------------+
+ | 1 | Fred | 2019-02-01 |
+ | 2 | Wilma | 2018-11-30 |
+ | 3 | Pebbles | 2016-01-01 |
+ | 4 | Barney | |
+ | | Dino | |
+ +----+---------+------------+
+
+Notice that the query plan contains a scan and project:
+
+ | 00-00 Screen
+ 00-01 Project(**=[$0])
+ 00-02 Scan(table=[[dfs, tmp, text_table]],
groupscan=[EasyGroupScan [selectionRoot=file:/tmp/text_table, numFiles=2,
columns=[`**`], files=[file:/tmp/text_table/1.csvh,
file:/tmp/text_table/2.csvh], schema=null]])
+
+Using the sqltypeof and modeof functions, you can see that each column is
defined as a non-nullable string (varchar), and missing columns are defined as
empty strings:
+
+ select sqltypeof(id), modeof(id) from dfs.tmp.`text_table` limit 1;
+ +-------------------+----------+
+ | EXPR$0 | EXPR$1 |
+ +-------------------+----------+
+ | CHARACTER VARYING | NOT NULL |
+ +-------------------+----------+
+
+### Creating a Schema
+Create a defined schema for the text_table directory. When you define schema,
you do not have to enumerate all columns. The columns in the defined schema are
not required to be in the same order as the data file. Note that the name of
the columns must match, but can differ in case.
+
+Define schema for the id column:
+
+ create schema (id int) for table dfs.tmp.`text_table`;
+ +------+-----------------------------------------+
+ | ok | summary |
+ +------+-----------------------------------------+
+ | true | Created schema for [dfs.tmp.text_table] |
+ +------+-----------------------------------------+
+
+After you define a schema, you can see the schema file (stored in JSON format)
in the root table directory:
+
+ [root@doc23 text_table]# ls -a
+ . .. 1.csvh 2.csvh .drill.schema ..drill.schema.crc
+
+ [root@doc23 text_table]# cat .drill.schema
+ {
+ "table" : "dfs.tmp.`text_table`",
+ "schema" : {
+ "columns" : [
+ {
+ "name" : "id",
+ "type" : "INT",
+ "mode" : "OPTIONAL"
+ }
+ ]
+ },
+ "version" : 1
+
+Query the text_table directory to see how the schema is applied:
+
+ select * from dfs.tmp.`text_table`;
+ +------+---------+------------+
+ | id | name | start_date |
+ +------+---------+------------+
+ | 1 | Fred | 2019-02-01 |
+ | 2 | Wilma | 2018-11-30 |
+ | 3 | Pebbles | 2016-01-01 |
+ | 4 | Barney | |
+ | null | Dino | |
+ +------+---------+------------+
+
+After defining the schema on the id column, you can see that the `id` column
type and mode is defined as a nullable integer, while other columns types were
inferred as non-nullable VarChar:
+
+ select sqltypeof(id), modeof(id) from dfs.tmp.`text_table` limit 1;
+ +---------+----------+
+ | EXPR$0 | EXPR$1 |
+ +---------+----------+
+ | INTEGER | NULLABLE |
+ +---------+----------+
+
+Running EXPLAIN PLAN, you can see that type conversion was done while reading
data from source; no additional plan stages were added:
+
+ explain plan for select * from dfs.tmp.`text_table`;
+ | 00-00 Screen
+ 00-01 Project(**=[$0])
+ 00-02 Scan(table=[[dfs, tmp, text_table]],
groupscan=[EasyGroupScan [selectionRoot=file:/tmp/text_table, numFiles=2,
columns=[`**`], files=[file:/tmp/text_table/1.csvh,
file:/tmp/text_table/2.csvh], schema=[TupleSchema [PrimitiveColumnMetadata
[`id` (INT(0, 0):OPTIONAL)]]]]])
+
+### Describing Schema for a Table
+After you create schema, you can examine the schema using the DESCRIBE SCHEMA
FOR TABLE command. Schema can print to JSON or STATEMENT format. JSON format is
the default if no format is indicated in the query. Schema displayed in JSON
format is the same as the JSON format in the `.drill.schema` file.
+
+ describe schema for table dfs.tmp.`text_table` as JSON;
+
+----------------------------------------------------------------------------------+
+ | schema
|
+
+----------------------------------------------------------------------------------+
+ | {
+ "table" : "dfs.tmp.`text_table`",
+ "schema" : {
+ "columns" : [
+ {
+ "name" : "id",
+ "type" : "INT",
+ "mode" : "OPTIONAL"
+ }
+ ]
+ },
+ "version" : 1
+ } |
+
+----------------------------------------------------------------------------------+
+
+STATEMENT format displays the schema in a form compatible with the CREATE OR
REPLACE SCHEMA command such that it can easily be copied, modified, and
executed.
+
+ describe schema for table dfs.tmp.`text_table` as statement;
+
+--------------------------------------------------------------------------+
+ | schema
|
+
+--------------------------------------------------------------------------+
+ | CREATE OR REPLACE SCHEMA
+ (
+ `id` INT
+ )
+ FOR TABLE dfs.tmp.`text_table`
+ |
+
+--------------------------------------------------------------------------+
+
+### Dropping Schema for a Table
+You can easily drop the schema for a table using the DROP SCHEMA [IF EXISTS]
FOR TABLE \`table_name` command, as shown:
+
+ use dfs.tmp;
+ +------+-------------------------------------+
+ | ok | summary |
+ +------+-------------------------------------+
+ | true | Default schema changed to [dfs.tmp] |
+ +------+-------------------------------------+
+
+ drop schema for table `text_table`;
+ +------+---------------------------------------+
+ | ok | summary |
+ +------+---------------------------------------+
+ | true | Dropped schema for table [text_table] |
+ +------+---------------------------------------+
+
+
+##Troubleshooting
+
+**Schema defined as incorrect data type produces DATA_READ_ERROR**
+Assume that you defined schema on the “name” column as integer, as shown:
+ create or replace schema (name int) for table dfs.tmp.`text_table`;
+ +------+-----------------------------------------+
+ | ok | summary |
+ +------+-----------------------------------------+
+ | true | Created schema for [dfs.tmp.text_table] |
+ +------+-----------------------------------------+
+
+Because the column does not contain integers, a query issued against the
directory returns the DATA_READ_ERROR. The error message includes the line and
value causing the issue:
+
+ select * from dfs.tmp.`text_table`;
+
+ Error: DATA_READ ERROR: Illegal conversion: Column `name` of type INT,
Illegal value `Fred`
+
+ Line 1
+ Record 0
+ Fragment 0:0
+
+ [Error Id: db6afe96-1014-4a98-a034-10e36daa1aa7 on doc23.lab:31010]
(state=,code=0)
+
+ Incorrect syntax
+ Required syntax properties are missing or incorrectly specified. Check
CREATE SCHEMA syntax to ensure that all required properties where specified.
+
+**Missing backticks**
+Unable to parse CREATE SCHEMA command due to unknown properties and keywords.
If the specified table name or column name contains spaces, enclose the name in
backticks.
+
+**Incorrect data type**
+Parsing has failed on unknown/unsupported data type. Check which data types
are supported by Drill. Check syntax for complex data types.
+
+**Missing table name**
+Table name indicated after `FOR TABLE` keywords are not present. Check if the
table exists in the specified schema or in the current schema (if the schema
was not specified). If you do not want to bind schema to the specific table,
use the PATH property.
+
+**Existing schema file**
+A schema file already exists for the table. Use the DROP SCHEMA command to
remove stale or mistakenly created schema file. Alternatively, you can use the
CREATE OR REPLACE SCHEMA command.
+
+**Lack of permissions**
+Unable to create schema file in the directory. Verify that you have write
permission on the table’s root directory or the directory specified. If not,
indicate the directory on which you write permissions in the command syntax.
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+