[drill] branch gh-pages updated: add doc for analyze refresh schema

bridgetb Fri, 26 Apr 2019 12:47:46 -0700

This is an automated email from the ASF dual-hosted git repository.

bridgetb pushed a commit to branch gh-pages
in repository https://gitbox.apache.org/repos/asf/drill.git



The following commit(s) were added to refs/heads/gh-pages by this push:
     new 46b33f9  add doc for analyze refresh schema
46b33f9 is described below

commit 46b33f942771d9872ae67235ad05fa57871c421d
Author: Bridget Bevens <[email protected]>
AuthorDate: Fri Apr 26 12:45:50 2019 -0700

    add doc for analyze refresh schema
---
 .../020-query-profiles.md                          |   6 +-
 .../sql-commands/009-analyze-table.md              | 323 ++++++++++
 .../sql-commands/011-refresh-table-metadata.md     | 282 +++++++++
 .../sql-commands/021-create-schema.md              | 650 +++++++++++++++++++++
 4 files changed, 1258 insertions(+), 3 deletions(-)

diff --git 
a/_docs/performance-tuning/identifying-performance-issues/020-query-profiles.md 
b/_docs/performance-tuning/identifying-performance-issues/020-query-profiles.md
index 7ce4652..2730da4 100644
--- 
a/_docs/performance-tuning/identifying-performance-issues/020-query-profiles.md
+++ 
b/_docs/performance-tuning/identifying-performance-issues/020-query-profiles.md
@@ -1,6 +1,6 @@
 ---
 title: "Query Profiles"
-date: 2019-04-22
+date: 2019-04-26
 parent: "Identifying Performance Issues"
 ---
 
@@ -57,8 +57,8 @@ The following table lists the warnings, icons, configurable 
options related to t
 
 | Warning   Message                                                            
                                                                                
                            | Icon                                              
                                                                                
      | Related   Option(s)                                                     
                | Description                                                   
              [...]
 
|------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------|-----------------------------------------------------------------------------------------------------------------------------------------|-----------------------------------------------------------------------------------------|----------------------------------------------------------------------------
 [...]
-| !WARNING: No fragments have made any progress in the last 300   seconds. 
(See Last Progress below.)                                                      
                                | None 
&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;
                                                                                
                                                    |       
drill.exec.http.profile.warning.progress.threshold [...]
-| !WARNING: Some operators have data spilled to disk. This will   result in 
performance loss. (See Avg Peak memory and Max Peak Memory below)               
                               |       ![](https://i.imgur.com/b8LGKmq.png) 
Look for the icon in the Avg Peak Memory   and Max Peak Memory columns to find 
the operators that spilled data to   disk.         |       See Sort-Based and 
Hash-Based Memory Constrained operators for related   options. |       Hovering 
the mouse over the icon [...]
+| !WARNING: No fragments have made any progress in the last 300   seconds. 
(See Last Progress below.)                                                      
                                | None 
&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;
                                                                                
   [...]
+| !WARNING: Some operators have data spilled to disk. This will   result in 
performance loss. (See Avg Peak memory and Max Peak Memory below)               
                               |       ![](https://i.imgur.com/b8LGKmq.png) 
Look for the icon in the Avg Peak Memory   and Max Peak Memory columns to find 
the operators that spilled data to   disk.         |       See [Sort-Based and 
Hash-Based Memory Constrained 
Operators]({{site.baseurl}}/docs/sort-based-and-hash-based-memory-constr [...]
 | !WARNING: Some of the operators spent more time waiting for data   than 
processing it. (See AVG WAIT TIME as compared to Average Process Time   for the 
operators below)                 |       ![](https://i.imgur.com/UrHy2Fv.png) 
Look for the icon in the Max Process Time   or Max Wait Time columns to locate 
the operators that waited or processed too   long. |       
drill.exec.http.profile.warning.time.skew.min                                   
  | Sets   the minimum threshold for opera [...]
 | See warning for drill.exec.http.profile.warning.time.skew.min                
                                                                                
                            | ![](https://i.imgur.com/UrHy2Fv.png)              
                                                                                
                                          |       
drill.exec.http.profile.warning.time.skew.ratio.process                         
  |       Defines the threshold ratio for   [...]
 | See warning for drill.exec.http.profile.warning.time.skew.min                
                                                                                
                            | ![](https://i.imgur.com/UrHy2Fv.png)              
                                                                                
                                         |       
drill.exec.http.profile.warning.time.skew.ratio.wait                            
  |       Defines the threshold ratio for wa [...]
diff --git a/_docs/sql-reference/sql-commands/009-analyze-table.md 
b/_docs/sql-reference/sql-commands/009-analyze-table.md
new file mode 100644
index 0000000..9aa65f9
--- /dev/null
+++ b/_docs/sql-reference/sql-commands/009-analyze-table.md
@@ -0,0 +1,323 @@
+---
+title: "ANALYZE TABLE"
+date: 2019-04-23
+parent: "SQL Commands"
+---  
+
+Starting in Drill 1.16, Drill supports the ANALYZE TABLE statement. The 
ANALYZE TABLE statement computes statistics on Parquet data stored in tables 
and directories. ANALYZE TABLE writes statistics to a JSON file in the 
`.stats.drill` directory, for example `/user/table1/.stats.drill/0_0.json`. The 
optimizer in Drill uses these statistics to estimate filter, aggregation, and 
join cardinalities to create more efficient query plans. 
+
+You can run the ANALYZE TABLE statement to calculate statistics for tables, 
columns, and directories with Parquet data; however, Drill will not use the 
statistics for query planning unless you enable the `planner.statistics.use` 
option, as shown:  
+
+       SET `planner.statistics.use` = true;
+
+Alternatively, you can enable the option in the Drill Web UI at 
`http://<drill-hostname-or-ip>:8047/options`.
+
+## Syntax
+
+The ANALYZE TABLE statement supports the following syntax:  
+
+       ANALYZE TABLE [workspace.]table_name COMPUTE STATISTICS [(column1, 
column2,...)] [SAMPLE number PERCENT]
+        
+
+## Parameters
+
+*workspace*  
+Optional. A configured storage plugin and workspace, like `dfs.samples`. For 
example, in `dfs.samples`, `dfs` is the file system storage plugin and samples 
is the `workspace` configured to point to a directory on the file system. 
+
+*table_name*  
+The name of the table or directory for which Drill will generate statistics. 
+
+*COMPUTE STATISTICS*  
+Generates statistics for the table, columns, or directory specified.   
+
+*column*  
+The name of the column(s) for which Drill will generate statistics.  
+
+*SAMPLE*    
+Optional. Indicates that compute statistics should run on a subset of the data.
+
+*number PERCENT*  
+An integer that specifies the percentage of data on which to compute 
statistics. For example, if a table has 100 rows, `SAMPLE 50 PERCENT` indicates 
that statistics should be computed on 50 rows. The optimizer selects the rows 
at random. 
+
+## Related Command  
+
+If you drop a table on which you have run ANALYZE TABLE, the statistics are 
automatically removed with the table:  
+
+       DROP TABLE [IF EXISTS] [workspace.]name  
+
+If you want to remove statistics for a table (and keep the table), you must 
remove the directory in which Drill stores the statistics:  
+
+       DROP TABLE [IF EXISTS] [workspace.]name/.stats.drill  
+
+If you have already issued the ANALYZE TABLE statement against specific 
columns, a table, or directory, you must run the DROP TABLE statement with 
`/.stats.drill` before you can successfully run the ANALYZE TABLE statement 
against the data source again:  
+
+       DROP TABLE dfs.samples.`nation1/.stats.drill`;
+
+Note that `/.stats.drill` is the directory to which the JSON file with 
statistics is written.   
+
+## Usage Notes  
+- The ANALYZE TABLE statement can compute statistics for Parquet data stored 
in tables, columns, and directories.  
+- The user running the ANALYZE TABLE statement must have read and write 
permissions on the data source.  
+- The optimizer in Drill computes the following types of statistics for each 
column: 
+       - Rowcount (total number of entries in the table)  
+       - Nonnullrowcount (total number of non-null entries in the table)  
+       - NDV (total distinct values in the table)  
+       - Avgwidth (average width of columns/average number of characters in a 
column)   
+       - Majortype (data type of the column values)  
+       - Histogram (represents the frequency distribution of values (numeric 
data) in a column; designed for estimations on data with skewed distribution; 
sorts data into “buckets” such that each bucket contains the same number of 
rows determined by ceiling(num_rows/n) where n is the number of buckets; the 
number of distinct values in each bucket depends on the distribution of the 
column's values)  
+
+- ANALYZE TABLE can compute statistics on nested scalar columns; however, you 
must explicitly state the columns, for example:  
+
+               ANALYZE TABLE employee_table COMPUTE STATISTICS 
(name.firstname, name.lastname);  
+- ANALYZE TABLE can compute statistics at the root directory level, but not at 
the partition level.  
+- Drill does not compute statistics for complex types (maps, arrays).   
+
+## Related Options
+You can set the following options related to the ANALYZE TABLE statement at 
the system or session level with the SET (session level) or ALTER SYSTEM SET 
(system level) statements, or through the Drill Web UI at 
`http://<drill-hostname-or-ip>:8047/options`:  
+
+- **planner.statistics.use**  
+Enables the query planner to use statistics. When disabled, ANALYZE TABLE 
generates statistics, but the query planner will not use the statistics unless 
this option is enabled. Disabled (false) by default. 
+- **exec.statistics.ndv_accuracy**  
+Controls the trade-off between NDV statistic computation memory cost and 
accuracy. Controls the amount of memory for estimates. More memory produces 
more accurate estimates. The default value should suffice for most scenarios. 
Default is 20. Range is 0- 30.  
+- **exec.statistics.ndv_extrapolation_bf_elements**  
+Controls the trade-off between NDV statistics computation memory cost and 
sampling extrapolation accuracy. Relates specifically to SAMPLE. The default 
value should suffice for most scenarios. Increasing the value requires 
additional memory. Default is 1000000.  
+- **exec.statistics.ndv_extrapolation_bf_fpprobability**  
+Controls the trade-off between NDV statistics computation memory cost and 
sampling extrapolation accuracy. Controls the overall accuracy of statistics 
when using sampling. Default is 10 percent. Range is 0-100.  
+- **exec.statistics.deterministic_sampling**  
+Turns deterministic sampling on and off. Relates specifically to SAMPLE. 
Default is false.  
+- **exec.statistics.tdigest_compression**  
+Controls the 'compression' factor for the TDigest algorithm used for histogram 
statistics. Controls trade-off between t-digest quantile statistic storage cost 
and accuracy. Higher values use more groups (clusters) for the t-digest and 
improve accuracy at the expense of extra storage. Positive integer values in 
the range [1, 10000].  Default is 100.  
+ 
+
+
+## Reserved Keywords
+
+The ANALYZE TABLE statement introduces the following reserved keywords:  
+ 
+       Analyze  
+       Compute  
+       Estimate  
+       Statistics  
+       Sample  
+
+If you use any of these words in a Drill query, you must enclose the word in 
backticks. For example, if you query a table named “estimate,” you would 
enclose the word "estimate" in backticks, as shown:  
+
+       SELECT * FROM `estimate`;
+
+ 
+## ANALYZE TABLE Performance
+
+- After you run the ANALYZE TABLE statement, you can view the profile for 
ANALYZE in the Drill Web UI. Go to 
`http://<drill-hostname-or-ip>:8047/profiles`, and click the ANALYZE TABLE 
statement for which you want to view the profile.  
+- Should you notice any performance issues, you may want to decrease the value 
of the `planner.slice_target` option.   
+- Generating statistics on large data sets can unnecessarily consume time and 
resources, such as memory and CPU. ANALYZE TABLE can compute statistics on a 
sample (subset of the data indicated as a percentage) to limit the amount of 
resources needed for computation. Drill still scans the entire data set, but 
only computes on the rows selected for sampling. Rows are randomly selected for 
the sample. Note that the quality of statistics increases with the sample size. 
 
+
+## Limitations  
+
+
+- Drill does not cache statistics. 
+- ANALYZE TABLE runs only on directory-based Parquet tables. 
+- ANALYZE TABLE cannot do the following:  
+       - compute statistics on schema-less file formats, such as text and CSV
+       - provide up-to-date statistics for operational data due to potential 
mismatches that can occur between operational updates and manually running 
ANALYZE TABLE  
+- Running the ANALYZE TABLE statement against multiple files in which some of 
the files have null values and others have no null values may return the 
following generic Drill error, which is not specific to the ANALYZE command:  
+ 
+               Error: SYSTEM ERROR: IllegalStateException: Failure while 
reading vector. 
+                Expected vector class of 
org.apache.drill.exec.vector.NullableBigIntVector
+               but was holding vector class 
org.apache.drill.exec.vector.IntVector, field= [`o_custkey` (INT:REQUIRED)] 
+ 
+               //If you encounter this error, run the ANALYZE TABLE statement 
on each file with null values individually instead of running the statement 
against all the files at once.  
+
+-  Running the ANALYZE TABLE statement against a table with a metadata cache 
file inadvertently updates the timestamp on the metadata cache file, which 
automatically triggers the REFRESH TABLE METADATA command.  
+
+## EXAMPLES  
+
+These examples use a schema, `dfs.samples`, which points to the `/home` 
directory. The `/home` directory contains a subdirectory, `parquet`, which 
contains the `nation.parquet` and `region.parquet` files. You can access these 
Parquet files in the `sample-data` directory of your Drill installation.  
+
+       [root@doc23 parquet]# pwd
+       /home/parquet
+       
+       [root@doc23 parquet]# ls
+       nation.parquet  region.parquet  
+
+Change schemas to use `dfs.samples`:
+
+       use dfs.samples;
+       +-------+------------------------------------------+
+       |  ok   |                 summary                  |
+       +-------+------------------------------------------+
+       | true  | Default schema changed to [dfs.samples]  |
+       +-------+------------------------------------------+  
+
+### Enabling Statistics for Query Planning 
+
+You can run the ANALYZE TABLE statement at any time to compute statistics; 
however, you must enable the following option if you want Drill to use 
statistics during query planning:
+
+       set `planner.statistics.use`=true;
+       +-------+----------------------------------+
+       |  ok   |             summary              |
+       +-------+----------------------------------+
+       | true  | planner.statistics.use updated.  |
+       +-------+----------------------------------+  
+
+### Computing Statistics on a Directory 
+
+If you want to compute statistics for all Parquet data in a directory, you can 
run the ANALYZE TABLE statement against the directory, as shown:
+
+       ANALYZE TABLE `/parquet` COMPUTE STATISTICS;
+       +-----------+----------------------------+
+       | Fragment  | Number of records written  |
+       +-----------+----------------------------+
+       | 0_0       | 4                          |
+       +-----------+----------------------------+
+
+### Computing Statistics on a Table 
+
+You can create a table from the data in the `nation.parquet` file, as shown:
+
+       CREATE TABLE nation1 AS SELECT * from `parquet/nation.parquet`;
+       +-----------+----------------------------+
+       | Fragment  | Number of records written  |
+       +-----------+----------------------------+
+       | 0_0       | 25                         |
+       +-----------+----------------------------+
+
+Drill writes the table to the `/home` directory, which is where the 
`dfs.samples` workspace points: 
+
+       [root@doc23 home]# ls
+       nation1  parquet  
+
+Changing to the `nation1` directory, you can see that the table is written as 
a parquet file:  
+
+       [root@doc23 home]# cd nation1
+       [root@doc23 nation1]# ls
+       0_0_0.parquet
+
+You can run the ANALYZE TABLE statement on a subset of columns in the table to 
generate statistics for those columns only, as shown:
+
+       ANALYZE TABLE dfs.samples.nation1 COMPUTE STATISTICS (N_NATIONKEY, 
N_REGIONKEY);
+       +-----------+----------------------------+
+       | Fragment  | Number of records written  |
+       +-----------+----------------------------+
+       | 0_0       | 2                          |
+       +-----------+----------------------------+
+
+Or, you can run the ANALYZE TABLE statement on the entire table if you want 
statistics generated for all columns in the table:
+
+       ANALYZE TABLE dfs.samples.nation1 COMPUTE STATISTICS;
+       +-----------+----------------------------+
+       | Fragment  | Number of records written  |
+       +-----------+----------------------------+
+       | 0_0       | 4                          |
+       +-----------+----------------------------+  
+
+### Computing Statistics on a SAMPLE
+You can also run ANALYZE TABLE on a percentage of the data in a table using 
the SAMPLE command, as shown:
+
+       ANALYZE TABLE dfs.samples.nation1 COMPUTE STATISTICS SAMPLE 50 PERCENT;
+       +-----------+----------------------------+
+       | Fragment  | Number of records written  |
+       +-----------+----------------------------+
+       | 0_0       | 4                          |
+       +-----------+----------------------------+  
+
+### Storing Statistics
+When you generate statistics, a statistics directory (`.stats.drill`) is 
created with a JSON file that contains the statistical data.
+
+For tables, the `.stats.drill` directory is nested within the table directory. 
For example, if you ran ANALYZE TABLE against a table named “nation1,” you 
could access the statistic file in:  
+       
+       [root@doc23 home]# cd nation1/.stats.drill
+       [root@doc23 .stats.drill]# ls
+       0_0.json
+
+For directories, a new directory is written with the same name as the 
directory on which you ran ANALYZE TABLE and appended by `.stats.drill`. For 
example, if you ran ANALYZE TABLE against a directory named “parquet,” you 
could access the statistic file in:
+
+       [root@doc23 home]# cd parquet.stats.drill
+       [root@doc23 parquet.stats.drill]# ls
+       0_0.json
+
+You can query the statistics file, as shown in the following two examples:
+
+       SELECT * FROM dfs.samples.`parquet.stats.drill`;
+       
+--------------------+----------------------------------------------------------------------------------+
+       | statistics_version |                                   directories    
                                |
+       
+--------------------+----------------------------------------------------------------------------------+
+       | v1                 | 
[{"computed":"2019-04-23","columns":[{"column":"`R_REGIONKEY`","majortype":{"type":"BIGINT","mode":"REQUIRED"},"schema":1.0,"rowcount":5.0,"nonnullrowcount":5.0,"ndv":5,"avgwidth":8.0,"histogram":{"category":"numeric-equi-depth","numRowsPerBucket":1,"buckets":[1.0,0.0,0.0,2.9999999999999996,2.0,4.0]}},{"column":"`R_NAME`","majortype":{"type":"VARCHAR","mode":"REQUIRED"},"schema":1.0,"rowcount":5.0,"nonnullrowcount":5.0,"ndv":5,"avgwidth":6.8,"histogram":{"buckets"
 [...]
+       
+--------------------+----------------------------------------------------------------------------------+
+
+
+
+       SELECT t.directories.columns[0].ndv as ndv, 
t.directories.columns[0].rowcount as rc, t.directories.columns[0].non           
                                                                                
    nullrowcount AS nnrc, t.directories.columns[0].histogram as histogram FROM 
dfs.samples.`parquet.stats.drill` t;
+       
+-----+-----+------+----------------------------------------------------------------------------------+
+       | ndv | rc  | nnrc |                                    histogram       
                              |
+       
+-----+-----+------+----------------------------------------------------------------------------------+
+       | 5   | 5.0 | 5.0  | 
{"category":"numeric-equi-depth","numRowsPerBucket":1,"buckets":[1.0,0.0,0.0,2.9999999999999996,2.0,4.0]}
 |
+       
+-----+-----+------+----------------------------------------------------------------------------------+
  
+
+### Dropping Statistics 
+
+If you want to compute statistics on a table or directory that you have 
already run the ANALYZE TABLE statement against, you must first drop the 
statistics before you can run ANALYZE TABLE statement on the table again.
+
+The following example demonstrates how to drop statistics on a table:
+
+       DROP TABLE dfs.samples.`parquet/.stats.drill`;
+       +-------+-------------------------------------+
+       |  ok   |               summary               |
+       +-------+-------------------------------------+
+       | true  | Table [parquet/.stats.drill] dropped  |
+       +-------+-------------------------------------+
+
+The following example demonstrates how to drop statistics on a directory:
+
+       DROP TABLE dfs.samples.`/parquet.stats.drill`;
+       +-------+------------------------------------+
+       |  ok   |              summary               |
+       +-------+------------------------------------+
+       | true  | Table [parquet.stats.drill] dropped  |
+       +-------+------------------------------------+
+
+When you drop statistics, the statistics directory no longer exists for the 
table: 
+
+       [root@doc23 home]# cd parquet/.stats.drill
+       -bash: cd: parquet/.stats.drill: No such file or directory
+       
+       SELECT * FROM dfs.samples.`parquet/.stats.drill`;
+       Error: VALIDATION ERROR: From line 1, column 15 to line 1, column 17: 
Object 'parquet/.stats.drill' not found within 'dfs.samples'
+       [Error Id: 0b9a0c35-f058-4e0a-91d5-034d095393d7 on doc23.lab:31010] 
(state=,code=0)  
+
+## Troubleshooting  
+
+Typical errors you may get when running ANALYZE TABLE result from running the 
statement against an individual file or against a data source other than 
Parquet, as shown in the following examples:
+
+**Running ANALYZE TABLE on a file.**  
+  
+       ANALYZE TABLE `/parquet/nation.parquet` COMPUTE STATISTICS;
+       
+--------+----------------------------------------------------------------------------------+
+       |   ok   |                                     summary                  
                    |
+       
+--------+----------------------------------------------------------------------------------+
+       | false  | Table /parquet/nation.parquet is not supported by ANALYZE. 
Support is currently limited to directory-based Parquet tables. |
+       
+--------+----------------------------------------------------------------------------------+
+
+
+**Running ANALYZE TABLE on a data source other than Parquet.**
+
+       ANALYZE TABLE nation1_json COMPUTE STATISTICS;
+       
+--------+----------------------------------------------------------------------------------+
+       |   ok   |                                     summary                  
                    |
+       
+--------+----------------------------------------------------------------------------------+
+       | false  | Table nation1_json is not supported by ANALYZE. Support is 
currently limited to directory-based Parquet tables. |
+       
+--------+----------------------------------------------------------------------------------+
+
+
+
+
+
+
+
+
+  
+
+
+
+
+
+
diff --git a/_docs/sql-reference/sql-commands/011-refresh-table-metadata.md 
b/_docs/sql-reference/sql-commands/011-refresh-table-metadata.md
new file mode 100644
index 0000000..d678619
--- /dev/null
+++ b/_docs/sql-reference/sql-commands/011-refresh-table-metadata.md
@@ -0,0 +1,282 @@
+---
+title: "REFRESH TABLE METADATA"
+date: 2019-04-23
+parent: "SQL Commands"
+---
+Run the REFRESH TABLE METADATA command on Parquet tables and directories to 
generate a metadata cache file. REFRESH TABLE METADATA collects metadata from 
the footers of Parquet files and writes the metadata to a metadata file 
(`.drill.parquet_file_metadata.v4`) and a summary file 
(`.drill.parquet_summary_metadata.v4`). The planner uses the metadata cache 
file to prune extraneous data during the query planning phase. Run the REFRESH 
TABLE METADATA command if planning time is a significant [...]
+
+Starting in Drill 1.11, Drill stores the paths to Parquet files as relative 
paths instead of absolute paths. You can move partitioned Parquet directories 
from one location in the distributed file system to another without issuing the 
REFRESH TABLE METADATA command to rebuild the Parquet metadata cache files; the 
metadata remains valid in the new location.  
+
+## Syntax   
+
+The REFRESH TABLE METADATA command supports the following syntax: 
+
+       REFRESH TABLE METADATA  [ COLUMNS ( column1, column2...) | NONE ]  
table_path
+
+
+## Parameters
+*COLUMNS*  
+Optional. Introduced in Drill 1.16. Use to indicate the columns on which the 
command should refresh metadata. When omitted, metadata is collected for all 
the columns.
+
+*column*  
+Required when using the COLUMNS clause. The columns for which metadata will be 
stored.
+
+*NONE*  
+Optional. Introduced in Drill 1.16. When used, the REFRESH command does not 
refresh any metadata; however, a summary file is generated and contains a 
summary section (ColumnTypeInfo) that lists all the columns with their data 
types.  
+
+*table_path*  
+Required. The name of the table or directory for which Drill will refresh 
metadata.  
+
+## Related Command  
+Run the [EXPLAIN]({{site.baseurl}}/docs/explain/) command to determine the 
query execution time. If the query execution time is the most significant time 
factor, running REFRESH TABLE METADATA will not improve query performance.
+
+
+## Usage Notes  
+
+### Metadata Storage  
+- Drill traverses directories for Parquet files and gathers the metadata from 
the footer of the files. Drill stores the collected metadata in a metadata 
cache file, `.drill.parquet_metadata`, at each directory level.  
+- The metadata cache file stores metadata for files in that directory, as well 
as the metadata for the files in the subdirectories.  
+- For each row group in a Parquet file, the metadata cache file stores the 
column names in the row group and the column statistics, such as the min/max 
values and null count.  
+- If the Parquet data is updated, for example data is added to a file, Drill 
automatically  refreshes the Parquet metadata when you issue the next query 
against the Parquet data.  
+
+### Refreshing Columns
+- Starting in Drill 1.16, you can run the REFRESH TABLE METADATA command on 
specific columns. When you run the command on specific columns, metadata is 
refreshed for the indicated columns only.  
+- Refreshing the metadata for sorted or partitioned columns improves the 
planning time for queries that filter on these columns. The query planner uses 
the min and max column statistics to determine which data meets the filter 
criteria. The planner can prune files and row groups that do not meet the 
filter criteria, which reduces the amount of time it takes the query planner to 
scan the metadata during query planning. Refreshing column metadata (versus 
entire tables) also reduces the siz [...]
+
+### Automatic Refresh
+
+- You only have to run the REFRESH TABLE METADATA command against a table once 
to generate the initial metadata cache file. Thereafter, Drill automatically 
refreshes stale cache files when you issue queries against the table. An 
automatic refresh is triggered when data is modified.The query planner uses the 
timestamp of the cache file and table to determine if the cache file is stale.  
 
+- An automatic refresh updates the metadata cache file the same way the 
REFRESH TABLE METADATA command did when it was last issued against a table. For 
example, if you ran REFRESH TABLE METADATA on col1 and col2 in table t1, the 
next time a query is issued against table t1, the query planner checks the 
timestamp on the cache file and the table, t1. If the timestamp on the data is 
later than the timestamp on the metadata cache file, automatic refresh is 
triggered and metadata is refreshed [...]
+- Note that the elapsed time of a query that triggers an automatic refresh can 
be greater than that of subsequent queries that use the metadata. To avoid 
this, you can manually run the REFRESH TABLE METADATA command.
+
+### Query Planning and Execution  
+
+- Drill reads the metadata cache file during query planning, which improves 
the query planning time.  
+- Parquet metadata caching has no effect on query execution time. At execution 
time, Drill reads the files. Metadata caching will not improve query 
performance if the time it takes to execute the query is greater than the time 
used to plan a query.  
+- Run [EXPLAIN]({{site.baseurl}}/docs/explain/) for the query to determine if 
query execution time is the most significant time factor for the query. Compare 
the query execution time to the overall time for the query to determine whether 
metadata caching would be useful.  
+
+### Not Supported  
+- Parquet metadata caching does not benefit queries on Hive tables, HBase 
tables, or text files. Drill only uses the Hive metastore to query Parquet 
files when a query is issued against the Hive storage plugin.
+
+
+## Related Options  
+You can set the following options related to the REFRESH TABLE METADATA 
command at the system or session level with the SET (session level) or ALTER 
SYSTEM SET (system level) statements, or through the Drill Web UI at 
`http://<drill-hostname-or-ip>:8047/options`:  
+
+- **planner.store.parquet.rowgroup.filter.pushdown.enabled**    
+Enables filter pushdown optimization for Parquet files. Drill reads the file 
metadata, stored in the footer, to eliminate row groups based on the filter 
condition. Default is true. (Drill 1.9+)   
+- **planner.store.parquet.rowgroup.filter.pushdown.threshold**  
+Sets the number of row groups that a table can have. You can increase the 
threshold if the filter can prune many row groups. However, if this setting is 
too high, the filter evaluation overhead increases. Base this setting on the 
data set. Reduce this setting if the planning time is significant or you do not 
see any benefit at runtime. Default is 10000.  (Drill 1.9+)  
+
+## Limitations
+Currently, Drill does not support runtime rowgroup pruning. 
+
+<!--
+## Examples  
+These examples use a schema, `dfs.samples`, which points to the `/home` 
directory. The `/home` directory contains a subdirectory, `parquet`, which
+contains the `nation.parquet` and a subdirectory, `dir1` with the 
`region.parquet` file. You can access the `nation.parquet` and `region.parquet` 
Parquet files in the `sample-data` directory of your Drill installation.  
+
+       [root@doc23 dir1]# pwd
+       /home/parquet/dir1
+        
+       [root@doc23 parquet]# ls
+       dir1  nation.parquet
+        
+       [root@doc23 dir1]# ls
+       region.parquet  
+
+Change schemas to use `dfs.samples`:
+ 
+       use dfs.samples;
+       +-------+------------------------------------------+
+       |  ok   |                 summary                     |
+       +-------+------------------------------------------+
+       | true  | Default schema changed to [dfs.samples]  |
+       +-------+------------------------------------------+  
+
+### Running REFRESH TABLE METADATA on a Directory  
+Running the REFRESH TABLE METADATA command on the `parquet` directory 
generates metadata cache files at each directory level.  
+
+       REFRESH TABLE METADATA parquet;  
+       +-------+---------------------------------------------------+
+       |  ok   |                       summary                         |
+       +-------+---------------------------------------------------+
+       | true  | Successfully updated metadata for table parquet.  |
+       +-------+---------------------------------------------------+  
+
+When looking at the `parquet` directory and `dir1` subdirectory, you can see 
that a metadata cache file was created at each level:
+
+       [root@doc23 parquet]# ls -la
+       drwxr-xr-x   2 root root   95 Mar 18 17:49 dir1
+       -rw-r--r--   1 root root 2642 Mar 18 17:52 .drill.parquet_metadata
+       -rw-r--r--   1 root root   32 Mar 18 17:52 ..drill.parquet_metadata.crc
+       -rwxr-xr-x   1 root root 1210 Mar 13 13:32 nation.parquet
+        
+       [root@doc23 dir1]# ls -la
+       -rw-r--r-- 1 root root 1235 Mar 18 17:52 .drill.parquet_metadata
+       -rw-r--r-- 1 root root   20 Mar 18 17:52 ..drill.parquet_metadata.crc
+       -rwxr-xr-x 1 root root  455 Mar 18 17:41 region.parquet  
+
+The following sections compare the content of the metadata cache file in  the 
`parquet` and `dir1` directories:  
+
+**Content of the metadata cache file in the directory named `parquet` that 
contains the nation.parquet file and subdirectory `dir1`.**  
+
+
+       [root@doc23 parquet]# cat .drill.parquet_metadata
+       {
+         "metadata_version" : "3.3",
+         "columnTypeInfo" : {
+           "`N_COMMENT`" : {
+             "name" : [ "N_COMMENT" ],
+             "primitiveType" : "BINARY",
+             "originalType" : "UTF8",
+             "precision" : 0,
+             "scale" : 0,
+             "repetitionLevel" : 0,
+             "definitionLevel" : 0
+           },
+           "`N_NATIONKEY`" : {
+             "name" : [ "N_NATIONKEY" ],
+             "primitiveType" : "INT64",
+             "originalType" : null,
+             "precision" : 0,
+             "scale" : 0,
+             "repetitionLevel" : 0,
+             "definitionLevel" : 0
+           },
+           "`R_REGIONKEY`" : {
+             "name" : [ "R_REGIONKEY" ],
+             "primitiveType" : "INT64",
+             "originalType" : null,
+             "precision" : 0,
+             "scale" : 0,
+             "repetitionLevel" : 0,
+             "definitionLevel" : 0
+           },
+           "`R_COMMENT`" : {
+             "name" : [ "R_COMMENT" ],
+             "primitiveType" : "BINARY",
+             "originalType" : "UTF8",
+             "precision" : 0,
+             "scale" : 0,
+             "repetitionLevel" : 0,
+             "definitionLevel" : 0
+           },
+           "`N_REGIONKEY`" : {
+             "name" : [ "N_REGIONKEY" ],
+             "primitiveType" : "INT64",
+             "originalType" : null,
+             "precision" : 0,
+             "scale" : 0,
+             "repetitionLevel" : 0,
+             "definitionLevel" : 0
+           },
+           "`R_NAME`" : {
+             "name" : [ "R_NAME" ],
+             "primitiveType" : "BINARY",
+             "originalType" : "UTF8",
+             "precision" : 0,
+             "scale" : 0,
+             "repetitionLevel" : 0,
+             "definitionLevel" : 0
+           },
+           "`N_NAME`" : {
+             "name" : [ "N_NAME" ],
+             "primitiveType" : "BINARY",
+             "originalType" : "UTF8",
+             "precision" : 0,
+             "scale" : 0,
+             "repetitionLevel" : 0,
+             "definitionLevel" : 0
+           }
+         },
+         "files" : [ {
+           "path" : "dir1/region.parquet",
+           "length" : 455,
+           "rowGroups" : [ {
+             "start" : 4,
+             "length" : 250,
+             "rowCount" : 5,
+             "hostAffinity" : {
+               "localhost" : 1.0
+             },
+             "columns" : [ ]
+           } ]
+         }, {
+           "path" : "nation.parquet",
+           "length" : 1210,
+           "rowGroups" : [ {
+             "start" : 4,
+             "length" : 944,
+             "rowCount" : 25,
+             "hostAffinity" : {
+               "localhost" : 1.0
+             },
+             "columns" : [ ]
+           } ]
+         } ],
+         "directories" : [ "dir1" ],
+         "drillVersion" : "1.16.0-SNAPSHOT"  
+
+**Content of the directory named `dir1` that contains the `region.parquet` 
file and no subdirectories.**  
+
+       [root@doc23 dir1]# cat .drill.parquet_metadata
+       {
+         "metadata_version" : "3.3",
+         "columnTypeInfo" : {
+               "`R_REGIONKEY`" : {
+               "name" : [ "R_REGIONKEY" ],
+               "primitiveType" : "INT64",
+               "originalType" : null,
+               "precision" : 0,
+               "scale" : 0,
+               "repetitionLevel" : 0,
+               "definitionLevel" : 0
+               },
+               "`R_COMMENT`" : {
+               "name" : [ "R_COMMENT" ],
+               "primitiveType" : "BINARY",
+               "originalType" : "UTF8",
+               "precision" : 0,
+               "scale" : 0,
+               "repetitionLevel" : 0,
+               "definitionLevel" : 0
+               },
+               "`R_NAME`" : {
+               "name" : [ "R_NAME" ],
+               "primitiveType" : "BINARY",
+             "originalType" : "UTF8",
+               "precision" : 0,
+               "scale" : 0,
+               "repetitionLevel" : 0,
+               "definitionLevel" : 0
+               }
+         },
+         "files" : [ {
+               "path" : "region.parquet",
+               "length" : 455,
+               "rowGroups" : [ {
+               "start" : 4,
+               "length" : 250,
+               "rowCount" : 5,
+               "hostAffinity" : {
+               "localhost" : 1.0
+               },
+               "columns" : [ ]
+               } ]
+         } ],
+         "directories" : [ ],
+         "drillVersion" : "1.16.0-SNAPSHOT"
+       }  
+
+### Verifying that the Planner is Using the Metadata Cache File 
+
+When the planner uses metadata cache files, the query plan includes the 
`usedMetadataFile` flag. You can access the query plan in the Drill Web UI, by 
clicking on the query in the Profiles page, or by running the EXPLAIN PLAN FOR 
command, as shown:
+
+       EXPLAIN PLAN FOR SELECT * FROM parquet;  
+ 
+       | 00-00    Screen
+       00-01      Project(**=[$0])
+       00-02      Scan(table=[[dfs, samples, parquet]], 
groupscan=[ParquetGroupScan [entries=[ReadEntryWithPath [path=/home/parquet]], 
selectionRoot=/home/parquet, numFiles=1, numRowGroups=2, usedMetadataFile=true, 
cacheFileRoot=/home/parquet, columns=[`**`]]])
+       |... 
+
+-->    
diff --git a/_docs/sql-reference/sql-commands/021-create-schema.md 
b/_docs/sql-reference/sql-commands/021-create-schema.md
new file mode 100644
index 0000000..f236735
--- /dev/null
+++ b/_docs/sql-reference/sql-commands/021-create-schema.md
@@ -0,0 +1,650 @@
+---
+title: "CREATE OR REPLACE SCHEMA"
+date: 2019-04-25
+parent: "SQL Commands"
+---
+
+Starting in Drill 1.16, you can define a schema for text files using the 
CREATE OR REPLACE SCHEMA command. Running this command generates a hidden 
.drill.schema file in the table’s root directory. The .drill.schema file stores 
the schema definition in JSON format. Drill uses the schema file at runtime if 
the exec.storage.enable_v3_text_reader and store.table.use_schema_file options 
are enabled. Alternatively, you can create the schema file manually. When 
created manually, the file conten [...]
+
+##Syntax
+
+The CREATE OR REPLACE SCHEMA command supports the following syntax:
+
+       CREATE [OR REPLACE] SCHEMA
+       [LOAD 'file:///path/to/file']
+       [(column_name data_type nullability format default properties 
{prop='val', ...})]
+       [FOR TABLE `table_name`]
+       [PATH 'file:///schema_file_path/schema_file_name'] 
+       [PROPERTIES ('key1'='value1', 'key2'='value2', ...)]  
+
+##Parameters
+
+*OR REPLACE*  
+Existing schema is dropped and replaced with the new schema. Only supported 
when using FOR TABLE. Not supported when using PATH because it prevents 
malicious deletion of any file. You must manually delete any schema file 
created in a custom location. 
+
+*LOAD*  
+Loads raw schema (list of column names with their attributes) from a file. You 
must indicate the path to the file after the LOAD keyword. Note that columns 
should be listed or provided when using the LOAD clause; at least one option is 
required for the successful schema creation.
+
+*column_name*  
+Name of the column for which schema is created. Case-insensitive. 
+
+*data_type*  
+Data type defined for the column. See Supported Data Types. 
+
+*format*  
+Sets the format for date and time data types when converting from string.
+
+*default*  
+Sets a default value for non-nullable columns, such that queries return the 
default value instead of null. 
+
+*properties*  
+Keyword to include optional properties. See Related Options below.  
+   
+*property*      
+Name of the property applied to the column or table.  
+     
+*value*      
+Value set for the indicated property. 
+ 
+*table_name*  
+Name of the table associated with the schema being created or replaced. 
Enclose the table name in backticks if there are spaces after the FOR TABLE 
keywords. If the table does not exist, the command fails and schema is not 
created. If you indicate the table name without schema, the table is assumed to 
be in the current workspace, and you must specify the PATH property. If you 
indicate FOR TABLE and PATH, or you do not indicate either, the CREATE SCHEMA 
command fails. In this case, the ta [...]
+
+*PATH*  
+Path to the schema file. You must indicate the path to the file after the PATH 
keyword. 
+
+*properties*  
+List of properties as key-value pairs in  parenthesis.  
+
+## Related Options 
+
+You must enable the following options for Drill to use the schema created 
during query execution:
+ 
+       set `exec.storage.enable_v3_text_reader` = true;
+       +------+---------------------------------------------+
+       |  ok  |                   summary                   |
+       +------+---------------------------------------------+
+       | true | exec.storage.enable_v3_text_reader updated. |
+       +------+---------------------------------------------+
+       
+       set `store.table.use_schema_file` = true;
+       +------+--------------------------------------+
+       |  ok  |               summary                |
+       +------+--------------------------------------+
+       | true | store.table.use_schema_file updated. |
+       +------+--------------------------------------+ 
+
+## Related Properties  
+
+When you create a schema, you can set the following properties within the 
CREATE [OR REPLACE] SCHEMA command:   
+
+**drill.strict**  
+A table property that determines the ordering of columns returned for wildcard 
(*) queries. Accepts a value of true or false. See Schema Mode (Column Order). 
+ 
+**drill.format**  
+A column property that ensures proper conversion when converting string values 
to date and time data types. See Format for Date, Time Conversion.
+
+**drill.default**  
+A column property that sets non-nullable columns to a “default” value when 
creating the schema. See Column Modes (Nullable and Non-Nullable Columns).  
+
+**drill.blank-as**  
+A property that sets how Drill handles blank column values. Accepts the 
following values:  
+- **null**: If the column is nullable, treat the blank as null. If 
non-nullable, leave the blank unchanged.  
+- **0**: Replace blanks with the value "0" for numeric types.   
+- **skip**: Skip blank values. This sets the column to its default value: NULL 
for nullable columns, the default value for non-nullable columns.  
+- If left empty, blanks have no special meaning. A blank is parsed as any 
other string, which typically produces an error.  
+
+See Handling Policy for Blank Column Values.  
+
+### Setting Properties
+Include properties after the “properties” keyword, as shown in the following 
example where the date format is set to `'yyyy-MM-dd'` through the 
`drill.format` column property:
+
+       create or replace schema (start_date date properties {'drill.format' = 
'yyyy-MM-dd'}) for table dfs.tmp.`text_table`;
+ 
+Alternatively, you can use “default” and “format” as keywords when creating 
schema, as shown in the following examples:
+ 
+Setting the default for the non-nullable column “id” to -1 using the keyword 
“default”:
+ 
+       create or replace schema (id int not null default '-1') for table 
dfs.tmp.`text_table`;
+
+When you query the text_table, all blank values in the “id” column return a 
value of -1.  
+
+###Storing Properties 
+The defined schema and configured properties are stored and reflected in the 
schema file, `.drill.schema`, which you can see when you run DESCRIBE SCHEMA 
FOR TABLE. 
+
+       describe schema for table dfs.tmp.`text_table`;
+       
+----------------------------------------------------------------------------------+
+       |                                     schema                            
        |
+       
+----------------------------------------------------------------------------------+
+       | {
+         "table" : "dfs.tmp.`text_table`",
+         "schema" : {
+           "columns" : [
+               {
+               "name" : "id",
+               "type" : "INT",
+               "mode" : "REQUIRED",
+               "properties" : {
+                 "drill.default" : "-1"
+               }
+               },
+               {
+               "name" : "start_date",
+               "type" : "DATE",
+               "mode" : "REQUIRED",
+               "properties" : {
+                 "drill.format" : "yyyy-MM-dd",
+                 "drill.default" : "2017-01-01"
+               }
+               }
+               ],
+           "properties" : {
+             "drill.strict" : "true"
+               }
+         },
+         "version" : 1
+       } |
+       
+----------------------------------------------------------------------------------+
  
+
+
+## Related Commands 
+
+
+    DROP SCHEMA [IF EXISTS] FOR TABLE `table_name`
+See Dropping Schema for a Table in the Examples section at the end of this 
topic. 
+
+
+    DESCRIBE SCHEMA FOR TABLE `table_name`
+See Describing Schema for a Table in the Examples section at the end of this 
topic.   
+
+## Supported Data Types
+
+Text files store information in string format and only support simple data 
types. You can use the CREATE [OR REPLACE] SCHEMA command to convert string 
data types in text files to the following data types:  
+- INTEGER  
+- BIGINT  
+- DOUBLE  
+- FLOAT  
+- DECIMAL  
+- BOOLEAN  
+- VARCHAR  
+- TIMESTAMP  
+- DATE  
+- TIME  
+- INTERVAL [YEAR, MONTH, DAY, HOUR, MINUTE, SECOND]  
+
+**Note:** Complex data types (arrays and maps) are not supported. 
+
+Values are trimmed when converting to any type, except for varchar.  
+
+## Usage Notes 
+
+### General Information  
+- Schema provisioning only works with tables defined as directories because 
Drill must have a place to store the schema file. The directory can contain one 
or more files.  
+- Text files must have headers. The default extension for delimited text files 
with headers is `.csvh`. Note that the column names that appear in the headers 
match column definitions in the schema.  
+- You do not have to enumerate all columns in a file when creating a schema. 
You can indicate the columns of interest only.  
+- Columns in the defined schema do not have to be in the same order as in the 
data file. However, the names must match. The case can differ, for example 
“name” and “NAME” are acceptable.   
+- Queries on columns with data types that cannot be converted fail with a 
`DATA_READ_ERROR`.   
+
+### Schema Mode (Column Order)
+The schema mode determines the ordering of columns returned for wildcard (*) 
queries. The mode is set through the `drill.strict` property. You can set this 
property to true (strict) or false (not strict). If you do not indicate the 
mode, the default is false (not strict).  
+
+**Not Strict (Default)**
+Columns defined in the schema are projected in the defined order. Columns not 
defined in the schema are appended to the defined columns, as shown:  
+
+       create or replace schema (id int, start_date date format 'yyyy-MM-dd') 
for table dfs.tmp.`text_table` properties ('drill.strict' = 'false');
+       +------+-----------------------------------------+
+       |  ok  |              summary                   |
+       +------+-----------------------------------------+
+       | true | Created schema for [dfs.tmp.text_table] |
+       +------+-----------------------------------------+
+        
+       select * from dfs.tmp.`text_table`;
+       +------+------------+---------+
+       |  id  | start_date |  name   |
+       +------+------------+---------+
+       | 1     | 2019-02-01 | Fred     |
+       | 2     | 2018-11-30 | Wilma   |
+       | 3     | 2016-01-01 | Pebbles |
+       | 4     | null       | Barney  |
+       | null | null           | Dino  |
+       +------+------------+---------+
+ 
+Note that the “name” column, which was not included in the schema was appended 
to the end of the table.
+
+**Strict**
+Setting the `drill.strict` property  to “true” changes the schema mode to 
strict, which means that the reader ignores any columns NOT included in the 
schema. The query only returns the columns defined in the schema, as shown:
+ 
+       create or replace schema (id int, start_date date format 'yyyy-MM-dd') 
for table dfs.tmp.`text_table` properties ('drill.strict' = 'true');
+       +------+-----------------------------------------+
+       |  ok  |                summary                 |
+       +------+-----------------------------------------+
+       | true | Created schema for [dfs.tmp.text_table] |
+       +------+-----------------------------------------+
+        
+       select * from dfs.tmp.`text_table`;
+       +------+------------+
+       |  id  | start_date |
+       +------+------------+
+       | 1     | 2019-02-01 |
+       | 2     | 2018-11-30 |
+       | 3     | 2016-01-01 |
+       | 4     | null       |
+       | null | null           |
+       +------+------------+  
+
+Note that the “name” column, which was not included in the schema was ignored 
and not returned in the result set.  
+
+## Including Additional Columns in the Schema
+When you create a schema, you can include columns that do not exist in the 
table and these columns will be projected. This feature ensures that queries 
return the correct results whether the files have a specific column or not. 
Note that schema mode does not affect the behavior of this feature.
+ 
+For example, the “comment” column is not in the text_table, but added when 
creating the schema:  
+
+       create or replace schema (id int, start_date date format 'yyyy-MM-dd', 
comment varchar) for table dfs.tmp.`text_table`;
+       +------+-----------------------------------------+
+       |  ok  |                summary                  |
+       +------+-----------------------------------------+
+       | true | Created schema for [dfs.tmp.text_table] |
+       +------+-----------------------------------------+
+
+You can see the “comment” column returned in the result set.  
+
+       select * from dfs.tmp.`text_table`;  
+       +------+------------+---------+---------+
+       |  id  | start_date | comment |  name   |
+       +------+------------+---------+---------+
+       | 1        | 2019-02-01 |  null   | Fred        |
+       | 2        | 2018-11-30 |  null   | Wilma   |
+       | 3        | 2016-01-01 |  null   | Pebbles |
+       | 4        | null       |  null   | Barney  |
+       | null | null           |  null   | Dino        |
+       +------+------------+---------+---------+  
+
+## Column Modes (Nullable and Non-Nullable Columns)
+If a column in the schema is nullable (allows null values), and the column has 
a null value, the column value is returned as null. If the column is required 
(not nullable), but contains a null value, Drill returns the default value 
provided. If no default value is provided, Drill sets the column value to the 
natural default. 
+
+For example, if you create a strict schema with two nullable columns (id and 
start_date), you can see that the missing values in both cases are null.
+
+       create or replace schema (id int, start_date date format 'yyyy-MM-dd') 
for table dfs.tmp.`text_table` properties ('drill.strict' = 'true');
+       +------+-----------------------------------------+
+       |  ok  |                summary                 |
+       +------+-----------------------------------------+
+       | true | Created schema for [dfs.tmp.text_table] |
+       +------+-----------------------------------------+
+        
+       select * from dfs.tmp.`text_table`;
+       +------+------------+
+       |  id  | start_date |
+       +------+------------+
+       | 1     | 2019-02-01 |
+       | 2     | 2018-11-30 |
+       | 3     | 2016-01-01 |
+       | 4     | null          |
+       | null | null           |
+       +------+------------+
+ 
+Updating the strict schema to have two required columns (id and start_date), 
you can see that the natural default was applied; 0 for id and 1970-01-01 for 
start_date.
+ 
+       create or replace schema (id
+       int not null, start_date date not null format 'yyyy-MM-dd') for table
+       dfs.tmp.`text_table` properties ('drill.strict' = 'true');
+       +------+-----------------------------------------+
+       |  ok  |                summary                 |
+       +------+-----------------------------------------+
+       | true | Created schema for [dfs.tmp.text_table] |
+       +------+-----------------------------------------+
+        
+       select * from dfs.tmp.`text_table`;
+       +----+------------+
+       | id | start_date |
+       +----+------------+
+       | 1  | 2019-02-01 |
+       | 2  | 2018-11-30 |
+       | 3  | 2016-01-01 |
+       | 4  | 1970-01-01 |
+       | 0  | 1970-01-01 |
+       +----+------------+
+
+Adding a default for each of these columns (-1 for id and 2017-01-01 for 
start_date),  you can see that the columns return the defined default value 
instead of the natural default.
+ 
+       create or replace schema (id
+       int not null default '-1', start_date date not null format 'yyyy-MM-dd' 
default
+       '2017-01-01') for table dfs.tmp.`text_table` properties ('drill.strict' 
=
+       'true');
+       +------+-----------------------------------------+
+       |  ok  |                summary                 |
+       +------+-----------------------------------------+
+       | true | Created schema for [dfs.tmp.text_table] |
+       +------+-----------------------------------------+
+        
+       select * from dfs.tmp.`text_table`;
+       +----+------------+
+       | id | start_date |
+       +----+------------+
+       | 1  | 2019-02-01 |
+       | 2  | 2018-11-30 |
+       | 3  | 2016-01-01 |
+       | 4  | 2017-01-01 |
+       | -1 | 2017-01-01 |
+       +----+------------+  
+
+## Handling Policy for Blank Column Values
+It is common for CSV files to have blank column values. The default
+output for blank column values are empty strings (''), as shown:  
+
+       select * from dfs.tmp.`text_blank`;
+       +----+--------+------------+
+       | id | amount | start_date |
+       +----+--------+------------+
+       | 1  | 20         | 2019-01-01 |
+       | 2  |            |                |
+       | 3  | 30         |            |
+       +----+--------+------------+
+
+When a schema is defined for columns, the default blank handling policy is 
`skip` which treats blank values as null, as shown:  
+
+       create or replace schema (id
+       int, amount double, start_date date format 'yyyy-MM-dd') for table
+       dfs.tmp.`text_blank`;
+       +------+-----------------------------------------+
+       |  ok  |                summary                 |
+       +------+-----------------------------------------+
+       | true | Created schema for [dfs.tmp.text_blank] |
+       +------+-----------------------------------------+
+        
+       select * from dfs.tmp.`text_blank`;
+       +----+--------+------------+
+       | id | amount | start_date |
+       +----+--------+------------+
+       | 1  | 20.0   | 2019-01-01 |
+       | 2  | null   | null       |
+       | 3  | 30.0   | null       |
+       +----+--------+------------+
+
+If a column is absent in the schema, the blank handling policy is default. 
Note that the blank handling policy is not applicable to varchar columns since 
they do not go through the type conversion logic.
+
+You can configure how Drill handles blank column values through the 
`drill.blank-as` property when you create schema. 
+
+In the following example, you can see the blank handling policy for the 
defined schema with the `drill.blank-as` property set to `0` on the “amount” 
column:
+ 
+       create or replace schema (id int, amount double properties 
{'drill.blank-as' = '0'}, start_date date format 'yyyy-MM-dd') for table 
dfs.tmp.`text_blank`;
+       +------+-----------------------------------------+
+       |  ok  |                summary                 |
+       +------+-----------------------------------------+
+       | true | Created schema for [dfs.tmp.text_blank] |
+       +------+-----------------------------------------+
+        
+       select * from dfs.tmp.`text_blank`;
+       +----+--------+------------+
+       | id | amount | start_date |
+       +----+--------+------------+
+       | 1  | 20.0   | 2019-01-01 |
+       | 2  | 0.0      | null       |
+       | 3  | 30.0   | null       |
+       +----+--------+------------+  
+
+## Format for Date, Time Conversion 
+When you convert string values to date and time data types, include the format 
for proper conversion. 
+
+You can include the format using the keyword “format,” as shown:  
+
+       create or replace schema (start_date date format 'yyyy-MM-dd') for 
table dfs.tmp.`text_table`;
+
+Alternatively, you can include the format in the column properties, as shown:  
+
+       create or replace schema (start_date date properties {'drill.format' = 
'yyyy-MM-dd'}) for table dfs.tmp.`text_table`;
+
+Note that date, time type conversion uses the Joda time library, thus the 
format pattern must comply with the [Joda time supported format 
pattern](https://www.joda.org/joda-time/key_format.html). If the format is not 
indicated, ISO datetime formats are used:  
+
+| **Type**  | **Accepted Format**                                              
        |
+|-----------|--------------------------------------------------------------------------|
+| 
[Timestamp](https://www.joda.org/joda-time/apidocs/org/joda/time/format/ISODateTimeFormat.html#dateTimeNoMillis--)
 | yyyy-MM-dd'T'HH:mm:ssZZ                                                  |
+| 
[Date](https://www.joda.org/joda-time/apidocs/org/joda/time/format/ISODateTimeFormat.html#localDateParser--)
      | date-element = std-date-element |   ord-date-element | 
week-date-element |
+|           | std-date-element  = yyyy ['-' MM ['-' dd]]                       
        |
+|           | ord-date-element  = yyyy ['-' DDD]                               
        |
+|           | week-date-element = xxxx '-W' ww ['-' e]                         
        |
+| 
[Time](https://www.joda.org/joda-time/apidocs/org/joda/time/format/ISODateTimeFormat.html#localTimeParser--)
      | time = ['T'] time-element                                               
 |
+|           | time-element = HH [minute-element] | [fraction]                  
        |
+|           | minute-element = ':' mm [second-element] |   [fraction]          
        |
+|           | second-element = ':' ss [fraction]                               
        |
+|           | fraction       =   ('.' | ',') digit+                            
        |
+
+
+## Limitations
+None
+
+## Examples
+Examples throughout this topic use the files and directories described in the 
following section, Directory and File Setup.   
+
+###Directory and File Setup
+
+       [root@doc23 text_table]# pwd
+       /tmp/text_table
+       [root@doc23 text_table]# ls
+       1.csvh  2.csvh
+       
+       [root@doc23 text_table]# cat 1.csvh
+       id,name,start_date
+       1,Fred,2019-02-01
+       2,Wilma,2018-11-30
+       3,Pebbles,2016-01-01
+       4,Barney
+       
+       [root@doc23 text_table]# cat 2.csvh
+       name
+       Dino
+       
+       [root@doc23 tmp]# cd text_blank/
+       [root@doc23 text_blank]# ls
+       blank.csvh
+       
+       [root@doc23 text_blank]# cat blank.csvh
+       id,amount,start_date
+       1,20,2019-01-01
+       2,,
+       3,30,
+
+
+Query the directory text_table. 
+
+       select * from dfs.tmp.`text_table`;
+       +----+---------+------------+
+       | id |  name   | start_date |
+       +----+---------+------------+
+       | 1  | Fred    | 2019-02-01 |
+       | 2  | Wilma   | 2018-11-30 |
+       | 3  | Pebbles | 2016-01-01 |
+       | 4  | Barney  |            |
+       |    | Dino    |            |
+       +----+---------+------------+
+
+Notice that the query plan contains a scan and project:
+       
+       | 00-00    Screen
+       00-01      Project(**=[$0])
+       00-02        Scan(table=[[dfs, tmp, text_table]], 
groupscan=[EasyGroupScan [selectionRoot=file:/tmp/text_table, numFiles=2, 
columns=[`**`], files=[file:/tmp/text_table/1.csvh, 
file:/tmp/text_table/2.csvh], schema=null]])
+       
+Using the sqltypeof and modeof functions, you can see that each column is 
defined as a non-nullable string (varchar), and missing columns are defined as 
empty strings:  
+ 
+       select sqltypeof(id), modeof(id) from dfs.tmp.`text_table` limit 1;
+       +-------------------+----------+
+       |      EXPR$0       |  EXPR$1  |
+       +-------------------+----------+
+       | CHARACTER VARYING | NOT NULL |
+       +-------------------+----------+  
+
+### Creating a Schema
+Create a defined schema for the text_table directory. When you define schema, 
you do not have to enumerate all columns. The columns in the defined schema are 
not required to be in the same order as the data file. Note that the name of 
the columns must match, but can differ in case.
+
+Define schema for the id column:
+        
+       create schema (id int) for table dfs.tmp.`text_table`;
+       +------+-----------------------------------------+
+       |  ok  |                 summary                 |
+       +------+-----------------------------------------+
+       | true | Created schema for [dfs.tmp.text_table] |
+       +------+-----------------------------------------+  
+
+After you define a schema, you can see the schema file (stored in JSON format) 
in the root table directory: 
+
+        [root@doc23 text_table]# ls -a
+       .  ..  1.csvh  2.csvh  .drill.schema  ..drill.schema.crc  
+       
+       [root@doc23 text_table]# cat .drill.schema
+       {
+         "table" : "dfs.tmp.`text_table`",
+         "schema" : {
+           "columns" : [
+             {
+               "name" : "id",
+               "type" : "INT",
+               "mode" : "OPTIONAL"
+             }
+           ]
+         },
+         "version" : 1
+
+Query the text_table directory to see how the schema is applied:
+
+       select * from dfs.tmp.`text_table`;
+       +------+---------+------------+
+       |  id  |  name   | start_date |
+       +------+---------+------------+
+       | 1    | Fred    | 2019-02-01 |
+       | 2    | Wilma   | 2018-11-30 |
+       | 3    | Pebbles | 2016-01-01 |
+       | 4    | Barney  |            |
+       | null | Dino    |            |
+       +------+---------+------------+  
+
+After defining the schema on the id column, you can see that the `id` column 
type and mode is defined as a nullable integer, while other columns types were 
inferred as non-nullable VarChar:
+
+       select sqltypeof(id), modeof(id) from dfs.tmp.`text_table` limit 1;
+       +---------+----------+
+       | EXPR$0  |  EXPR$1  |
+       +---------+----------+
+       | INTEGER | NULLABLE |
+       +---------+----------+
+
+Running EXPLAIN PLAN, you can see that type conversion was done while reading 
data from source; no additional plan stages were added:
+
+       explain plan for select * from dfs.tmp.`text_table`;
+       | 00-00    Screen
+       00-01      Project(**=[$0])
+       00-02        Scan(table=[[dfs, tmp, text_table]], 
groupscan=[EasyGroupScan [selectionRoot=file:/tmp/text_table, numFiles=2, 
columns=[`**`], files=[file:/tmp/text_table/1.csvh, 
file:/tmp/text_table/2.csvh], schema=[TupleSchema [PrimitiveColumnMetadata 
[`id` (INT(0, 0):OPTIONAL)]]]]])  
+
+### Describing Schema for a Table
+After you create schema, you can examine the schema using the DESCRIBE SCHEMA 
FOR TABLE command. Schema can print to JSON or STATEMENT format. JSON format is 
the default if no format is indicated in the query. Schema displayed in JSON 
format is the same as the JSON format in the `.drill.schema` file.
+
+       describe schema for table dfs.tmp.`text_table` as JSON;
+       
+----------------------------------------------------------------------------------+
+       |                                      schema                           
           |
+       
+----------------------------------------------------------------------------------+
+       | {
+         "table" : "dfs.tmp.`text_table`",
+         "schema" : {
+           "columns" : [
+             {
+               "name" : "id",
+               "type" : "INT",
+               "mode" : "OPTIONAL"
+             }
+           ]
+         },
+         "version" : 1
+       } |
+       
+----------------------------------------------------------------------------------+
+
+STATEMENT format displays the schema in a form compatible with the CREATE OR 
REPLACE SCHEMA command such that it can easily be copied, modified, and 
executed.
+
+       describe schema for table dfs.tmp.`text_table` as statement;
+       
+--------------------------------------------------------------------------+
+       |                                  schema                               
   |
+       
+--------------------------------------------------------------------------+
+       | CREATE OR REPLACE SCHEMA
+       (
+       `id` INT
+       )
+       FOR TABLE dfs.tmp.`text_table`
+        |
+       
+--------------------------------------------------------------------------+
+
+### Dropping Schema for a Table
+You can easily drop the schema for a table using the DROP SCHEMA [IF EXISTS] 
FOR TABLE \`table_name` command, as shown:
+
+       use dfs.tmp;
+       +------+-------------------------------------+
+       |  ok  |               summary               |
+       +------+-------------------------------------+
+       | true | Default schema changed to [dfs.tmp] |
+       +------+-------------------------------------+
+       
+       drop schema for table `text_table`;
+       +------+---------------------------------------+
+       |  ok  |                summary                |
+       +------+---------------------------------------+
+       | true | Dropped schema for table [text_table] |
+       +------+---------------------------------------+
+       
+
+##Troubleshooting 
+
+**Schema defined as incorrect data type produces DATA_READ_ERROR**  
+Assume that you defined schema on the “name” column as integer, as shown:
+       create or replace schema (name int) for table dfs.tmp.`text_table`;
+       +------+-----------------------------------------+
+       |  ok  |                 summary                 |
+       +------+-----------------------------------------+
+       | true | Created schema for [dfs.tmp.text_table] |
+       +------+-----------------------------------------+
+
+Because the column does not contain integers, a query issued against the 
directory returns the DATA_READ_ERROR. The error message includes the line and 
value causing the issue:  
+
+       select * from dfs.tmp.`text_table`;
+       
+       Error: DATA_READ ERROR: Illegal conversion: Column `name` of type INT, 
Illegal value `Fred`
+       
+       Line 1
+       Record 0
+       Fragment 0:0
+       
+       [Error Id: db6afe96-1014-4a98-a034-10e36daa1aa7 on doc23.lab:31010] 
(state=,code=0)
+       
+       Incorrect syntax
+       Required syntax properties are missing or incorrectly specified. Check 
CREATE SCHEMA syntax to ensure that all required properties where specified.
+
+**Missing backticks**  
+Unable to parse CREATE SCHEMA command due to unknown properties and keywords. 
If the specified table name or column name contains spaces, enclose the name in 
backticks.
+
+**Incorrect data type**  
+Parsing has failed on unknown/unsupported data type. Check which data types 
are supported by Drill. Check syntax for complex data types.  
+
+**Missing table name**  
+Table name indicated after `FOR TABLE` keywords are not present. Check if the 
table exists in the specified schema or in the current schema (if the schema 
was not specified). If you do not want to bind schema to the specific table, 
use the PATH property.
+
+**Existing schema file**  
+A schema file already exists for the table. Use the DROP SCHEMA command to 
remove stale or mistakenly created schema file. Alternatively, you can use the 
CREATE OR REPLACE SCHEMA command.
+
+**Lack of permissions**  
+Unable to create schema file in the directory. Verify that you have write 
permission on the table’s root directory or the directory specified. If not, 
indicate the directory on which you write permissions in the command syntax.
+
+
+
+
+
+
+
+ 
+
+
+
+
+
+
+
+  
+
+
+
+
+
+
+

[drill] branch gh-pages updated: add doc for analyze refresh schema

Reply via email to