This is an automated email from the ASF dual-hosted git repository.
volodymyr pushed a commit to branch gh-pages
in repository https://gitbox.apache.org/repos/asf/drill.git
The following commit(s) were added to refs/heads/gh-pages by this push:
new 0e17eea Update docs for Metastore to point that all format plugins
are supported
0e17eea is described below
commit 0e17eea19aca27b88c98778fcfb7057a45501ab9
Author: Volodymyr Vysotskyi <[email protected]>
AuthorDate: Tue Mar 17 19:56:23 2020 +0200
Update docs for Metastore to point that all format plugins are supported
---
.../035-plugin-configuration-basics.md | 36 ++++
.../plugins/080-rdbms-storage-plugin.md | 72 ++++---
.../plugins/114-image-metadata-format-plugin.md | 84 ++++-----
.../drill-metastore/010-using-drill-metastore.md | 207 +++++++++++++++++++--
.../005-querying-a-file-system-introduction.md | 6 +-
.../query-a-file-system/009-querying-avro-files.md | 28 ++-
.../007-analyze-table-refresh-metadata.md | 17 +-
.../sql-commands/021-create-schema.md | 69 ++++++-
images/7671b34d6e8a4d050f75278f10f1a08.jpg | Bin 0 -> 45877 bytes
9 files changed, 419 insertions(+), 100 deletions(-)
diff --git a/_docs/connect-a-data-source/035-plugin-configuration-basics.md
b/_docs/connect-a-data-source/035-plugin-configuration-basics.md
index 3e1d866..79a8d82 100644
--- a/_docs/connect-a-data-source/035-plugin-configuration-basics.md
+++ b/_docs/connect-a-data-source/035-plugin-configuration-basics.md
@@ -133,6 +133,8 @@ The following table describes the attributes you configure
for storage plugins i
You set the formats attributes, such as skipFirstLine, in the `formats` area
of the storage plugin configuration. When setting attributes for text files,
such as CSV, you also need to set the `sys.options` property
`exec.storage.enable_new_text_reader` to true (the default). For more
information and examples of using formats for text files, see ["Text Files:
CSV, TSV, PSV"]({{site.baseurl}}{{site.baseurl}}/docs/text-files-csv-tsv-psv/).
+# Table Function Parameters
+
## Using the Formats Attributes as Table Function Parameters
In Drill version 1.4 and later, you can also set the formats attributes
defined above on a per query basis. To pass parameters to the format plugin,
use the table function syntax:
@@ -147,6 +149,40 @@ fieldDelimiter => ',', extractHeader => true))``
For more information about format plugin configuration see ["Text Files: CSV,
TSV, PSV"]({{site.baseurl}}{{site.baseurl}}/docs/text-files-csv-tsv-psv/).
+## Specifying the Schema as Table Function Parameter
+
+Table schemas normally reside in the root folder of each table. You can also
specify a schema for an individual query
+ using a table function and specifying the `SCHEMA` property. You can combine
the schema with format plugin properties.
+ The syntax is similar to the [CREATE OR REPLACE
SCHEMA]({{site.baseurl}}/docs/create-or-replace-schema/#syntax):
+
+```
+SELECT a, b FROM TABLE (table_name(
+SCHEMA => 'inline=(column_name data_type [nullability] [format] [default]
[properties {prop='val', ...})]'))
+```
+
+You can specify the schema inline within the query. For example:
+
+```
+select * from table(dfs.tmp.`text_table`(
+schema => 'inline=(col1 date properties {`drill.format` = `yyyy-MM-dd`})
+properties {`drill.strict` = `false`}'))
+```
+
+Alternatively, you can also specify the path to a schema file. For example:
+
+```
+select * from table(dfs.tmp.`text_table`(schema => 'path=`/tmp/my_schema`'))
+```
+
+The following example demonstrates applying provided schema alongside with
format plugin table function parameters.
+Suppose that you have a CSV file with headers and with a custom extension:
`csvh-test`. You can combine the schema with format plugin properties:
+
+```
+select * from table(dfs.tmp.`cars.csvh-test`(type => 'text',
+fieldDelimiter => ',', extractHeader => true,
+schema => 'inline=(col1 date)'))
+```
+
## Using Other Attributes
The configuration of other attributes, such as `size.calculator.enabled` in
the `hbase` plugin and `configProps` in the `hive` plugin, are
implementation-dependent and beyond the scope of this document.
diff --git a/_docs/connect-a-data-source/plugins/080-rdbms-storage-plugin.md
b/_docs/connect-a-data-source/plugins/080-rdbms-storage-plugin.md
index ea84e87..f3d4bad 100644
--- a/_docs/connect-a-data-source/plugins/080-rdbms-storage-plugin.md
+++ b/_docs/connect-a-data-source/plugins/080-rdbms-storage-plugin.md
@@ -1,6 +1,6 @@
---
title: "RDBMS Storage Plugin"
-date: 2018-12-08
+date: 2020-03-17
parent: "Connect a Data Source"
---
Apache Drill supports querying a number of RDBMS instances. This allows you to
connect your traditional databases to your Drill cluster so you can have a
single view of both your relational and NoSQL datasources in a single system.
@@ -9,14 +9,25 @@ As with any source, Drill supports joins within and between
all systems. Drill a
## Using the RDBMS Storage Plugin
-Drill is designed to work with any relational datastore that provides a JDBC
driver. Drill is actively tested with Postgres, MySQL, Oracle, MSSQL and Apache
Derby. For each system, you will follow three basic steps for setup:
+Drill is designed to work with any relational datastore that provides a JDBC
driver. Drill is actively tested with
+ Postgres, MySQL, Oracle, MSSQL, Apache Derby and H2. For each system, you
will follow three basic steps for setup:
1. [Install Drill]({{ site.baseurl
}}/docs/installing-drill-in-embedded-mode), if you do not already have it
installed.
- 2. Copy your database's JDBC driver into the jars/3rdparty directory.
(You'll need to do this on every node.)
+ 2. Copy your database's JDBC driver into the `jars/3rdparty` directory.
(You'll need to do this on every node.)
3. Restart Drill. See [Starting Drill in Distributed
Mode]({{site.baseurl}}/docs/starting-drill-in-distributed-mode/).
- 4. Add a new storage configuration to Drill through the Web UI. Example
configurations for [Oracle](#Example-Oracle-Configuration), [SQL
Server](#Example-SQL-Server-Configuration),
[MySQL](#Example-MySQL-Configuration) and
[Postgres](#Example-Postgres-Configuration) are provided below.
-
-**Example: Working with MySQL**
+ 4. Add a new storage configuration to Drill through the Web UI. Example
configurations for [Oracle](#example-oracle-configuration), [SQL
Server](#example-sql-server-configuration),
[MySQL](#example-mysql-configuration) and
[Postgres](#example-postgres-configuration) are provided below.
+
+## Setting data source parameters in the storage plugin configuration
+
+Starting from Drill 1.18.0, new JDBC storage plugin configuration property
`sourceParameters` was introduced to allow
+ setting data source parameters described in
[HikariCP](https://github.com/brettwooldridge/HikariCP#configuration-knobs-baby).
+ Parameters names with incorrect naming and parameter values which are of
incorrect data type or illegal will fail
+ storage plugin to start up.
+
+See the [Example of Postgres Configuration with `sourceParameters`
configuration
property](#example-of-postgres-configuration-with-sourceparameters-configuration-property)
+section for the example of usage.
+
+### Example: Working with MySQL
Drill communicates with MySQL through the JDBC driver using the configuration
that you specify in the Web UI or through the [REST
API]({{site.baseurl}}/docs/plugin-configuration-basics/#storage-plugin-rest-api).
@@ -29,7 +40,7 @@ To configure the JDBC storage plugin:
1. On the Storage tab, enter a name in **New Storage Plugin**. For example,
enter `myplugin`.
Each configuration registered with Drill must have a distinct name. Names
are case-sensitive.
- {% include startnote.html %}The URL differs depending on your installation
and configuration. See the [example configurations](#Example-Configurations)
below for examples.{% include endnote.html %}
+ {% include startnote.html %}The URL differs depending on your installation
and configuration. See the example configurations below for examples.{% include
endnote.html %}
1. Click **Create**.
1. In Configuration, set the required properties using JSON formatting as
shown in the following example. Change the properties to match your
environment.
@@ -62,7 +73,7 @@ You can use the performance_schema database, which is
installed with MySQL to qu
## Example Configurations
-**Example Oracle Configuration**
+### Example Oracle Configuration
Download and install Oracle's Thin
[ojdbc7.12.1.0.2.jar](http://www.oracle.com/technetwork/database/features/jdbc/default-2280470.html)
driver and copy it to all nodes in your cluster.
@@ -70,10 +81,10 @@ Download and install Oracle's Thin
[ojdbc7.12.1.0.2.jar](http://www.oracle.com/t
type: "jdbc",
enabled: true,
driver: "oracle.jdbc.OracleDriver",
- url:"jdbc:oracle:thin:user/[email protected]:1521/ORCL"
+ url: "jdbc:oracle:thin:user/[email protected]:1521/ORCL"
}
-**Example SQL Server Configuration**
+### Example SQL Server Configuration
For SQL Server, Drill has been tested with Microsoft's
[sqljdbc41.4.2.6420.100.jar](https://www.microsoft.com/en-US/download/details.aspx?id=11774)
driver. Copy this jar file to all Drillbits.
@@ -83,12 +94,12 @@ For SQL Server, Drill has been tested with Microsoft's
[sqljdbc41.4.2.6420.100.
type: "jdbc",
enabled: true,
driver: "com.microsoft.sqlserver.jdbc.SQLServerDriver",
- url:"jdbc:sqlserver://1.2.3.4:1433;databaseName=mydatabase",
- username:"user",
- password:"password"
+ url: "jdbc:sqlserver://1.2.3.4:1433;databaseName=mydatabase",
+ username: "user",
+ password: "password"
}
-**Example MySQL Configuration**
+### Example MySQL Configuration
For MySQL, Drill has been tested with MySQL's
[mysql-connector-java-5.1.37-bin.jar](http://dev.mysql.com/downloads/connector/j/)
driver. Copy this to all nodes.
@@ -96,14 +107,15 @@ For MySQL, Drill has been tested with MySQL's
[mysql-connector-java-5.1.37-bin.j
type: "jdbc",
enabled: true,
driver: "com.mysql.jdbc.Driver",
- url:"jdbc:mysql://1.2.3.4",
- username:"user",
- password:"password"
+ url: "jdbc:mysql://1.2.3.4",
+ username: "user",
+ password: "password"
}
-**Example Postgres Configuration**
+### Example Postgres Configuration
-For Postgres, Drill has been tested with Postgres's
[9.1-901-1.jdbc4](http://central.maven.org/maven2/org/postgresql/postgresql/)
driver (any recent driver should work). Copy this driver file to all nodes.
+Drill is tested with the Postgres driver version
[42.2.11](https://mvnrepository.com/artifact/org.postgresql/postgresql) (any
recent driver should work).
+ Download and copy this driver jar to the `jars/3rdparty` folder on all nodes.
{% include startnote.html %}You'll need to provide a database name as part of
your JDBC connection string for Drill to correctly expose Postgres tables.{%
include endnote.html %}
@@ -111,9 +123,9 @@ For Postgres, Drill has been tested with Postgres's
[9.1-901-1.jdbc4](http://cen
type: "jdbc",
enabled: true,
driver: "org.postgresql.Driver",
- url:"jdbc:postgresql://1.2.3.4/mydatabase",
- username:"user",
- password:"password"
+ url: "jdbc:postgresql://1.2.3.4/mydatabase",
+ username: "user",
+ password: "password"
}
You may need to qualify a table name with a schema name for Drill to return
data. For example, when querying a table named ips, you must issue the query
against public.ips, as shown in the following example:
@@ -142,4 +154,20 @@ You may need to qualify a table name with a schema name
for Drill to return data
| 2 | 1.2.3.5 |
+-------+----------+
+### Example of Postgres Configuration with `sourceParameters` configuration
property
+ {
+ type: "jdbc",
+ enabled: true,
+ driver: "org.postgresql.Driver",
+ url: "jdbc:postgresql://1.2.3.4/mydatabase?defaultRowFetchSize=2",
+ username: "user",
+ password: "password",
+ sourceParameters: {
+ "minimumIdle": 5,
+ "autoCommit": false,
+ "connectionTestQuery": "select version() as postgresql_version",
+ "dataSource.cachePrepStmts": true,
+ "dataSource.prepStmtCacheSize": 250
+ }
+ }
diff --git
a/_docs/connect-a-data-source/plugins/114-image-metadata-format-plugin.md
b/_docs/connect-a-data-source/plugins/114-image-metadata-format-plugin.md
index 27c53e4..313116f 100644
--- a/_docs/connect-a-data-source/plugins/114-image-metadata-format-plugin.md
+++ b/_docs/connect-a-data-source/plugins/114-image-metadata-format-plugin.md
@@ -1,6 +1,6 @@
---
title: "Image Metadata Format Plugin"
-date: 2018-06-13 18:31:03 UTC
+date: 2020-03-17
parent: "Connect a Data Source"
---
@@ -39,7 +39,7 @@ To configure Drill to read image metadata, you must modify
the extensions sectio
**Note:** The result does not include file names, but you can use [implicit
columns]({{site.baseurl}}/docs/querying-a-file-system-introduction/#implicit-columns)
to get file names, full paths, fully qualified names, and file suffixes.
-##Attributes
+## Attributes
The following table lists configuration attributes:
@@ -49,54 +49,53 @@ fileSystemMetadata|true|Set to true to extract filesystem
metadata including the
descriptive|true|Set to true to extract metadata in a human-readable string
format. Set false to extract metadata in a machine-readable typed format.
timeZone|null|Specify the time zone to interpret the timestamp with no time
zone information. If the timestamp includes the time zone information, this
value is ignored. If null is set, the local time zone is used.
-##Examples
+## Examples
+
+To follow along with the examples, start by downloading the following image to
your `\tmp` directory.
+
+[]({{
site.baseurl }}/images/7671b34d6e8a4d050f75278f10f1a08.jpg)
A Drill query on a JPEG file with the property descriptive: true
- 0: jdbc:drill:zk=local> select FileName, * from
dfs.`4349313028_f69ffa0257_o.jpg`;
-
+----------+----------+--------------+--------+------------+-------------+--------------+----------+-----------+------------+-----------+----------+----------+------------+-----------+------------+-----------------+-----------------+------+------+----------+------------+------------------+-----+---------------+-----------+------+---------+----------+
- | FileName | FileSize | FileDateTime | Format | PixelWidth |
PixelHeight | BitsPerPixel | DPIWidth | DPIHeight | Orientaion | ColorMode |
HasAlpha | Duration | VideoCodec | FrameRate | AudioCodec | AudioSampleSize |
AudioSampleRate | JPEG | JFIF | ExifIFD0 | ExifSubIFD | Interoperability | GPS
| ExifThumbnail | Photoshop | IPTC | Huffman | FileType |
-
+----------+----------+--------------+--------+------------+-------------+--------------+----------+-----------+------------+-----------+----------+----------+------------+-----------+------------+-----------------+-----------------+------+------+----------+------------+------------------+-----+---------------+-----------+------+---------+----------+
- | 4349313028_f69ffa0257_o.jpg | 257213 bytes | Fri Mar 09 12:09:34
+08:00 2018 | JPEG | 1199 | 800 | 24 | 96 | 96 | Unknown (0) | RGB | false |
00:00:00 | Unknown | 0 | Unknown | 0 | 0 |
{"CompressionType":"Baseline","DataPrecision":"8 bits","ImageHeight":"800
pixels","ImageWidth":"1199 pixels","NumberOfComponents":"3","Component1":"Y
component: Quantization table 0, Sampling factors 2 horiz/2
vert","Component2":"Cb component: Quantization table 1, Sampling factors 1
horiz/1 vert" [...]
-
+----------+----------+--------------+--------+------------+-------------+--------------+----------+-----------+------------+-----------+----------+----------+------------+-----------+------------+-----------------+-----------------+------+------+----------+------------+------------------+-----+---------------+-----------+------+---------+----------+
-
+ select FileName, * from dfs.tmp.`7671b34d6e8a4d050f75278f10f1a08.jpg`;
+
+-------------------------------------+-------------+---------------------------------+--------+------------+-------------+--------------+-------------+----------+-----------+-----------+----------+----------+------------+-----------+------------+-----------------+-----------------+----------------------------------------------------------------------------------+------------------------------------------------------------------------------+---------------------------------------
[...]
+ | FileName | FileSize |
FileDateTime | Format | PixelWidth | PixelHeight | BitsPerPixel |
Orientaion | DPIWidth | DPIHeight | ColorMode | HasAlpha | Duration |
VideoCodec | FrameRate | AudioCodec | AudioSampleSize | AudioSampleRate |
JPEG |
JpegComment |
[...]
+
+-------------------------------------+-------------+---------------------------------+--------+------------+-------------+--------------+-------------+----------+-----------+-----------+----------+----------+------------+-----------+------------+-----------------+-----------------+----------------------------------------------------------------------------------+------------------------------------------------------------------------------+---------------------------------------
[...]
+ | 7671b34d6e8a4d050f75278f10f1a08.jpg | 45877 bytes | Tue Mar 17
21:37:09 +02:00 2020 | JPEG | 604 | 453 | 24 |
Unknown (0) | 0 | 0 | RGB | false | 00:00:00 | Unknown
| 0 | Unknown | 0 | 0 |
{"CompressionType":"Baseline","DataPrecision":"8 bits","ImageHeight":"453
pixels","ImageWidth":"604 pixels","NumberOfComponents":"3","Component1":"Y
component: Quantization table 0, Sampling factors 2 [...]
+
+-------------------------------------+-------------+---------------------------------+--------+------------+-------------+--------------+-------------+----------+-----------+-----------+----------+----------+------------+-----------+------------+-----------------+-----------------+----------------------------------------------------------------------------------+------------------------------------------------------------------------------+---------------------------------------
[...]
A Drill query on a JPEG file with the property descriptive: false
- 0: jdbc:drill:zk=local> select FileName, * from
dfs.`4349313028_f69ffa0257_o.jpg`;
-
+----------+----------+--------------+--------+------------+-------------+--------------+----------+-----------+------------+-----------+----------+----------+------------+-----------+------------+-----------------+-----------------+------+------+----------+------------+------------------+-----+---------------+-----------+------+---------+----------+
- | FileName | FileSize | FileDateTime | Format | PixelWidth |
PixelHeight | BitsPerPixel | DPIWidth | DPIHeight | Orientaion | ColorMode |
HasAlpha | Duration | VideoCodec | FrameRate | AudioCodec | AudioSampleSize |
AudioSampleRate | JPEG | JFIF | ExifIFD0 | ExifSubIFD | Interoperability | GPS
| ExifThumbnail | Photoshop | IPTC | Huffman | FileType |
-
+----------+----------+--------------+--------+------------+-------------+--------------+----------+-----------+------------+-----------+----------+----------+------------+-----------+------------+-----------------+-----------------+------+------+----------+------------+------------------+-----+---------------+-----------+------+---------+----------+
- | 4349313028_f69ffa0257_o.jpg | 257213 | 2018-03-09 04:09:34.0 | JPEG |
1199 | 800 | 24 | 96.0 | 96.0 | 0 | RGB | false | 0 | Unknown | 0.0 | Unknown |
0 | 0.0 |
{"CompressionType":0,"DataPrecision":8,"ImageHeight":800,"ImageWidth":1199,"NumberOfComponents":3,"Component1":{"ComponentId":1,"HorizontalSamplingFactor":2,"VerticalSamplingFactor":2,"QuantizationTableNumber":0},"Component2":{"ComponentId":2,"HorizontalSamplingFactor":1,"VerticalSamplingFactor":1,"QuantizationTableNumber
[...]
-
+----------+----------+--------------+--------+------------+-------------+--------------+----------+-----------+------------+-----------+----------+----------+------------+-----------+------------+-----
+ select FileName, * from dfs.tmp.`7671b34d6e8a4d050f75278f10f1a08.jpg`;
+
+-------------------------------------+----------+-----------------------+--------+------------+-------------+--------------+------------+----------+-----------+-----------+----------+----------+------------+-----------+------------+-----------------+-----------------+----------------------------------------------------------------------------------+------------------------------------------------------------------------------+-----------------------------------------------------
[...]
+ | FileName | FileSize | FileDateTime
| Format | PixelWidth | PixelHeight | BitsPerPixel | Orientaion | DPIWidth |
DPIHeight | ColorMode | HasAlpha | Duration | VideoCodec | FrameRate |
AudioCodec | AudioSampleSize | AudioSampleRate |
JPEG |
JpegComment |
JFIF [...]
+
+-------------------------------------+----------+-----------------------+--------+------------+-------------+--------------+------------+----------+-----------+-----------+----------+----------+------------+-----------+------------+-----------------+-----------------+----------------------------------------------------------------------------------+------------------------------------------------------------------------------+-----------------------------------------------------
[...]
+ | 7671b34d6e8a4d050f75278f10f1a08.jpg | 45877 | 2020-03-17
19:37:09.0 | JPEG | 604 | 453 | 24 | 0 |
0.0 | 0.0 | RGB | false | 0 | Unknown | 0.0
| Unknown | 0 | 0.0 |
{"CompressionType":0,"DataPrecision":8,"ImageHeight":453,"ImageWidth":604,"NumberOfComponents":3,"Component1":{"ComponentId":1,"HorizontalSamplingFactor":2,"VerticalSamplingFactor":2,"QuantizationTableNumber":0},"C
[...]
+
+-------------------------------------+----------+-----------------------+--------+------------+-------------+--------------+------------+----------+-----------+-----------+----------+----------+------------+-----------+------------+-----------------+-----------------+----------------------------------------------------------------------------------+------------------------------------------------------------------------------+-----------------------------------------------------
[...]
Retrieving GPS location data from the Exif metadata for the use of GIS
functions.
- 0: jdbc:drill:zk=local> select t.GPS.GPSLatitude as lat,
t.GPS.GPSLongitude as lon from dfs.`4349313028_f69ffa0257_o.jpg` t;
- +--------------------+----------------------+
- | lat | lon |
- +--------------------+----------------------+
- | 47.53777313232332 | -122.03510284423795 |
- +--------------------+----------------------+
-
-Retrieving the images that are larger than 640 x 480 pixels.
-
- 0: jdbc:drill:zk=local> select FileName, PixelWidth, PixelHeight from
dfs.`/images/*.png` where PixelWidth >= 640 and PixelHeight >= 480;
- +--------------------------+-------------+--------------+
- | FileName | PixelWidth | PixelHeight |
- +--------------------------+-------------+--------------+
- | 1.png | 2788 | 1758 |
- | 1500x500.png | 1500 | 500 |
- | 2.png | 2788 | 1758 |
- | 9784873116914_1.png | 874 | 1240 |
- | Driven-Example-Load.png | 1208 | 970 |
- | features-diagram.png | 1170 | 644 |
- | hal1.png | 1223 | 772 |
- | hal2.png | 1184 | 768 |
- | image-3.png | 1200 | 771 |
- | image-4.png | 1200 | 771 |
- | image002.png | 1689 | 695 |
- +--------------------------+-------------+--------------+
+ select t.GPS.GPSLatitude as lat, t.GPS.GPSLongitude as lon from
dfs.tmp.`7671b34d6e8a4d050f75278f10f1a08.jpg` t;
+ +-------------------+--------------------+
+ | lat | lon |
+ +-------------------+--------------------+
+ | 50.46355547157135 | 30.508668422733077 |
+ +-------------------+--------------------+
+
+Download all `png` images from [Logos]({{ site.baseurl }}/images/logos/) page
and place them to `/tmp/logos`
+ directory to examine the following example.
+
+An example query to retrieve the images that are less than 640 x 480 pixels.
+
+ select FileName, PixelWidth, PixelHeight from dfs.tmp.`logos` where
PixelWidth < 640 and PixelHeight < 480;
+ +---------------------+------------+-------------+
+ | FileName | PixelWidth | PixelHeight |
+ +---------------------+------------+-------------+
+ | redbusLogo.png | 500 | 325 |
+ | option3-io-logo.png | 194 | 76 |
+ | sanchezLogo.png | 235 | 85 |
+ | IORA_NUS.png | 375 | 253 |
+ +---------------------+------------+-------------+
## Supported File Formats
@@ -798,6 +797,3 @@ EPS.ImageWidth|INTEGER|Width of the image in pixels
EPS.ImageHeight|INTEGER|Height of the image in pixels
EPS.ColorType|INTEGER|Color type: `1` (Bitmap/grayscale), `2` (Lab), `3`: RGB
or `4` (CMYK)
EPS.RamSize|INTEGER|Ram Size to keep the image in bytes
-
-
-
diff --git
a/_docs/performance-tuning/drill-metastore/010-using-drill-metastore.md
b/_docs/performance-tuning/drill-metastore/010-using-drill-metastore.md
index e12d31b..2d4a28c 100644
--- a/_docs/performance-tuning/drill-metastore/010-using-drill-metastore.md
+++ b/_docs/performance-tuning/drill-metastore/010-using-drill-metastore.md
@@ -1,14 +1,17 @@
---
title: "Using Drill Metastore"
parent: "Drill Metastore"
-date: 2020-03-03
+date: 2020-03-17
---
Drill 1.17 introduces the Drill Metastore which stores the table schema and
table statistics. Statistics allow Drill to better create optimal query plans.
-The Metastore is a Beta feature; it is subject to change. We encourage you to
try it and provide feedback.
-Because the Metastore is in Beta, the SQL commands and Metastore formats may
change in the next release.
-{% include startnote.html %}In Drill 1.17, this feature is supported for
Parquet tables only and is disabled by default.{% include endnote.html %}
+The Metastore is a beta feature and is subject to change.
+In particular, the SQL commands and Metastore format may change based on your
experience and feedback.
+{% include startnote.html %}
+In Drill 1.17, Metastore supports only tables in Parquet format. The feature
is disabled by default.
+In Drill 1.18, Metastore supports all format plugins (except MaprDB) for the
file system plugin. The feature is still disabled by default.
+{% include endnote.html %}
## Drill Metastore introduction
@@ -103,13 +106,13 @@ Schema information and summary statistics also computed
and stored for table seg
The detailed metadata schema is described
[here](https://github.com/apache/drill/tree/master/metastore/metastore-api#metastore-tables).
You can try out the metadata to get a sense of what is available, by using the
- [Inspect the Metastore using `INFORMATION_SCHEMA`
tables]({{site.baseurl}}/docs/using-drill-metastore/#inspect-the-metastore-using-information_schema-tables)
tutorial.
+ [Inspect the Metastore using `INFORMATION_SCHEMA`
tables](#inspect-the-metastore-using-information_schema-tables) tutorial.
Every table described by the Metastore may be a bare file or one or more files
that reside in one or more directories.
If a table consists of a single directory or file, then it is non-partitioned.
The single directory can contain any number of files.
Larger tables tend to have subdirectories. Each subdirectory is a partition
and such a table are called "partitioned".
-Please refer to [Exposing Drill Metastore metadata through
`INFORMATION_SCHEMA`
tables]({{site.baseurl}}/docs/using-drill-metastore/#exposing-drill-metastore-metadata-through-information_schema-tables)
+Please refer to [Exposing Drill Metastore metadata through
`INFORMATION_SCHEMA`
tables](#exposing-drill-metastore-metadata-through-information_schema-tables)
for information, how to query partitions and segments metadata.
A traditional database divides tables into schemas and tables.
@@ -117,6 +120,45 @@ Drill can connect to any number of data sources, each of
which may have its own
As a result, the Metastore labels tables with a combination of (plugin
configuration name, workspace name, table name).
Note that if before renaming any of these items, you must delete table's
Metadata entry and recreate it after renaming.
+### Using schema provisioning feature with Drill Metastore
+
+The Drill Metastore holds both schema and statistics information for a table.
The `ANALYZE` command can infer the table
+ schema for well-defined tables (such as many Parquet tables). Some tables are
too complex or variable for Drill's
+ schema inference to work well. For example, JSON tables often omit fields or
have long runs of nulls so that Drill
+ cannot determine column types. In these cases, you can specify the correct
schema based on your knowledge of the
+ table's structure. You specify a schema in the `ANALYZE` command using the
+ [Schema
provisioning]({{site.baseurl}}/docs/plugin-configuration-basics/#specifying-the-schema-as-table-function-parameter)
syntax.
+
+Please refer to [Provisioning schema for Drill
Metastore](#provisioning-schema-for-drill-metastore) for examples of usage.
+
+### Schema priority
+
+Drill uses metadata during both query planning and execution. Drill gives you
multiple ways to provide a schema.
+
+When you run the `ANALYZE TABLE` command, Drill will use the following rules
for the table schema to be stored in the Metastore. In priority order:
+
+* A schema provided in the table function.
+* A schema file, created with `CREATE OR REPLACE SCHEMA`, in the table root
directory.
+* Schema inferred from file data.
+
+To plan a query, Drill requires information about your file partitions (if
any) and about row and column cardinality.
+Drill does not use the provided schema for planning as it does not provide
this metadata. Instead, at plan time Drill
+obtains metadata from one of the following, again in priority order:
+
+* The Drill Metastore, if available.
+* Inferred from file data. Drill scans the table's directory structure to
identify partitions.
+ Drill estimates row counts based on the file size. Drill uses default
estimates for column cardinality.
+
+At query execution time, a schema tells Drill the shape of your data and how
that data should be converted to Drill's SQL types.
+Your choices for execution-time schema, in priority order, are:
+
+* With a table function:
+ - specify an inline schema
+ - specify the path to the schema file.
+* With a schema file, created with `CREATE OR REPLACE SCHEMA`, in the table
root directory.
+* Using the schema from the Drill Metastore, if available.
+* Infer the schema directly from file data.
+
### Related Session/System Options
The Metastore provides a number of options to fit your environment. The
default options are fine in most cases.
@@ -179,7 +221,7 @@ Incremental analysis will compute metadata only for files
and partitions changed
The command will return the following message if table statistics are
up-to-date:
```
-apache drill (dfs.tmp)> analyze table lineitem refresh metadata;
+ANALYZE TABLE `lineitem` REFRESH METADATA;
+-------+---------------------------------------------------------+
| ok | summary |
+-------+---------------------------------------------------------+
@@ -250,6 +292,9 @@ A table can be divided into directories, called
"partitions". The `PARTITIONS` t
- Applies to tables stored as Parquet files and only when stored in the `DFS`
storage plugin.
- Disabled by default. You must enable this feature through the
`metastore.enabled` system/session option.
+### Limitations of the 1.18 release
+ - Applies to all file system storage plugin formats except for MaprDB.
+
### Cheat sheet of `ANALYZE TABLE` commands
- Add a new table with `ANALYZE TABLE dfs.tmp.lineitem REFRESH METADATA`
command.
@@ -315,7 +360,7 @@ Run the [ANALYZE
TABLE]({{site.baseurl}}/docs/analyze-table-refresh-metadata) co
be computed and stored into the Drill Metastore:
```
-apache drill> ANALYZE TABLE dfs.tmp.lineitem REFRESH METADATA;
+ANALYZE TABLE dfs.tmp.lineitem REFRESH METADATA;
+------+-------------------------------------------------------------+
| ok | summary |
+------+-------------------------------------------------------------+
@@ -339,7 +384,7 @@ Now that we've collected table metadata, we can use it when
we query the table,
Rerun [ANALYZE TABLE]({{site.baseurl}}/docs/analyze-table-refresh-metadata)
command on the `lineitem` table:
```
-apache drill> ANALYZE TABLE dfs.tmp.lineitem REFRESH METADATA;
+ANALYZE TABLE dfs.tmp.lineitem REFRESH METADATA;
+-------+---------------------------------------------------------+
| ok | summary |
+-------+---------------------------------------------------------+
@@ -353,7 +398,7 @@ apache drill> ANALYZE TABLE dfs.tmp.lineitem REFRESH
METADATA;
Run the following query to inspect `lineitem` table metadata from `TABLES`
table stored in the Metastore:
```
-apache drill> SELECT * FROM INFORMATION_SCHEMA.`TABLES` WHERE
TABLE_NAME='lineitem';
+SELECT * FROM INFORMATION_SCHEMA.`TABLES` WHERE TABLE_NAME='lineitem';
+---------------+--------------+------------+------------+--------------+---------------+----------+-----------------------+
| TABLE_CATALOG | TABLE_SCHEMA | TABLE_NAME | TABLE_TYPE | TABLE_SOURCE |
LOCATION | NUM_ROWS | LAST_MODIFIED_TIME |
+---------------+--------------+------------+------------+--------------+---------------+----------+-----------------------+
@@ -365,7 +410,7 @@ apache drill> SELECT * FROM INFORMATION_SCHEMA.`TABLES`
WHERE TABLE_NAME='lineit
To obtain columns with their types and descriptions within the `lineitem`
table, run the following query:
```
-apache drill> SELECT * FROM INFORMATION_SCHEMA.`COLUMNS` WHERE
TABLE_NAME='lineitem';
+SELECT * FROM INFORMATION_SCHEMA.`COLUMNS` WHERE TABLE_NAME='lineitem';
+---------------+--------------+------------+-----------------+------------------+----------------+-------------+-------------------+--------------------------+------------------------+-------------------+-------------------------+---------------+--------------------+---------------+--------------------+-------------+---------------+-----------+--------------+---------------------------------------------+-----------+-------------------+-----------+
| TABLE_CATALOG | TABLE_SCHEMA | TABLE_NAME | COLUMN_NAME |
ORDINAL_POSITION | COLUMN_DEFAULT | IS_NULLABLE | DATA_TYPE |
CHARACTER_MAXIMUM_LENGTH | CHARACTER_OCTET_LENGTH | NUMERIC_PRECISION |
NUMERIC_PRECISION_RADIX | NUMERIC_SCALE | DATETIME_PRECISION | INTERVAL_TYPE |
INTERVAL_PRECISION | COLUMN_SIZE | COLUMN_FORMAT | NUM_NULLS | MIN_VAL |
MAX_VAL | NDV | EST_NUM_NON_NULLS |
IS_NESTED |
+---------------+--------------+------------+-----------------+------------------+----------------+-------------+-------------------+--------------------------+------------------------+-------------------+-------------------------+---------------+--------------------+---------------+--------------------+-------------+---------------+-----------+--------------+---------------------------------------------+-----------+-------------------+-----------+
@@ -381,7 +426,7 @@ apache drill> SELECT * FROM INFORMATION_SCHEMA.`COLUMNS`
WHERE TABLE_NAME='linei
The sample `lineitem` table has two partitions. The `PARTITIONS` table
contains an entry for each directory:
```
-apache drill (information_schema)> SELECT * FROM
INFORMATION_SCHEMA.`PARTITIONS` WHERE TABLE_NAME='lineitem';
+SELECT * FROM INFORMATION_SCHEMA.`PARTITIONS` WHERE TABLE_NAME='lineitem';
+---------------+--------------+------------+--------------+---------------+---------------------+------------------+-----------------+------------------+-----------------------+
| TABLE_CATALOG | TABLE_SCHEMA | TABLE_NAME | METADATA_KEY | METADATA_TYPE |
METADATA_IDENTIFIER | PARTITION_COLUMN | PARTITION_VALUE | LOCATION |
LAST_MODIFIED_TIME |
+---------------+--------------+------------+--------------+---------------+---------------------+------------------+-----------------+------------------+-----------------------+
@@ -398,7 +443,7 @@ Once we are done exploring metadata we can drop the
metadata for the `lineitem`
Table metadata may be dropped using `ANALYZE TABLE DROP METADATA` command:
```
-apache drill> ANALYZE TABLE dfs.tmp.lineitem DROP METADATA;
+ANALYZE TABLE dfs.tmp.lineitem DROP METADATA;
+------+----------------------------------------+
| ok | summary |
+------+----------------------------------------+
@@ -417,7 +462,7 @@ Next let's gather metadata for a subset of the columns in
the `lineitem` table.
For the case when metadata for several columns should be computed and stored
into the Metastore, the following command may be used:
```
-apache drill (information_schema)> ANALYZE TABLE dfs.tmp.lineitem
COLUMNS(l_orderkey, l_partkey) REFRESH METADATA;
+ANALYZE TABLE dfs.tmp.lineitem COLUMNS(l_orderkey, l_partkey) REFRESH METADATA;
+------+-------------------------------------------------------------+
| ok | summary |
+------+-------------------------------------------------------------+
@@ -430,7 +475,7 @@ Now, check, that metadata is collected only for specified
columns (`MIN_VAL`, `M
columns are present:
```
-apache drill (information_schema)> SELECT * FROM INFORMATION_SCHEMA.`COLUMNS`
WHERE TABLE_NAME='lineitem';
+SELECT * FROM INFORMATION_SCHEMA.`COLUMNS` WHERE TABLE_NAME='lineitem';
+---------------+--------------+------------+-----------------+------------------+----------------+-------------+-------------------+--------------------------+------------------------+-------------------+-------------------------+---------------+--------------------+---------------+--------------------+-------------+---------------+-----------+---------+---------+-----------+-------------------+-----------+
| TABLE_CATALOG | TABLE_SCHEMA | TABLE_NAME | COLUMN_NAME |
ORDINAL_POSITION | COLUMN_DEFAULT | IS_NULLABLE | DATA_TYPE |
CHARACTER_MAXIMUM_LENGTH | CHARACTER_OCTET_LENGTH | NUMERIC_PRECISION |
NUMERIC_PRECISION_RADIX | NUMERIC_SCALE | DATETIME_PRECISION | INTERVAL_TYPE |
INTERVAL_PRECISION | COLUMN_SIZE | COLUMN_FORMAT | NUM_NULLS | MIN_VAL |
MAX_VAL | NDV | EST_NUM_NON_NULLS | IS_NESTED |
+---------------+--------------+------------+-----------------+------------------+----------------+-------------+-------------------+--------------------------+------------------------+-------------------+-------------------------+---------------+--------------------+---------------+--------------------+-------------+---------------+-----------+---------+---------+-----------+-------------------+-----------+
@@ -442,3 +487,135 @@ apache drill (information_schema)> SELECT * FROM
INFORMATION_SCHEMA.`COLUMNS` WH
+---------------+--------------+------------+-----------------+------------------+----------------+-------------+-------------------+--------------------------+------------------------+-------------------+-------------------------+---------------+--------------------+---------------+--------------------+-------------+---------------+-----------+---------+---------+-----------+-------------------+-----------+
17 rows selected (0.183 seconds)
```
+
+### Provisioning schema for Drill Metastore
+
+#### Directory and File Setup
+
+Ensure you have configured the file system storage plugin as described here:
+ [Connecting Drill to a File
System]({{site.baseurl}}/docs/file-system-storage-plugin/#connecting-drill-to-a-file-system).
+
+Set `store.format` to `csvh`:
+
+```
+SET `store.format`='csvh';
++------+-----------------------+
+| ok | summary |
++------+-----------------------+
+| true | store.format updated. |
++------+-----------------------+
+```
+
+Create a text table based on the sample `/tpch/nation.parquet` table from `cp`
plugin:
+
+```
+CREATE TABLE dfs.tmp.text_nation AS
+ (SELECT *
+ FROM cp.`/tpch/nation.parquet`);
++----------+---------------------------+
+| Fragment | Number of records written |
++----------+---------------------------+
+| 0_0 | 25 |
++----------+---------------------------+
+```
+
+Query the table `text_nation`:
+
+```
+SELECT typeof(n_nationkey),
+ typeof(n_name),
+ typeof(n_regionkey),
+ typeof(n_comment)
+FROM dfs.tmp.text_nation
+LIMIT 1;
++---------+---------+---------+---------+
+| EXPR$0 | EXPR$1 | EXPR$2 | EXPR$3 |
++---------+---------+---------+---------+
+| VARCHAR | VARCHAR | VARCHAR | VARCHAR |
++---------+---------+---------+---------+
+```
+
+Notice that the query plan contains a group scan with `usedMetastore = false`:
+
+```
+00-00 Screen : rowType = RecordType(ANY EXPR$0, ANY EXPR$1, ANY EXPR$2, ANY
EXPR$3): rowcount = 1.0, cumulative cost = {25.1 rows, 109.1 cpu, 2247.0 io,
0.0 network, 0.0 memory}, id = 160
+00-01 Project(EXPR$0=[TYPEOF($0)], EXPR$1=[TYPEOF($1)],
EXPR$2=[TYPEOF($2)], EXPR$3=[TYPEOF($3)]) : rowType = RecordType(ANY EXPR$0,
ANY EXPR$1, ANY EXPR$2, ANY EXPR$3): rowcount = 1.0, cumulative cost = {25.0
rows, 109.0 cpu, 2247.0 io, 0.0 network, 0.0 memory}, id = 159
+00-02 SelectionVectorRemover : rowType = RecordType(ANY n_nationkey,
ANY n_name, ANY n_regionkey, ANY n_comment): rowcount = 1.0, cumulative cost =
{24.0 rows, 93.0 cpu, 2247.0 io, 0.0 network, 0.0 memory}, id = 158
+00-03 Limit(fetch=[1]) : rowType = RecordType(ANY n_nationkey, ANY
n_name, ANY n_regionkey, ANY n_comment): rowcount = 1.0, cumulative cost =
{23.0 rows, 92.0 cpu, 2247.0 io, 0.0 network, 0.0 memory}, id = 157
+00-04 Scan(table=[[dfs, tmp, text_nation]],
groupscan=[EasyGroupScan [... schema=null, usedMetastore=false...
+```
+
+#### Compute table metadata and store in the Drill Metastore
+
+Enable Drill Metastore:
+
+```
+SET `metastore.enabled` = true;
+```
+
+Specify table schema when running `ANALYZE` query:
+
+```
+ANALYZE TABLE table(dfs.tmp.`text_nation` (type=>'text', fieldDelimiter=>',',
extractHeader=>true,
+ schema=>'inline=(
+ `n_nationkey` INT not null,
+ `n_name` VARCHAR not null,
+ `n_regionkey` INT not null,
+ `n_comment` VARCHAR not null)'
+ )) REFRESH METADATA;
++------+----------------------------------------------------------------+
+| ok | summary |
++------+----------------------------------------------------------------+
+| true | Collected / refreshed metadata for table [dfs.tmp.text_nation] |
++------+----------------------------------------------------------------+
+```
+
+#### Inspect the Metastore using INFORMATION_SCHEMA tables
+
+Run the following query to inspect `text_nation` table schema stored in the
Metastore:
+
+```
+SELECT COLUMN_NAME, DATA_TYPE FROM INFORMATION_SCHEMA.`COLUMNS` WHERE
TABLE_NAME='text_nation';
++-------------+-------------------+
+| COLUMN_NAME | DATA_TYPE |
++-------------+-------------------+
+| n_nationkey | INTEGER |
+| n_name | CHARACTER VARYING |
+| n_regionkey | INTEGER |
+| n_comment | CHARACTER VARYING |
++-------------+-------------------+
+```
+
+Ensure that this schema is applied to the table:
+
+```
+SELECT typeof(n_nationkey),
+ typeof(n_name),
+ typeof(n_regionkey),
+ typeof(n_comment)
+FROM dfs.tmp.text_nation
+LIMIT 1;
++--------+---------+--------+---------+
+| EXPR$0 | EXPR$1 | EXPR$2 | EXPR$3 |
++--------+---------+--------+---------+
+| INT | VARCHAR | INT | VARCHAR |
++--------+---------+--------+---------+
+```
+
+```
+select sum(n_nationkey) from dfs.tmp.`text_nation`;
++--------+
+| EXPR$0 |
++--------+
+| 300 |
++--------+
+```
+
+Query plan contains schema from the Metastore and group scan with
`usedMetastore = true`:
+
+```
+00-00 Screen : rowType = RecordType(ANY EXPR$0): rowcount = 1.0, cumulative
cost = {45.1 rows, 287.1 cpu, 2247.0 io, 0.0 network, 0.0 memory}, id = 3129
+00-01 Project(EXPR$0=[$0]) : rowType = RecordType(ANY EXPR$0): rowcount =
1.0, cumulative cost = {45.0 rows, 287.0 cpu, 2247.0 io, 0.0 network, 0.0
memory}, id = 3128
+00-02 StreamAgg(group=[{}], EXPR$0=[SUM($0)]) : rowType =
RecordType(ANY EXPR$0): rowcount = 1.0, cumulative cost = {44.0 rows, 286.0
cpu, 2247.0 io, 0.0 network, 0.0 memory}, id = 3127
+00-03 Scan(table=[[dfs, tmp, text_nation]], groupscan=[EasyGroupScan
... schema=..., usedMetastore=true]]) ...
+```
diff --git
a/_docs/query-data/query-a-file-system/005-querying-a-file-system-introduction.md
b/_docs/query-data/query-a-file-system/005-querying-a-file-system-introduction.md
index 436cc75..641127c 100644
---
a/_docs/query-data/query-a-file-system/005-querying-a-file-system-introduction.md
+++
b/_docs/query-data/query-a-file-system/005-querying-a-file-system-introduction.md
@@ -27,9 +27,9 @@ Drill supports the following file types:
* Tab-separated values (TSV, type: text)
* Pipe-separated values (PSV, type: text)
* Structured data files:
- * Avro (type: avro) (This file type is experimental. See [Querying Avro
Files]({{site.baseurl}}/docs/querying-avro-files/).)
- * JSON (type: json)
- * Parquet (type: parquet)
+ * [Avro]({{site.baseurl}}/docs/querying-avro-files/) (type: avro)
+ * [JSON]({{site.baseurl}}/docs/querying-json-files/) (type: json)
+ * [Parquet]({{site.baseurl}}/docs/querying-parquet-files/) (type: parquet)
The extensions for these file types must match the configuration settings for
your registered storage plugins. For example, PSV files may be defined with a
diff --git a/_docs/query-data/query-a-file-system/009-querying-avro-files.md
b/_docs/query-data/query-a-file-system/009-querying-avro-files.md
index 394d82e..0529346 100644
--- a/_docs/query-data/query-a-file-system/009-querying-avro-files.md
+++ b/_docs/query-data/query-a-file-system/009-querying-avro-files.md
@@ -3,5 +3,29 @@ title: "Querying Avro Files"
date: 2019-04-16
parent: "Querying a File System"
---
-
-The Avro format is experimental at this time. There are known issues when
querying Avro files.
+
+Drill supports files in the [Avro](https://avro.apache.org/) format.
+Starting from Drill 1.18, the Avro format supports the [Schema
provisioning]({{site.baseurl}}/docs/create-or-replace-schema/#usage-notes)
feature.
+
+#### Preparing example data
+
+To follow along with this example, download [sample data
file](https://github.com/apache/drill/blob/master/exec/java-exec/src/test/resources/avro/map_string_to_long.avro)
+ to your `/tmp` directory.
+
+#### Selecting data from Avro files
+
+We can query all data from the `map_string_to_long.avro` file:
+
+```
+select * from dfs.tmp.`map_string_to_long.avro`
+```
+
+The query returns the following results:
+
+```
++-----------------+
+| mapf |
++-----------------+
+| {"ki":1,"ka":2} |
++-----------------+
+```
\ No newline at end of file
diff --git
a/_docs/sql-reference/sql-commands/007-analyze-table-refresh-metadata.md
b/_docs/sql-reference/sql-commands/007-analyze-table-refresh-metadata.md
index 8f6b087..8e285ef 100644
--- a/_docs/sql-reference/sql-commands/007-analyze-table-refresh-metadata.md
+++ b/_docs/sql-reference/sql-commands/007-analyze-table-refresh-metadata.md
@@ -1,7 +1,7 @@
---
title: "ANALYZE TABLE REFRESH METADATA"
parent: "SQL Commands"
-date: 2020-03-03
+date: 2020-03-17
---
Starting from Drill 1.17, you can store table metadata (including schema and
computed statistics) into Drill Metastore.
@@ -34,10 +34,17 @@ The name of the table or directory for which Drill will
collect table metadata.
Table function parameters. This syntax is only available since Drill 1.18.
Example of table function parameters usage:
- table(dfs.`table_name` (type => 'parquet', autoCorrectCorruptDates =>
true))
-
-For detailed information, please refer to
- [Using the Formats Attributes as Table Function
Parameters]({{site.baseurl}}/docs/plugin-configuration-basics/#using-the-formats-attributes-as-table-function-parameters)
+ table(dfs.tmp.`text_nation` (type=>'text', fieldDelimiter=>',',
extractHeader=>true,
+ schema=>'inline=(
+ `n_nationkey` INT not null,
+ `n_name` VARCHAR not null,
+ `n_regionkey` INT not null,
+ `n_comment` VARCHAR not null)'
+ ))
+
+Please refer to
+ [Specifying the Schema as Table Function
Parameter]({{site.baseurl}}/docs/plugin-configuration-basics/#specifying-the-schema-as-table-function-parameter)
+ for the details.
*COLUMNS (col1, col2, ...)*
Optional names of the column(s) for which Drill will compute and store
statistics. The stored schema will include all
diff --git a/_docs/sql-reference/sql-commands/021-create-schema.md
b/_docs/sql-reference/sql-commands/021-create-schema.md
index 260eb38..8c6fcb2 100644
--- a/_docs/sql-reference/sql-commands/021-create-schema.md
+++ b/_docs/sql-reference/sql-commands/021-create-schema.md
@@ -4,14 +4,24 @@ date: 2019-05-31
parent: "SQL Commands"
---
-Starting in Drill 1.16, you can define a schema for text files using the
CREATE OR REPLACE SCHEMA command. Schema is only available for tables
represented by a directory. To use this feature with a single file, put the
file inside a directory, and use the directory name to query the table.
+Starting in Drill 1.16 you can define a schema for text files. Drill places a
schema file in the root directory of your text table and so the schema feature
only works for tables within a directory. If you have a single-file table,
simply create a directory to hold that file and the schema file.
-In Drill 1.16, this feature is in preview status and disabled by default. You
can enable this feature by setting the `exec.storage.enable_v3_text_reader` and
`store.table.use_schema_file` system/session options to true. The feature is
currently only available for text (CSV) files.
+In Drill 1.17, the provided schema feature is disabled by default. Enable it
by setting the `store.table.use_schema_file` system/session option to true:
+
+```
+ALTER SESSION SET `store.table.use_schema_file` = true
+```
+
+Next you create the schema using the `CREATE OR REPLACE SCHEMA` command as
described in [Syntax](#syntax) section.
Running this command generates a hidden `.drill.schema` file in the table’s
root directory. The `.drill.schema` file stores the schema definition in JSON
format. Alternatively, you can create the schema file manually. If created
manually, the file content must comply with the structure recognized by the
Drill.
The end of this topic provides
[examples]({{site.baseurl}}/docs/create-or-replace-schema/#examples) that show
how the feature is used. You may want to review this section before reading the
reference material.
+As described in [Specifying the Schema as Table Function
Parameter]({{site.baseurl}}/docs/plugin-configuration-basics/#specifying-the-schema-as-table-function-parameter),
+ you can also use a table function to apply a query to individual queries. Or,
you can place the
+ table function within a view, and query the table through the view.
+
Please post your experience and suggestions to the
"[user]([email protected])" mailing list.
@@ -69,6 +79,7 @@ In Drill 1.16, you must enable the following options for
Drill to use the schema
**exec.storage.enable_v3_text_reader**
Enables the preview "version 3" of the text (CSV) file reader. The V3 text
reader is the only reader in Drill 1.16 that supports file schemas.
+In Drill 1.17, this option is enabled by default.
**store.table.use_schema_file**
Enables the use of the schema file mechanism.
@@ -209,13 +220,23 @@ Values are trimmed when converting to any type, except
for varchar.
## Usage Notes
-### General Information
-- Schema provisioning only works with tables defined as directories because
Drill must have a place to store the schema file. The directory can contain one
or more files.
+### General Information
+- Schema provisioning works only with the file system (dfs-based) storage
plugins. It works by placing a file `.drill.schema` in the root folder of
tables defined as a directory. The directory can contain any number of files
(even just one) in addition to the schema file.
- Text files must have headers. The default extension for delimited text files
with headers is `.csvh`. Note that the column names that appear in the headers
match column definitions in the schema.
- You do not have to enumerate all columns in a file when creating a schema.
You can indicate the columns of interest only.
- Columns in the defined schema do not have to be in the same order as in the
data file.
- Column names must match. The case can differ, for example “name” and “NAME”
are acceptable.
-- Queries on columns with data types that cannot be converted fail with a
`DATA_READ_ERROR`.
+
+Drill is unique in that it infers table schema at runtime. However, sometimes
schema inference can fail when Drill
+ cannot infer the correct types. For example, Drill treats all fields in a
text file as text. Drill may not be able
+ to determine the type of fields in JSON files if the fields are missing or
set to `null` in the first few records
+ in the file. Drill issues a `DATA_READ_ERROR` when runtime schema inference
fails.
+
+When Drill cannot correctly infer the schema, you can instead use your
knowledge of the file layout to tell Drill
+ the proper schema to use. Schema provisioning is the feature you use to
specify the schema.
+ You can provide a schema for the file as a whole using the [`CREATE OR
REPLACE SCHEMA` command](#syntax) or for
+ a single query using a [table
function]({{site.baseurl}}/docs/plugin-configuration-basics/#table-function-parameters).
+ Please see [Specifying the Schema as Table Function
Parameter]({{site.baseurl}}/docs/plugin-configuration-basics/#specifying-the-schema-as-table-function-parameter)
for details.
### Schema Mode (Column Order)
The schema mode determines the set of columns returned for wildcard (*)
queries and the ordering of those columns. The mode is set through the
`drill.strict` property. You can set this property to true (strict) or false
(not strict). If you do not indicate the mode, the default is false (not
strict).
@@ -446,7 +467,10 @@ Note that date, time type conversion uses the Joda time
library, thus the format
## Limitations
-This feature is currently in the alpha phase (preview, experimental) for Drill
1.16 and only applies to text (CSV) files in this release. You must enable this
feature through the `exec.storage.enable_v3_text_reader` and
`store.table.use_schema_file` system/session options.
+Schema provisioning works with selected readers. If you develop a format
plugin, you must use the
+ `Enhanced Vector Framework` (rather than the "classic" techniques) to enable
schema support.
+
+To use schema provisioning, you must first enable it with the
`store.table.use_schema_file` option.
## Examples
Examples throughout this topic use the files and directories described in the
following section, Directory and File Setup.
@@ -569,7 +593,7 @@ Running EXPLAIN PLAN, you can see that type conversion was
done while reading da
00-02 Scan(table=[[dfs, tmp, text_table]],
groupscan=[EasyGroupScan [selectionRoot=file:/tmp/text_table, numFiles=2,
columns=[`**`], files=[file:/tmp/text_table/1.csvh,
file:/tmp/text_table/2.csvh], schema=[TupleSchema [PrimitiveColumnMetadata
[`id` (INT(0, 0):OPTIONAL)]]]]])
### Describing Schema for a Table
-After you create schema, you can examine the schema using the DESCRIBE SCHEMA
FOR TABLE command. Schema can print to JSON or STATEMENT format. JSON format is
the default if no format is indicated in the query. Schema displayed in JSON
format is the same as the JSON format in the `.drill.schema` file.
+You can verify the provided schema using the [`DESCRIBE SCHEMA FOR TABLE`
command](#related-commands). This command can format the schema in two formats.
The `JSON` format is the same as the contents of the `.drill.schema` file
stored in your table directory.
describe schema for table dfs.tmp.`text_table` as JSON;
+----------------------------------------------------------------------------------+
@@ -590,7 +614,7 @@ After you create schema, you can examine the schema using
the DESCRIBE SCHEMA FO
} |
+----------------------------------------------------------------------------------+
-STATEMENT format displays the schema in a form compatible with the CREATE OR
REPLACE SCHEMA command such that it can easily be copied, modified, and
executed.
+You can also use the `STATEMENT` format to recover the SQL statement to
recreate the schema. You can easily copy, reuse or edit this statement to
change the schema or reuse the statement for other files.
describe schema for table dfs.tmp.`text_table` as statement;
+--------------------------------------------------------------------------+
@@ -604,8 +628,35 @@ STATEMENT format displays the schema in a form compatible
with the CREATE OR REP
|
+--------------------------------------------------------------------------+
+### Altering Schema for a Table
+Use the `ALTER SCHEMA` command to update your table schema. The command can
add or replace columns.
+Or, it can update properties for the table or individual columns. Syntax:
+
+ ALTER SCHEMA
+ (FOR TABLE dfs.tmp.nation | PATH '/tmp/schema.json')
+ ADD [OR REPLACE]
+ [COLUMNS (col1 int, col2 varchar)]
+ [PROPERTIES ('prop1'='val1', 'prop2'='val2')]
+
+`ALTER SCHEMA` modifies an existing schema file; it will fail if the schema
file does not exist.
+(Use `CREATE SCHEMA` to create a new schema file.)
+
+To prevent accidental changes, the `ALTER SCHEMA ... ADD` command will fail if
the requested column or property
+ already exists. Use the `OR REPLACE` clause to modify an existing column or
property.
+
+You can remove columns or property with the `ALTER SCHEMA ... REMOVE` command:
+
+ ALTER SCHEMA
+ (FOR TABLE dfs.tmp.nation | PATH '/tmp/schema.json')
+ REMOVE
+ [COLUMNS (col1 int, col2 varchar)]
+ [PROPERTIES ('prop1'='val1', 'prop2'='val2')]
+
+The command fails if the schema file does not exist. The command silently
ignores a request to remove a column or
+ property which does not exist.
+
### Dropping Schema for a Table
-You can easily drop the schema for a table using the DROP SCHEMA [IF EXISTS]
FOR TABLE \`table_name` command, as shown:
+You can easily drop the schema for a table using the ``DROP SCHEMA [IF EXISTS]
FOR TABLE `table_name` `` command, as shown:
use dfs.tmp;
+------+-------------------------------------+
diff --git a/images/7671b34d6e8a4d050f75278f10f1a08.jpg
b/images/7671b34d6e8a4d050f75278f10f1a08.jpg
new file mode 100644
index 0000000..ea16a44
Binary files /dev/null and b/images/7671b34d6e8a4d050f75278f10f1a08.jpg differ