[drill] branch gh-pages updated: edit refresh and schema docs

bridgetb Mon, 29 Apr 2019 13:49:35 -0700

This is an automated email from the ASF dual-hosted git repository.

bridgetb pushed a commit to branch gh-pages
in repository https://gitbox.apache.org/repos/asf/drill.git



The following commit(s) were added to refs/heads/gh-pages by this push:
     new cb22336  edit refresh and schema docs
cb22336 is described below

commit cb22336129c6edf72f60747b2950da7d91f90d3d
Author: Bridget Bevens <[email protected]>
AuthorDate: Mon Apr 29 13:48:17 2019 -0700

    edit refresh and schema docs
---
 .../sql-commands/011-refresh-table-metadata.md     | 321 +++++++++++----------
 .../sql-commands/021-create-schema.md              |   8 +-
 2 files changed, 170 insertions(+), 159 deletions(-)

diff --git a/_docs/sql-reference/sql-commands/011-refresh-table-metadata.md 
b/_docs/sql-reference/sql-commands/011-refresh-table-metadata.md
index d678619..2af0d0b 100644
--- a/_docs/sql-reference/sql-commands/011-refresh-table-metadata.md
+++ b/_docs/sql-reference/sql-commands/011-refresh-table-metadata.md
@@ -1,6 +1,6 @@
 ---
 title: "REFRESH TABLE METADATA"
-date: 2019-04-23
+date: 2019-04-29
 parent: "SQL Commands"
 ---
 Run the REFRESH TABLE METADATA command on Parquet tables and directories to 
generate a metadata cache file. REFRESH TABLE METADATA collects metadata from 
the footers of Parquet files and writes the metadata to a metadata file 
(`.drill.parquet_file_metadata.v4`) and a summary file 
(`.drill.parquet_summary_metadata.v4`). The planner uses the metadata cache 
file to prune extraneous data during the query planning phase. Run the REFRESH 
TABLE METADATA command if planning time is a significant [...]
@@ -34,7 +34,7 @@ Run the [EXPLAIN]({{site.baseurl}}/docs/explain/) command to 
determine the query
 ## Usage Notes  
 
 ### Metadata Storage  
-- Drill traverses directories for Parquet files and gathers the metadata from 
the footer of the files. Drill stores the collected metadata in a metadata 
cache file, `.drill.parquet_metadata`, at each directory level.  
+- Drill traverses directories for Parquet files and gathers the metadata from 
the footer of the files. Drill stores the collected metadata in a metadata 
cache file, `.drill.parquet_file_metadata.v4`, a summary file, 
`.drill.parquet_summary_metadata.v4`, and a directories file, 
`.drill.parquet_metadata_directories` file at each directory level.     
 - The metadata cache file stores metadata for files in that directory, as well 
as the metadata for the files in the subdirectories.  
 - For each row group in a Parquet file, the metadata cache file stores the 
column names in the row group and the column statistics, such as the min/max 
values and null count.  
 - If the Parquet data is updated, for example data is added to a file, Drill 
automatically  refreshes the Parquet metadata when you issue the next query 
against the Parquet data.  
@@ -70,22 +70,31 @@ Sets the number of row groups that a table can have. You 
can increase the thresh
 ## Limitations
 Currently, Drill does not support runtime rowgroup pruning. 
 
-<!--
-## Examples  
-These examples use a schema, `dfs.samples`, which points to the `/home` 
directory. The `/home` directory contains a subdirectory, `parquet`, which
-contains the `nation.parquet` and a subdirectory, `dir1` with the 
`region.parquet` file. You can access the `nation.parquet` and `region.parquet` 
Parquet files in the `sample-data` directory of your Drill installation.  
 
-       [root@doc23 dir1]# pwd
-       /home/parquet/dir1
-        
+## Examples  
+These examples use a schema, `dfs.samples`, which points to the `/tmp` 
directory. The `/tmp` directory contains the following subdirectories and files 
used in the examples:  
+
+       [root@doc23 parquet1]# pwd
+       /tmp/parquet1
+       
+       [root@doc23 parquet1]# ls
+       Parquet
+       
+       [root@doc23 parquet1]# cd parquet
+       
        [root@doc23 parquet]# ls
-       dir1  nation.parquet
-        
-       [root@doc23 dir1]# ls
-       region.parquet  
+       nation.parquet  test
+       
+       [root@doc23 parquet]# cd test
+       
+       [root@doc23 test]# ls
+       nation.parquet
+
+**Note:** You can access the sample `nation.parquet` file in the `sample-data` 
directory of your Drill installation.
 
-Change schemas to use `dfs.samples`:
  
+Change schemas to switch to `dfs.samples`: 
+
        use dfs.samples;
        +-------+------------------------------------------+
        |  ok   |                 summary                     |
@@ -93,37 +102,113 @@ Change schemas to use `dfs.samples`:
        | true  | Default schema changed to [dfs.samples]  |
        +-------+------------------------------------------+  
 
-### Running REFRESH TABLE METADATA on a Directory  
-Running the REFRESH TABLE METADATA command on the `parquet` directory 
generates metadata cache files at each directory level.  
-
-       REFRESH TABLE METADATA parquet;  
-       +-------+---------------------------------------------------+
-       |  ok   |                       summary                         |
-       +-------+---------------------------------------------------+
-       | true  | Successfully updated metadata for table parquet.  |
-       +-------+---------------------------------------------------+  
-
-When looking at the `parquet` directory and `dir1` subdirectory, you can see 
that a metadata cache file was created at each level:
-
+### Running REFRESH TABLE METADATA on a Directory
+Running the REFRESH TABLE METADATA command on the “parquet1” directory 
generates metadata cache files at each directory level.
+
+       apache drill (dfs.samples)> REFRESH TABLE METADATA parquet1;
+       +------+---------------------------------------------------+
+       |  ok  |                      summary                      |
+       +------+---------------------------------------------------+
+       | true | Successfully updated metadata for table parquet1. |
+       +------+---------------------------------------------------+
+
+When looking at the “parquet1” directory and subdirectories, you can see that 
a metadata cache and summary (hidden) files were created at each level:
+
+**Note:** The CRC files are Cyclical Redundancy Check checksum files used to 
verify the data integrity of other files. 
+
+       [root@doc23 parquet1]# ls -la
+       total 36
+       drwxr-xr-x   3 root root  284 Apr 29 11:46 .
+       drwxrwxrwt. 51 root root 8192 Apr 29 11:44 ..
+       -rw-r--r--   1 root root 1037 Apr 29 11:46 
.drill.parquet_file_metadata.v4
+       -rw-r--r--   1 root root   20 Apr 29 11:46 
..drill.parquet_file_metadata.v4.crc
+       -rw-r--r--   1 root root   51 Apr 29 11:46 
.drill.parquet_metadata_directories
+       -rw-r--r--   1 root root   12 Apr 29 11:46 
..drill.parquet_metadata_directories.crc
+       -rw-r--r--   1 root root 1334 Apr 29 11:46 
.drill.parquet_summary_metadata.v4
+       -rw-r--r--   1 root root   20 Apr 29 11:46 
..drill.parquet_summary_metadata.v4.crc
+       drwxr-xr-x   3 root root  212 Apr 29 11:30 parquet  
+       
+       [root@doc23 parquet1]# cd parquet
        [root@doc23 parquet]# ls -la
-       drwxr-xr-x   2 root root   95 Mar 18 17:49 dir1
-       -rw-r--r--   1 root root 2642 Mar 18 17:52 .drill.parquet_metadata
-       -rw-r--r--   1 root root   32 Mar 18 17:52 ..drill.parquet_metadata.crc
-       -rwxr-xr-x   1 root root 1210 Mar 13 13:32 nation.parquet
-        
-       [root@doc23 dir1]# ls -la
-       -rw-r--r-- 1 root root 1235 Mar 18 17:52 .drill.parquet_metadata
-       -rw-r--r-- 1 root root   20 Mar 18 17:52 ..drill.parquet_metadata.crc
-       -rwxr-xr-x 1 root root  455 Mar 18 17:41 region.parquet  
-
-The following sections compare the content of the metadata cache file in  the 
`parquet` and `dir1` directories:  
+       total 20
+       drwxr-xr-x 3 root root  212 Apr 29 11:30 .
+       drwxr-xr-x 3 root root  284 Apr 29 11:46 ..
+       -rw-r--r-- 1 root root 1021 Apr 29 11:46 .drill.parquet_file_metadata.v4
+       -rw-r--r-- 1 root root   16 Apr 29 11:46 
..drill.parquet_file_metadata.v4.crc
+       -rw-r--r-- 1 root root 1315 Apr 29 11:46 
.drill.parquet_summary_metadata.v4
+       -rw-r--r-- 1 root root   20 Apr 29 11:46 
..drill.parquet_summary_metadata.v4.crc
+       -rwxr-xr-x 1 root root 1210 Apr 29 11:23 nation.parquet
+       drwxr-xr-x 2 root root  200 Apr 29 11:46 test
+       
+       [root@doc23 test]# ls -la
+       total 20
+       drwxr-xr-x 2 root root  200 Apr 29 11:46 .
+       drwxr-xr-x 3 root root  212 Apr 29 11:30 ..
+       -rw-r--r-- 1 root root  517 Apr 29 11:46 .drill.parquet_file_metadata.v4
+       -rw-r--r-- 1 root root   16 Apr 29 11:46 
..drill.parquet_file_metadata.v4.crc
+       -rw-r--r-- 1 root root 1308 Apr 29 11:46 
.drill.parquet_summary_metadata.v4
+       -rw-r--r-- 1 root root   20 Apr 29 11:46 
..drill.parquet_summary_metadata.v4.crc
+       -rwxr-xr-x 1 root root 1210 Apr 29 11:23 nation.parquet  
+
+Looking at the `.drill.parquet_file_metadata.v4` file in the `/tmp/parquet1` 
directory, you can see that the file contains the paths to the Parquet files in 
the subdirectories, as well as metadata for those files: 
+
+       [root@doc23 parquet1]# cat .drill.parquet_file_metadata.v4
+       {
+         "files" : [ {
+           "path" : "parquet/test/nation.parquet",
+           "length" : 1210,
+           "rowGroups" : [ {
+             "start" : 4,
+             "length" : 944,
+             "rowCount" : 25,
+             "hostAffinity" : {
+               "localhost" : 1.0
+             },
+             "columns" : [ {
+               "name" : [ "N_NATIONKEY" ],
+               "nulls" : -1
+             }, {
+               "name" : [ "N_NAME" ],
+               "nulls" : -1
+             }, {
+               "name" : [ "N_REGIONKEY" ],
+               "nulls" : -1
+             }, {
+               "name" : [ "N_COMMENT" ],
+               "nulls" : -1
+             } ]
+           } ]
+         }, {
+           "path" : "parquet/nation.parquet",
+           "length" : 1210,
+           "rowGroups" : [ {
+             "start" : 4,
+             "length" : 944,
+             "rowCount" : 25,
+             "hostAffinity" : {
+               "localhost" : 1.0
+             },
+             "columns" : [ {
+               "name" : [ "N_NATIONKEY" ],
+               "nulls" : -1
+             }, {
+               "name" : [ "N_NAME" ],
+               "nulls" : -1
+             }, {
+               "name" : [ "N_REGIONKEY" ],
+               "nulls" : -1
+             }, {
+               "name" : [ "N_COMMENT" ],
+               "nulls" : -1
+             } ]
+           } ]
+         } ]
 
-**Content of the metadata cache file in the directory named `parquet` that 
contains the nation.parquet file and subdirectory `dir1`.**  
 
+Looking at the `.drill.parquet_summary_metadata.v4` file in the `parquet1` 
directory, you can see information about each of the columns in the files and 
the list of subdirectories and interesting columns (useful when indicating 
columns in the REFRESH TABLE METADATA command):  
 
-       [root@doc23 parquet]# cat .drill.parquet_metadata
+       [root@doc23 parquet1]# cat .drill.parquet_summary_metadata.v4
        {
-         "metadata_version" : "3.3",
          "columnTypeInfo" : {
            "`N_COMMENT`" : {
              "name" : [ "N_COMMENT" ],
@@ -132,7 +217,9 @@ The following sections compare the content of the metadata 
cache file in  the `p
              "precision" : 0,
              "scale" : 0,
              "repetitionLevel" : 0,
-             "definitionLevel" : 0
+             "definitionLevel" : 0,
+             "totalNullCount" : -1,
+             "isInteresting" : true
            },
            "`N_NATIONKEY`" : {
              "name" : [ "N_NATIONKEY" ],
@@ -141,25 +228,9 @@ The following sections compare the content of the metadata 
cache file in  the `p
              "precision" : 0,
              "scale" : 0,
              "repetitionLevel" : 0,
-             "definitionLevel" : 0
-           },
-           "`R_REGIONKEY`" : {
-             "name" : [ "R_REGIONKEY" ],
-             "primitiveType" : "INT64",
-             "originalType" : null,
-             "precision" : 0,
-             "scale" : 0,
-             "repetitionLevel" : 0,
-             "definitionLevel" : 0
-           },
-           "`R_COMMENT`" : {
-             "name" : [ "R_COMMENT" ],
-             "primitiveType" : "BINARY",
-             "originalType" : "UTF8",
-             "precision" : 0,
-             "scale" : 0,
-             "repetitionLevel" : 0,
-             "definitionLevel" : 0
+             "definitionLevel" : 0,
+             "totalNullCount" : -1,
+             "isInteresting" : true
            },
            "`N_REGIONKEY`" : {
              "name" : [ "N_REGIONKEY" ],
@@ -168,16 +239,9 @@ The following sections compare the content of the metadata 
cache file in  the `p
              "precision" : 0,
              "scale" : 0,
              "repetitionLevel" : 0,
-             "definitionLevel" : 0
-           },
-           "`R_NAME`" : {
-             "name" : [ "R_NAME" ],
-             "primitiveType" : "BINARY",
-             "originalType" : "UTF8",
-             "precision" : 0,
-             "scale" : 0,
-             "repetitionLevel" : 0,
-             "definitionLevel" : 0
+             "definitionLevel" : 0,
+             "totalNullCount" : -1,
+             "isInteresting" : true
            },
            "`N_NAME`" : {
              "name" : [ "N_NAME" ],
@@ -186,97 +250,44 @@ The following sections compare the content of the 
metadata cache file in  the `p
              "precision" : 0,
              "scale" : 0,
              "repetitionLevel" : 0,
-             "definitionLevel" : 0
+             "definitionLevel" : 0,
+             "totalNullCount" : -1,
+             "isInteresting" : true
            }
          },
-         "files" : [ {
-           "path" : "dir1/region.parquet",
-           "length" : 455,
-           "rowGroups" : [ {
-             "start" : 4,
-             "length" : 250,
-             "rowCount" : 5,
-             "hostAffinity" : {
-               "localhost" : 1.0
-             },
-             "columns" : [ ]
-           } ]
-         }, {
-           "path" : "nation.parquet",
-           "length" : 1210,
-           "rowGroups" : [ {
-             "start" : 4,
-             "length" : 944,
-             "rowCount" : 25,
-             "hostAffinity" : {
-               "localhost" : 1.0
-             },
-             "columns" : [ ]
-           } ]
-         } ],
-         "directories" : [ "dir1" ],
-         "drillVersion" : "1.16.0-SNAPSHOT"  
+         "directories" : [ "parquet/test", "parquet" ],
+         "drillVersion" : "1.16.0-SNAPSHOT",
+         "totalRowCount" : 50,
+         "allColumnsInteresting" : true,
+         "metadata_version" : "4"  
 
-**Content of the directory named `dir1` that contains the `region.parquet` 
file and no subdirectories.**  
-
-       [root@doc23 dir1]# cat .drill.parquet_metadata
-       {
-         "metadata_version" : "3.3",
-         "columnTypeInfo" : {
-               "`R_REGIONKEY`" : {
-               "name" : [ "R_REGIONKEY" ],
-               "primitiveType" : "INT64",
-               "originalType" : null,
-               "precision" : 0,
-               "scale" : 0,
-               "repetitionLevel" : 0,
-               "definitionLevel" : 0
-               },
-               "`R_COMMENT`" : {
-               "name" : [ "R_COMMENT" ],
-               "primitiveType" : "BINARY",
-               "originalType" : "UTF8",
-               "precision" : 0,
-               "scale" : 0,
-               "repetitionLevel" : 0,
-               "definitionLevel" : 0
-               },
-               "`R_NAME`" : {
-               "name" : [ "R_NAME" ],
-               "primitiveType" : "BINARY",
-             "originalType" : "UTF8",
-               "precision" : 0,
-               "scale" : 0,
-               "repetitionLevel" : 0,
-               "definitionLevel" : 0
-               }
-         },
-         "files" : [ {
-               "path" : "region.parquet",
-               "length" : 455,
-               "rowGroups" : [ {
-               "start" : 4,
-               "length" : 250,
-               "rowCount" : 5,
-               "hostAffinity" : {
-               "localhost" : 1.0
-               },
-               "columns" : [ ]
-               } ]
-         } ],
-         "directories" : [ ],
-         "drillVersion" : "1.16.0-SNAPSHOT"
-       }  
-
-### Verifying that the Planner is Using the Metadata Cache File 
+###Verifying that the Planner is Using the Metadata Cache or Summary Files
 
 When the planner uses metadata cache files, the query plan includes the 
`usedMetadataFile` flag. You can access the query plan in the Drill Web UI, by 
clicking on the query in the Profiles page, or by running the EXPLAIN PLAN FOR 
command, as shown:
 
-       EXPLAIN PLAN FOR SELECT * FROM parquet;  
- 
+       apache drill (dfs.samples)> explain plan for select * from parquet1;
+       
+----------------------------------------------------------------------------------+----------------------------------------------------------------------------------+
+       |                                       text                            
           |                                       json                         
              |
+       
+----------------------------------------------------------------------------------+----------------------------------------------------------------------------------+
        | 00-00    Screen
        00-01      Project(**=[$0])
-       00-02      Scan(table=[[dfs, samples, parquet]], 
groupscan=[ParquetGroupScan [entries=[ReadEntryWithPath [path=/home/parquet]], 
selectionRoot=/home/parquet, numFiles=1, numRowGroups=2, usedMetadataFile=true, 
cacheFileRoot=/home/parquet, columns=[`**`]]])
-       |... 
+       00-02        Scan(table=[[dfs, samples, parquet1]], 
groupscan=[ParquetGroupScan [entries=[ReadEntryWithPath [path=/tmp/parquet1]], 
selectionRoot=/tmp/parquet1, numFiles=1, numRowGroups=2, usedMetadataFile=true, 
cacheFileRoot=/tmp/parquet1, columns=[`**`]]])  
+        |   
+
+When you run the EXPLAIN command with a COUNT() query, as shown, you can see 
that the query planner uses the summary cache file and avoids reading the 
larger metadata cache file. The query plan includes the 
`usedMetadataSummaryFile` flag.
+
+       apache drill (dfs.samples)> explain plan for select count(*) from 
parquet1;
+       
+----------------------------------------------------------------------------------+----------------------------------------------------------------------------------+
+       |                                       text                            
           |                                       json                         
              |
+       
+----------------------------------------------------------------------------------+----------------------------------------------------------------------------------+
+       | 00-00    Screen
+       00-01      Project(EXPR$0=[$0])
+       00-02        DirectScan(groupscan=[files = 
[file:/tmp/parquet1/.drill.parquet_summary_metadata.v4], numFiles = 1, 
usedMetadataSummaryFile = true, DynamicPojoRecordReader{records = [[50]]}])
+        | 
+
+       
+       
+
+
+
 
--->    
diff --git a/_docs/sql-reference/sql-commands/021-create-schema.md 
b/_docs/sql-reference/sql-commands/021-create-schema.md
index f236735..21ee180 100644
--- a/_docs/sql-reference/sql-commands/021-create-schema.md
+++ b/_docs/sql-reference/sql-commands/021-create-schema.md
@@ -1,10 +1,10 @@
 ---
 title: "CREATE OR REPLACE SCHEMA"
-date: 2019-04-25
+date: 2019-04-29
 parent: "SQL Commands"
 ---
 
-Starting in Drill 1.16, you can define a schema for text files using the 
CREATE OR REPLACE SCHEMA command. Running this command generates a hidden 
.drill.schema file in the table’s root directory. The .drill.schema file stores 
the schema definition in JSON format. Drill uses the schema file at runtime if 
the exec.storage.enable_v3_text_reader and store.table.use_schema_file options 
are enabled. Alternatively, you can create the schema file manually. When 
created manually, the file conten [...]
+Starting in Drill 1.16, you can define a schema for text files using the 
CREATE OR REPLACE SCHEMA command. Running this command generates a hidden 
`.drill.schema` file in the table’s root directory. The `.drill.schema` file 
stores the schema definition in JSON format. Drill uses the schema file at 
runtime if the `exec.storage.enable_v3_text_reader` and 
`store.table.use_schema_file` options are enabled. Alternatively, you can 
create the schema file manually. If created manually, the file  [...]
 
 ##Syntax
 
@@ -187,7 +187,7 @@ Values are trimmed when converting to any type, except for 
varchar.
 ### Schema Mode (Column Order)
 The schema mode determines the ordering of columns returned for wildcard (*) 
queries. The mode is set through the `drill.strict` property. You can set this 
property to true (strict) or false (not strict). If you do not indicate the 
mode, the default is false (not strict).  
 
-**Not Strict (Default)**
+**Not Strict (Default)**  
 Columns defined in the schema are projected in the defined order. Columns not 
defined in the schema are appended to the defined columns, as shown:  
 
        create or replace schema (id int, start_date date format 'yyyy-MM-dd') 
for table dfs.tmp.`text_table` properties ('drill.strict' = 'false');
@@ -210,7 +210,7 @@ Columns defined in the schema are projected in the defined 
order. Columns not de
  
 Note that the “name” column, which was not included in the schema was appended 
to the end of the table.
 
-**Strict**
+**Strict**  
 Setting the `drill.strict` property  to “true” changes the schema mode to 
strict, which means that the reader ignores any columns NOT included in the 
schema. The query only returns the columns defined in the schema, as shown:
  
        create or replace schema (id int, start_date date format 'yyyy-MM-dd') 
for table dfs.tmp.`text_table` properties ('drill.strict' = 'true');

[drill] branch gh-pages updated: edit refresh and schema docs

Reply via email to