[GitHub] [drill] vvysotskyi commented on a change in pull request #1986: Additional changes for Drill Metastore docs

GitBox Tue, 03 Mar 2020 07:30:26 -0800

vvysotskyi commented on a change in pull request #1986: Additional changes for 
Drill Metastore docs
URL: https://github.com/apache/drill/pull/1986#discussion_r387070626


 ##########
 File path: 
_docs/performance-tuning/drill-metastore/010-using-drill-metastore.md
 ##########
 @@ -160,19 +199,78 @@ Drill uses the Metastore in several places. When you run 
a query with multiple d
  then Drill will assume that existing metadata is invalid and wouldn't use it. 
Periodically rerun `ANALYZE TABLE` so
  that Drill can use table metadata when possible.
 
-### Limitations
+### Exposing Drill Metastore metadata through `INFORMATION_SCHEMA` tables
+
+Drill exposes some Metastore tables metadata through `INFORMATION_SCHEMA` 
tables.
+Note, that Metastore metadata will be exposed to the info schema, only if 
Metastore is enabled. If it is disabled, info
+ tables won't contain Metastore metadata.
+
+`TABLES` table includes the set of tables on which you have run `ANALYZE 
TABLE`.
+Description of Metastore-specific columns:
+
+|Column name            |Type       |Nullable   |Description                   
                                                                         |
+|-----------------------|-----------|-----------|-------------------------------------------------------------------------------------------------------|
+|`TABLE_SOURCE`         |VARCHAR    |YES        |Table data type: `PARQUET`, 
`CSV`, `JSON`                                                              |
+|`LOCATION`             |VARCHAR    |YES        |Table location: `/tmp/nation` 
                                                                         |
+|`NUM_ROWS`             |BIGINT     |YES        |The total number of rows in 
all files of the table. Null if not known                                  |
+|`LAST_MODIFIED_TIME`   |TIMESTAMP  |YES        |Timestamp of the 
most-recently modified file within the table. Updated on each `ANALYZE TABLE` 
run.    |
 
-This feature is currently in the beta phase (preview, experimental) for Drill 
1.17 and only applies to Parquet
- tables in this release. You must enable this feature through the 
`metastore.enabled` system/session option.
+The `COLUMNS` table describes the columns within each table. Only those 
columns listed in the `COLUMNS` clause of the
+ `ANALYZE TABLE` statement appear in this table.
 
-## Examples
+|Column name            |Type       |Nullable   |Description                   
                                                                                
                                                                     |
+|-----------------------|-----------|-----------|-----------------------------------------------------------------------------------------------------------------------------------------------------------------------------------|
+|`COLUMN_DEFAULT`       |VARCHAR    |YES        |Column default value.         
                                                                                
                                                                     |
+|`COLUMN_FORMAT`        |VARCHAR    |YES        |Usually applicable for date 
time columns: `yyyy-MM-dd`. See [Format for Date, Time 
Conversion]({{site.baseurl}}/docs/create-or-replace-schema/#format-for-date-time-conversion).
   |
+|`NUM_NULLS`            |BIGINT     |YES        |The number of rows which 
contain nulls for this column.                                                  
                                                                          |
+|`MIN_VAL`              |VARCHAR    |YES        |The minimum value of the 
column expressed as a string. For example: `'-273'`.                            
                                                                          |
+|`MAX_VAL`              |VARCHAR    |YES        |The maximum value of the 
column expressed as a string. For example: `'100500'`.                          
                                                                          |
+|`NDV`                  |FLOAT8     |YES        |Number of distinct values in 
column, expressed in Double.                                                    
                                                                      |
+|`EST_NUM_NON_NULLS`    |FLOAT8     |YES        |Estimated number of non null 
values, expressed in Double.                                                    
                                                                      |
+|`IS_NESTED`            |BIT        |NO         |If column is nested. Nested 
columns are extracted from columns with struct type.                            
                                                                       |
 
-Examples throughout this topic use the files and directories described in the 
following section, Directory, and File Setup.
+A table can be divided into directories, called "partitions". The `PARTITIONS` 
table contains an entry for each directory.
+
+|Column name            |Type       |Nullable   |Description                   
                                                                                
                                                                     |
+|-----------------------|-----------|-----------|-----------------------------------------------------------------------------------------------------------------------------------------------------------------------------------|
+|`TABLE_CATALOG`        |VARCHAR    |YES        |Table catalog (currently we 
have only one catalog): `DRILL`.                                                
                                                                       |
+|`TABLE_SCHEMA`         |VARCHAR    |YES        |Table schema: `dfs.tmp`.      
                                                                                
                                                                     |
+|`TABLE_NAME`           |VARCHAR    |YES        |Table name: `nation`.         
                                                                                
                                                                     |
+|`METADATA_KEY`         |VARCHAR    |YES        |Top level segment key, the 
same for all nested segments and partitions: `part_int=3`.                      
                                                                        |
+|`METADATA_TYPE`        |VARCHAR    |YES        |`SEGMENT` or `PARTITION`. 
Partition here corresponds to "Drill partition", though segment corresponds to 
data parts like partitions in general case, for example, Hive partition.  |
+|`METADATA_IDENTIFIER`  |VARCHAR    |YES        |Current metadata identifier: 
`part_int=3/part_varchar=g`. It is unique value for segment or partition within 
the table.                                                            |
+|`PARTITION_COLUMN`     |VARCHAR    |YES        |Partition column name: 
`part_varchar`.                                                                 
                                                                            |
+|`PARTITION_VALUE`      |VARCHAR    |YES        |Partition column value: `g`.  
                                                                                
                                                                     |
+|`LOCATION`             |VARCHAR    |YES        |Segment location, `null` for 
partitions: `/tmp/nation/part_int=3`.                                           
                                                                      |
+|`LAST_MODIFIED_TIME`   |TIMESTAMP  |YES        |Last modification time.       
                                                                                
                                                                     |
+
+### Limitations of the 1.17 release
+
+ - Applies to tables stored as Parquet files and only when stored in the `DFS` 
storage plugin.
+ - Disabled by default. You must enable this feature through the 
`metastore.enabled` system/session option.
+
+### Cheat sheet of `ANALYZE TABLE` commands
+
+ - Add a new table with `ANALYZE TABLE dfs.tmp.lineitem REFRESH METADATA` 
command.
+ - When table data (but not schema) changes, run `ANALYZE TABLE 
dfs.tmp.lineitem REFRESH METADATA` command.
+ - When the table schema changes, run `ANALYZE TABLE dfs.tmp.lineitem COLUMNS 
(col1, col2, ...) REFRESH METADATA` command.
+ - If partitions are added or removed, run `ANALYZE TABLE dfs.tmp.lineitem 
REFRESH METADATA` command.
+ - Remove table metadata by submitting `ANALYZE TABLE dfs.tmp.lineitem DROP 
METADATA` command.
+
+## Tutorial
+
+Examples throughout this topic use the files and directories described in the 
following section `Directory and File Setup`.
 
 ### Directory and File Setup
 
+The following examples are written for local file system, but Drill Metastore 
supports collecting metadata for tables
+ placed in any file system, supported by Drill and works for both embedded and 
distributed Drill modes.
 
 Review comment:
   Thanks, fixed.

----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.
 
For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


With regards,
Apache Git Services

[GitHub] [drill] vvysotskyi commented on a change in pull request #1986: Additional changes for Drill Metastore docs

Reply via email to