vvysotskyi commented on a change in pull request #1986: Additional changes for Drill Metastore docs URL: https://github.com/apache/drill/pull/1986#discussion_r387070626
########## File path: _docs/performance-tuning/drill-metastore/010-using-drill-metastore.md ########## @@ -160,19 +199,78 @@ Drill uses the Metastore in several places. When you run a query with multiple d then Drill will assume that existing metadata is invalid and wouldn't use it. Periodically rerun `ANALYZE TABLE` so that Drill can use table metadata when possible. -### Limitations +### Exposing Drill Metastore metadata through `INFORMATION_SCHEMA` tables + +Drill exposes some Metastore tables metadata through `INFORMATION_SCHEMA` tables. +Note, that Metastore metadata will be exposed to the info schema, only if Metastore is enabled. If it is disabled, info + tables won't contain Metastore metadata. + +`TABLES` table includes the set of tables on which you have run `ANALYZE TABLE`. +Description of Metastore-specific columns: + +|Column name |Type |Nullable |Description | +|-----------------------|-----------|-----------|-------------------------------------------------------------------------------------------------------| +|`TABLE_SOURCE` |VARCHAR |YES |Table data type: `PARQUET`, `CSV`, `JSON` | +|`LOCATION` |VARCHAR |YES |Table location: `/tmp/nation` | +|`NUM_ROWS` |BIGINT |YES |The total number of rows in all files of the table. Null if not known | +|`LAST_MODIFIED_TIME` |TIMESTAMP |YES |Timestamp of the most-recently modified file within the table. Updated on each `ANALYZE TABLE` run. | -This feature is currently in the beta phase (preview, experimental) for Drill 1.17 and only applies to Parquet - tables in this release. You must enable this feature through the `metastore.enabled` system/session option. +The `COLUMNS` table describes the columns within each table. Only those columns listed in the `COLUMNS` clause of the + `ANALYZE TABLE` statement appear in this table. -## Examples +|Column name |Type |Nullable |Description | +|-----------------------|-----------|-----------|-----------------------------------------------------------------------------------------------------------------------------------------------------------------------------------| +|`COLUMN_DEFAULT` |VARCHAR |YES |Column default value. | +|`COLUMN_FORMAT` |VARCHAR |YES |Usually applicable for date time columns: `yyyy-MM-dd`. See [Format for Date, Time Conversion]({{site.baseurl}}/docs/create-or-replace-schema/#format-for-date-time-conversion). | +|`NUM_NULLS` |BIGINT |YES |The number of rows which contain nulls for this column. | +|`MIN_VAL` |VARCHAR |YES |The minimum value of the column expressed as a string. For example: `'-273'`. | +|`MAX_VAL` |VARCHAR |YES |The maximum value of the column expressed as a string. For example: `'100500'`. | +|`NDV` |FLOAT8 |YES |Number of distinct values in column, expressed in Double. | +|`EST_NUM_NON_NULLS` |FLOAT8 |YES |Estimated number of non null values, expressed in Double. | +|`IS_NESTED` |BIT |NO |If column is nested. Nested columns are extracted from columns with struct type. | -Examples throughout this topic use the files and directories described in the following section, Directory, and File Setup. +A table can be divided into directories, called "partitions". The `PARTITIONS` table contains an entry for each directory. + +|Column name |Type |Nullable |Description | +|-----------------------|-----------|-----------|-----------------------------------------------------------------------------------------------------------------------------------------------------------------------------------| +|`TABLE_CATALOG` |VARCHAR |YES |Table catalog (currently we have only one catalog): `DRILL`. | +|`TABLE_SCHEMA` |VARCHAR |YES |Table schema: `dfs.tmp`. | +|`TABLE_NAME` |VARCHAR |YES |Table name: `nation`. | +|`METADATA_KEY` |VARCHAR |YES |Top level segment key, the same for all nested segments and partitions: `part_int=3`. | +|`METADATA_TYPE` |VARCHAR |YES |`SEGMENT` or `PARTITION`. Partition here corresponds to "Drill partition", though segment corresponds to data parts like partitions in general case, for example, Hive partition. | +|`METADATA_IDENTIFIER` |VARCHAR |YES |Current metadata identifier: `part_int=3/part_varchar=g`. It is unique value for segment or partition within the table. | +|`PARTITION_COLUMN` |VARCHAR |YES |Partition column name: `part_varchar`. | +|`PARTITION_VALUE` |VARCHAR |YES |Partition column value: `g`. | +|`LOCATION` |VARCHAR |YES |Segment location, `null` for partitions: `/tmp/nation/part_int=3`. | +|`LAST_MODIFIED_TIME` |TIMESTAMP |YES |Last modification time. | + +### Limitations of the 1.17 release + + - Applies to tables stored as Parquet files and only when stored in the `DFS` storage plugin. + - Disabled by default. You must enable this feature through the `metastore.enabled` system/session option. + +### Cheat sheet of `ANALYZE TABLE` commands + + - Add a new table with `ANALYZE TABLE dfs.tmp.lineitem REFRESH METADATA` command. + - When table data (but not schema) changes, run `ANALYZE TABLE dfs.tmp.lineitem REFRESH METADATA` command. + - When the table schema changes, run `ANALYZE TABLE dfs.tmp.lineitem COLUMNS (col1, col2, ...) REFRESH METADATA` command. + - If partitions are added or removed, run `ANALYZE TABLE dfs.tmp.lineitem REFRESH METADATA` command. + - Remove table metadata by submitting `ANALYZE TABLE dfs.tmp.lineitem DROP METADATA` command. + +## Tutorial + +Examples throughout this topic use the files and directories described in the following section `Directory and File Setup`. ### Directory and File Setup +The following examples are written for local file system, but Drill Metastore supports collecting metadata for tables + placed in any file system, supported by Drill and works for both embedded and distributed Drill modes. Review comment: Thanks, fixed. ---------------------------------------------------------------- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. For queries about this service, please contact Infrastructure at: us...@infra.apache.org With regards, Apache Git Services