[GitHub] [drill] paul-rogers commented on a change in pull request #1953: Add docs for Drill Metastore

GitBox Tue, 14 Jan 2020 21:07:13 -0800

paul-rogers commented on a change in pull request #1953: Add docs for Drill 
Metastore
URL: https://github.com/apache/drill/pull/1953#discussion_r366688559


 ##########
 File path: 
_docs/sql-reference/sql-commands/007-analyze-table-refresh-metadata.md
 ##########
 @@ -0,0 +1,158 @@
+---
+title: "ANALYZE TABLE REFRESH METADATA"
+parent: "SQL Commands"
+date: 2020-01-13
+---
+
+Starting from Drill 1.17, you can store table metadata (including schema and 
computed statistics) into Drill Metastore.
+This metadata will be used when querying a table for more optimal plan 
creation.
+
+{% include startnote.html %}In Drill 1.17, this feature is supported for 
Parquet tables only and is disabled by default.{% include endnote.html %}
+
+To enable Drill Metastore usage, the following option `metastore.enabled` 
should be set to `true`, as shown:
+
+       SET `metastore.enabled` = true;
+
+Alternatively, you can enable the option in the Drill Web UI at 
`http://<drill-hostname-or-ip-address>:8047/options`.
+
+## Syntax
+
+The ANALYZE TABLE REFRESH METADATA statement supports the following syntax:
+
+       ANALYZE TABLE [table_name] [COLUMNS {(col1, col2, ...) | NONE}]
+       REFRESH METADATA ['level' LEVEL]
+       [{COMPUTE | ESTIMATE} | STATISTICS [(column1, column2, ...)]
+       [ SAMPLE number PERCENT ]]
+
+## Parameters
+
+*table_name*
+The name of the table or directory for which Drill will collect table 
metadata. If the table does not exist, or the table
+ is temporary, the command fails and metadata is not collected and stored.
+
+*COLUMNS (col1, col2, ...)*
+Optional names of the column(s) for which Drill will generate and store 
metadata. the Stored schema will include all table columns.
+
+*COLUMNS NONE*
+Specifies to ignore collecting and storing metadata for all table columns.
+
+*level*
+Optional varchar literal which specifies maximum level depth for collecting 
metadata.
+Possible values: `TABLE`, `SEGMENT`, `PARTITION`, `FILE`, `ROW_GROUP`, `ALL`. 
Default is `ALL`.
+
+*COMPUTE*
+Generates statistics for the table to be stored into the Metastore.
+If statistics usage is disabled (`planner.enable_statistics` is set to 
`false`), an error will be thrown when this clause is specified.
+
+*ESTIMATE*
+Generates estimated statistics for the table to be stored into the Metastore. 
Currently is not supported.
+
+*(column1, column2, ...)*
+The name of the column(s) for which Drill will generate statistics.
+
+*SAMPLE*
+Optional. Indicates that compute statistics should run on a subset of the data.
+
+*number PERCENT*  
+An integer that specifies the percentage of data on which to compute 
statistics. For example, if a table has 100 rows, `SAMPLE 50 PERCENT` indicates 
that statistics should be computed on 50 rows. The optimizer selects the rows 
at random. 
+
+## Related Options
+
+- **metastore.enabled**
+Enables Drill Metastore usage to be able to store table metadata during 
ANALYZE TABLE commands execution and to be able
+ to read table metadata during regular queries execution or when querying some 
INFORMATION_SCHEMA tables. Default is `false`.
+- **metastore.metadata.store.depth_level**
+Specifies maximum level depth for collecting metadata. Default is `'ALL'`.
+- **metastore.retrieval.retry_attempts**
+Specifies the number of attempts for retrying query planning after detecting 
that query metadata is changed.
+If the number of retries was exceeded, query will be planned without metadata 
information from the Metastore. Default is 5.
+- **metastore.metadata.fallback_to_file_metadata**
+Allows using file metadata cache for the case when required metadata is absent 
in the Metastore. Default is true.
+- **metastore.metadata.use_schema**
+Enables schema usage, stored to the Metastore. Default is `true`.
+- **metastore.metadata.use_statistics**
+Enables statistics usage, stored in the Metastore, at the planning stage. 
Default is `true`.
+- **metastore.metadata.ctas.auto-collect**
+Specifies whether schema and / or column statistics will be automatically 
collected for every table after CTAS and CTTAS.
+This option is not active for now. Default is `'NONE'`.
+- **drill.exec.storage.implicit.last_modified_time.column.label**
+Sets the implicit column name for the last modified time (`lmt`) column. For 
internal usage when producing Metastore analyze.
+- **drill.exec.storage.implicit.row_group_index.column.label**
+Sets the implicit column name for the row group index (`rgi`) column. For 
internal usage when producing Metastore analyze.
+- **drill.exec.storage.implicit.row_group_length.column.label**
+Sets the implicit column name for the row group length (`rgl`) column. For 
internal usage when producing Metastore analyze.
+- **drill.exec.storage.implicit.row_group_start.column.label**
+Sets the implicit column name for the row group start (`rgs`) column. For 
internal usage when producing Metastore analyze.
+
+## Related Commands
+
+To drop table metadata from the Metastore, the following command may be used:
+
+       ANALYZE TABLE [table_name] DROP [METADATA|STATISTICS] [IF EXISTS]
+
+It will not throw an exception for absent table metadata if `IF EXISTS` clause 
was specified.
+
+       ANALYZE TABLE [workspace.]table_name COMPUTE STATISTICS [(column1, 
column2,...)] [SAMPLE number PERCENT]
+       
+See [ANALYZE TABLE COMPUTE 
STATISTICS]({{site.baseurl}}/docs/analyze-table-compute-statistics). 
+
+## Usage Notes
+
+### General Information
+
+- Currently `ANALYZE TABLE REFRESH METADATA` statement can compute and store 
metadata only for Parquet tables within `dfs` storage plugins.
+- For the case when `ANALYZE TABLE REFRESH METADATA` command is executed for 
the first time, whole table metadata will be collected and stored into 
Metastore.
+If analyze was already executed for the table, and table data wasn't changed, 
all further analyze commands wouldn't trigger table analyzing and message that 
table metadata is up to date will be returned.
 
 Review comment:
   Drill 1.17 supports the Metastore and `ANALYZE TABLE` only for tables stored 
as Parquet files and only when stored in the DFS storage plugin. (Does this 
work for local files?)
   
   The first time you execute `ANALYZE TABLE` for a table, Drill will scan the 
entire tables (all files.) When you next issue the same command, Drill will 
scan only those files added since the previous run. The command will return the 
message (what?) if table statistics are up-to-date.
   
   (Will drill automatically remove files that have been removed since the 
previous scan?
   
   Please explain this in detail. Do we place a timestamp in the metadata? At 
what point do we set the timestamp? At the start or end of the run? Or, do we 
have a per-file timestamp? How do we detect if a table is new? Modification 
timestamp? Creation timestamp? What do we compare the timestamp against?)
   

----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.
 
For queries about this service, please contact Infrastructure at:
[email protected]


With regards,
Apache Git Services

[GitHub] [drill] paul-rogers commented on a change in pull request #1953: Add docs for Drill Metastore

Reply via email to