arina-ielchiieva commented on a change in pull request #1886: DRILL-7273: Introduce operators for handling metadata URL: https://github.com/apache/drill/pull/1886#discussion_r344774705
########## File path: docs/dev/MetastoreAnalyze.md ########## @@ -60,14 +64,57 @@ The following tables are populated with table metadata from the metastore: - `LOCATION` - segment location, `null` for partitions: `/tmp/nation/part_int=3` - `LAST_MODIFIED_TIME` - last modification time -# Metastore-related options +# Metastore related options - - `metastore.enabled` Enables Drill Metastore usage to be able to store table metadata during `ANALYZE TABLE` commands + - `metastore.enabled` - enables Drill Metastore usage to be able to store table metadata during `ANALYZE TABLE` commands execution and to be able to read table metadata during regular queries execution or when querying some `INFORMATION_SCHEMA` tables. - - `metastore.metadata.store.depth_level` Specifies maximum level depth for collecting metadata. - - `metastore.metadata.use_schema` Enables schema usage, stored to the Metastore. - - `metastore.metadata.use_statistics` Enables statistics usage, stored in the Metastore, at the planning stage. - - `metastore.metadata.fallback_to_file_metadata` Allows using file metadata cache for the case when required metadata is absent in the Metastore. - - `metastore.retrieval.retry_attempts` Specifies the number of attempts for retrying query planning after detecting that query metadata is changed. + - `metastore.metadata.store.depth_level` - specifies maximum level depth for collecting metadata. + Possible values : `TABLE`, `SEGMENT`, `PARTITION`, `FILE`, `ROW_GROUP`, `ALL`. + - `metastore.metadata.use_schema` - enables schema usage, stored to the Metastore. + - `metastore.metadata.use_statistics` - enables statistics usage, stored in the Metastore, at the planning stage. + - `metastore.metadata.fallback_to_file_metadata` - allows using file metadata cache for the case when required metadata is absent in the Metastore. + - `metastore.retrieval.retry_attempts` - specifies the number of attempts for retrying query planning after detecting that query metadata is changed. If the number of retries was exceeded, query will be planned without metadata information from the Metastore. - \ No newline at end of file + +# Analyze operators description + +Entry point for `ANALYZE` command is `MetastoreAnalyzeTableHandler` class. It creates plan which includes some +Metastore specific operators for collecting metadata. + +`MetastoreAnalyzeTableHandler` uses `AnalyzeInfoProvider` for providing the information +required for building a suitable plan for collecting metadata. +Every group scan should provide each `AnalyzeInfoProvider` implementation and annotate implementation class with +`AnalyzeInfoProviderTemplate` annotation to load it dynamically when executing analyze. + +Analyze specific operators: + - `MetadataAggBatch` - operator which adds aggregate calls for all incoming table columns to calculate required + metadata and produces aggregations. If aggregation is performed on top of another aggregation, + required aggregate calls for merging metadata will be added. + - `MetadataHandlerBatch` - operator responsible for handling metadata returned by incoming aggregate operators and + fetching required metadata form the metastore to produce further aggregations. + - `MetadataControllerBatch` - responsible for converting obtained metadata, fetching absent metadata from the metastore + and storing resulting metadata into the metastore. + +`MetastoreAnalyzeTableHandler` forms plan depending on segments count in the following form: + +``` +MetadataControllerBatch + ... + MetadataHandlerBatch + MetadataAggBatch(dir0, ...) + MetadataHandlerBatch + MetadataAggBatch(dir0, dir1, ...) + MetadataHandlerBatch + MetadataAggBatch(dir0, dir1, fqn, ...) + Scan(DYNAMIC_STAR **, ANY fqn, ...) +``` + +The lowest `MetadataAggBatch` creates required aggregate calls for every (or interesting only) table columns +and produces aggregations with grouping by segment columns that correspond to specific table level. +`MetadataHandlerBatch` above it populates batch with additional information about metadata type and other info. +`MetadataAggBatch` above merges metadata calculated before to obtain metadata for parent metadata levels and also stores incoming data to populate it to the metastore later. + +`MetadataControllerBatch` obtains all calculated metadata, converts it to the suitable form and sends it to the metastore. + +For the case of incremental analyze, `MetastoreAnalyzeTableHandler` creates Scan with updated files only +and provides `MetadataHandlerBatch` with information about metadata which should be fetched from the metastore, so existing actual metadata wouldn't be recalculated. Review comment: Metastore ---------------------------------------------------------------- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. For queries about this service, please contact Infrastructure at: [email protected] With regards, Apache Git Services
