[
https://issues.apache.org/jira/browse/IMPALA-11050?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
]
Quanlong Huang updated IMPALA-11050:
------------------------------------
Epic Link: IMPALA-11532
> Skip file metadata reloading in AlterPartition event from event processor in
> catalogd
> -------------------------------------------------------------------------------------
>
> Key: IMPALA-11050
> URL: https://issues.apache.org/jira/browse/IMPALA-11050
> Project: IMPALA
> Issue Type: Improvement
> Components: Catalog
> Reporter: Sourabh Goyal
> Assignee: Sourabh Goyal
> Priority: Major
> Fix For: Impala 4.1.0
>
>
> HdfsPartition in catalogD is a collection of files and each file is
> represented by a FileDescriptor. A fd contains:
> 1. RelativePath of this file
> 2. Length of the file
> 3. Compression info like GZIP etc
> 4. Modification time of the file
> 5. Blocks info that belong to this file. Each block has info like offset,
> length, diskIds
> When the event processor processes an AlterPartitionEvent, currently it
> reloads the partition again along with file metadata reloading. Reloading of
> file metadata is a relatively expensive operation as it involves listing
> files in the underlying filesystem. From the Impala shell, an alter partition
> is triggered via ALTER TABLE PARTITION <partition_spec> <operation>. Here
> operation can be:
> # Update stats
> # Drop stats
> # Set file format
> # Set row format
> # Set table properties
> # Unset table properties
> # Set serde properties
> # Unset serde properties
> # Set cached <hdfs-pool-name>
> # Unset cached <hdfs-pool-name>
> # Set location
>
> *For transactional tables:*
> For transactional tables, if the incremental refresh is enabled, event
> processor reloades file metadata at the CommitTxn event. Since there is no
> way to know whether the commit txn event was due to alter_partition or some
> other event, file metadata reloading can not be skipped.
> *For external tables:*
> From the operations above, any operation that affects the underlying storage
> descriptor of a partition should trigger the file metadata reloading.
> Operations 3,4,7,8,11 are such operations.
>
> *How to detect change in file descriptor of a partition:*
> HMS partition object received in alter_partition event contains
> metastore.api.StorageDescriptor object. This object has fields like:
> * List<FieldSchema> cols
> * String location
> * String inputFormat
> * String outputFormat
> * Boolean compressed
> * Boolean numBuckets
> * SerdeInfo serdeInfo
> * LIst<String> bucketCols
> * List<Order> sortCols
> * Map<String, String> params
>
> Fetch HMS partition object from alterPartition event and compare its storage
> descriptor properties with the similar properties of already cached partition
> object
> {*}Unknowns{*}:
> # If a partition is cached in HDFS, should we always reload its filemetadata
> (irrespective of any of the operations mentioned above) to get most up to
> date block locations?
>
> cc - [~vihangk1] [~stigahuang] [~hsnusonic]
--
This message was sent by Atlassian Jira
(v8.20.10#820010)
---------------------------------------------------------------------
To unsubscribe, e-mail: [email protected]
For additional commands, e-mail: [email protected]