[ 
https://issues.apache.org/jira/browse/IMPALA-11050?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Quanlong Huang updated IMPALA-11050:
------------------------------------
    Epic Link: IMPALA-11532

> Skip file metadata reloading in AlterPartition event from event processor in 
> catalogd
> -------------------------------------------------------------------------------------
>
>                 Key: IMPALA-11050
>                 URL: https://issues.apache.org/jira/browse/IMPALA-11050
>             Project: IMPALA
>          Issue Type: Improvement
>          Components: Catalog
>            Reporter: Sourabh Goyal
>            Assignee: Sourabh Goyal
>            Priority: Major
>             Fix For: Impala 4.1.0
>
>
> HdfsPartition in catalogD is a collection of files and each file is 
> represented by a FileDescriptor. A fd contains:
> 1.  RelativePath of this file
> 2. Length of the file
> 3. Compression info like GZIP etc
> 4. Modification time of the file
> 5. Blocks info that belong to this file. Each block has info like offset, 
> length, diskIds
> When the event processor processes an AlterPartitionEvent, currently it 
> reloads the partition again along with file metadata reloading. Reloading of 
> file metadata is a relatively expensive operation as it involves listing 
> files in the underlying filesystem. From the Impala shell, an alter partition 
> is triggered via ALTER TABLE PARTITION <partition_spec> <operation>.  Here 
> operation can be: 
>  # Update stats
>  # Drop stats
>  # Set file format
>  # Set row format
>  # Set table properties
>  # Unset table properties
>  # Set serde properties
>  # Unset serde properties
>  # Set cached <hdfs-pool-name>
>  # Unset cached <hdfs-pool-name>
>  # Set location
>  
> *For transactional tables:*
> For transactional tables, if the incremental refresh is enabled, event 
> processor reloades file metadata at the CommitTxn event. Since there is no 
> way to know whether the commit txn event was due to alter_partition or some 
> other event, file metadata reloading can not be skipped. 
> *For external tables:* 
> From the operations above, any operation that affects the underlying storage 
> descriptor of a partition should trigger the file metadata reloading. 
> Operations 3,4,7,8,11 are such operations. 
>  
> *How to detect change in file descriptor of a partition:*
> HMS partition object received in alter_partition event contains 
> metastore.api.StorageDescriptor object. This object has fields like: 
>  * List<FieldSchema> cols
>  * String location 
>  * String inputFormat
>  * String outputFormat
>  * Boolean compressed
>  * Boolean numBuckets
>  * SerdeInfo serdeInfo
>  * LIst<String> bucketCols
>  * List<Order> sortCols
>  * Map<String, String> params
>  
> Fetch HMS partition object from alterPartition event and compare its storage 
> descriptor properties with the similar properties of already cached partition 
> object
> {*}Unknowns{*}: 
>  # If a partition is cached in HDFS, should we always reload its filemetadata 
> (irrespective of any of the operations mentioned above) to get most up to 
> date block locations? 
>  
> cc - [~vihangk1] [~stigahuang] [~hsnusonic] 



--
This message was sent by Atlassian Jira
(v8.20.10#820010)

---------------------------------------------------------------------
To unsubscribe, e-mail: [email protected]
For additional commands, e-mail: [email protected]

Reply via email to