Thank Dimitris!

At 2017-09-12 01:15:46, "Dimitris Tsirogiannis" <[email protected]> 
wrote:
>Hi Quanlong,
>
>You're pretty much correct. REFRESH can handle the majority of external
>metadata modifications (adding/dropping files/partitions, etc) and
>INVALIDATE METADATA should be used in the two use cases you mention. I am
>sorry you had to look at the code to figure that out. I checked our
>documentation (https://www.cloudera.com/documentation/enterprise/
>latest/topics/impala_refresh.html) and I see that some parts are not as
>explicit as they should. I filed a docs JIRA (
>https://issues.apache.org/jira/browse/IMPALA-5918).
>
>Thanks
>Dimitris
>
>On Mon, Sep 11, 2017 at 5:55 AM, Quanlong Huang <[email protected]>
>wrote:
>
>> Hi all,
>>
>>
>> I used to thought that REFRESH statement is just incremental metadata
>> reload. It can't detect file deletion or modification. So we should use
>> INVALIDATE METADATA for these cases.
>> However, one of my friends told me that they always use REFRESH statement
>> in their ETL pipeline, either adding new files or replacing the whole table
>> files. They never use INVALIDATE METADATA and haven't encounter any errors.
>>
>>
>> I realized my thought is wrong and digged into the codes. I found comments
>> in HdfsTable.java said that in these two cases we should use INVALIDATE
>> METADATA instead of REFRESH:
>> an ALTER TABLE ADD PARTITION or dynamic partition insert is executed
>> through Hive. This does not update the lastDdlTime.
>> Hdfs rebalancer is executed. This changes the block locations but doesn't
>> update the mtime (file modification time).
>> However, in my experiments, for all manual changes made in Hive or HDFS,
>> we just need to trigger REFRESH statement. For example, modifying or
>> deleting files under an existent partition, adding partitions in Hive by
>> ALTER TABLE ADD PARTITION etc.
>>
>>
>> In HdfsTable#refreshFileMetadata, all manual changes (add/delete/modify)
>> of data files can be detected and file descriptors will be updated.
>> Thus, the previous comments are wrong. There're only two cases we should
>> use INVALIDATE METADATA:
>> When new tables are created outside Impala
>> When block locations are changed by HDFS balancer (This one is for
>> increasing local reads)
>> Hope you could correct me if I'm wrong.
>>
>>
>> Thanks,
>> Quanlong

Reply via email to