Thank Dimitris!
At 2017-09-12 01:15:46, "Dimitris Tsirogiannis" <[email protected]> wrote: >Hi Quanlong, > >You're pretty much correct. REFRESH can handle the majority of external >metadata modifications (adding/dropping files/partitions, etc) and >INVALIDATE METADATA should be used in the two use cases you mention. I am >sorry you had to look at the code to figure that out. I checked our >documentation (https://www.cloudera.com/documentation/enterprise/ >latest/topics/impala_refresh.html) and I see that some parts are not as >explicit as they should. I filed a docs JIRA ( >https://issues.apache.org/jira/browse/IMPALA-5918). > >Thanks >Dimitris > >On Mon, Sep 11, 2017 at 5:55 AM, Quanlong Huang <[email protected]> >wrote: > >> Hi all, >> >> >> I used to thought that REFRESH statement is just incremental metadata >> reload. It can't detect file deletion or modification. So we should use >> INVALIDATE METADATA for these cases. >> However, one of my friends told me that they always use REFRESH statement >> in their ETL pipeline, either adding new files or replacing the whole table >> files. They never use INVALIDATE METADATA and haven't encounter any errors. >> >> >> I realized my thought is wrong and digged into the codes. I found comments >> in HdfsTable.java said that in these two cases we should use INVALIDATE >> METADATA instead of REFRESH: >> an ALTER TABLE ADD PARTITION or dynamic partition insert is executed >> through Hive. This does not update the lastDdlTime. >> Hdfs rebalancer is executed. This changes the block locations but doesn't >> update the mtime (file modification time). >> However, in my experiments, for all manual changes made in Hive or HDFS, >> we just need to trigger REFRESH statement. For example, modifying or >> deleting files under an existent partition, adding partitions in Hive by >> ALTER TABLE ADD PARTITION etc. >> >> >> In HdfsTable#refreshFileMetadata, all manual changes (add/delete/modify) >> of data files can be detected and file descriptors will be updated. >> Thus, the previous comments are wrong. There're only two cases we should >> use INVALIDATE METADATA: >> When new tables are created outside Impala >> When block locations are changed by HDFS balancer (This one is for >> increasing local reads) >> Hope you could correct me if I'm wrong. >> >> >> Thanks, >> Quanlong
