[ 
https://issues.apache.org/jira/browse/IMPALA-6830?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Joe McDonnell closed IMPALA-6830.
---------------------------------
    Resolution: Not A Bug

This is behavior that we expect from the file handle cache, which was enabled 
in Impala 2.12. 

When Hive does an insert overwrite, it is often using a deterministic naming 
system for the files, so it is overwriting file X with different data. Due to 
file handle caching, Impala continues to have an HDFS file handle for the file 
with name X. Impala does not know it has changed, so it continues to use the 
file handle it already has open. For a while after the overwrite, this file 
will continue to see the old version. HDFS file handles can have a regular UNIX 
file handle which continues to look at the old OS file (which is still around 
due to the UNIX file handle). HDFS does notice when an HDFS file is deleted or 
overwritten and it will invalidate the HDFS file handle. After that happens, 
Impala will get an error and then see the new version of the file.

Refreshing the table causes Impala to notice that the file has changed (and has 
a different mtime). It will not use a cached file handle with a different 
mtime, so this means it opens a new HDFS file handle and sees the new data. 
Existing queries might finish with the old handle, but new queries will use the 
new handle.

> HdfsScanner get stale data when Hive table is overwrited
> --------------------------------------------------------
>
>                 Key: IMPALA-6830
>                 URL: https://issues.apache.org/jira/browse/IMPALA-6830
>             Project: IMPALA
>          Issue Type: Bug
>            Reporter: Quanlong Huang
>            Assignee: Joe McDonnell
>            Priority: Major
>
> In the minicluster:
> {code:bash}
> hive> create table tmp_parq (a int, b string, c int) stored as parquet;
> hive> insert overwrite table tmp_parq select 1, "abc", 2;
> impala> select * from tmp_parq;
> +---+-----+---+
> | a | b   | c |
> +---+-----+---+
> | 1 | abc | 2 |
> +---+-----+---+
> hive> insert overwrite table tmp_parq select 100, "ddd", 200;
> # # impala still gets old results:
> impala> select * from tmp_parq;
> +---+-----+---+
> | a | b   | c |
> +---+-----+---+
> | 1 | abc | 2 |
> +---+-----+---+
> # # It can be fixed after REFRESH
> impala> refresh tmp_parq;
> impala> select * from tmp_parq;
> +-----+-----+-----+
> | a   | b   | c   |
> +-----+-----+-----+
> | 100 | ddd | 200 |
> +-----+-----+-----+
> {code}



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

Reply via email to