[ 
https://issues.apache.org/jira/browse/IMPALA-8663?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16881431#comment-16881431
 ] 

ASF subversion and git services commented on IMPALA-8663:
---------------------------------------------------------

Commit fc974f944a9266e68e6f1694eecdc2160fd52582 in impala's branch 
refs/heads/master from Vihang Karajgaonkar
[ https://gitbox.apache.org/repos/asf?p=impala.git;h=fc974f9 ]

IMPALA-8663 : FileMetadataLoader should skip hidden and tmp directories

The FileMetadataLoader is used to load the file information in when the
table is loaded. By default, it lists all the files in the
table/partition directory. Currently, it only skips the filenames which
are invalid (hidden files and ones starting with "_" etc). However, it
does not skip the directories which are temporary or hidden. In case of
Hive when data is inserted into a table, it creates a temporary staging
directory which is a hidden directory under the table location. When the
insert in hive is completed, such staging directories are removed. But
if there is a refresh called during that time, FileMetadataLoader will
add the files in the staging directory as well. Not only this could
cause temporary invalid results but it causes table to go in a bad state
when these temporary directories are removed. The only work-around in
such a case to issue a refresh on the table again.

This patch adds logic in the filemetadataloader to ignore such temporary
staging directories. Unfortunately, hadoop does not provide a API which
can recursively list files in a directory and skip certain directories.
This patch adds a new FilterIterator which wraps around existing
listFiles, listStatus and RecursingIterator to skip the hidden
directories from the listing result.

Also, the existing code to recover partitions implements its own
recursion logic which includes path validation. This already skips such
hidden directories since they do not conform to the partition spec. The
patch does a minor modification to this method by directly calling the
listStatusIterator instead of going through FileSystemUtil#listStatus
whiche uses the filtering remote iterator now.

Testing:
1. Added a new tests as well as modified existing ones which were
related to cover interesting cases.
2. Ran concurrent inserts from Hive while issuing refresh in a loop on
Impala side. Earlier this would cause the table to go into a bad state.
Now, it works fine for the staging directories. It still runs into a
FileNotFoundException from the impalad when there are insert overwrite
statements in Hive

Change-Id: I2c4a22908304fe9e377d77d6c18d401c3f3294aa
Reviewed-on: http://gerrit.cloudera.org:8080/13665
Tested-by: Impala Public Jenkins <impala-public-jenk...@cloudera.com>
Reviewed-by: Vihang Karajgaonkar <vih...@cloudera.com>


> FileMetadataLoader should skip listing files in hidden and tmp directories
> --------------------------------------------------------------------------
>
>                 Key: IMPALA-8663
>                 URL: https://issues.apache.org/jira/browse/IMPALA-8663
>             Project: IMPALA
>          Issue Type: Bug
>            Reporter: Vihang Karajgaonkar
>            Assignee: Vihang Karajgaonkar
>            Priority: Critical
>              Labels: catalog-v2, impala-acid
>
> Currently, the file metadata loader recursively lists the table and partition 
> directories to get the fileStatuses. For each filestatus we ignore the hidden 
> files in {{FileSystemUtil.isValidDataFile}}(). However that is not 
> sufficient. For instance, if Hive is inserting data into a table when the 
> refresh is called, it is possible the staging directory is present within the 
> table directory. This staging directory is a hidden directory of the naming 
> {{.hive-staging_*}}. It is possible that this directory has files which are 
> not hidden (starting from a . or _). Such files should be considered 
> temporary files and should not be considered as valid data files.
>  
> Another instance where we see this happen is in transactional tables which 
> has a {{.manifest}} which is located in a {{_tmp}} directory within the table 
> directory. This file should also be skipped and not considered as a valid 
> data file.



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

---------------------------------------------------------------------
To unsubscribe, e-mail: issues-all-unsubscr...@impala.apache.org
For additional commands, e-mail: issues-all-h...@impala.apache.org

Reply via email to