Vihang Karajgaonkar has uploaded this change for review. ( 
http://gerrit.cloudera.org:8080/13665


Change subject: IMPALA-8663 : [WIP] FileMetadataLoader should skip listing 
files in hidden and tmp directories
......................................................................

IMPALA-8663 : [WIP] FileMetadataLoader should skip listing files in hidden and 
tmp directories

The FileMetadataLoader is used to load the file information in when the
table is loaded. By default, it lists all the files in the
table/partition directory. Currently, it only skips the filenames which
are invalid (hidden files and ones starting with "_" etc). However, it
does not skip the directories which are temporary or hidden. In case of
Hive when data is inserted into a table, it creates a temporary staging
directory which is a hidden directory under the table location. When the
insert in hive is completed, such staging directories are removed. But
if there is a refresh called during that time, FileMetadataLoader will
add the files in the staging directory as well. Not only this could
cause temporary invalid results but it causes table to go in a bad state
when these temporary directories are removed. The only work-around in
such a case to issue a refresh on the table again.

This patch adds logic in the filemetadataloader to ignore such temporary
staging directories. Unfortunately, hadoop does not provide a API which
can recursively list files in a directory and skip certain directories.
This patch addes this logic of filtering into existing RecursingIterator
in FileSystemUtil.

Testing:
1. Added a new test in FilemetadataloaderTest and modified existing ones
in AcidUtils
2. Ran concurrent inserts from Hive while issuing refresh in a loop on
Impala side. Earlier this would cause the table to go into a bad state.
Now, it works fine for the staging directories. However, still need to
figure out if we should skip listing _copy_ files which Hive creates
when the insert is in progress

Change-Id: I2c4a22908304fe9e377d77d6c18d401c3f3294aa
---
M fe/src/main/java/org/apache/impala/common/FileSystemUtil.java
M fe/src/test/java/org/apache/impala/catalog/FileMetadataLoaderTest.java
M fe/src/test/java/org/apache/impala/util/AcidUtilsTest.java
3 files changed, 103 insertions(+), 17 deletions(-)



  git pull ssh://gerrit.cloudera.org:29418/Impala-ASF refs/changes/65/13665/2
--
To view, visit http://gerrit.cloudera.org:8080/13665
To unsubscribe, visit http://gerrit.cloudera.org:8080/settings

Gerrit-Project: Impala-ASF
Gerrit-Branch: master
Gerrit-MessageType: newchange
Gerrit-Change-Id: I2c4a22908304fe9e377d77d6c18d401c3f3294aa
Gerrit-Change-Number: 13665
Gerrit-PatchSet: 2
Gerrit-Owner: Vihang Karajgaonkar <[email protected]>
Gerrit-Reviewer: Bharath Vissapragada <[email protected]>
Gerrit-Reviewer: Todd Lipcon <[email protected]>

Reply via email to