[
https://issues.apache.org/jira/browse/IMPALA-10579?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17312199#comment-17312199
]
ASF subversion and git services commented on IMPALA-10579:
----------------------------------------------------------
Commit 1d839e423e51b05314e3dbfd790cb1fa7fc82d98 in impala's branch
refs/heads/master from stiga-huang
[ https://gitbox.apache.org/repos/asf?p=impala.git;h=1d839e4 ]
IMPALA-10579: Fix usage of RemoteIterator in FileSystemUtil
HDFS FileSystem provides a listStatusIterator() API for listing remote
storage using a RemoteIterator. We use it to list files when loading
table file metadata.
It's not guaranteed that a RemoteIterator can survive when its hasNext()
or next() throws IOExceptions. We should stop the loop in this case.
Otherwise, we may go into a infinite loop.
Without HADOOP-16685, it's also not guaranteed that
FileSystem.listStatusIterator() will throw a FileNotFoundException when
the path doesn't exist.
This patch refactors the file listing iterators so we don't need to
depend on these two assumptions. The basic idea is:
- On one side, we should not depends on other RemoteIterator's behavior
after exception.
- On the other side, we try to make our own iterators more robust on
transient sub-directories. So table loading won't be failed by them.
Tests:
- Loop test_insert_stress.py 100 times. Verified the non-existing
subdirs are skipped and inserts are stable in a high concurrency.
Change-Id: I859bd4f976c51a34eb6a03cefd2ddcdf11656cea
Reviewed-on: http://gerrit.cloudera.org:8080/17171
Reviewed-by: Impala Public Jenkins <[email protected]>
Tested-by: Impala Public Jenkins <[email protected]>
> Deadloop in table metadata loading when using an invalid RemoteIterator
> -----------------------------------------------------------------------
>
> Key: IMPALA-10579
> URL: https://issues.apache.org/jira/browse/IMPALA-10579
> Project: IMPALA
> Issue Type: Bug
> Components: Catalog
> Affects Versions: Impala 3.4.0
> Reporter: Quanlong Huang
> Assignee: Quanlong Huang
> Priority: Critical
>
> The file listing thread in catalogd will go into a dead loop if it gets a
> RemoteIterator on a non-existing path. The first call of the
> RemoteIterator.hasNext() will throw a FileNotFoundException. However, this
> exception will be catched and the loop will continue, which results in a dead
> loop. Related codes:
> [https://github.com/apache/impala/blob/d89c04bf806682d3449c566ce979632bd2ac5b29/fe/src/main/java/org/apache/impala/common/FileSystemUtil.java#L789-L814]
> {code:java}
> static class FilterIterator implements RemoteIterator<FileStatus> {
> ...
> public boolean hasNext() throws IOException {
> ...
> while (curFile_ == null) {
> FileStatus next;
> try {
> if (!baseIterator_.hasNext()) return false; // <---- throws
> FileNotFoundException
> ...
> next = baseIterator_.next();
> } catch (FileNotFoundException ex) {
> ...
> LOG.warn(ex.getMessage());
> continue; // <--------- catch the exception and continue into a
> dead loop
> }
> if (!isInIgnoredDirectory(startPath_, next)) {
> curFile_ = next;
> return true;
> }
> }
> return true;
> }
> {code}
> *When will the path to be loading not exist?*
> It happens when metadata (table/partition location) in HMS still have the
> path. But it's actually removed from the storage.
> *When will impala get such an invalid RemoteIterator?*
> For FileSystem implementations that don't override the
> FileSystem#listStatusIterator() interface, e.g. S3AFileSystem before
> HADOOP-17281, AzureBlobFileSystem, and GoogleHadoopFileSystem.
--
This message was sent by Atlassian Jira
(v8.3.4#803005)
---------------------------------------------------------------------
To unsubscribe, e-mail: [email protected]
For additional commands, e-mail: [email protected]