[
https://issues.apache.org/jira/browse/IMPALA-11469?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17575378#comment-17575378
]
ASF subversion and git services commented on IMPALA-11469:
----------------------------------------------------------
Commit abcb62b676539b85c7c428ed385177f591de3492 in impala's branch
refs/heads/master from stiga-huang
[ https://gitbox.apache.org/repos/asf?p=impala.git;h=abcb62b67 ]
IMPALA-11469: Make prefix of ignored staging dirs configurable
External systems like Hive or Spark will write temporary or "non-data"
files in the table location. Catalogd will skip them when loading file
metadata. However, the prefix is currently hard coded. We recently found
that Spark streaming will generated a _spark_metadata dir which is not
handled correctly.
To avoid future code changes when interact with more systems, this patch
adds a new startup flag, ignored_dir_prefix_list, for catalogd. It's a
comma separated list for the prefix of ignored dirs. Currently, the
default value is ".,_tmp.,_spark_metadata". Users can add more in the
future.
Tests:
- Add a case for _spark_metadata in FileSystemUtilTest
Change-Id: I108bfa823281a35d28932f7ccce0b12a0c5af57d
Reviewed-on: http://gerrit.cloudera.org:8080/18811
Reviewed-by: Impala Public Jenkins <[email protected]>
Tested-by: Impala Public Jenkins <[email protected]>
> Ignore _spark_metadata folder in table location
> -----------------------------------------------
>
> Key: IMPALA-11469
> URL: https://issues.apache.org/jira/browse/IMPALA-11469
> Project: IMPALA
> Issue Type: Improvement
> Components: Backend
> Reporter: Matthias Wies
> Assignee: Quanlong Huang
> Priority: Major
>
> When spark streaming is used to write parquet files out to an external table
> a folder _spark_metadata is created within the directory of the table. Hive
> is capable of dealing with this directory, but Impala trips on it.
> So REFRESH TABLE won't work as it sees a directory with data Impala cannot
> cope with. A SELECT will also not work as it trips on the _spark_metadata __
> folder _._
> Issue was found in CDP 7.1.7 SP1 but I suspect it is in all versions
> Regards Matthias
--
This message was sent by Atlassian Jira
(v8.20.10#820010)
---------------------------------------------------------------------
To unsubscribe, e-mail: [email protected]
For additional commands, e-mail: [email protected]