[
https://issues.apache.org/jira/browse/IMPALA-11469?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17574469#comment-17574469
]
Quanlong Huang commented on IMPALA-11469:
-----------------------------------------
It seems the spark folder is hard coded:
https://stackoverflow.com/questions/50847512/how-to-change-the-location-of-spark-metadata-directory
Currently, we just ignored dirs start with "." and "_tmp."
https://github.com/apache/impala/blob/0b9bead70084f2ed1a55ca38ceb7b3ebe30eebce/fe/src/main/java/org/apache/impala/common/FileSystemUtil.java#L846-L863
It makes sense to make the ignored dirs configurable.
> Ignore _spark_metadata folder in table location
> -----------------------------------------------
>
> Key: IMPALA-11469
> URL: https://issues.apache.org/jira/browse/IMPALA-11469
> Project: IMPALA
> Issue Type: Improvement
> Components: Backend
> Reporter: Matthias Wies
> Assignee: Quanlong Huang
> Priority: Major
>
> When spark streaming is used to write parquet files out to an external table
> a folder _spark_metadata is created within the directory of the table. Hive
> is capable of dealing with this directory, but Impala trips on it.
> So REFRESH TABLE won't work as it sees a directory with data Impala cannot
> cope with. A SELECT will also not work as it trips on the _spark_metadata __
> folder _._
> Issue was found in CDP 7.1.7 SP1 but I suspect it is in all versions
> Regards Matthias
--
This message was sent by Atlassian Jira
(v8.20.10#820010)
---------------------------------------------------------------------
To unsubscribe, e-mail: [email protected]
For additional commands, e-mail: [email protected]