[
https://issues.apache.org/jira/browse/SPARK-23641?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
]
Andrei Badea updated SPARK-23641:
---------------------------------
Description:
We have an application deployed in yarn-cluster mode.
At some point, the application invokes
{noformat}
spark.sql("LOAD DATA INPATH some/relative/path ...")
{noformat}
in an attempt to add that directory to a Hive table. The relative path should
be interpreted relatively to the home directory of the user who ran the Spark
application (this is what the Hive shell does).
The command runs without failing, but the directory is not added to the table.
Investigation showed that
{{org.apache.spark.sql.execution.command.LoadDataCommand}} attempts to make the
path absolute by prepending {{s"/user/${System.getProperty("user.name")}"}}.
Since the application was deployed in yarn-cluster mode, the value of the
{{user.name}} property is "yarn". This is illustrated by the following message
in the driver logs:
{noformat}
INFO metadata.Hive: No sources specified to move:
hdfs://namenode:8020/user/yarn/some/relative/path{noformat}
Interestingly, the same Spark application writes the data to the relative path
(prior to calling LOAD DATA), and that makes the path absolute as expected. It
uses {{Path.makeQualified()}}, which makes the path relative against
{{FileSystem.getWorkingDirectory}}, which by default is
{{FileSystem.getHomeDirectory}} (and that apparently initializes early enough –
on the machine on which the application is submitted).
was:
We have an application deployed in yarn-cluster mode.
At some point, the application invokes
{noformat}
spark.sql("LOAD DATA INPATH some/relative/path ...")
{noformat}
in an attempt to add that directory to a Hive table. The relative path should
be interpreted relatively to the home directory of the user who ran the Spark
application (this is what the Hive shell does).
The command runs without failing, but the directory is not added to the table.
Investigation showed that
{{org.apache.spark.sql.execution.command.LoadDataCommand}} attempts to make the
path absolute by prepending {{s"/user/${System.getProperty("user.name")}"}}.
Since the application was deployed in yarn-cluster mode, the value of the
{{user.name}} property is "yarn". This is illustrated by the following message
in the driver logs:
{noformat}
INFO metadata.Hive: No sources specified to move:
hdfs://.../user/yarn/some/relative/path{noformat}
Interestingly, the same Spark application writes the data to the relative path
(prior to calling LOAD DATA), and that makes the path absolute as expected. It
uses {{Path.makeQualified()}}, which makes the path relative against
{{FileSystem.getWorkingDirectory}}, which by default is
{{FileSystem.getHomeDirectory}} (and that apparently initializes early enough
-- on the machine on which the application is submitted).
> Wrong username when making relative path to Hive LOAD DATA absolute
> -------------------------------------------------------------------
>
> Key: SPARK-23641
> URL: https://issues.apache.org/jira/browse/SPARK-23641
> Project: Spark
> Issue Type: Bug
> Components: SQL
> Affects Versions: 2.1.0
> Reporter: Andrei Badea
> Priority: Major
>
> We have an application deployed in yarn-cluster mode.
> At some point, the application invokes
> {noformat}
> spark.sql("LOAD DATA INPATH some/relative/path ...")
> {noformat}
> in an attempt to add that directory to a Hive table. The relative path should
> be interpreted relatively to the home directory of the user who ran the Spark
> application (this is what the Hive shell does).
> The command runs without failing, but the directory is not added to the
> table. Investigation showed that
> {{org.apache.spark.sql.execution.command.LoadDataCommand}} attempts to make
> the path absolute by prepending
> {{s"/user/${System.getProperty("user.name")}"}}. Since the application was
> deployed in yarn-cluster mode, the value of the {{user.name}} property is
> "yarn". This is illustrated by the following message in the driver logs:
> {noformat}
> INFO metadata.Hive: No sources specified to move:
> hdfs://namenode:8020/user/yarn/some/relative/path{noformat}
> Interestingly, the same Spark application writes the data to the relative
> path (prior to calling LOAD DATA), and that makes the path absolute as
> expected. It uses {{Path.makeQualified()}}, which makes the path relative
> against {{FileSystem.getWorkingDirectory}}, which by default is
> {{FileSystem.getHomeDirectory}} (and that apparently initializes early enough
> – on the machine on which the application is submitted).
--
This message was sent by Atlassian JIRA
(v7.6.3#76005)
---------------------------------------------------------------------
To unsubscribe, e-mail: [email protected]
For additional commands, e-mail: [email protected]