[
https://issues.apache.org/jira/browse/SPARK-44976?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
]
YUBI LEE updated SPARK-44976:
-----------------------------
Description:
SPARK-6558 changes the behavior of {{Utils.getCurrentUserName()}} to use
shortname instead of full principal name.
Due to this, it doesn't respect {{hadoop.security.auth_to_local}} rule on the
side of non-kerberized hdfs namenode.
For example, I use 2 hdfs cluster. One is kerberized, the other one is not
kerberized.
I make a rule to add some prefix to username on the non-kerberized cluster if
some one access it from the kerberized cluster.
{code}
<property>
<name>hadoop.security.auth_to_local</name>
<value xml:space="preserve">
RULE:[1:$1@$0](.*@EXAMPLE.COM)s/(.+)@.*/_ex_$1/
RULE:[2:$1@$0](.*@EXAMPLE.COM)s/(.+)@.*/_ex_$1/
DEFAULT</value>
</property>
{code}
However, if I submit spark job with keytab & principal option, hdfs directory
and files ownership is not coherent.
(I change some words for privacy.)
{code}
$ hdfs dfs -ls hdfs:///user/eub/some/path/20230510/23
Found 52 items
-rw-rw-rw- 3 _ex_eub hdfs 0 2023-05-11 00:16
hdfs:///user/eub/some/path/20230510/23/_SUCCESS
-rw-r--r-- 3 eub hdfs 134418857 2023-05-11 00:15
hdfs:///user/eub/some/path/20230510/23/part-00000-b781be38-9dbc-41da-8d0e-597a7f343649-c000.txt.gz
-rw-r--r-- 3 eub hdfs 153410049 2023-05-11 00:16
hdfs:///user/eub/some/path/20230510/23/part-00001-b781be38-9dbc-41da-8d0e-597a7f343649-c000.txt.gz
-rw-r--r-- 3 eub hdfs 157260989 2023-05-11 00:16
hdfs:///user/eub/some/path/20230510/23/part-00002-b781be38-9dbc-41da-8d0e-597a7f343649-c000.txt.gz
-rw-r--r-- 3 eub hdfs 156222760 2023-05-11 00:16
hdfs:///user/eub/some/path/20230510/23/part-00003-b781be38-9dbc-41da-8d0e-597a7f343649-c000.txt.gz
{code}
Another interesting point is that if I submit spark job without keytab and
principal option but with kerberos authentication with {{kinit}}, it will not
follow {{hadoop.security.auth_to_local}} rule completely.
{code}
$ hdfs dfs -ls hdfs:///user/eub/output/
Found 3 items
-rw-rw-r--+ 3 eub hdfs 0 2023-08-25 12:31
hdfs:///user/eub/output/_SUCCESS
-rw-rw-r--+ 3 eub hdfs 512 2023-08-25 12:31
hdfs:///user/eub/output/part-00000.gz
-rw-rw-r--+ 3 eub hdfs 574 2023-08-25 12:31
hdfs:///user/eub/output/part-00001.gz
{code}
I finally found that if I submit spark job with {{--principal}} and
{{--keytab}} option, ugi will be different.
(refer to
https://github.com/apache/spark/blob/2583bd2c16a335747895c0843f438d0966f47ecd/resource-managers/yarn/src/main/scala/org/apache/spark/deploy/yarn/ApplicationMaster.scala#L905).
Only file ({{_SUCCESS}}) and output directory created by driver (application
master side) will respect {{hadoop.security.auth_to_local}} on the
non-kerberized namenode only if {{--principal}} and {{--keytab}] options are
provided.
No matter how hdfs files or directory are created by executor or driver, those
should respect {{hadoop.security.auth_to_local}} rule and should be the same.
This issue is related to https://issues.apache.org/jira/browse/SPARK-6558.
was:
SPARK-6558 changes the behavior of {{Utils.getCurrentUserName()}} to use
shortname instead of full principal name.
Due to this, it doesn't respect {{hadoop.security.auth_to_local}} rule on the
side of non-kerberized hdfs namenode.
For example, I use 2 hdfs cluster. One is kerberized, the other one is not
kerberized.
I make a rule to add some prefix to username on the non-kerberized cluster if
some one access it from the kerberized cluster.
{code}
<property>
<name>hadoop.security.auth_to_local</name>
<value xml:space="preserve">
RULE:[1:$1@$0](.*@EXAMPLE.COM)s/(.+)@.*/_ex_$1/
RULE:[2:$1@$0](.*@EXAMPLE.COM)s/(.+)@.*/_ex_$1/
DEFAULT</value>
</property>
{code}
However, if I submit spark job with keytab & principal option, hdfs directory
and files ownership is not coherent.
(I change some words for privacy.)
{code}
$ hdfs dfs -ls hdfs:///user/eub/some/path/20230510/23
Found 52 items
-rw-rw-rw- 3 _ex_eub hdfs 0 2023-05-11 00:16
hdfs:///user/eub/some/path/20230510/23/_SUCCESS
-rw-r--r-- 3 eub hdfs 134418857 2023-05-11 00:15
hdfs:///user/eub/some/path/20230510/23/part-00000-b781be38-9dbc-41da-8d0e-597a7f343649-c000.txt.gz
-rw-r--r-- 3 eub hdfs 153410049 2023-05-11 00:16
hdfs:///user/eub/some/path/20230510/23/part-00001-b781be38-9dbc-41da-8d0e-597a7f343649-c000.txt.gz
-rw-r--r-- 3 eub hdfs 157260989 2023-05-11 00:16
hdfs:///user/eub/some/path/20230510/23/part-00002-b781be38-9dbc-41da-8d0e-597a7f343649-c000.txt.gz
-rw-r--r-- 3 eub hdfs 156222760 2023-05-11 00:16
hdfs:///user/eub/some/path/20230510/23/part-00003-b781be38-9dbc-41da-8d0e-597a7f343649-c000.txt.gz
{code}
Another interesting point is that if I submit spark job without keytab and
principal option but with kerberos authentication with {{kinit}}, it will not
follow {{hadoop.security.auth_to_local}} rule completely.
{code}
$ hdfs dfs -ls hdfs:///user/eub/output/
Found 3 items
-rw-rw-r--+ 3 eub hdfs 0 2023-08-25 12:31
hdfs:///user/eub/output/_SUCCESS
-rw-rw-r--+ 3 eub hdfs 512 2023-08-25 12:31
hdfs:///user/eub/output/part-00000.gz
-rw-rw-r--+ 3 eub hdfs 574 2023-08-25 12:31
hdfs:///user/eub/output/part-00001.gz
{code}
I finally found that if I submit spark job with {{--principal}} and
{{--keytab}} option, ugi will be different.
(refer to
https://github.com/apache/spark/blob/2583bd2c16a335747895c0843f438d0966f47ecd/resource-managers/yarn/src/main/scala/org/apache/spark/deploy/yarn/ApplicationMaster.scala#L905).
Only file ({{_SUCCESS}}) and output directory created by driver (application
master side) will respect {{hadoop.security.auth_to_local}} on the
non-kerberized namenode only if {{--principal}} and {{--keytab}] options are
provided.
No matter how hdfs files or directory are created by executor or driver, those
should respect {{hadoop.security.auth_to_local}} rule and should be the same.
> Utils.getCurrentUserName should return the full principal name
> --------------------------------------------------------------
>
> Key: SPARK-44976
> URL: https://issues.apache.org/jira/browse/SPARK-44976
> Project: Spark
> Issue Type: Bug
> Components: Spark Core
> Affects Versions: 3.2.3, 3.3.3, 3.4.1
> Reporter: YUBI LEE
> Priority: Major
>
> SPARK-6558 changes the behavior of {{Utils.getCurrentUserName()}} to use
> shortname instead of full principal name.
> Due to this, it doesn't respect {{hadoop.security.auth_to_local}} rule on the
> side of non-kerberized hdfs namenode.
> For example, I use 2 hdfs cluster. One is kerberized, the other one is not
> kerberized.
> I make a rule to add some prefix to username on the non-kerberized cluster if
> some one access it from the kerberized cluster.
> {code}
> <property>
> <name>hadoop.security.auth_to_local</name>
> <value xml:space="preserve">
> RULE:[1:$1@$0](.*@EXAMPLE.COM)s/(.+)@.*/_ex_$1/
> RULE:[2:$1@$0](.*@EXAMPLE.COM)s/(.+)@.*/_ex_$1/
> DEFAULT</value>
> </property>
> {code}
> However, if I submit spark job with keytab & principal option, hdfs directory
> and files ownership is not coherent.
> (I change some words for privacy.)
> {code}
> $ hdfs dfs -ls hdfs:///user/eub/some/path/20230510/23
> Found 52 items
> -rw-rw-rw- 3 _ex_eub hdfs 0 2023-05-11 00:16
> hdfs:///user/eub/some/path/20230510/23/_SUCCESS
> -rw-r--r-- 3 eub hdfs 134418857 2023-05-11 00:15
> hdfs:///user/eub/some/path/20230510/23/part-00000-b781be38-9dbc-41da-8d0e-597a7f343649-c000.txt.gz
> -rw-r--r-- 3 eub hdfs 153410049 2023-05-11 00:16
> hdfs:///user/eub/some/path/20230510/23/part-00001-b781be38-9dbc-41da-8d0e-597a7f343649-c000.txt.gz
> -rw-r--r-- 3 eub hdfs 157260989 2023-05-11 00:16
> hdfs:///user/eub/some/path/20230510/23/part-00002-b781be38-9dbc-41da-8d0e-597a7f343649-c000.txt.gz
> -rw-r--r-- 3 eub hdfs 156222760 2023-05-11 00:16
> hdfs:///user/eub/some/path/20230510/23/part-00003-b781be38-9dbc-41da-8d0e-597a7f343649-c000.txt.gz
> {code}
> Another interesting point is that if I submit spark job without keytab and
> principal option but with kerberos authentication with {{kinit}}, it will not
> follow {{hadoop.security.auth_to_local}} rule completely.
> {code}
> $ hdfs dfs -ls hdfs:///user/eub/output/
> Found 3 items
> -rw-rw-r--+ 3 eub hdfs 0 2023-08-25 12:31
> hdfs:///user/eub/output/_SUCCESS
> -rw-rw-r--+ 3 eub hdfs 512 2023-08-25 12:31
> hdfs:///user/eub/output/part-00000.gz
> -rw-rw-r--+ 3 eub hdfs 574 2023-08-25 12:31
> hdfs:///user/eub/output/part-00001.gz
> {code}
> I finally found that if I submit spark job with {{--principal}} and
> {{--keytab}} option, ugi will be different.
> (refer to
> https://github.com/apache/spark/blob/2583bd2c16a335747895c0843f438d0966f47ecd/resource-managers/yarn/src/main/scala/org/apache/spark/deploy/yarn/ApplicationMaster.scala#L905).
> Only file ({{_SUCCESS}}) and output directory created by driver (application
> master side) will respect {{hadoop.security.auth_to_local}} on the
> non-kerberized namenode only if {{--principal}} and {{--keytab}] options are
> provided.
> No matter how hdfs files or directory are created by executor or driver,
> those should respect {{hadoop.security.auth_to_local}} rule and should be the
> same.
> This issue is related to https://issues.apache.org/jira/browse/SPARK-6558.
--
This message was sent by Atlassian Jira
(v8.20.10#820010)
---------------------------------------------------------------------
To unsubscribe, e-mail: [email protected]
For additional commands, e-mail: [email protected]