[
https://issues.apache.org/jira/browse/HIVE-27723?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17772076#comment-17772076
]
László Bodor commented on HIVE-27723:
-------------------------------------
merged to master, thanks [~ayushsaxena] and [~dkuzmenko] for all the comments
on this jira!
> Prevent localizing the same original file more than once if symlinks are
> present
> --------------------------------------------------------------------------------
>
> Key: HIVE-27723
> URL: https://issues.apache.org/jira/browse/HIVE-27723
> Project: Hive
> Issue Type: Improvement
> Reporter: László Bodor
> Assignee: László Bodor
> Priority: Major
> Labels: pull-request-available
> Fix For: 4.0.0
>
>
> We already calculate SHA hashes for the files to be localized. There is a
> chance, that in some setups, the hive-exec jars are symlinked so it gets
> localized more than once.
> {code}
> [root@lbodor-hiveontez-4 ~]# sudo -u hive hdfs dfs -ls -R
> /tmp/hive/hive/_tez_session_dir
> drwx------ - hive supergroup 0 2023-09-20 12:13
> /tmp/hive/hive/_tez_session_dir/0febf6f5-bacc-4055-b22b-e621c59cd1d6
> drwx------ - hive supergroup 0 2023-09-20 12:19
> /tmp/hive/hive/_tez_session_dir/0febf6f5-bacc-4055-b22b-e621c59cd1d6/.tez
> drwx------ - hive supergroup 0 2023-09-20 11:58
> /tmp/hive/hive/_tez_session_dir/0febf6f5-bacc-4055-b22b-e621c59cd1d6-resources
> -rw-r--r-- 3 hive supergroup 78366781 2023-09-20 11:58
> /tmp/hive/hive/_tez_session_dir/0febf6f5-bacc-4055-b22b-e621c59cd1d6-resources/hive-exec-3.1.3000.7.2.18.0-334.jar
> -rw-r--r-- 3 hive supergroup 78366781 2023-09-20 11:58
> /tmp/hive/hive/_tez_session_dir/0febf6f5-bacc-4055-b22b-e621c59cd1d6-resources/hive-exec.jar
> drwx------ - hive supergroup 0 2023-09-20 11:58
> /tmp/hive/hive/_tez_session_dir/21686e3c-2a00-457b-b84f-1a8db37699d1
> drwx------ - hive supergroup 0 2023-09-20 12:04
> /tmp/hive/hive/_tez_session_dir/21686e3c-2a00-457b-b84f-1a8db37699d1/.tez
> drwx------ - hive supergroup 0 2023-09-20 11:58
> /tmp/hive/hive/_tez_session_dir/21686e3c-2a00-457b-b84f-1a8db37699d1-resources
> -rw-r--r-- 3 hive supergroup 78366781 2023-09-20 11:58
> /tmp/hive/hive/_tez_session_dir/21686e3c-2a00-457b-b84f-1a8db37699d1-resources/hive-exec-3.1.3000.7.2.18.0-334.jar
> -rw-r--r-- 3 hive supergroup 78366781 2023-09-20 11:58
> /tmp/hive/hive/_tez_session_dir/21686e3c-2a00-457b-b84f-1a8db37699d1-resources/hive-exec.jar
> drwx------ - hive supergroup 0 2023-09-20 11:58
> /tmp/hive/hive/_tez_session_dir/40c7fb13-cfa1-4377-8d40-7e19503fbdad
> drwx------ - hive supergroup 0 2023-09-20 13:13
> /tmp/hive/hive/_tez_session_dir/40c7fb13-cfa1-4377-8d40-7e19503fbdad/.tez
> drwx------ - hive supergroup 0 2023-09-20 11:58
> /tmp/hive/hive/_tez_session_dir/40c7fb13-cfa1-4377-8d40-7e19503fbdad-resources
> -rw-r--r-- 3 hive supergroup 78366781 2023-09-20 11:58
> /tmp/hive/hive/_tez_session_dir/40c7fb13-cfa1-4377-8d40-7e19503fbdad-resources/hive-exec-3.1.3000.7.2.18.0-334.jar
> -rw-r--r-- 3 hive supergroup 78366781 2023-09-20 11:58
> /tmp/hive/hive/_tez_session_dir/40c7fb13-cfa1-4377-8d40-7e19503fbdad-resources/hive-exec.jar
> drwx------ - hive supergroup 0 2023-09-20 11:58
> /tmp/hive/hive/_tez_session_dir/5c48d6ab-ed8c-49c9-afe0-465de82c9c57
> drwx------ - hive supergroup 0 2023-09-20 12:04
> /tmp/hive/hive/_tez_session_dir/5c48d6ab-ed8c-49c9-afe0-465de82c9c57/.tez
> drwx------ - hive supergroup 0 2023-09-20 11:58
> /tmp/hive/hive/_tez_session_dir/5c48d6ab-ed8c-49c9-afe0-465de82c9c57-resources
> -rw-r--r-- 3 hive supergroup 78366781 2023-09-20 11:58
> /tmp/hive/hive/_tez_session_dir/5c48d6ab-ed8c-49c9-afe0-465de82c9c57-resources/hive-exec-3.1.3000.7.2.18.0-334.jar
> -rw-r--r-- 3 hive supergroup 78366781 2023-09-20 11:58
> /tmp/hive/hive/_tez_session_dir/5c48d6ab-ed8c-49c9-afe0-465de82c9c57-resources/hive-exec.jar
> {code}
> in the presence of huge amount of sessions, we cannot afford this overhead of
> copying this files to HDFS and localizing to all containers twice
> the root cause can be solved by removing symlinks of the same hive-exec jar,
> -however, as we're already calculating SHA for the files, it's so easy to
> take care of the duplications in the localization codepath, and this takes
> care of any accidental duplications- so if all symlinks point to the same
> jar, resolving those before passing the Path objects to the localization
> codepath would simply solve this issue
--
This message was sent by Atlassian Jira
(v8.20.10#820010)