[
https://issues.apache.org/jira/browse/TEZ-4415?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17564569#comment-17564569
]
László Bodor commented on TEZ-4415:
-----------------------------------
thanks for reporting this [~preaudc]
I'm not really familiar with hadoop archives, are you about to work on this one?
as far as I can see, the missing files are created in the reducer's
configure():
https://github.com/apache/hadoop/blob/9b1d3579b483069d0a211cb0b29c1f25013684dd/hadoop-tools/hadoop-archives/src/main/java/org/apache/hadoop/tools/HadoopArchives.java#L747
I believe this should be called even when this map reduce job is submitted to
Tez via
[YarnRunner|https://github.com/apache/tez/blob/master/tez-mapreduce/src/main/java/org/apache/tez/mapreduce/client/YARNRunner.java]
(I guess YarnRunner is used here as a ClientProtocol when you define
mapreduce.framework.name=yarn-tez)
> Hadoop archives created with Tez miss index files
> -------------------------------------------------
>
> Key: TEZ-4415
> URL: https://issues.apache.org/jira/browse/TEZ-4415
> Project: Apache Tez
> Issue Type: Bug
> Affects Versions: 0.9.2
> Reporter: Christophe Préaud
> Priority: Minor
>
> When a hadoop archive is created with Tez, the _index and _masterindex files
> are not created:
> {code:java}
> # create hadoop archive with Tez
> hadoop archive -D mapreduce.framework.name=yarn-tez -archiveName data.har -p
> /user/preaudc/data /user/preaudc
> (...)
> 22/05/23 13:04:39 INFO client.TezClient: Tez Client Version: [
> component=tez-api, version=0.9.2,
> revision=10cb3519bd34389210e6511a2ba291b52dcda081,
> SCM-URL=scm:git:https://gitbox.apache.org/repos/asf/tez.git,
> buildTime=2019-03-19T20:44:07Z ]
> (...)
> # _index and _masterindex files are not created
> hdfs dfs -ls /user/preaudc/data.har
> Found 2 items
> -rw-r--r-- 3 preaudc preaudc 0 2022-05-23 13:06
> /user/preaudc/data.har/_SUCCESS
> -rw-r--r-- 3 preaudc preaudc 2537147461 2022-05-23 13:06
> /user/preaudc/data.har/part-0
> # the hadoop archive is thus unreadable
> hdfs dfs -ls har:/user/preaudc/data.har
> ls: Invalid path for the Har Filesystem. No index file in
> har:/user/preaudc/data.har{code}
--
This message was sent by Atlassian Jira
(v8.20.10#820010)