[
https://issues.apache.org/jira/browse/BEAM-11329?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
]
Anonymous updated BEAM-11329:
-----------------------------
Status: Triage Needed (was: Resolved)
> HDFS not deduplicating identical configuration paths.
> -----------------------------------------------------
>
> Key: BEAM-11329
> URL: https://issues.apache.org/jira/browse/BEAM-11329
> Project: Beam
> Issue Type: Bug
> Components: io-java-hadoop-file-system
> Reporter: Kyle Weaver
> Assignee: Yuhong Cheng
> Priority: P3
> Fix For: 2.28.0
>
> Time Spent: 0.5h
> Remaining Estimate: 0h
>
> Originally reported by Yuhong on the dev list:
> https://lists.apache.org/thread.html/r6a61c94e6d14aa9e8b56ff4919c0bea17fceada446d1193d19fd9ed2%40%3Cdev.beam.apache.org%3E
> Caused by: java.lang.IllegalArgumentException: The HadoopFileSystemRegistrar
> currently only supports at most a single Hadoop configuration.
> at
> org.apache.beam.vendor.guava.v26_0_jre.com.google.common.base.Preconditions.checkArgument(Preconditions.java:141)
> ~[beam-vendor-guava-26_0-jre-0.1.jar:?]
> at
> org.apache.beam.sdk.io.hdfs.HadoopFileSystemRegistrar.fromOptions(HadoopFileSystemRegistrar.java:60)
> ~[beam-sdks-java-io-hadoop-file-system-3.2250.5.jar:?]
> at
> org.apache.beam.sdk.io.FileSystems.verifySchemesAreUnique(FileSystems.java:496)
> ~[beam-sdks-java-core-3.2250.5.jar:?]
> at
> org.apache.beam.sdk.io.FileSystems.setDefaultPipelineOptions(FileSystems.java:486)
> ~[beam-sdks-java-core-3.2250.5.jar:?]
> at
> org.apache.beam.sdk.PipelineRunner.fromOptions(PipelineRunner.java:47)
> ~[beam-sdks-java-core-3.2250.5.jar:?]
> at org.apache.beam.sdk.Pipeline.create(Pipeline.java:149)
> ~[beam-sdks-java-core-3.2250.5.jar:?]
> I tried to debug and printed some logs using
> List<Configuration> configurations =
> pipelineOpts.as(HadoopFileSystemOptions.class).getHdfsConfiguration();
> LOG.info("print hdfsConfiguration for testing: " +
> configurations.toString());
> 2020-11-19 18:02:26.289 [main] HelloBeam [INFO] print hdfsConfiguration for
> testing:
> [Configuration:
> /export/content/lid/apps/samza-yarn-nodemanager/1d5c39c31bb33e3dd8e8149168167870328a014b/genConfig/core-site.xml,
>
> Configuration:
> /export/content/lid/apps/samza-yarn-nodemanager/1d5c39c31bb33e3dd8e8149168167870328a014b/genConfig/core-site.xml]
> as you can see the hdfsConfiguration is a list and contains two same
> elements, which caused the error.
> I noticed that the configurations are generated according to HADOOP_CONF_DIR
> and YARN_CONF_DIR. In the class, a set is used to dedup,
> however, in my test environment, the two dirs are:
> HADOOP_CONF_DIR=/export/content/lid/apps/samza-yarn-nodemanager/1d5c39c31bb33e3dd8e8149168167870328a014b/bin/../genConfig/
> YARN_CONF_DIR=/export/content/lid/apps/samza-yarn-nodemanager/1d5c39c31bb33e3dd8e8149168167870328a014b/bin/../genConfig
> HADOOP_CONF_DIR contains a '/' at the end so these two dir are considered to
> be different and then got added twice.
--
This message was sent by Atlassian Jira
(v8.20.10#820010)