Thanks for the detailed investigation Yuhong. This definitely sounds like a
bug; the code means to deduplicate identical paths, but uses only String
equality, not Path equality. I filed a JIRA issue in case someone wants to
work on fixing it: https://issues.apache.org/jira/browse/BEAM-11329

On Mon, Nov 23, 2020 at 5:32 PM 成雨虹 <mabelyuhong0...@gmail.com> wrote:

> Hi Beam,
>
> I am new to beam spark and recently got an error:
>
>
> Caused by: java.lang.IllegalArgumentException: The HadoopFileSystemRegistrar 
> currently only supports at most a single Hadoop configuration.
>
> at 
> org.apache.beam.vendor.guava.v26_0_jre.com.google.common.base.Preconditions.checkArgument(Preconditions.java:141)
>  ~[beam-vendor-guava-26_0-jre-0.1.jar:?]
>       at 
> org.apache.beam.sdk.io.hdfs.HadoopFileSystemRegistrar.fromOptions(HadoopFileSystemRegistrar.java:60)
>  ~[beam-sdks-java-io-hadoop-file-system-3.2250.5.jar:?]
>       at 
> org.apache.beam.sdk.io.FileSystems.verifySchemesAreUnique(FileSystems.java:496)
>  ~[beam-sdks-java-core-3.2250.5.jar:?]
>       at 
> org.apache.beam.sdk.io.FileSystems.setDefaultPipelineOptions(FileSystems.java:486)
>  ~[beam-sdks-java-core-3.2250.5.jar:?]
>       at 
> org.apache.beam.sdk.PipelineRunner.fromOptions(PipelineRunner.java:47) 
> ~[beam-sdks-java-core-3.2250.5.jar:?]
>       at org.apache.beam.sdk.Pipeline.create(Pipeline.java:149) 
> ~[beam-sdks-java-core-3.2250.5.jar:?]
>
>
> I tried to debug and printed some logs using
>
>  List<Configuration> configurations =        
> pipelineOpts.as(HadoopFileSystemOptions.class).getHdfsConfiguration(); 
> LOG.info("print hdfsConfiguration for testing: " + configurations.toString());
>
>
> 2020-11-19 18:02:26.289 [main] HelloBeam [INFO] print hdfsConfiguration for 
> testing:
>
> [Configuration: 
> /export/content/lid/apps/samza-yarn-nodemanager/1d5c39c31bb33e3dd8e8149168167870328a014b/genConfig/core-site.xml,
>
>  Configuration: 
> /export/content/lid/apps/samza-yarn-nodemanager/1d5c39c31bb33e3dd8e8149168167870328a014b/genConfig/core-site.xml]
>
>
> as you can see the hdfsConfiguration is a list and contains two same 
> elements, which caused the error.
>
> I noticed that the configurations are generated according to HADOOP_CONF_DIR 
> and YARN_CONF_DIR. In the class, a set is used to dedup,
>
> however, in my test environment, the two dirs are:
>
>
> HADOOP_CONF_DIR=/export/content/lid/apps/samza-yarn-nodemanager/1d5c39c31bb33e3dd8e8149168167870328a014b/bin/../genConfig*/*
>
> YARN_CONF_DIR=/export/content/lid/apps/samza-yarn-nodemanager/1d5c39c31bb33e3dd8e8149168167870328a014b/bin/../genConfig
>
>
> HADOOP_CONF_DIR contains a '/' at the end so these two dir are considered to 
> be different and then got added twice.
>
>
> I am not sure this is what we expected or is it a bug we should fix?
>
>
> Thanks in advance. Hope can hear from you soon.
>
>
> Best,
>
> Yuhong
>
>
>
>

Reply via email to