[ 
https://issues.apache.org/jira/browse/BEAM-2457?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16185845#comment-16185845
 ] 

Tim Robertson edited comment on BEAM-2457 at 9/29/17 1:43 PM:
--------------------------------------------------------------

I got to the bottom of this for my case.  The TL;DR to make sure you have this 
when shading up the über jar:
{code}
<transformers>
  <transformer 
implementation="org.apache.maven.plugins.shade.resource.ServicesResourceTransformer"/>
</transformers>
{code}

A service loader is used to register the {{FileSystemRegistrar}}.  You can see 
which are registered using:
{code}
    Set<FileSystemRegistrar> registrars =
      Sets.newTreeSet(ReflectHelpers.ObjectsClassComparator.INSTANCE);
    registrars.addAll(Lists.newArrayList(
      ServiceLoader.load(FileSystemRegistrar.class, 
ReflectHelpers.findClassLoader())));

    for (FileSystemRegistrar reg : registrars) {
      System.out.println(reg.getClass());
    }
{code}

Assuming you have built an über jar to submit, what is loaded is defined by the 
classes listed in the 
{{/META-INF/service/org.apache.beam.sdk.io.FileSystemRegistrar}} file (you can 
expand your jar and take a look).  When there are several on the build path the 
first will win, and the HDFS one may not be used.  Merging at build time using 
Maven shading can be like so:
{code}
      <!-- Shade the project into an über jar to send to Spark -->
      <plugin>
        <groupId>org.apache.maven.plugins</groupId>
        <artifactId>maven-shade-plugin</artifactId>
        <configuration>
          <createDependencyReducedPom>false</createDependencyReducedPom>
          <filters>
            <filter>
              <artifact>*:*</artifact>
              <excludes>
                <exclude>META-INF/*.SF</exclude>
                <exclude>META-INF/*.DSA</exclude>
                <exclude>META-INF/*.RSA</exclude>
              </excludes>
            </filter>
          </filters>
        </configuration>
        <executions>
          <execution>
            <phase>package</phase>
            <goals>
              <goal>shade</goal>
            </goals>
            <configuration>
              <shadedArtifactAttached>true</shadedArtifactAttached>
              <shadedClassifierName>shaded</shadedClassifierName>
              <transformers>
                <transformer 
implementation="org.apache.maven.plugins.shade.resource.ServicesResourceTransformer"/>
              </transformers>
            </configuration>
          </execution>
        </executions>
      </plugin>
{code} 

With this done, the resulting file will read as:
{code}
org.apache.beam.sdk.io.LocalFileSystemRegistrar
org.apache.beam.sdk.io.hdfs.HadoopFileSystemRegistrar
{code}

I hope this helps someone else.  For a Cloudera CDH (presumable HW too) 
environment, when running on a gateway machine all the rest of the Hadoop 
config should just get picked up automatically and nothing else should be 
needed.




was (Author: timrobertson100):
I got to the bottom of this for my case.  The TL;DR to make sure you have this 
when sharing up the über jar:
{code}
<transformers>
  <transformer 
implementation="org.apache.maven.plugins.shade.resource.ServicesResourceTransformer"/>
</transformers>
{code}

A service loader is used to register the {{FileSystemRegistrar}}.  You can see 
which are registered using:
{code}
    Set<FileSystemRegistrar> registrars =
      Sets.newTreeSet(ReflectHelpers.ObjectsClassComparator.INSTANCE);
    registrars.addAll(Lists.newArrayList(
      ServiceLoader.load(FileSystemRegistrar.class, 
ReflectHelpers.findClassLoader())));

    for (FileSystemRegistrar reg : registrars) {
      System.out.println(reg.getClass());
    }
{code}

Assuming you have built an über jar to submit, what is loaded is defined by the 
classes listed in the 
{{/META-INF/service/org.apache.beam.sdk.io.FileSystemRegistrar}} file (you can 
expand your jar and take a look).  When there are several on the build path the 
first will win, and the HDFS one may not be used.  Merging at build time using 
Maven shading can be like so:
{code}
      <!-- Shade the project into an über jar to send to Spark -->
      <plugin>
        <groupId>org.apache.maven.plugins</groupId>
        <artifactId>maven-shade-plugin</artifactId>
        <configuration>
          <createDependencyReducedPom>false</createDependencyReducedPom>
          <filters>
            <filter>
              <artifact>*:*</artifact>
              <excludes>
                <exclude>META-INF/*.SF</exclude>
                <exclude>META-INF/*.DSA</exclude>
                <exclude>META-INF/*.RSA</exclude>
              </excludes>
            </filter>
          </filters>
        </configuration>
        <executions>
          <execution>
            <phase>package</phase>
            <goals>
              <goal>shade</goal>
            </goals>
            <configuration>
              <shadedArtifactAttached>true</shadedArtifactAttached>
              <shadedClassifierName>shaded</shadedClassifierName>
              <transformers>
                <transformer 
implementation="org.apache.maven.plugins.shade.resource.ServicesResourceTransformer"/>
              </transformers>
            </configuration>
          </execution>
        </executions>
      </plugin>
{code} 

With this done, the resulting file will read as:
{code}
org.apache.beam.sdk.io.LocalFileSystemRegistrar
org.apache.beam.sdk.io.hdfs.HadoopFileSystemRegistrar
{code}

I hope this helps someone else.  For a Cloudera CDH (presumable HW too) 
environment, when running on a gateway machine all the rest of the Hadoop 
config should just get picked up automatically and nothing else should be 
needed.



> Error: "Unable to find registrar for hdfs" - need to prevent/improve error 
> message
> ----------------------------------------------------------------------------------
>
>                 Key: BEAM-2457
>                 URL: https://issues.apache.org/jira/browse/BEAM-2457
>             Project: Beam
>          Issue Type: Improvement
>          Components: sdk-java-core
>    Affects Versions: 2.0.0
>            Reporter: Stephen Sisk
>            Assignee: Flavio Fiszman
>
> I've noticed a number of user reports where jobs are failing with the error 
> message "Unable to find registrar for hdfs": 
> * 
> https://stackoverflow.com/questions/44497662/apache-beamunable-to-find-registrar-for-hdfs/44508533?noredirect=1#comment76026835_44508533
> * 
> https://lists.apache.org/thread.html/144c384e54a141646fcbe854226bb3668da091c5dc7fa2d471626e9b@%3Cuser.beam.apache.org%3E
> * 
> https://lists.apache.org/thread.html/e4d5ac744367f9d036a1f776bba31b9c4fe377d8f11a4b530be9f829@%3Cuser.beam.apache.org%3E
>  
> This isn't too many reports, but it is the only time I can recall so many 
> users reporting the same error message in a such a short amount of time. 
> We believe the problem is one of two things: 
> 1) bad uber jar creation
> 2) incorrect HDFS configuration
> However, it's highly possible this could have some other root cause. 
> It seems like it'd be useful to:
> 1) Follow up with the above reports to see if they've resolved the issue, and 
> if so what fixed it. There may be another root cause out there.
> 2) Improve the error message to include more information about how to resolve 
> it
> 3) See if we can improve detection of the error cases to give more specific 
> information (specifically, if HDFS is miconfigured, can we detect that 
> somehow and tell the user exactly that?)
> 4) update documentation



--
This message was sent by Atlassian JIRA
(v6.4.14#64029)

Reply via email to