I am trying to use Kinesis as source to Spark Streaming and have run into a
dependency issue that can't be resolved without making my own custom Spark
build. The issue is that Spark is transitively dependent
on org.apache.httpcomponents:httpclient:jar:4.1.2 (I think because of
libfb303 coming from hbase and hive-serde) whereas AWS SDK is dependent
on org.apache.httpcomponents:httpclient:jar:4.2. When I package and run
Spark Streaming application, I get the following:

Caused by: java.lang.NoSuchMethodError:
org.apache.http.impl.conn.DefaultClientConnectionOperator.<init>(Lorg/apache/http/conn/scheme/SchemeRegistry;Lorg/apache/http/conn/DnsResolver;)V
        at
org.apache.http.impl.conn.PoolingClientConnectionManager.createConnectionOperator(PoolingClientConnectionManager.java:140)
        at
org.apache.http.impl.conn.PoolingClientConnectionManager.<init>(PoolingClientConnectionManager.java:114)
        at
org.apache.http.impl.conn.PoolingClientConnectionManager.<init>(PoolingClientConnectionManager.java:99)
        at
com.amazonaws.http.ConnectionManagerFactory.createPoolingClientConnManager(ConnectionManagerFactory.java:29)
        at
com.amazonaws.http.HttpClientFactory.createHttpClient(HttpClientFactory.java:97)
        at
com.amazonaws.http.AmazonHttpClient.<init>(AmazonHttpClient.java:181)
        at
com.amazonaws.AmazonWebServiceClient.<init>(AmazonWebServiceClient.java:119)
        at
com.amazonaws.AmazonWebServiceClient.<init>(AmazonWebServiceClient.java:103)
        at
com.amazonaws.services.kinesis.AmazonKinesisClient.<init>(AmazonKinesisClient.java:136)
        at
com.amazonaws.services.kinesis.AmazonKinesisClient.<init>(AmazonKinesisClient.java:117)
        at
com.amazonaws.services.kinesis.AmazonKinesisAsyncClient.<init>(AmazonKinesisAsyncClient.java:132)

I can create a custom Spark build with
org.apache.httpcomponents:httpclient:jar:4.2 included in the assembly but I
was wondering if this is something Spark devs have noticed and are looking
to resolve in near releases. Here are my thoughts on this issue:

Containers that allow running custom user code have to often resolve
dependency issues in case of conflicts between framework's and user code's
dependency. Here is how I have seen some frameworks resolve the issue:
1. Provide a child-first class loader: Some JEE containers provided a
child-first class loader that allowed for loading classes from user code
first. I don't think this approach completely solves the problem as the
framework is then susceptible to class mismatch errors.
2. Fold in all dependencies in a sub-package: This approach involves
folding all dependencies in a project specific sub-package (like
spark.dependencies). This approach is tedious because it involves building
custom version of all dependencies (and their transitive dependencies)
3. Use something like OSGi: Some frameworks has successfully used OSGi to
manage dependencies between the modules. The challenge in this approach is
to OSGify the framework and hide OSGi complexities from end user.

My personal preference is OSGi (or atleast some support for OSGi) but I
would love to hear what Spark devs are thinking in terms of resolving the
problem.

Thanks,
Aniket

Reply via email to