I am trying to use Kinesis as source to Spark Streaming and have run into a dependency issue that can't be resolved without making my own custom Spark build. The issue is that Spark is transitively dependent on org.apache.httpcomponents:httpclient:jar:4.1.2 (I think because of libfb303 coming from hbase and hive-serde) whereas AWS SDK is dependent on org.apache.httpcomponents:httpclient:jar:4.2. When I package and run Spark Streaming application, I get the following:
Caused by: java.lang.NoSuchMethodError: org.apache.http.impl.conn.DefaultClientConnectionOperator.<init>(Lorg/apache/http/conn/scheme/SchemeRegistry;Lorg/apache/http/conn/DnsResolver;)V at org.apache.http.impl.conn.PoolingClientConnectionManager.createConnectionOperator(PoolingClientConnectionManager.java:140) at org.apache.http.impl.conn.PoolingClientConnectionManager.<init>(PoolingClientConnectionManager.java:114) at org.apache.http.impl.conn.PoolingClientConnectionManager.<init>(PoolingClientConnectionManager.java:99) at com.amazonaws.http.ConnectionManagerFactory.createPoolingClientConnManager(ConnectionManagerFactory.java:29) at com.amazonaws.http.HttpClientFactory.createHttpClient(HttpClientFactory.java:97) at com.amazonaws.http.AmazonHttpClient.<init>(AmazonHttpClient.java:181) at com.amazonaws.AmazonWebServiceClient.<init>(AmazonWebServiceClient.java:119) at com.amazonaws.AmazonWebServiceClient.<init>(AmazonWebServiceClient.java:103) at com.amazonaws.services.kinesis.AmazonKinesisClient.<init>(AmazonKinesisClient.java:136) at com.amazonaws.services.kinesis.AmazonKinesisClient.<init>(AmazonKinesisClient.java:117) at com.amazonaws.services.kinesis.AmazonKinesisAsyncClient.<init>(AmazonKinesisAsyncClient.java:132) I can create a custom Spark build with org.apache.httpcomponents:httpclient:jar:4.2 included in the assembly but I was wondering if this is something Spark devs have noticed and are looking to resolve in near releases. Here are my thoughts on this issue: Containers that allow running custom user code have to often resolve dependency issues in case of conflicts between framework's and user code's dependency. Here is how I have seen some frameworks resolve the issue: 1. Provide a child-first class loader: Some JEE containers provided a child-first class loader that allowed for loading classes from user code first. I don't think this approach completely solves the problem as the framework is then susceptible to class mismatch errors. 2. Fold in all dependencies in a sub-package: This approach involves folding all dependencies in a project specific sub-package (like spark.dependencies). This approach is tedious because it involves building custom version of all dependencies (and their transitive dependencies) 3. Use something like OSGi: Some frameworks has successfully used OSGi to manage dependencies between the modules. The challenge in this approach is to OSGify the framework and hide OSGi complexities from end user. My personal preference is OSGi (or atleast some support for OSGi) but I would love to hear what Spark devs are thinking in terms of resolving the problem. Thanks, Aniket