[Architecture] [C5] Spark/Lucene Integration in Stream Processor

Anjana Fernando Fri, 21 Oct 2016 01:31:51 -0700

Hi,

So we are starting on porting the earlier DAS specific functionality to C5.
And with this, we are planning on not embedding the Spark server
functionality to the primary binary itself, but rather run it separately as
another script in the same distribution. So basically, when running the
server in the standalone mode, from a centralized script, we will start
Spark processes and then the main stream processor server. And in a
clustered setup, we will start the Spark processes separately, and do the
clustering that is native to it, which is currently by integrating with
ZooKeeper.


So basically, for the minimum H/A setup, we would need two stream
processing nodes and also ZK to build up the cluster, if we are using Spark
also. So with C5, since we are not anyway not using Hazelcast, for other
general coordination operations also we can use ZK, since it is already a
requirement for Spark. And we have the added benefit of not getting the
issues that comes with a peer-to-peer coordination library, such as split
brain scenarios.

Also, aligning with the above approach, we are considering of directly
integrate to Solr in running in external to stream processor, rather than
doing the indexing in the embedded mode. Now also in DAS, we have a
separate indexing mode (profile), so rather than using that, we can use
Solr directly. So one of the main reasons for using this would be, it has
additional functionality to base Lucene, where it comes OOTB functionality
with aggregates etc.. which at the moment, we don't have full
functionality. So the suggestion is, Solr will also come as a separate
profile (script) with the distribution, and this will be started up if the
indexing scenarios are required for the stream processor, which we can
automatically start it up or selectively start it. Also, Solr clustering is
also done with ZK, which we will anyway have with the new Spark clustering
approach we are using.

So the aim of getting out the non-WSO2 specific servers without embedding
is, the simplicity it provides in our codebase, since we do not have to
maintain the integration code that is required to embed it, and those
servers can use its own recommended deployment patterns. For example, Spark
isn't designed to be embedded in to other servers, so we had to mess around
with some things to embed and cluster it internally. And also, upgrading
dependencies such as that becomes very straightforward, since it's external
to the base binary.

Cheers,
Anjana.
-- 
*Anjana Fernando*
Associate Director / Architect
WSO2 Inc. | http://wso2.com
lean . enterprise . middleware

_______________________________________________
Architecture mailing list
[email protected]
https://mail.wso2.org/cgi-bin/mailman/listinfo/architecture

[Architecture] [C5] Spark/Lucene Integration in Stream Processor

Reply via email to