Re: [Architecture] [C5] Spark/Lucene Integration in Stream Processor
+1 for this approach. This would be a very cleaner way to integrate with Spark. So, now rather than trying to customize spark to work with our own clustering, we can focus on a more generic approach and then may be contribute to the community as well! @Nirmal & Suho, I still think we would need spark binaries in the runtime. It's just that we would not have to meddle with the internals of spark clustering etc, which we are handling internally at the moment. On Sat, Oct 22, 2016 at 2:48 PM, Sriskandarajah Suhothayan <s...@wso2.com> wrote: > > > On Sat, Oct 22, 2016 at 10:45 AM, Nirmal Fernando <nir...@wso2.com> wrote: > >> >> >> On Fri, Oct 21, 2016 at 2:00 PM, Anjana Fernando <anj...@wso2.com> wrote: >> >>> Hi, >>> >>> So we are starting on porting the earlier DAS specific functionality to >>> C5. And with this, we are planning on not embedding the Spark server >>> functionality to the primary binary itself, but rather run it separately as >>> another script in the same distribution. So basically, when running the >>> server in the standalone mode, from a centralized script, we will start >>> Spark processes and then the main stream processor server. And in a >>> clustered setup, we will start the Spark processes separately, and do the >>> clustering that is native to it, which is currently by integrating with >>> ZooKeeper. >>> >> >> Does this mean we still keep Spark binaries inside Stream Processor? If >> not how are we planning to start a Spark process from Stream Processor? >> > > We don't need to have Spark binaries in Stream Processor and I believe its > wrong as its not the core functionality of that. But when it comes to > Product Analytics we may ship that. We need to decide on that. > > >>> So basically, for the minimum H/A setup, we would need two stream >>> processing nodes and also ZK to build up the cluster, if we are using Spark >>> also. So with C5, since we are not anyway not using Hazelcast, for other >>> general coordination operations also we can use ZK, since it is already a >>> requirement for Spark. And we have the added benefit of not getting the >>> issues that comes with a peer-to-peer coordination library, such as split >>> brain scenarios. >>> >>> Also, aligning with the above approach, we are considering of directly >>> integrate to Solr in running in external to stream processor, rather than >>> doing the indexing in the embedded mode. Now also in DAS, we have a >>> separate indexing mode (profile), so rather than using that, we can use >>> Solr directly. So one of the main reasons for using this would be, it has >>> additional functionality to base Lucene, where it comes OOTB functionality >>> with aggregates etc.. which at the moment, we don't have full >>> functionality. So the suggestion is, Solr will also come as a separate >>> profile (script) with the distribution, and this will be started up if the >>> indexing scenarios are required for the stream processor, which we can >>> automatically start it up or selectively start it. Also, Solr clustering is >>> also done with ZK, which we will anyway have with the new Spark clustering >>> approach we are using. >>> >>> So the aim of getting out the non-WSO2 specific servers without >>> embedding is, the simplicity it provides in our codebase, since we do not >>> have to maintain the integration code that is required to embed it, and >>> those servers can use its own recommended deployment patterns. For example, >>> Spark isn't designed to be embedded in to other servers, so we had to mess >>> around with some things to embed and cluster it internally. And also, >>> upgrading dependencies such as that becomes very straightforward, since >>> it's external to the base binary. >>> >>> Cheers, >>> Anjana. >>> -- >>> *Anjana Fernando* >>> Associate Director / Architect >>> WSO2 Inc. | http://wso2.com >>> lean . enterprise . middleware >>> >> >> >> >> -- >> >> Thanks & regards, >> Nirmal >> >> Team Lead - WSO2 Machine Learner >> Associate Technical Lead - Data Technologies Team, WSO2 Inc. >> Mobile: +94715779733 >> Blog: http://nirmalfdo.blogspot.com/ >> >> >> > > > -- > > *S. Suhothayan* > Associate Director / Architect & Team Lead of WSO2 Complex Event Processor > *WSO2 Inc. *http://wso2.com > * <
[Architecture] WSO2 Data Analytics Server (DAS) 3.1.0 Released!
*WSO2 Data Analytics Server (DAS) 3.1.0 Released!* WSO2 Data Analytics Server development team is pleased to announce the release of WSO2 Data Analytics Server 3.1 .0 WSO2 Data Analytics Server combines real-time, batch, interactive, and predictive (via machine learning) analysis of data into one integrated platform to support the multiple demands of Internet of Things (IoT) solutions, as well as mobile and Web apps. As a part of WSO2’s Analytics Platform, WSO2 DAS introduces a single solution with the ability to build systems and applications that collect and analyze both batch and realtime data to communicate results. It is designed to treat millions of events per second, and is therefore capable to handle Big Data volumes and Internet of Things projects. WSO2 DAS is powered by WSO2 Carbon <http://wso2.com/products/carbon/>, the SOA middleware component platform. An open source product, WSO2 Carbon is available under the Apache Software License (v2.0) <http://www.apache.org/licenses/LICENSE-2.0.html> You can download this distribution from wso2.com/products/data- analytics-server and give it a try. What's New In This Release - Integrating WSO2 Machine Learner features - Supporting incremental data processing - Improved gadget generation wizard - Cross-tenant support - Improved CarbonJDBC connector - Improvements for facet based aggregations - Supporting index based sorting - Supporting Spark on YARN for DAS - Improvements for indexing - Upgrading Spark to 1.6.2 Issues Fixed in This Release - WSO2 DAS 3.1.0 Fixed Issues <https://wso2.org/jira/issues/?filter=13152> Known Issues - WSO2 DAS 3.1.0 Known Issues <https://wso2.org/jira/issues/?filter=13154> *Source and distribution packages:* - http://wso2.com/products/data-analytics-server/ Please download, test, and vote. The README file under the distribution contains guide and instructions on how to try it out locally. Mailing Lists Join our mailing list and correspond with the developers directly. - Developer List : d...@wso2.org | Subscribe | Mail Archive <http://mail.wso2.org/mailarchive/dev/> Reporting Issues We encourage you to report issues, documentation faults and feature requests regarding WSO2 DAS through the public DAS JIRA <https://wso2.org/jira/browse/DAS>. You can use the Carbon JIRA <http://www.wso2.org/jira/browse/CARBON> to report any issues related to the Carbon base framework or associated Carbon components. Discussion Forums Alternatively, questions could be raised on http://stackoverflow.com <http://stackoverflow.com/questions/tagged/wso2>. Support We are committed to ensuring that your enterprise middleware deployment is completely supported from evaluation to production. Our unique approach ensures that all support leverages our open development methodology and is provided by the very same engineers who build the technology. For more details and to take advantage of this unique opportunity please visit http://wso2.com/support. For more information about WSO2 DAS please see wso2.com/products/data- analytics-server. Regards, WSO2 DAS Team -- *Niranda Perera* Software Engineer, WSO2 Inc. Mobile: +94-71-554-8430 Twitter: @n1r44 <https://twitter.com/N1R44> https://pythagoreanscript.wordpress.com/ ___ Architecture mailing list Architecture@wso2.org https://mail.wso2.org/cgi-bin/mailman/listinfo/architecture
Re: [Architecture] [Arch] Adding CEP and ML samples to DAS distribution in a consistent way
Great! we can add the CEP samples to DAS 3.1.0 then. On Fri, Aug 5, 2016 at 9:14 PM, Sriskandarajah Suhothayan <s...@wso2.com> wrote: > Dilini tried adding CEP samples to DAS it worked as expected, we'll send > you a pull of all CEP samples to DAS repo. > > Regards > Suho > > On Fri, Aug 5, 2016 at 12:34 PM, Gihan Anuruddha <gi...@wso2.com> wrote: > >> We discussed this as well. So our plan to inject CEP integration test to >> DAS at product build time. We are not maintaining a separate copy, instead >> we use the same tests that CEP use. >> >> On Thu, Aug 4, 2016 at 7:33 PM, Sinthuja Ragendran <sinth...@wso2.com> >> wrote: >> >>> Hi, >>> >>> We also need to find a consistent way to maintain the integration tests >>> as well. CEP, and ML features are being used in DAS, and there is no >>> integration tests for those components getting executed in the DAS product >>> build. Similarly there are many UI tests we have in dashboard server as >>> well, but those are not executed in the products which are using those. As >>> these are the core functionalities of DAS, IMHO we need to execute the >>> testcases for each of these components during the product-das build time. >>> >>> Thanks, >>> Sinthuja. >>> >>> On Thu, Aug 4, 2016 at 3:17 PM, Niranda Perera <nira...@wso2.com> wrote: >>> >>>> Hi Suho, >>>> >>>> As per the immediate DAS 310 release, we will continue to keep a local >>>> copy of the samples. I have created a JIRA here [1] to add the suggestion >>>> provided by Isuru. >>>> >>>> Best >>>> >>>> [1] https://wso2.org/jira/browse/DAS-481 >>>> >>>> On Wed, Aug 3, 2016 at 10:02 PM, Sriskandarajah Suhothayan < >>>> s...@wso2.com> wrote: >>>> >>>>> DAS team how about doing it for this release ? >>>>> >>>>> Regards >>>>> Suho >>>>> >>>>> On Wed, Aug 3, 2016 at 6:31 PM, Ramith Jayasinghe <ram...@wso2.com> >>>>> wrote: >>>>> >>>>>> I think we need to ship samples with product. otherwise, The first >>>>>> 5-minite experience of users will be negatively affected. >>>>>> >>>>>> >>>>>> ___ >>>>>> Architecture mailing list >>>>>> Architecture@wso2.org >>>>>> https://mail.wso2.org/cgi-bin/mailman/listinfo/architecture >>>>>> >>>>>> >>>>> >>>>> >>>>> -- >>>>> >>>>> *S. Suhothayan* >>>>> Associate Director / Architect & Team Lead of WSO2 Complex Event >>>>> Processor >>>>> *WSO2 Inc. *http://wso2.com >>>>> * <http://wso2.com/>* >>>>> lean . enterprise . middleware >>>>> >>>>> >>>>> *cell: (+94) 779 756 757 <%28%2B94%29%20779%20756%20757> | blog: >>>>> http://suhothayan.blogspot.com/ <http://suhothayan.blogspot.com/>twitter: >>>>> http://twitter.com/suhothayan <http://twitter.com/suhothayan> | linked-in: >>>>> http://lk.linkedin.com/in/suhothayan >>>>> <http://lk.linkedin.com/in/suhothayan>* >>>>> >>>>> ___ >>>>> Architecture mailing list >>>>> Architecture@wso2.org >>>>> https://mail.wso2.org/cgi-bin/mailman/listinfo/architecture >>>>> >>>>> >>>> >>>> >>>> -- >>>> *Niranda Perera* >>>> Software Engineer, WSO2 Inc. >>>> Mobile: +94-71-554-8430 >>>> Twitter: @n1r44 <https://twitter.com/N1R44> >>>> https://pythagoreanscript.wordpress.com/ >>>> >>>> ___ >>>> Architecture mailing list >>>> Architecture@wso2.org >>>> https://mail.wso2.org/cgi-bin/mailman/listinfo/architecture >>>> >>>> >>> >>> >>> -- >>> *Sinthuja Rajendran* >>> Technical Lead >>> WSO2, Inc.:http://wso2.com >>> >>> Blog: http://sinthu-rajan.blogspot.com/ >>> Mobile: +94774273955 >>> >>> >>> >>> ___ >>> Architecture mailing list >>> Architecture@wso2.org >>> https://mail.wso2.org/cgi-bin/mailman/listinfo/architecture >>> >>> >> >> >> -- >> W.G. Gihan Anuruddha >> Senior Software Engineer | WSO2, Inc. >> M: +94772272595 >> >> ___ >> Architecture mailing list >> Architecture@wso2.org >> https://mail.wso2.org/cgi-bin/mailman/listinfo/architecture >> >> > > > -- > > *S. Suhothayan* > Associate Director / Architect & Team Lead of WSO2 Complex Event Processor > *WSO2 Inc. *http://wso2.com > * <http://wso2.com/>* > lean . enterprise . middleware > > > *cell: (+94) 779 756 757 <%28%2B94%29%20779%20756%20757> | blog: > http://suhothayan.blogspot.com/ <http://suhothayan.blogspot.com/>twitter: > http://twitter.com/suhothayan <http://twitter.com/suhothayan> | linked-in: > http://lk.linkedin.com/in/suhothayan <http://lk.linkedin.com/in/suhothayan>* > > ___ > Architecture mailing list > Architecture@wso2.org > https://mail.wso2.org/cgi-bin/mailman/listinfo/architecture > > -- *Niranda Perera* Software Engineer, WSO2 Inc. Mobile: +94-71-554-8430 Twitter: @n1r44 <https://twitter.com/N1R44> https://pythagoreanscript.wordpress.com/ ___ Architecture mailing list Architecture@wso2.org https://mail.wso2.org/cgi-bin/mailman/listinfo/architecture
Re: [Architecture] [Arch] Adding CEP and ML samples to DAS distribution in a consistent way
Hi Suho, We still have not added the CEP samples in DAS 3.1.0 release. But we would have to add them in the next iteration. Best On Thu, Aug 4, 2016 at 9:57 PM, Sriskandarajah Suhothayan <s...@wso2.com> wrote: > Hi Niranda, > > Are you guys adding all CEP samples too ? > > Regards > Suho > > On Thu, Aug 4, 2016 at 7:33 PM, Sinthuja Ragendran <sinth...@wso2.com> > wrote: > >> Hi, >> >> We also need to find a consistent way to maintain the integration tests >> as well. CEP, and ML features are being used in DAS, and there is no >> integration tests for those components getting executed in the DAS product >> build. Similarly there are many UI tests we have in dashboard server as >> well, but those are not executed in the products which are using those. As >> these are the core functionalities of DAS, IMHO we need to execute the >> testcases for each of these components during the product-das build time. >> >> Thanks, >> Sinthuja. >> >> On Thu, Aug 4, 2016 at 3:17 PM, Niranda Perera <nira...@wso2.com> wrote: >> >>> Hi Suho, >>> >>> As per the immediate DAS 310 release, we will continue to keep a local >>> copy of the samples. I have created a JIRA here [1] to add the suggestion >>> provided by Isuru. >>> >>> Best >>> >>> [1] https://wso2.org/jira/browse/DAS-481 >>> >>> On Wed, Aug 3, 2016 at 10:02 PM, Sriskandarajah Suhothayan < >>> s...@wso2.com> wrote: >>> >>>> DAS team how about doing it for this release ? >>>> >>>> Regards >>>> Suho >>>> >>>> On Wed, Aug 3, 2016 at 6:31 PM, Ramith Jayasinghe <ram...@wso2.com> >>>> wrote: >>>> >>>>> I think we need to ship samples with product. otherwise, The first >>>>> 5-minite experience of users will be negatively affected. >>>>> >>>>> >>>>> ___ >>>>> Architecture mailing list >>>>> Architecture@wso2.org >>>>> https://mail.wso2.org/cgi-bin/mailman/listinfo/architecture >>>>> >>>>> >>>> >>>> >>>> -- >>>> >>>> *S. Suhothayan* >>>> Associate Director / Architect & Team Lead of WSO2 Complex Event >>>> Processor >>>> *WSO2 Inc. *http://wso2.com >>>> * <http://wso2.com/>* >>>> lean . enterprise . middleware >>>> >>>> >>>> *cell: (+94) 779 756 757 <%28%2B94%29%20779%20756%20757> | blog: >>>> http://suhothayan.blogspot.com/ <http://suhothayan.blogspot.com/>twitter: >>>> http://twitter.com/suhothayan <http://twitter.com/suhothayan> | linked-in: >>>> http://lk.linkedin.com/in/suhothayan >>>> <http://lk.linkedin.com/in/suhothayan>* >>>> >>>> ___ >>>> Architecture mailing list >>>> Architecture@wso2.org >>>> https://mail.wso2.org/cgi-bin/mailman/listinfo/architecture >>>> >>>> >>> >>> >>> -- >>> *Niranda Perera* >>> Software Engineer, WSO2 Inc. >>> Mobile: +94-71-554-8430 >>> Twitter: @n1r44 <https://twitter.com/N1R44> >>> https://pythagoreanscript.wordpress.com/ >>> >>> ___ >>> Architecture mailing list >>> Architecture@wso2.org >>> https://mail.wso2.org/cgi-bin/mailman/listinfo/architecture >>> >>> >> >> >> -- >> *Sinthuja Rajendran* >> Technical Lead >> WSO2, Inc.:http://wso2.com >> >> Blog: http://sinthu-rajan.blogspot.com/ >> Mobile: +94774273955 >> >> >> >> ___ >> Architecture mailing list >> Architecture@wso2.org >> https://mail.wso2.org/cgi-bin/mailman/listinfo/architecture >> >> > > > -- > > *S. Suhothayan* > Associate Director / Architect & Team Lead of WSO2 Complex Event Processor > *WSO2 Inc. *http://wso2.com > * <http://wso2.com/>* > lean . enterprise . middleware > > > *cell: (+94) 779 756 757 <%28%2B94%29%20779%20756%20757> | blog: > http://suhothayan.blogspot.com/ <http://suhothayan.blogspot.com/>twitter: > http://twitter.com/suhothayan <http://twitter.com/suhothayan> | linked-in: > http://lk.linkedin.com/in/suhothayan <http://lk.linkedin.com/in/suhothayan>* > > ___ > Architecture mailing list > Architecture@wso2.org > https://mail.wso2.org/cgi-bin/mailman/listinfo/architecture > > -- *Niranda Perera* Software Engineer, WSO2 Inc. Mobile: +94-71-554-8430 Twitter: @n1r44 <https://twitter.com/N1R44> https://pythagoreanscript.wordpress.com/ ___ Architecture mailing list Architecture@wso2.org https://mail.wso2.org/cgi-bin/mailman/listinfo/architecture
[Architecture] [Arch] Adding CEP and ML samples to DAS distribution in a consistent way
Hi all, At the moment we are maintaining samples of DAS, CEP and ML in their own product repos. Since, DAS integrates both CEP and ML, we need to send these samples with DAS. Currently we do so for ML samples, but the approach we are using is to keep a local copy of the samples in the product-das repo. This approach is rather problematic, because when there are changes in the original samples, we would also have to reflect those changes manually in the product-das copy. Is there a more consistent way to add samples? May be like creating a separate samples feature? Would like to hear from you regarding this. Best -- *Niranda Perera* Software Engineer, WSO2 Inc. Mobile: +94-71-554-8430 Twitter: @n1r44 <https://twitter.com/N1R44> https://pythagoreanscript.wordpress.com/ ___ Architecture mailing list Architecture@wso2.org https://mail.wso2.org/cgi-bin/mailman/listinfo/architecture
Re: [Architecture] [Arch] [ML] [DAS] Common directory path for spark conf files in ML and DAS
Hi Nirmal, Yes, approach should be good for the ML integration. Let me check this and get back to you. +1 for removing the old ML spark config. It would be more consistent then. Best On Mon, Jul 25, 2016 at 5:46 PM, Nirmal Fernando <nir...@wso2.com> wrote: > Hi Niranda, > > With the ML-DAS integration, we no more use a separate spark config but > the same config as DAS. We'll remove this old ML spark config. > > Please find the below image depicts how it'll work on a clustered > environment. > > > > > On Mon, Jul 25, 2016 at 12:05 PM, Niranda Perera <nira...@wso2.com> wrote: > >> Hi Nirmal, >> >> Currently, the spark configurations in ML are in the >> /repository/conf/spark directory, while DAS spark configurations >> are in /repository/analytics/spark directory. >> >> I suggest we move the ml spark configurations also to >> /repository/conf/ml/spark directory because it would be more >> consistent and self explanatory. >> >> WDYT? >> >> Best >> >> -- >> *Niranda Perera* >> Software Engineer, WSO2 Inc. >> Mobile: +94-71-554-8430 >> Twitter: @n1r44 <https://twitter.com/N1R44> >> https://pythagoreanscript.wordpress.com/ >> > > > > -- > > Thanks & regards, > Nirmal > > Team Lead - WSO2 Machine Learner > Associate Technical Lead - Data Technologies Team, WSO2 Inc. > Mobile: +94715779733 > Blog: http://nirmalfdo.blogspot.com/ > > > -- *Niranda Perera* Software Engineer, WSO2 Inc. Mobile: +94-71-554-8430 Twitter: @n1r44 <https://twitter.com/N1R44> https://pythagoreanscript.wordpress.com/ ___ Architecture mailing list Architecture@wso2.org https://mail.wso2.org/cgi-bin/mailman/listinfo/architecture
[Architecture] [Arch] [ML] [DAS] Common directory path for spark conf files in ML and DAS
Hi Nirmal, Currently, the spark configurations in ML are in the /repository/conf/spark directory, while DAS spark configurations are in /repository/analytics/spark directory. I suggest we move the ml spark configurations also to /repository/conf/ml/spark directory because it would be more consistent and self explanatory. WDYT? Best -- *Niranda Perera* Software Engineer, WSO2 Inc. Mobile: +94-71-554-8430 Twitter: @n1r44 <https://twitter.com/N1R44> https://pythagoreanscript.wordpress.com/ ___ Architecture mailing list Architecture@wso2.org https://mail.wso2.org/cgi-bin/mailman/listinfo/architecture
[Architecture] [Archi] DAS 3.0.1 ML integration approach
Hi all, This is to explain how ML components will be integrated to DAS from DAS 3.1.0 release. There are 2 modes - Standalone mode - ML components will share the Spark context instance created by the DAS components. Advised to be used for development and testing environments. - Clustered mode - ML components and DAS components will have separate Spark contexts. This means that the Spark cluster will be separated out and the consumed by the 2 components separately. Important points to note: - In the clustered approach, there will be a clear resource separation for the 2 components. - ML components will be disabled by default and can be enabled by passing a java env variable - Care should be taken when allocating resources, since there will be 2 separate spark contexts in the cluster, and the ML instance needs to be taken out from the analyzer cluster Future developments: This approach will be changed later on, so that even in the clustered mode, ML components have the option of using the analytics spark context. Would like to know your inputs regarding this Best -- *Niranda Perera* Software Engineer, WSO2 Inc. Mobile: +94-71-554-8430 Twitter: @n1r44 <https://twitter.com/N1R44> https://pythagoreanscript.wordpress.com/ ___ Architecture mailing list Architecture@wso2.org https://mail.wso2.org/cgi-bin/mailman/listinfo/architecture
[Architecture] [Archi] [DAS] Changing the schema setting behavior in Spark SQL when using CarbonAnalytics connector
Hi all, This is to inform you that we have made a small change to the existing schema setting behavior in DAS when using the CarbonAnalytics connector in SparkSQL. Let me clarify the approaches here. *Previous approach * Assume that there is a table corresponding to a stream 'abcd' with the schema 'int a, int b int c, int d'. So, the following queries were available. 1. CREATE TEMPORARY TABLE test USING CarbonAnalytics OPTIONS (tableName "abcd"); --> Infers the schema from the DAL (data access layer) 2. CREATE TEMPORARY TABLE test USING CarbonAnalytics OPTIONS (tableName "abcd", schema "a int, b int, c int, d int"); --> this schema and the existing schema *will be merged and set in the DAL & in Spark* 3. CREATE TEMPORARY TABLE test USING CarbonAnalytics OPTIONS (tableName "abcd", schema "a int, b int, c int, d int, *e int*"); --> because of the schema merge, this is also supported (to define a new field) 4. CREATE TEMPORARY TABLE test USING CarbonAnalytics OPTIONS (tableName "abcd", schema "a int, b int, c int, d int, *_timestamp long*"); --> allows the timestamp to be used for queries Implications: Because of the merge approach in the #3 query, the final order of the schema was not definite (which was set in the DAL). Ex: (a,b,c,d) merge (a, b, c, d, e) --> (a, b, c, d, e ) BUT (a, b, c, d) merger (e, d, c) --> (a, b,e, d, c) This resulted in an issue where we had to put aliases for each field in the insert statements. Ex: INSERT INTO TABLE test SELECT 1, 2, 3, 4, 5; could result in a=1, b=2, ..., d = 5 OR a=1, b=2, e=3, d=4, c =5 depending on the merge So, we had to use aliases. INSERT INTO TABLE test SELECT 1 as a, 2 as b, 3 as c, 4 as d, 5 as e; Because of this undefined nature of the merged schema, we had to fix the position of the special field "_timestamp". So, "_timestamp" was put in as the last element in the merged schema. *New approach * As new approach, have separated out the schema in spark and DAL. Now, when a user explicitly mentions a schema, the merged schema will be set in the DAL and the given schema will be used in Spark. As per the same example before, 1. CREATE TEMPORARY TABLE test USING CarbonAnalytics OPTIONS (tableName "abcd"); --> No change. Infers the schema from the DAL (data access layer) 2. CREATE TEMPORARY TABLE test USING CarbonAnalytics OPTIONS (tableName "abcd", schema "a int, b int, c int, d int"); --> This schema and the existing schema will be merged and set in the DAL *only.* This given schema will be used in Spark 3. CREATE TEMPORARY TABLE test USING CarbonAnalytics OPTIONS (tableName "abcd", schema "a int, b int, c int, d int, *e int*"); --> Merged schema will be set in DAL. This given schema will be used in Spark 4. CREATE TEMPORARY TABLE test USING CarbonAnalytics OPTIONS (tableName "abcd", schema "a int, b int, c int, d int, *_timestamp long*"); --> allows the timestamp to be used for queries So, now, there's no ambiguity in the schema setting. If you set a schema in Spark SQL as 'int a, int b int c, int d', then it will be the final schema in the Spark runtime. This change should not ideally conflict with the current samples and analytics4x implementations. Just wanted to keep you guys informed. Best -- *Niranda Perera* Software Engineer, WSO2 Inc. Mobile: +94-71-554-8430 Twitter: @n1r44 <https://twitter.com/N1R44> https://pythagoreanscript.wordpress.com/ ___ Architecture mailing list Architecture@wso2.org https://mail.wso2.org/cgi-bin/mailman/listinfo/architecture
Re: [Architecture] [DAS] Overhauling the Spark JDBC Connector
will be testing against all DBs supported by DAS >>>>>>>> over the >>>>>>>> following days. The connector is expected to be shipped with the DAS >>>>>>>> 3.1.0 >>>>>>>> release. >>>>>>>> >>>>>>>> Thoughts welcome. >>>>>>>> >>>>>>>> [1] https://github.com/wso2/carbon-analytics/pull/187 >>>>>>>> >>>>>>>> Thanks, >>>>>>>> >>>>>>>> -- >>>>>>>> Gokul Balakrishnan >>>>>>>> Senior Software Engineer, >>>>>>>> WSO2, Inc. http://wso2.com >>>>>>>> M +94 77 5935 789 | +44 7563 570502 >>>>>>>> >>>>>>>> >>>>>>>> ___ >>>>>>>> Architecture mailing list >>>>>>>> Architecture@wso2.org >>>>>>>> https://mail.wso2.org/cgi-bin/mailman/listinfo/architecture >>>>>>>> >>>>>>>> >>>>>>> >>>>>>> >>>>>>> -- >>>>>>> Thanks & Regards, >>>>>>> >>>>>>> Inosh Goonewardena >>>>>>> Associate Technical Lead- WSO2 Inc. >>>>>>> Mobile: +94779966317 >>>>>>> >>>>>>> ___ >>>>>>> Architecture mailing list >>>>>>> Architecture@wso2.org >>>>>>> https://mail.wso2.org/cgi-bin/mailman/listinfo/architecture >>>>>>> >>>>>>> >>>>>> >>>>>> >>>>>> -- >>>>>> Gokul Balakrishnan >>>>>> Senior Software Engineer, >>>>>> WSO2, Inc. http://wso2.com >>>>>> M +94 77 5935 789 | +44 7563 570502 >>>>>> >>>>>> >>>>>> ___ >>>>>> Architecture mailing list >>>>>> Architecture@wso2.org >>>>>> https://mail.wso2.org/cgi-bin/mailman/listinfo/architecture >>>>>> >>>>>> >>>>> >>>>> >>>>> -- >>>>> >>>>> Thanks & regards, >>>>> Nirmal >>>>> >>>>> Team Lead - WSO2 Machine Learner >>>>> Associate Technical Lead - Data Technologies Team, WSO2 Inc. >>>>> Mobile: +94715779733 >>>>> Blog: http://nirmalfdo.blogspot.com/ >>>>> >>>>> >>>>> >>>> >>>> >>>> -- >>>> *Anjana Fernando* >>>> Senior Technical Lead >>>> WSO2 Inc. | http://wso2.com >>>> lean . enterprise . middleware >>>> >>> >>> >>> >>> -- >>> Gokul Balakrishnan >>> Senior Software Engineer, >>> WSO2, Inc. http://wso2.com >>> M +94 77 5935 789 | +44 7563 570502 >>> >>> >> >> >> -- >> Gokul Balakrishnan >> Senior Software Engineer, >> WSO2, Inc. http://wso2.com >> M +94 77 5935 789 | +44 7563 570502 >> >> >> ___ >> Architecture mailing list >> Architecture@wso2.org >> https://mail.wso2.org/cgi-bin/mailman/listinfo/architecture >> >> > > > -- > Dulitha Wijewantha (Chan) > Software Engineer - Mobile Development > WSO2 Inc > Lean.Enterprise.Middleware > * ~Email duli...@wso2.com <duli...@wso2mobile.com>* > * ~Mobile +94712112165 <%2B94712112165>* > * ~Website dulitha.me <http://dulitha.me>* > * ~Twitter @dulitharw <https://twitter.com/dulitharw>* > *~Github @dulichan <https://github.com/dulichan>* > *~SO @chan <http://stackoverflow.com/users/813471/chan>* > > ___ > Architecture mailing list > Architecture@wso2.org > https://mail.wso2.org/cgi-bin/mailman/listinfo/architecture > > -- *Niranda Perera* Software Engineer, WSO2 Inc. Mobile: +94-71-554-8430 Twitter: @n1r44 <https://twitter.com/N1R44> https://pythagoreanscript.wordpress.com/ ___ Architecture mailing list Architecture@wso2.org https://mail.wso2.org/cgi-bin/mailman/listinfo/architecture
Re: [Architecture] Incremental Processing Support in DAS
gt;>>> >>>>>> Here's a quick introduction into that. >>>>>> >>>>>> *Execution*: >>>>>> >>>>>> In the first run of the script, it will process all the data in the >>>>>> given table and store the last processed event timestamp. >>>>>> Then from the next run onwards it would start processing starting >>>>>> from that stored timestamp. >>>>>> >>>>>> Until the query that contains the data processing part, completes, >>>>>> last processed event timestamp would not be overridden with the new >>>>>> value. >>>>>> This is to ensure that the data processing for the query wouldn't >>>>>> have to be done again, if the whole query fails. >>>>>> This is ensured by adding a commit query after the main query. >>>>>> Refer to the Syntax section for the example. >>>>>> >>>>>> *Syntax*: >>>>>> >>>>>> In the spark script, incremental processing support has to be >>>>>> specified per table, this would happen in the create temporary table >>>>>> line. >>>>>> >>>>>> ex: CREATE TEMPORARY TABLE T1 USING CarbonAnalytics options >>>>>> (tableName "test", >>>>>> *incrementalProcessing "T1,3600");* >>>>>> >>>>>> INSERT INTO T2 SELECT username, age GROUP BY age FROM T1; >>>>>> >>>>>> INC_TABLE_COMMIT T1; >>>>>> >>>>>> The last line is where it ensures the processing took place >>>>>> successfully and replaces the lastProcessed timestamp with the new one. >>>>>> >>>>>> *TimeWindow*: >>>>>> >>>>>> To do the incremental processing, the user has to provide the time >>>>>> window per which the data would be processed. >>>>>> In the above example. the data would be summarized in *1 hour *time >>>>>> windows. >>>>>> >>>>>> *WindowOffset*: >>>>>> >>>>>> Events might arrive late that would belong to a previous processed >>>>>> time window. To account to that, we have added an optional parameter that >>>>>> would allow to >>>>>> process immediately previous time windows as well ( acts like a >>>>>> buffer). >>>>>> ex: If this is set to 1, apart from the to-be-processed data, data >>>>>> related to the previously processed time window will also be taken for >>>>>> processing. >>>>>> >>>>>> >>>>>> *Limitations*: >>>>>> >>>>>> Currently, multiple time windows cannot be specified per temporary >>>>>> table in the same script. >>>>>> It would have to be done using different temporary tables. >>>>>> >>>>>> >>>>>> >>>>>> *Future Improvements:* >>>>>> - Add aggregation function support for incremental processing >>>>>> >>>>>> >>>>>> Thanks, >>>>>> Sachith >>>>>> -- >>>>>> Sachith Withana >>>>>> Software Engineer; WSO2 Inc.; http://wso2.com >>>>>> E-mail: sachith AT wso2.com >>>>>> M: +94715518127 >>>>>> Linked-In: <http://goog_416592669> >>>>>> https://lk.linkedin.com/in/sachithwithana >>>>>> >>>>> >>>>> >>>>> >>>>> -- >>>>> >>>>> Srinath Perera, Ph.D. >>>>>http://people.apache.org/~hemapani/ >>>>>http://srinathsview.blogspot.com/ >>>>> >>>> >>>> >>>> >>>> -- >>>> Sachith Withana >>>> Software Engineer; WSO2 Inc.; http://wso2.com >>>> E-mail: sachith AT wso2.com >>>> M: +94715518127 >>>> Linked-In: <http://goog_416592669> >>>> https://lk.linkedin.com/in/sachithwithana >>>> >>>> ___ >>>> Architecture mailing list >>>> Architecture@wso2.org >>>> https://mail.wso2.org/cgi-bin/mailman/listinfo/architecture >>>> >>>> >>> >>> >>> -- >>> Thanks & Regards, >>> >>> Inosh Goonewardena >>> Associate Technical Lead- WSO2 Inc. >>> Mobile: +94779966317 >>> >> >> >> >> -- >> Sachith Withana >> Software Engineer; WSO2 Inc.; http://wso2.com >> E-mail: sachith AT wso2.com >> M: +94715518127 >> Linked-In: <http://goog_416592669> >> https://lk.linkedin.com/in/sachithwithana >> > > > > -- > Thanks & Regards, > > Inosh Goonewardena > Associate Technical Lead- WSO2 Inc. > Mobile: +94779966317 > > ___ > Architecture mailing list > Architecture@wso2.org > https://mail.wso2.org/cgi-bin/mailman/listinfo/architecture > > -- *Niranda Perera* Software Engineer, WSO2 Inc. Mobile: +94-71-554-8430 Twitter: @n1r44 <https://twitter.com/N1R44> https://pythagoreanscript.wordpress.com/ ___ Architecture mailing list Architecture@wso2.org https://mail.wso2.org/cgi-bin/mailman/listinfo/architecture
Re: [Architecture] [Dev] Carbon Spark JDBC connector
Hi Gihan, are we talking about incremental processing here? insert into/overwrite queries will normally be used to push analyzed data into summary tables. in the spark jargon, insert overwrite table means, completely deleting the table and recreating it. I'm a confused with the meaning of 'overwrite' with respect to the previous 2.5.0 Hive scripts, are doing an update there? rgds On Tue, Aug 11, 2015 at 7:58 PM, Gihan Anuruddha gi...@wso2.com wrote: Hi Niranda, Are we going to solve those limitations before the GA? Specially limitation no.2. Over time we can have stat table with thousands of records, so are we going to remove all the records and reinsert every time that spark script runs? Regards, Gihan On Tue, Aug 11, 2015 at 7:13 AM, Niranda Perera nira...@wso2.com wrote: Hi all, we have implemented a custom Spark JDBC connector to be used in the Carbon environment. this enables the following 1. Now, temporary tables can be created in the Spark environment by specifying an analytics datasource (configured by the analytics-datasources.xml) and a table 2. Spark uses SELECT 1 FROM $table LIMIT 1 query to check the existence of a table and the LIMIT query is not provided by all dbs. With the new connector, this query can be provided with as a config. (this config is still WIP) 3. Adding new spark dialects related for various dbs (WIP) the idea is to test this for the following dbs - mysql - h2 - mssql - oracle - postgres - db2 I have loosely tested the connector with MySQL, and I would like the APIM team to use it with the API usage stats use-case, and provide us some feedback. this connector can be accessed as follows. (docs are still not updated. I will do that ASAP) create temporary table temp_table using CarbonJDBC options (dataSource datasource name, tableName table name); select * from temp_table insert into/overwrite table temp_table some select statement known limitations 1. when creating a temp table, it should already be created in the underlying datasource 2. insert overwrite table deletes the existing table and creates it again would be very grateful if you could use this connector in your current JDBC use cases and provide us with feedback. best -- *Niranda Perera* Software Engineer, WSO2 Inc. Mobile: +94-71-554-8430 Twitter: @n1r44 https://twitter.com/N1R44 https://pythagoreanscript.wordpress.com/ ___ Architecture mailing list Architecture@wso2.org https://mail.wso2.org/cgi-bin/mailman/listinfo/architecture -- W.G. Gihan Anuruddha Senior Software Engineer | WSO2, Inc. M: +94772272595 ___ Dev mailing list d...@wso2.org http://wso2.org/cgi-bin/mailman/listinfo/dev -- *Niranda Perera* Software Engineer, WSO2 Inc. Mobile: +94-71-554-8430 Twitter: @n1r44 https://twitter.com/N1R44 https://pythagoreanscript.wordpress.com/ ___ Architecture mailing list Architecture@wso2.org https://mail.wso2.org/cgi-bin/mailman/listinfo/architecture
[Architecture] [Dev] Carbon Spark JDBC connector
Hi all, we have implemented a custom Spark JDBC connector to be used in the Carbon environment. this enables the following 1. Now, temporary tables can be created in the Spark environment by specifying an analytics datasource (configured by the analytics-datasources.xml) and a table 2. Spark uses SELECT 1 FROM $table LIMIT 1 query to check the existence of a table and the LIMIT query is not provided by all dbs. With the new connector, this query can be provided with as a config. (this config is still WIP) 3. Adding new spark dialects related for various dbs (WIP) the idea is to test this for the following dbs - mysql - h2 - mssql - oracle - postgres - db2 I have loosely tested the connector with MySQL, and I would like the APIM team to use it with the API usage stats use-case, and provide us some feedback. this connector can be accessed as follows. (docs are still not updated. I will do that ASAP) create temporary table temp_table using CarbonJDBC options (dataSource datasource name, tableName table name); select * from temp_table insert into/overwrite table temp_table some select statement known limitations 1. when creating a temp table, it should already be created in the underlying datasource 2. insert overwrite table deletes the existing table and creates it again would be very grateful if you could use this connector in your current JDBC use cases and provide us with feedback. best -- *Niranda Perera* Software Engineer, WSO2 Inc. Mobile: +94-71-554-8430 Twitter: @n1r44 https://twitter.com/N1R44 https://pythagoreanscript.wordpress.com/ ___ Architecture mailing list Architecture@wso2.org https://mail.wso2.org/cgi-bin/mailman/listinfo/architecture
[Architecture] [Dev] [DAS] Upgrading Spark 1.4.0 - 1.4.1 in DAS
Hi all, this is to inform that we will be upgrading Spark from 1.4.0 to 1.4.1 the upgrade on the outset, does not have any API changes or dependency upgrades. therefore, the version bump should not affect the current components. rgds -- *Niranda Perera* Software Engineer, WSO2 Inc. Mobile: +94-71-554-8430 Twitter: @n1r44 https://twitter.com/N1R44 https://pythagoreanscript.wordpress.com/ ___ Architecture mailing list Architecture@wso2.org https://mail.wso2.org/cgi-bin/mailman/listinfo/architecture
Re: [Architecture] [DAS] Changing the name of Message Console
created a JIRA to track this [1] [1] https://wso2.org/jira/browse/BAM-2123 On Wed, Jul 15, 2015 at 4:50 PM, Maninda Edirisooriya mani...@wso2.com wrote: +1 for data explorer which will also make sense for the people that were familiar with BAM Cassandra Explorer. Unless we are going to support other query languages like Siddhi or SQL it is good to keep the name spark console instead of query analyzer or query console. *Maninda Edirisooriya* Senior Software Engineer *WSO2, Inc.*lean.enterprise.middleware. *Blog* : http://maninda.blogspot.com/ *E-mail* : mani...@wso2.com *Skype* : @manindae *Twitter* : @maninda On Mon, Jul 13, 2015 at 1:40 PM, Seshika Fernando sesh...@wso2.com wrote: My vote is for data explorer for message console and to keep spark console for spark console. seshi On Mon, Jul 13, 2015 at 10:36 AM, Anjana Fernando anj...@wso2.com wrote: Hi, +1 for Data Explorer for message console. The name Spark Console is fine the way it is now. Cheers, Anjana. On Sun, Jul 12, 2015 at 7:59 AM, Niranda Perera nira...@wso2.com wrote: Hi all, DAS currently ships a UI component named 'message console'. it can be used to browse data inside the DAS tables. IMO this name message console, is misleading. for a person who's new to DAS would not know the exact use of it just by reading the name. I suggest a more self-explanatory name such as, 'data explorer', 'data navigator' etc WDYT? -- *Niranda Perera* Software Engineer, WSO2 Inc. Mobile: +94-71-554-8430 Twitter: @n1r44 https://twitter.com/N1R44 https://pythagoreanscript.wordpress.com/ -- *Anjana Fernando* Senior Technical Lead WSO2 Inc. | http://wso2.com lean . enterprise . middleware ___ Architecture mailing list Architecture@wso2.org https://mail.wso2.org/cgi-bin/mailman/listinfo/architecture ___ Architecture mailing list Architecture@wso2.org https://mail.wso2.org/cgi-bin/mailman/listinfo/architecture ___ Architecture mailing list Architecture@wso2.org https://mail.wso2.org/cgi-bin/mailman/listinfo/architecture -- *Niranda Perera* Software Engineer, WSO2 Inc. Mobile: +94-71-554-8430 Twitter: @n1r44 https://twitter.com/N1R44 https://pythagoreanscript.wordpress.com/ ___ Architecture mailing list Architecture@wso2.org https://mail.wso2.org/cgi-bin/mailman/listinfo/architecture
[Architecture] [DAS] Changing the name of Message Console
Hi all, DAS currently ships a UI component named 'message console'. it can be used to browse data inside the DAS tables. IMO this name message console, is misleading. for a person who's new to DAS would not know the exact use of it just by reading the name. I suggest a more self-explanatory name such as, 'data explorer', 'data navigator' etc WDYT? -- *Niranda Perera* Software Engineer, WSO2 Inc. Mobile: +94-71-554-8430 Twitter: @n1r44 https://twitter.com/N1R44 https://pythagoreanscript.wordpress.com/ ___ Architecture mailing list Architecture@wso2.org https://mail.wso2.org/cgi-bin/mailman/listinfo/architecture
Re: [Architecture] [DAS] Changing the name of Message Console
@suho +1 @iranga I think you are referring to the spark console. do you think we should change the name from 'spark console' to 'query analyzer'? On Sun, Jul 12, 2015 at 8:28 AM, Sriskandarajah Suhothayan s...@wso2.com wrote: +1 for data explorer and I believe it's inline with the primary usecase of that. Suho On Sat, Jul 11, 2015 at 10:46 PM, Iranga Muthuthanthri ira...@wso2.com wrote: +1, Since the category mostly falls under interactive analytics and more related about to querying data , suggest Query Analyzer On Sun, Jul 12, 2015 at 7:59 AM, Niranda Perera nira...@wso2.com wrote: Hi all, DAS currently ships a UI component named 'message console'. it can be used to browse data inside the DAS tables. IMO this name message console, is misleading. for a person who's new to DAS would not know the exact use of it just by reading the name. I suggest a more self-explanatory name such as, 'data explorer', 'data navigator' etc WDYT? -- *Niranda Perera* Software Engineer, WSO2 Inc. Mobile: +94-71-554-8430 Twitter: @n1r44 https://twitter.com/N1R44 https://pythagoreanscript.wordpress.com/ ___ Architecture mailing list Architecture@wso2.org https://mail.wso2.org/cgi-bin/mailman/listinfo/architecture -- Thanks Regards Iranga Muthuthanthri (M) -0777-255773 Team Product Management ___ Architecture mailing list Architecture@wso2.org https://mail.wso2.org/cgi-bin/mailman/listinfo/architecture -- *S. Suhothayan* Technical Lead Team Lead of WSO2 Complex Event Processor *WSO2 Inc. *http://wso2.com * http://wso2.com/* lean . enterprise . middleware *cell: (+94) 779 756 757 %28%2B94%29%20779%20756%20757 | blog: http://suhothayan.blogspot.com/ http://suhothayan.blogspot.com/twitter: http://twitter.com/suhothayan http://twitter.com/suhothayan | linked-in: http://lk.linkedin.com/in/suhothayan http://lk.linkedin.com/in/suhothayan* ___ Architecture mailing list Architecture@wso2.org https://mail.wso2.org/cgi-bin/mailman/listinfo/architecture -- *Niranda Perera* Software Engineer, WSO2 Inc. Mobile: +94-71-554-8430 Twitter: @n1r44 https://twitter.com/N1R44 https://pythagoreanscript.wordpress.com/ ___ Architecture mailing list Architecture@wso2.org https://mail.wso2.org/cgi-bin/mailman/listinfo/architecture
Re: [Architecture] [POC] Performance evaluation of Hive vs Shark
Hi David, Could you point me to an example where SparkSQL is used in Stratio Deep? Rgds On Mon, Dec 15, 2014 at 2:20 PM, David Morales dmora...@stratio.com wrote: Hi there, For sure, the new release does support SparkSQL, so you can use sparkSQL and Stratio Deep together jusy out of the box. About cross-data, it' not itself related to Spark but can use Spark-Deep. It's an interactive SQL like Hive, for example. Regards. 2014-12-12 21:29 GMT+01:00 Niranda Perera nira...@wso2.com: Hi David, I have been going through the Deep-Spark examples. It looks very promising. On a follow up query, does Deep-spark/ deep-cassandra support SQL like operations on RDDs (like SparkSQL)? Example (from Datastax Cassandra connector demos): object SQLDemo extends DemoApp { val cc = new CassandraSQLContext(sc) CassandraConnector(conf).withSessionDo { session = session.execute(CREATE KEYSPACE IF NOT EXISTS test WITH REPLICATION = {'class': 'SimpleStrategy', 'replication_factor': 1 }) session.execute(DROP TABLE IF EXISTS test.sql_demo) session.execute(CREATE TABLE test.sql_demo (key INT PRIMARY KEY, grp INT, value DOUBLE)) session.execute(INSERT INTO test.sql_demo(key, grp, value) VALUES (1, 1, 1.0)) session.execute(INSERT INTO test.sql_demo(key, grp, value) VALUES (2, 1, 2.5)) session.execute(INSERT INTO test.sql_demo(key, grp, value) VALUES (3, 1, 10.0)) session.execute(INSERT INTO test.sql_demo(key, grp, value) VALUES (4, 2, 4.0)) session.execute(INSERT INTO test.sql_demo(key, grp, value) VALUES (5, 2, 2.2)) session.execute(INSERT INTO test.sql_demo(key, grp, value) VALUES (6, 2, 2.8)) } val rdd = cc.cassandraSql(SELECT grp, max(value) AS mv FROM test.sql_demo GROUP BY grp ORDER BY mv) rdd.collect().foreach(println) // [2, 4.0] [1, 10.0] sc.stop() } I also read about Stratio Crossdata. Does Crossdata serve this purpose? Rgds On Tue, Dec 2, 2014 at 11:14 PM, David Morales dmora...@stratio.com wrote: Hi¡ Please, check the develop branch if you want to see a more realistic view of our development path. Last commit was about two hours ago :) Stratio Deep is one of our core modules so there is a core team in Stratio fully devoted to spark + noSQL integration. In these last months, for example, we have added mongoDB, ElasticSearch and Aerospike to Stratio Deep, so you can talk to these databases from Spark just like you do with HDFS. Furthermore, we are working on more backends, such as neo4j or couchBase, for example. About our benchmarks, you can check out some results in this link: http://www.stratio.com/deep-vs-datastax/ Please, keep in mind that spark integration with a datastore could be done in two ways: HCI or native. We are now working on improving native integration because it's quite more performant. In this way, we are just working on some other tests with even more impressive results. Here you can find a technical overview of all our platform. http://www.slideshare.net/Stratio/stratio-platform-overview-v41 Regards 2014-12-02 11:14 GMT+01:00 Niranda Perera nira...@wso2.com: Hi David, Sorry to re-initiate this thread. But may I know if you have done any benchmarking on Datastax Spark cassandra connector and Stratio Deep-spark cassandra integration? Would love to take a look at it. I recently checked deep-spark github repo and noticed that there is no activity since Oct 29th. May I know what your future plans on this particular project? Cheers On Tue, Aug 26, 2014 at 9:12 PM, David Morales dmora...@stratio.com wrote: Yes, it is already included in our benchmarks. It could be a nice idea to share our findings, let me talk about it here. Meanwhile, you can ask us any question by using my mail or this thread, we are glad to help you. Best regards. 2014-08-24 15:49 GMT+02:00 Niranda Perera nira...@wso2.com: Hi David, Thank you for your detailed reply. It was great to hear about Stratio-Deep and I must say, it looks very interesting. Storage handlers for databases such Cassandra, MongoDB etc would be very helpful. We will definitely look up on Stratio-Deep. I came across with the Datastax Spark-Cassandra connector ( https://github.com/datastax/spark-cassandra-connector ). Have you done any comparison with your implementation and Datastax's connector? And, yes, please do share the performance results with us once it's ready. On a different note, is there any way for us to interact with Stratio dev community, in the form of dev mail lists etc, so that we could mutually share our findings? Best regards On Fri, Aug 22, 2014 at 2:07 PM, David Morales dmora...@stratio.com wrote: Hi there, *1. About the size of deployments.* It depends on your use case... specially when you combine spark with a datastore. We use to deploy spark with cassandra or mongodb, instead of using HDFS for example. Spark will be faster if you put
Re: [Architecture] [POC] Performance evaluation of Hive vs Shark
Hi David, I have been going through the Deep-Spark examples. It looks very promising. On a follow up query, does Deep-spark/ deep-cassandra support SQL like operations on RDDs (like SparkSQL)? Example (from Datastax Cassandra connector demos): object SQLDemo extends DemoApp { val cc = new CassandraSQLContext(sc) CassandraConnector(conf).withSessionDo { session = session.execute(CREATE KEYSPACE IF NOT EXISTS test WITH REPLICATION = {'class': 'SimpleStrategy', 'replication_factor': 1 }) session.execute(DROP TABLE IF EXISTS test.sql_demo) session.execute(CREATE TABLE test.sql_demo (key INT PRIMARY KEY, grp INT, value DOUBLE)) session.execute(INSERT INTO test.sql_demo(key, grp, value) VALUES (1, 1, 1.0)) session.execute(INSERT INTO test.sql_demo(key, grp, value) VALUES (2, 1, 2.5)) session.execute(INSERT INTO test.sql_demo(key, grp, value) VALUES (3, 1, 10.0)) session.execute(INSERT INTO test.sql_demo(key, grp, value) VALUES (4, 2, 4.0)) session.execute(INSERT INTO test.sql_demo(key, grp, value) VALUES (5, 2, 2.2)) session.execute(INSERT INTO test.sql_demo(key, grp, value) VALUES (6, 2, 2.8)) } val rdd = cc.cassandraSql(SELECT grp, max(value) AS mv FROM test.sql_demo GROUP BY grp ORDER BY mv) rdd.collect().foreach(println) // [2, 4.0] [1, 10.0] sc.stop() } I also read about Stratio Crossdata. Does Crossdata serve this purpose? Rgds On Tue, Dec 2, 2014 at 11:14 PM, David Morales dmora...@stratio.com wrote: Hi¡ Please, check the develop branch if you want to see a more realistic view of our development path. Last commit was about two hours ago :) Stratio Deep is one of our core modules so there is a core team in Stratio fully devoted to spark + noSQL integration. In these last months, for example, we have added mongoDB, ElasticSearch and Aerospike to Stratio Deep, so you can talk to these databases from Spark just like you do with HDFS. Furthermore, we are working on more backends, such as neo4j or couchBase, for example. About our benchmarks, you can check out some results in this link: http://www.stratio.com/deep-vs-datastax/ Please, keep in mind that spark integration with a datastore could be done in two ways: HCI or native. We are now working on improving native integration because it's quite more performant. In this way, we are just working on some other tests with even more impressive results. Here you can find a technical overview of all our platform. http://www.slideshare.net/Stratio/stratio-platform-overview-v41 Regards 2014-12-02 11:14 GMT+01:00 Niranda Perera nira...@wso2.com: Hi David, Sorry to re-initiate this thread. But may I know if you have done any benchmarking on Datastax Spark cassandra connector and Stratio Deep-spark cassandra integration? Would love to take a look at it. I recently checked deep-spark github repo and noticed that there is no activity since Oct 29th. May I know what your future plans on this particular project? Cheers On Tue, Aug 26, 2014 at 9:12 PM, David Morales dmora...@stratio.com wrote: Yes, it is already included in our benchmarks. It could be a nice idea to share our findings, let me talk about it here. Meanwhile, you can ask us any question by using my mail or this thread, we are glad to help you. Best regards. 2014-08-24 15:49 GMT+02:00 Niranda Perera nira...@wso2.com: Hi David, Thank you for your detailed reply. It was great to hear about Stratio-Deep and I must say, it looks very interesting. Storage handlers for databases such Cassandra, MongoDB etc would be very helpful. We will definitely look up on Stratio-Deep. I came across with the Datastax Spark-Cassandra connector ( https://github.com/datastax/spark-cassandra-connector ). Have you done any comparison with your implementation and Datastax's connector? And, yes, please do share the performance results with us once it's ready. On a different note, is there any way for us to interact with Stratio dev community, in the form of dev mail lists etc, so that we could mutually share our findings? Best regards On Fri, Aug 22, 2014 at 2:07 PM, David Morales dmora...@stratio.com wrote: Hi there, *1. About the size of deployments.* It depends on your use case... specially when you combine spark with a datastore. We use to deploy spark with cassandra or mongodb, instead of using HDFS for example. Spark will be faster if you put the data in memory, so if you need a lot of speed (interactive queries, for example), you should have enough memory. *2. About storage handlers.* We have developed the first tight integration between Cassandra and Spark, called Stratio Deep, announced in the first spark summit. You can check Stratio Deep out here: https://github.com/Stratio/stratio-deep (open, apache2 license). *Deep is a thin integration layer between Apache Spark and several NoSQL datastores. We actually support Apache Cassandra and MongoDB, but in the near
Re: [Architecture] [POC] Performance evaluation of Hive vs Shark
Hi David, Sorry to re-initiate this thread. But may I know if you have done any benchmarking on Datastax Spark cassandra connector and Stratio Deep-spark cassandra integration? Would love to take a look at it. I recently checked deep-spark github repo and noticed that there is no activity since Oct 29th. May I know what your future plans on this particular project? Cheers On Tue, Aug 26, 2014 at 9:12 PM, David Morales dmora...@stratio.com wrote: Yes, it is already included in our benchmarks. It could be a nice idea to share our findings, let me talk about it here. Meanwhile, you can ask us any question by using my mail or this thread, we are glad to help you. Best regards. 2014-08-24 15:49 GMT+02:00 Niranda Perera nira...@wso2.com: Hi David, Thank you for your detailed reply. It was great to hear about Stratio-Deep and I must say, it looks very interesting. Storage handlers for databases such Cassandra, MongoDB etc would be very helpful. We will definitely look up on Stratio-Deep. I came across with the Datastax Spark-Cassandra connector ( https://github.com/datastax/spark-cassandra-connector ). Have you done any comparison with your implementation and Datastax's connector? And, yes, please do share the performance results with us once it's ready. On a different note, is there any way for us to interact with Stratio dev community, in the form of dev mail lists etc, so that we could mutually share our findings? Best regards On Fri, Aug 22, 2014 at 2:07 PM, David Morales dmora...@stratio.com wrote: Hi there, *1. About the size of deployments.* It depends on your use case... specially when you combine spark with a datastore. We use to deploy spark with cassandra or mongodb, instead of using HDFS for example. Spark will be faster if you put the data in memory, so if you need a lot of speed (interactive queries, for example), you should have enough memory. *2. About storage handlers.* We have developed the first tight integration between Cassandra and Spark, called Stratio Deep, announced in the first spark summit. You can check Stratio Deep out here: https://github.com/Stratio/stratio-deep (open, apache2 license). *Deep is a thin integration layer between Apache Spark and several NoSQL datastores. We actually support Apache Cassandra and MongoDB, but in the near future we will add support for sever other datastores.* Datastax have announce its own driver for spark in the last spark summit, but we have been working in our solution for almost a year. Furthermore, we are working to extend this solution in order to work also with other databases... MongoDB integration is completed right now and ElasticSearch will be ready in a few weeks. And that is not all, we have also developed an integration with Cassandra and Lucene for indexing data (open source, apache2). *Stratio Cassandra is a fork of Apache Cassandra http://cassandra.apache.org/ where index functionality has been extended to provide near real time search such as ElasticSearch or Solr, including full text search http://en.wikipedia.org/wiki/Full_text_search capabilities and free multivariable search. It is achieved through an Apache Lucene http://lucene.apache.org/ based implementation of Cassandra secondary indexes, where each node of the cluster indexes its own data.* We will publish some benchmarks in two weeks, so i will share our results here if you are interested. If you are more interested in distributed file systems, you should take a look on Tachyon: http://tachyon-project.org/index.html *3. Spark - Hive compatibility* Spark will support anything with the Hadoop InputFormat interface. *4. Performance* We are working a lot with Cassandra and mongoDB and the performance is quite nice. We are finishing right now some benchmarks comparing Hadoop + HDFS vs Spark + HDFS vs Spark + Cassandra (using stratio deep and even our fork of Cassandra). Let me please share this results with you when they were ready, ok? Regards. 2014-08-22 7:53 GMT+02:00 Niranda Perera nira...@wso2.com: Hi Srinath, Yes, I am working on deploying it on a multi-node cluster with the debs dataset. I will keep architecture@ posted on the progress. Hi David, Thank you very much for the detailed insight you've provided. Few quick questions, 1. Do you have experiences in using storage handlers in Spark? 2. Would a storage handler used in Hive, be directly compatible with Spark? 3. How do you grade the performance of Spark with other databases such as Cassandra, HBase, H2, etc? Thank you very much again for your interest. Look forward to hearing from you. Regards On Thu, Aug 21, 2014 at 7:02 PM, Srinath Perera srin...@wso2.com wrote: Niranda, we need test Spark in multi-node mode before making a decision. Spark is very fast, I think there is no doubt about that. We need to make sure it stable. David, thanks
[Architecture] POODLE Vulnerability (SSL 3.0) in WSO2 Carbon 3.0 Products
Hi all, This follows Prabath's bolgpost on POODLE Attack and Disabling SSL V3 in WSO2 Carbon 4.2.0 Based Products [1] I was trying to disable SSL v3 in Carbon 3.0 as per the blogpost, but ran into the following problems. 1. I could not find a catalina-server.xml file in Carbon 3.0. Is there a different configuration file which governs the SSL Protocol version in Carbon 3.0? AFAIK Carbon 3.0 uses Tomcat 5.5 (Pls, correct me if I am wrong!) which supports SSLv1, v2 and v3. 2. In Carbon 3.0 products (ESB) axis2.xml file, I could not find a transportReceiver configuration element for PassThroughHttpSSLListener? was this introduced after Carbon 3.0? Would be very grateful if you could help me out on this. Cheers [1] http://blog.facilelogin.com/2014/10/poodle-attack-and-disabling-ssl-v3-in_18.html -- *Niranda Perera* Software Engineer, WSO2 Inc. Mobile: +94-71-554-8430 Twitter: @n1r44 https://twitter.com/N1R44 ___ Architecture mailing list Architecture@wso2.org https://mail.wso2.org/cgi-bin/mailman/listinfo/architecture
[Architecture] [Dev] [Suggestion] List of ports used by WSO2 products
Hi, Is there a list of products used by WSO2 products by default? IMO it would be better if we could have one, because it would be easier to setup Security Group Rules while spawning support cloud instances. Would like to know your opinion about this! Cheers -- *Niranda Perera* Software Engineer, WSO2 Inc. Mobile: +94-71-554-8430 Twitter: @n1r44 https://twitter.com/N1R44 ___ Architecture mailing list Architecture@wso2.org https://mail.wso2.org/cgi-bin/mailman/listinfo/architecture
Re: [Architecture] [Dev] [Suggestion] List of ports used by WSO2 products
Thanks Nuwan. Yes, I was looking for this, my bad! On Wed, Oct 22, 2014 at 12:00 PM, Nuwan Silva nuw...@wso2.com wrote: yes, looking for something like [1]? [1] https://docs.wso2.com/display/shared/Default+Ports+of+WSO2+Products On Wed, Oct 22, 2014 at 11:48 AM, Niranda Perera nira...@wso2.com wrote: Hi, Is there a list of products used by WSO2 products by default? IMO it would be better if we could have one, because it would be easier to setup Security Group Rules while spawning support cloud instances. Would like to know your opinion about this! Cheers -- *Niranda Perera* Software Engineer, WSO2 Inc. Mobile: +94-71-554-8430 Twitter: @n1r44 https://twitter.com/N1R44 -- *Nuwan Silva* *Senior Software Engineer - QA* Mobile: +94779804543 WSO2 Inc. lean . enterprise . middlewear. http://www.wso2.com -- *Niranda Perera* Software Engineer, WSO2 Inc. Mobile: +94-71-554-8430 Twitter: @n1r44 https://twitter.com/N1R44 ___ Architecture mailing list Architecture@wso2.org https://mail.wso2.org/cgi-bin/mailman/listinfo/architecture
Re: [Architecture] BAM Performance tests
___ Architecture mailing list Architecture@wso2.org https://mail.wso2.org/cgi-bin/mailman/listinfo/architecture -- *Sinthuja Rajendran* Senior Software Engineer http://wso2.com/ WSO2, Inc.:http://wso2.com Blog: http://sinthu-rajan.blogspot.com/ Mobile: +94774273955 ___ Architecture mailing list Architecture@wso2.org https://mail.wso2.org/cgi-bin/mailman/listinfo/architecture -- Regards, Thayalan Sivapaleswararajah Associate Technical Lead - QA Mob: +94(0)777872485 Tel : +94(0)(11)2145345 Fax : +94(0)(11)2145300 Email: thaya...@wso2.com *Disclaimer*: *This communication may contain privileged or other confidential information and is intended exclusively for the addressee/s. If you are not the intended recipient/s, or believe that you may have received this communication in error, please reply to the sender indicating that fact and delete the copy you received and in addition, you should not print, copy, retransmit, disseminate, or otherwise use the information contained in this communication. Internet communications cannot be guaranteed to be timely, secure, error or virus-free. The sender does not accept liability for any errors or omissions.* -- *Anjana Fernando* Senior Technical Lead WSO2 Inc. | http://wso2.com lean . enterprise . middleware -- Regards, Thayalan Sivapaleswararajah Associate Technical Lead - QA Mob: +94(0)777872485 Tel : +94(0)(11)2145345 Fax : +94(0)(11)2145300 Email: thaya...@wso2.com *Disclaimer*: *This communication may contain privileged or other confidential information and is intended exclusively for the addressee/s. If you are not the intended recipient/s, or believe that you may have received this communication in error, please reply to the sender indicating that fact and delete the copy you received and in addition, you should not print, copy, retransmit, disseminate, or otherwise use the information contained in this communication. Internet communications cannot be guaranteed to be timely, secure, error or virus-free. The sender does not accept liability for any errors or omissions.* -- Srinath Perera, Ph.D. http://people.apache.org/~hemapani/ http://srinathsview.blogspot.com/ ___ Architecture mailing list Architecture@wso2.org https://mail.wso2.org/cgi-bin/mailman/listinfo/architecture -- *Niranda Perera* Software Engineer, WSO2 Inc. Mobile: +94-71-554-8430 Twitter: @n1r44 https://twitter.com/N1R44 ___ Architecture mailing list Architecture@wso2.org https://mail.wso2.org/cgi-bin/mailman/listinfo/architecture
Re: [Architecture] [POC] Performance evaluation of Hive vs Shark
Hi Srinath, Yes, I am working on deploying it on a multi-node cluster with the debs dataset. I will keep architecture@ posted on the progress. Hi David, Thank you very much for the detailed insight you've provided. Few quick questions, 1. Do you have experiences in using storage handlers in Spark? 2. Would a storage handler used in Hive, be directly compatible with Spark? 3. How do you grade the performance of Spark with other databases such as Cassandra, HBase, H2, etc? Thank you very much again for your interest. Look forward to hearing from you. Regards On Thu, Aug 21, 2014 at 7:02 PM, Srinath Perera srin...@wso2.com wrote: Niranda, we need test Spark in multi-node mode before making a decision. Spark is very fast, I think there is no doubt about that. We need to make sure it stable. David, thanks for a detailed email! How big (nodes) is the Spark setup you guys are running? --Srinath On Thu, Aug 21, 2014 at 1:34 PM, David Morales dmora...@stratio.com wrote: Sorry for disturbing this thread, but i think that i can help clarifying a few things (we were attending the last Spark Summit, we were also speakers there and we are working very close to spark) * Hive/Shark and others benchmark* You can find a nice comparison and benchmark in this web: https://amplab.cs.berkeley.edu/benchmark/ * Shark and SparkSQL* SparkSQL is the natural replacement for Shark, but SparkSQL is still young at this moment. If you are looking for Hive compatibility, you have to execute SparkSQL with an specific context. Quoted from spark website: * Note that Spark SQL currently uses a very basic SQL parser. Users that want a more complete dialect of SQL should look at the HiveSQL support provided by HiveContext.* So, only note that SparkSQL is a work in progress. If you want SparkSQL you have to run a SparkSQLContext, if you want Hive, you will have a different context... * Spark - Hadoop: the future* Most Hadoop distributions are including Spark: cloudera, hortonworks, mapR... and contributing to migrate all the Hadoop ecosystem to Spark. Spark is a bit more than Map/Reduce... as you can read here: http://gigaom.com/2014/06/28/4-reasons-why-spark-could-jolt-hadoop-into-hyperdrive/ * Spark Streaming / Spark SQL* Spark Streaming is built on Spark and it provides streaming processing through an information abstraction called DStreams (a collection of RDDs in a window of time). There is some efforts in order to make SparkSQL compatible with Spark Streaming (something similar to trident for storm), as you can see here: *StreamSQL (https://github.com/thunderain-project/StreamSQL https://github.com/thunderain-project/StreamSQL) is a POC project based on Spark to combine the power of Catalyst and Spark Streaming, to offer people the ability to manipulate SQL on top of DStream as you wanted, this keep the same semantics with SparkSQL as offer a SchemaDStream on top of DStream. You don't need to do tricky thing like extracting rdd to register as a table. Besides other parts are the same as Spark.* So, you can apply a SQL in a data stream, but it is very simple at the moment... you can expect a bunch of improvements in this matter in the next months (i guess that sparkSQL will work on Spark streaming streams before the end of this year). * Spark Streaming / Spark SQL and CEP* There is no relationship at this moment between (your absolutely amazing) Siddhi CEP and Spark. As fas as i know, you are working in doing distributed CEP with Storm and Siddhi. We are currently working on doing an interactive cep built with kafka + spark streaming + siddhi, with some features such as an API, an interactive shell, built-in statistics and auditing, built-in functions (save2cassandra, save2mongo, save2elasticsearch...). If you are interested we can talk about this project, i think that it would be a nice idea¡ Anyway, i don't think that SparkSQL will evolve in something like a CEP. Patterns, sequences, for example would be very complex to do with spark streaming (at least now). Thanks. 2014-08-21 6:18 GMT+02:00 Sriskandarajah Suhothayan s...@wso2.com: On Wed, Aug 20, 2014 at 1:36 PM, Niranda Perera nira...@wso2.com wrote: @Maninda, +1 for suggesting Spark SQL. Quote Databricks, Spark SQL provides state-of-the-art SQL performance and maintains compatibility with Shark/Hive. In particular, like Shark, Spark SQL supports all existing Hive data formats, user-defined functions (UDF), and the Hive metastore. [1] But I am not entirely sure if Spark SQL and Siddhi is comparable, because SparkSQL (like Hive) is designed for batch processing, where as Siddhi is real-time processing. But if there are implementations where Siddhi is run on top of Spark, it would be very interesting. Yes Siddhi's current way of operation does not support this. But with partitions and we can achieve this to some extent. Suho Spark supports
Re: [Architecture] [POC] Performance evaluation of Hive vs Shark
Hi Anjana and Srinath, After the discussion I had with Anjana, I researched more on the continuation of Shark project by Databricks. Here's what I found out, - Shark was built on the Hive codebase and achieved performance improvements by swapping out the physical execution engine part of Hive. While this approach enabled Shark users to speed up their Hive queries, Shark inherited a large, complicated code base from Hive that made it hard to optimize and maintain. Hence, Databricks has announced that they are halting the development of Shark from July, 2014. (Shark 0.9 would be the last release) [1] - Shark will be replaced by Spark SQL. It beats Shark in TPC-DS performance http://databricks.com/blog/2014/06/02/exciting-performance-improvements-on-the-horizon-for-spark-sql.html by almost an order of magnitude. It also supports all existing Hive data formats, user-defined functions (UDF), and the Hive metastore. [2] - Following is the Shark, Spark SQL migration plan http://spark-summit.org/wp-content/uploads/2014/07/Future-of-Spark-Patrick-Wendell.pdf - For the legacy Hive and MapReduce users, they have proposed a new 'Hive on Spark Project' [3], [4] But, given the performance enhancement, it is quite certain that Hive and MR would be replaced by engines build on top of Spark (ex: Spark SQL) In my opinion there are a few matters to figure out if we are migrating from Hive, 1. whether we are changing the query engine only? (Then, we can replace Hive by Shark) 2. whether we are changing the existing Hadoop/ MapReduce framework to Spark? (Then we can replace Hive and Hadoop with Spark and Spark SQL) In my opinion, considering the longterm impact and the availability of support, it is best to migrate the Hive/Hadoop to Spark. It is open for discussion! In the mean time, I've already tried Spark SQL, and Databricks claims on improved performance seems to be true. I will work more on this. Cheers [1] http://databricks.com/blog/2014/07/01/shark-spark-sql-hive-on-spark-and-the-future-of-sql-on-spark.html [2] http://databricks.com/blog/2014/06/02/exciting-performance-improvements-on-the-horizon-for-spark-sql.html [3] https://issues.apache.org/jira/browse/HIVE-7292 [4] https://cwiki.apache.org/confluence/display/Hive/Hive+on+Spark On Thu, Aug 14, 2014 at 12:16 PM, Anjana Fernando anj...@wso2.com wrote: Hi Srinath, No, this has not been tested in multiple nodes. I told Niranda here in my last mail, to test a cluster with the same set of hardware we have, that we are using to test our large data set with Hive. As for the effort to make the change, we still have to figure out the MT aspects of Shark here. Sinthuja was working on making the latest Hive version MT ready, and most probably, we can do the same changes to the Hive version Shark is using. So after we do that, the integration should be seamless. And also, as I mentioned earlier here, we are also going to test this with the APIM Hive script, to check if there are any unforeseen incompatibilities. Cheers, Anjana. On Thu, Aug 14, 2014 at 11:53 AM, Srinath Perera srin...@wso2.com wrote: This look great. We need to test Spark with multiple nodes? Did we do that. Please create few VMs in performance could (talk to Lakmal) and test with at least 5 nodes. We need to make sure it works OK with distributed setup as well. What does it take to change to spark? Anjana .. how much work is it? --Srinath On Wed, Aug 13, 2014 at 7:06 PM, Niranda Perera nira...@wso2.com wrote: Thank you Anjana. Yes, I am working on it. In the mean time, I found this in Hive documentation [1]. It talks about Hive on Spark, and compares Hive, Shark and Spark SQL at an higher architectural level. Additionally, it is said that the in-memory performance of Shark can be improved by introducing Tachyon [2]. I guess we can consider this later on. Cheers. [1] https://cwiki.apache.org/confluence/display/Hive/Hive+on+Spark#HiveonSpark-1.3ComparisonwithSharkandSparkSQL [2] http://tachyon-project.org/Running-Tachyon-Locally.html On Wed, Aug 13, 2014 at 3:17 PM, Anjana Fernando anj...@wso2.com wrote: Hi Niranda, Excellent analysis of Hive vs Shark! .. This gives a lot of insight into how both operates in different scenarios. As the next step, we will need to run this in an actual cluster of computers. Since you've used a subset of the dataset of 2014 DEBS challenge, we should use the full data set in a clustered environment and check this. Gokul is already working on the Hive based setup for this, after that is done, you can create a Shark cluster in the same hardware and run the tests there, to get a clear comparison on how these two match up in a cluster. Until the setup is ready, do continue with your next steps on checking the RDD support and Spark SQL use. After these are done, we should also do a trial run of our own APIM Hive scripts, migrated to Shark. Cheers, Anjana. On Mon, Aug 11, 2014 at 12:21 PM, Niranda
Re: [Architecture] [POC] Performance evaluation of Hive vs Shark
@Maninda, +1 for suggesting Spark SQL. Quote Databricks, Spark SQL provides state-of-the-art SQL performance and maintains compatibility with Shark/Hive. In particular, like Shark, Spark SQL supports all existing Hive data formats, user-defined functions (UDF), and the Hive metastore. [1] But I am not entirely sure if Spark SQL and Siddhi is comparable, because SparkSQL (like Hive) is designed for batch processing, where as Siddhi is real-time processing. But if there are implementations where Siddhi is run on top of Spark, it would be very interesting. Spark supports either Hadoop1 or 2. But I think we should see, what is best, MR1 or YARN+MR2 [image: Hadoop Architecture] [2] [1] http://databricks.com/blog/2014/07/01/shark-spark-sql-hive-on-spark-and-the-future-of-sql-on-spark.html [2] http://www.tomsitpro.com/articles/hadoop-2-vs-1,2-718.html On Wed, Aug 20, 2014 at 1:13 PM, Lasantha Fernando lasan...@wso2.com wrote: Hi Maninda, On 20 August 2014 12:02, Maninda Edirisooriya mani...@wso2.com wrote: In the case of discontinuity of Shark project, IMO we should not move to Shark at all. And it seems better to go with Spark SQL as we are already using Spark for CEP. But I am not sure the difference between Spark SQL and the Siddhi queries on the Spark engine. Currently, we are doing integration with CEP using Apache Storm, not Spark... :-). Spark Streaming is a possible candidate for integrating with CEP, but we have opted with Storm. I think there has been some independent work on integrating Kafka + Spark Streaming + Siddhi. Please refer to thread on arch@ [Architecture] A few questions about WSO2 CEP/Siddhi And we have to figure out how Spark SQL is used for historical data, whether it can execute incremental processing by default which will implement all out existing BAM use cases. On the other hand in Hadoop 2 [1] they are using a completely different platform for resource allocation known as Yarn. Sometimes this may be more suitable for batch jobs. [1] https://www.youtube.com/watch?v=RncoVN0l6dc Thanks, Lasantha *Maninda Edirisooriya* Senior Software Engineer *WSO2, Inc. *lean.enterprise.middleware. *Blog* : http://maninda.blogspot.com/ *E-mail* : mani...@wso2.com *Skype* : @manindae *Twitter* : @maninda On Wed, Aug 20, 2014 at 11:33 AM, Niranda Perera nira...@wso2.com wrote: Hi Anjana and Srinath, After the discussion I had with Anjana, I researched more on the continuation of Shark project by Databricks. Here's what I found out, - Shark was built on the Hive codebase and achieved performance improvements by swapping out the physical execution engine part of Hive. While this approach enabled Shark users to speed up their Hive queries, Shark inherited a large, complicated code base from Hive that made it hard to optimize and maintain. Hence, Databricks has announced that they are halting the development of Shark from July, 2014. (Shark 0.9 would be the last release) [1] - Shark will be replaced by Spark SQL. It beats Shark in TPC-DS performance http://databricks.com/blog/2014/06/02/exciting-performance-improvements-on-the-horizon-for-spark-sql.html by almost an order of magnitude. It also supports all existing Hive data formats, user-defined functions (UDF), and the Hive metastore. [2] - Following is the Shark, Spark SQL migration plan http://spark-summit.org/wp-content/uploads/2014/07/Future-of-Spark-Patrick-Wendell.pdf - For the legacy Hive and MapReduce users, they have proposed a new 'Hive on Spark Project' [3], [4] But, given the performance enhancement, it is quite certain that Hive and MR would be replaced by engines build on top of Spark (ex: Spark SQL) In my opinion there are a few matters to figure out if we are migrating from Hive, 1. whether we are changing the query engine only? (Then, we can replace Hive by Shark) 2. whether we are changing the existing Hadoop/ MapReduce framework to Spark? (Then we can replace Hive and Hadoop with Spark and Spark SQL) In my opinion, considering the longterm impact and the availability of support, it is best to migrate the Hive/Hadoop to Spark. It is open for discussion! In the mean time, I've already tried Spark SQL, and Databricks claims on improved performance seems to be true. I will work more on this. Cheers [1] http://databricks.com/blog/2014/07/01/shark-spark-sql-hive-on-spark-and-the-future-of-sql-on-spark.html [2] http://databricks.com/blog/2014/06/02/exciting-performance-improvements-on-the-horizon-for-spark-sql.html [3] https://issues.apache.org/jira/browse/HIVE-7292 [4] https://cwiki.apache.org/confluence/display/Hive/Hive+on+Spark On Thu, Aug 14, 2014 at 12:16 PM, Anjana Fernando anj...@wso2.com wrote: Hi Srinath, No, this has not been tested in multiple nodes. I told Niranda here in my last mail, to test a cluster with the same set of hardware we have, that we are using to test our large data set with Hive. As for the effort