Re: [Architecture] [C5] Spark/Lucene Integration in Stream Processor

2016-10-23 Thread Niranda Perera
+1 for this approach. This would be a very cleaner way to integrate with
Spark.
So, now rather than trying to customize spark to work with our own
clustering, we can focus on a more generic approach and then may be
contribute to the community as well!

@Nirmal & Suho,
I still think we would need spark binaries in the runtime. It's just that
we would not have to meddle with the internals of spark clustering etc,
which we are handling internally at the moment.

On Sat, Oct 22, 2016 at 2:48 PM, Sriskandarajah Suhothayan <s...@wso2.com>
wrote:

>
>
> On Sat, Oct 22, 2016 at 10:45 AM, Nirmal Fernando <nir...@wso2.com> wrote:
>
>>
>>
>> On Fri, Oct 21, 2016 at 2:00 PM, Anjana Fernando <anj...@wso2.com> wrote:
>>
>>> Hi,
>>>
>>> So we are starting on porting the earlier DAS specific functionality to
>>> C5. And with this, we are planning on not embedding the Spark server
>>> functionality to the primary binary itself, but rather run it separately as
>>> another script in the same distribution. So basically, when running the
>>> server in the standalone mode, from a centralized script, we will start
>>> Spark processes and then the main stream processor server. And in a
>>> clustered setup, we will start the Spark processes separately, and do the
>>> clustering that is native to it, which is currently by integrating with
>>> ZooKeeper.
>>>
>>
>> Does this mean we still keep Spark binaries inside Stream Processor? If
>> not how are we planning to start a Spark process from Stream Processor?
>>
>
> We don't need to have Spark binaries in Stream Processor and I believe its
> wrong as its not the core functionality of that. But when it comes to
> Product Analytics we may ship that. We need to decide on that.
>
>
>>> So basically, for the minimum H/A setup, we would need two stream
>>> processing nodes and also ZK to build up the cluster, if we are using Spark
>>> also. So with C5, since we are not anyway not using Hazelcast, for other
>>> general coordination operations also we can use ZK, since it is already a
>>> requirement for Spark. And we have the added benefit of not getting the
>>> issues that comes with a peer-to-peer coordination library, such as split
>>> brain scenarios.
>>>
>>> Also, aligning with the above approach, we are considering of directly
>>> integrate to Solr in running in external to stream processor, rather than
>>> doing the indexing in the embedded mode. Now also in DAS, we have a
>>> separate indexing mode (profile), so rather than using that, we can use
>>> Solr directly. So one of the main reasons for using this would be, it has
>>> additional functionality to base Lucene, where it comes OOTB functionality
>>> with aggregates etc.. which at the moment, we don't have full
>>> functionality. So the suggestion is, Solr will also come as a separate
>>> profile (script) with the distribution, and this will be started up if the
>>> indexing scenarios are required for the stream processor, which we can
>>> automatically start it up or selectively start it. Also, Solr clustering is
>>> also done with ZK, which we will anyway have with the new Spark clustering
>>> approach we are using.
>>>
>>> So the aim of getting out the non-WSO2 specific servers without
>>> embedding is, the simplicity it provides in our codebase, since we do not
>>> have to maintain the integration code that is required to embed it, and
>>> those servers can use its own recommended deployment patterns. For example,
>>> Spark isn't designed to be embedded in to other servers, so we had to mess
>>> around with some things to embed and cluster it internally. And also,
>>> upgrading dependencies such as that becomes very straightforward, since
>>> it's external to the base binary.
>>>
>>> Cheers,
>>> Anjana.
>>> --
>>> *Anjana Fernando*
>>> Associate Director / Architect
>>> WSO2 Inc. | http://wso2.com
>>> lean . enterprise . middleware
>>>
>>
>>
>>
>> --
>>
>> Thanks & regards,
>> Nirmal
>>
>> Team Lead - WSO2 Machine Learner
>> Associate Technical Lead - Data Technologies Team, WSO2 Inc.
>> Mobile: +94715779733
>> Blog: http://nirmalfdo.blogspot.com/
>>
>>
>>
>
>
> --
>
> *S. Suhothayan*
> Associate Director / Architect & Team Lead of WSO2 Complex Event Processor
> *WSO2 Inc. *http://wso2.com
> * <

[Architecture] WSO2 Data Analytics Server (DAS) 3.1.0 Released!

2016-09-14 Thread Niranda Perera
*WSO2 Data Analytics Server (DAS) 3.1.0 Released!*

​
WSO2 Data Analytics Server development team is pleased to announce the
release of WSO2 Data Analytics Server 3.1
​.0​

WSO2 Data Analytics Server combines real-time, batch, interactive, and
predictive (via machine learning) analysis of data into one integrated
platform to support the multiple demands of Internet of Things (IoT)
solutions, as well as mobile and Web apps.

As a part of WSO2’s Analytics Platform, WSO2 DAS introduces a single
solution with the ability to build systems and applications that collect
and analyze both batch and realtime data to communicate results. It is
designed to treat millions of events per second, and is therefore capable
to handle Big Data volumes and Internet of Things projects.

WSO2 DAS is powered by WSO2 Carbon <http://wso2.com/products/carbon/>, the
SOA middleware component platform.

An open source product, WSO2 Carbon is available under the Apache Software
License (v2.0) <http://www.apache.org/licenses/LICENSE-2.0.html>

You can download this distribution from wso2.com/products/data-
analytics-server and give it a try.


​​
What's New In This Release

   - Integrating WSO2 Machine Learner features
   - Supporting incremental data processing
   - Improved gadget generation wizard
   - Cross-tenant support
   - Improved CarbonJDBC connector
   - Improvements for facet based aggregations
   - Supporting index based sorting
   - Supporting Spark on YARN for DAS
   - Improvements for indexing
   - ​
   Upgrading Spark to 1.6.2



Issues Fixed in This Release

   - WSO2 DAS 3.1.0 Fixed Issues
   <https://wso2.org/jira/issues/?filter=13152>

Known Issues

   - WSO2 DAS 3.1.0 Known Issues
   <https://wso2.org/jira/issues/?filter=13154>

*Source and distribution packages:*


   - ​http://wso2.com/products/data-analytics-server/ ​


Please download, test, and vote. The README file under the distribution
contains guide and instructions on how to try it out locally.
​​
Mailing Lists

Join our mailing list and correspond with the developers directly.

   - Developer List : d...@wso2.org | Subscribe | Mail Archive
   <http://mail.wso2.org/mailarchive/dev/>

Reporting Issues

We encourage you to report issues, documentation faults and feature
requests regarding WSO2 DAS through the public DAS JIRA
<https://wso2.org/jira/browse/DAS>. You can use the Carbon JIRA
<http://www.wso2.org/jira/browse/CARBON> to report any issues related to
the Carbon base framework or associated Carbon components.
Discussion Forums

Alternatively, questions could be raised on http://stackoverflow.com
<http://stackoverflow.com/questions/tagged/wso2>.
Support

We are committed to ensuring that your enterprise middleware deployment is
completely supported from evaluation to production. Our unique approach
ensures that all support leverages our open development methodology and is
provided by the very same engineers who build the technology.

For more details and to take advantage of this unique opportunity please
visit http://wso2.com/support.
For more information about WSO2 DAS please see wso2.com/products/data-
analytics-server.​


Regards,
WSO2 DAS Team
​


-- 
*Niranda Perera*
Software Engineer, WSO2 Inc.
Mobile: +94-71-554-8430
Twitter: @n1r44 <https://twitter.com/N1R44>
https://pythagoreanscript.wordpress.com/
___
Architecture mailing list
Architecture@wso2.org
https://mail.wso2.org/cgi-bin/mailman/listinfo/architecture


Re: [Architecture] [Arch] Adding CEP and ML samples to DAS distribution in a consistent way

2016-08-05 Thread Niranda Perera
Great! we can add the CEP samples to DAS 3.1.0 then.



On Fri, Aug 5, 2016 at 9:14 PM, Sriskandarajah Suhothayan <s...@wso2.com>
wrote:

> Dilini tried adding CEP samples to DAS it worked as expected, we'll send
> you a pull of all CEP samples to DAS repo.
>
> Regards
> Suho
>
> On Fri, Aug 5, 2016 at 12:34 PM, Gihan Anuruddha <gi...@wso2.com> wrote:
>
>> We discussed this as well. So our plan to inject CEP integration test to
>> DAS at product build time. We are not maintaining a separate copy, instead
>> we use the same tests that CEP use.
>>
>> On Thu, Aug 4, 2016 at 7:33 PM, Sinthuja Ragendran <sinth...@wso2.com>
>> wrote:
>>
>>> Hi,
>>>
>>> We also need to find a consistent way to maintain the integration tests
>>> as well. CEP, and ML features are being used in DAS, and there is no
>>> integration tests for those components getting executed in the DAS product
>>> build. Similarly there are many UI tests we have in dashboard server as
>>> well, but those are not executed in the products which are using those. As
>>> these are the core functionalities of DAS, IMHO we need to execute the
>>> testcases for each of these components during the product-das build time.
>>>
>>> Thanks,
>>> Sinthuja.
>>>
>>> On Thu, Aug 4, 2016 at 3:17 PM, Niranda Perera <nira...@wso2.com> wrote:
>>>
>>>> Hi Suho,
>>>>
>>>> As per the immediate DAS 310 release, we will continue to keep a local
>>>> copy of the samples. I have created a JIRA here [1] to add the suggestion
>>>> provided by Isuru.
>>>>
>>>> Best
>>>>
>>>> [1] https://wso2.org/jira/browse/DAS-481
>>>>
>>>> On Wed, Aug 3, 2016 at 10:02 PM, Sriskandarajah Suhothayan <
>>>> s...@wso2.com> wrote:
>>>>
>>>>> DAS team how about doing it for this release ?
>>>>>
>>>>> Regards
>>>>> Suho
>>>>>
>>>>> On Wed, Aug 3, 2016 at 6:31 PM, Ramith Jayasinghe <ram...@wso2.com>
>>>>> wrote:
>>>>>
>>>>>> I think we need to ship samples with product. otherwise, The first
>>>>>> 5-minite experience of users will be negatively affected.
>>>>>>
>>>>>>
>>>>>> ___
>>>>>> Architecture mailing list
>>>>>> Architecture@wso2.org
>>>>>> https://mail.wso2.org/cgi-bin/mailman/listinfo/architecture
>>>>>>
>>>>>>
>>>>>
>>>>>
>>>>> --
>>>>>
>>>>> *S. Suhothayan*
>>>>> Associate Director / Architect & Team Lead of WSO2 Complex Event
>>>>> Processor
>>>>> *WSO2 Inc. *http://wso2.com
>>>>> * <http://wso2.com/>*
>>>>> lean . enterprise . middleware
>>>>>
>>>>>
>>>>> *cell: (+94) 779 756 757 <%28%2B94%29%20779%20756%20757> | blog:
>>>>> http://suhothayan.blogspot.com/ <http://suhothayan.blogspot.com/>twitter:
>>>>> http://twitter.com/suhothayan <http://twitter.com/suhothayan> | linked-in:
>>>>> http://lk.linkedin.com/in/suhothayan 
>>>>> <http://lk.linkedin.com/in/suhothayan>*
>>>>>
>>>>> ___
>>>>> Architecture mailing list
>>>>> Architecture@wso2.org
>>>>> https://mail.wso2.org/cgi-bin/mailman/listinfo/architecture
>>>>>
>>>>>
>>>>
>>>>
>>>> --
>>>> *Niranda Perera*
>>>> Software Engineer, WSO2 Inc.
>>>> Mobile: +94-71-554-8430
>>>> Twitter: @n1r44 <https://twitter.com/N1R44>
>>>> https://pythagoreanscript.wordpress.com/
>>>>
>>>> ___
>>>> Architecture mailing list
>>>> Architecture@wso2.org
>>>> https://mail.wso2.org/cgi-bin/mailman/listinfo/architecture
>>>>
>>>>
>>>
>>>
>>> --
>>> *Sinthuja Rajendran*
>>> Technical Lead
>>> WSO2, Inc.:http://wso2.com
>>>
>>> Blog: http://sinthu-rajan.blogspot.com/
>>> Mobile: +94774273955
>>>
>>>
>>>
>>> ___
>>> Architecture mailing list
>>> Architecture@wso2.org
>>> https://mail.wso2.org/cgi-bin/mailman/listinfo/architecture
>>>
>>>
>>
>>
>> --
>> W.G. Gihan Anuruddha
>> Senior Software Engineer | WSO2, Inc.
>> M: +94772272595
>>
>> ___
>> Architecture mailing list
>> Architecture@wso2.org
>> https://mail.wso2.org/cgi-bin/mailman/listinfo/architecture
>>
>>
>
>
> --
>
> *S. Suhothayan*
> Associate Director / Architect & Team Lead of WSO2 Complex Event Processor
> *WSO2 Inc. *http://wso2.com
> * <http://wso2.com/>*
> lean . enterprise . middleware
>
>
> *cell: (+94) 779 756 757 <%28%2B94%29%20779%20756%20757> | blog:
> http://suhothayan.blogspot.com/ <http://suhothayan.blogspot.com/>twitter:
> http://twitter.com/suhothayan <http://twitter.com/suhothayan> | linked-in:
> http://lk.linkedin.com/in/suhothayan <http://lk.linkedin.com/in/suhothayan>*
>
> ___
> Architecture mailing list
> Architecture@wso2.org
> https://mail.wso2.org/cgi-bin/mailman/listinfo/architecture
>
>


-- 
*Niranda Perera*
Software Engineer, WSO2 Inc.
Mobile: +94-71-554-8430
Twitter: @n1r44 <https://twitter.com/N1R44>
https://pythagoreanscript.wordpress.com/
___
Architecture mailing list
Architecture@wso2.org
https://mail.wso2.org/cgi-bin/mailman/listinfo/architecture


Re: [Architecture] [Arch] Adding CEP and ML samples to DAS distribution in a consistent way

2016-08-05 Thread Niranda Perera
Hi Suho,

We still have not added the CEP samples in DAS 3.1.0 release. But we would
have to add them in the next iteration.

Best

On Thu, Aug 4, 2016 at 9:57 PM, Sriskandarajah Suhothayan <s...@wso2.com>
wrote:

> Hi Niranda,
>
> Are you guys adding all CEP samples too ?
>
> Regards
> Suho
>
> On Thu, Aug 4, 2016 at 7:33 PM, Sinthuja Ragendran <sinth...@wso2.com>
> wrote:
>
>> Hi,
>>
>> We also need to find a consistent way to maintain the integration tests
>> as well. CEP, and ML features are being used in DAS, and there is no
>> integration tests for those components getting executed in the DAS product
>> build. Similarly there are many UI tests we have in dashboard server as
>> well, but those are not executed in the products which are using those. As
>> these are the core functionalities of DAS, IMHO we need to execute the
>> testcases for each of these components during the product-das build time.
>>
>> Thanks,
>> Sinthuja.
>>
>> On Thu, Aug 4, 2016 at 3:17 PM, Niranda Perera <nira...@wso2.com> wrote:
>>
>>> Hi Suho,
>>>
>>> As per the immediate DAS 310 release, we will continue to keep a local
>>> copy of the samples. I have created a JIRA here [1] to add the suggestion
>>> provided by Isuru.
>>>
>>> Best
>>>
>>> [1] https://wso2.org/jira/browse/DAS-481
>>>
>>> On Wed, Aug 3, 2016 at 10:02 PM, Sriskandarajah Suhothayan <
>>> s...@wso2.com> wrote:
>>>
>>>> DAS team how about doing it for this release ?
>>>>
>>>> Regards
>>>> Suho
>>>>
>>>> On Wed, Aug 3, 2016 at 6:31 PM, Ramith Jayasinghe <ram...@wso2.com>
>>>> wrote:
>>>>
>>>>> I think we need to ship samples with product. otherwise, The first
>>>>> 5-minite experience of users will be negatively affected.
>>>>>
>>>>>
>>>>> ___
>>>>> Architecture mailing list
>>>>> Architecture@wso2.org
>>>>> https://mail.wso2.org/cgi-bin/mailman/listinfo/architecture
>>>>>
>>>>>
>>>>
>>>>
>>>> --
>>>>
>>>> *S. Suhothayan*
>>>> Associate Director / Architect & Team Lead of WSO2 Complex Event
>>>> Processor
>>>> *WSO2 Inc. *http://wso2.com
>>>> * <http://wso2.com/>*
>>>> lean . enterprise . middleware
>>>>
>>>>
>>>> *cell: (+94) 779 756 757 <%28%2B94%29%20779%20756%20757> | blog:
>>>> http://suhothayan.blogspot.com/ <http://suhothayan.blogspot.com/>twitter:
>>>> http://twitter.com/suhothayan <http://twitter.com/suhothayan> | linked-in:
>>>> http://lk.linkedin.com/in/suhothayan 
>>>> <http://lk.linkedin.com/in/suhothayan>*
>>>>
>>>> ___
>>>> Architecture mailing list
>>>> Architecture@wso2.org
>>>> https://mail.wso2.org/cgi-bin/mailman/listinfo/architecture
>>>>
>>>>
>>>
>>>
>>> --
>>> *Niranda Perera*
>>> Software Engineer, WSO2 Inc.
>>> Mobile: +94-71-554-8430
>>> Twitter: @n1r44 <https://twitter.com/N1R44>
>>> https://pythagoreanscript.wordpress.com/
>>>
>>> ___
>>> Architecture mailing list
>>> Architecture@wso2.org
>>> https://mail.wso2.org/cgi-bin/mailman/listinfo/architecture
>>>
>>>
>>
>>
>> --
>> *Sinthuja Rajendran*
>> Technical Lead
>> WSO2, Inc.:http://wso2.com
>>
>> Blog: http://sinthu-rajan.blogspot.com/
>> Mobile: +94774273955
>>
>>
>>
>> ___
>> Architecture mailing list
>> Architecture@wso2.org
>> https://mail.wso2.org/cgi-bin/mailman/listinfo/architecture
>>
>>
>
>
> --
>
> *S. Suhothayan*
> Associate Director / Architect & Team Lead of WSO2 Complex Event Processor
> *WSO2 Inc. *http://wso2.com
> * <http://wso2.com/>*
> lean . enterprise . middleware
>
>
> *cell: (+94) 779 756 757 <%28%2B94%29%20779%20756%20757> | blog:
> http://suhothayan.blogspot.com/ <http://suhothayan.blogspot.com/>twitter:
> http://twitter.com/suhothayan <http://twitter.com/suhothayan> | linked-in:
> http://lk.linkedin.com/in/suhothayan <http://lk.linkedin.com/in/suhothayan>*
>
> ___
> Architecture mailing list
> Architecture@wso2.org
> https://mail.wso2.org/cgi-bin/mailman/listinfo/architecture
>
>


-- 
*Niranda Perera*
Software Engineer, WSO2 Inc.
Mobile: +94-71-554-8430
Twitter: @n1r44 <https://twitter.com/N1R44>
https://pythagoreanscript.wordpress.com/
___
Architecture mailing list
Architecture@wso2.org
https://mail.wso2.org/cgi-bin/mailman/listinfo/architecture


[Architecture] [Arch] Adding CEP and ML samples to DAS distribution in a consistent way

2016-08-01 Thread Niranda Perera
Hi all,

At the moment we are maintaining samples of DAS, CEP and ML in their own
product repos. Since, DAS integrates both CEP and ML, we need to send these
samples with DAS.

Currently we do so for ML samples, but the approach we are using is to keep
a local copy of the samples in the product-das repo. This approach is
rather problematic, because when there are changes in the original samples,
we would also have to reflect those changes manually in the product-das
copy.

Is there a more consistent way to add samples? May be like creating a
separate samples feature?

Would like to hear from you regarding this.

Best

-- 
*Niranda Perera*
Software Engineer, WSO2 Inc.
Mobile: +94-71-554-8430
Twitter: @n1r44 <https://twitter.com/N1R44>
https://pythagoreanscript.wordpress.com/
___
Architecture mailing list
Architecture@wso2.org
https://mail.wso2.org/cgi-bin/mailman/listinfo/architecture


Re: [Architecture] [Arch] [ML] [DAS] Common directory path for spark conf files in ML and DAS

2016-07-25 Thread Niranda Perera
Hi Nirmal,

Yes, approach should be good for the ML integration. Let me check this and
get back to you.

+1 for removing the old ML spark config. It would be more consistent then.

Best

On Mon, Jul 25, 2016 at 5:46 PM, Nirmal Fernando <nir...@wso2.com> wrote:

> Hi Niranda,
>
> With the ML-DAS integration, we no more use a separate spark config but
> the same config as DAS. We'll remove this old ML spark config.
>
> Please find the below image depicts how it'll work on a clustered
> environment.
>
>
> ​
>
> On Mon, Jul 25, 2016 at 12:05 PM, Niranda Perera <nira...@wso2.com> wrote:
>
>> Hi Nirmal,
>>
>> Currently, the spark configurations in ML are in the
>> /repository/conf/spark directory, while DAS spark configurations
>> are in /repository/analytics/spark directory.
>>
>> I suggest we move the ml spark configurations also to
>> /repository/conf/ml/spark directory because it would be more
>> consistent and self explanatory.
>>
>> WDYT?
>>
>> Best
>>
>> --
>> *Niranda Perera*
>> Software Engineer, WSO2 Inc.
>> Mobile: +94-71-554-8430
>> Twitter: @n1r44 <https://twitter.com/N1R44>
>> https://pythagoreanscript.wordpress.com/
>>
>
>
>
> --
>
> Thanks & regards,
> Nirmal
>
> Team Lead - WSO2 Machine Learner
> Associate Technical Lead - Data Technologies Team, WSO2 Inc.
> Mobile: +94715779733
> Blog: http://nirmalfdo.blogspot.com/
>
>
>


-- 
*Niranda Perera*
Software Engineer, WSO2 Inc.
Mobile: +94-71-554-8430
Twitter: @n1r44 <https://twitter.com/N1R44>
https://pythagoreanscript.wordpress.com/
___
Architecture mailing list
Architecture@wso2.org
https://mail.wso2.org/cgi-bin/mailman/listinfo/architecture


[Architecture] [Arch] [ML] [DAS] Common directory path for spark conf files in ML and DAS

2016-07-25 Thread Niranda Perera
Hi Nirmal,

Currently, the spark configurations in ML are in the
/repository/conf/spark directory, while DAS spark configurations
are in /repository/analytics/spark directory.

I suggest we move the ml spark configurations also to
/repository/conf/ml/spark directory because it would be more
consistent and self explanatory.

WDYT?

Best

-- 
*Niranda Perera*
Software Engineer, WSO2 Inc.
Mobile: +94-71-554-8430
Twitter: @n1r44 <https://twitter.com/N1R44>
https://pythagoreanscript.wordpress.com/
___
Architecture mailing list
Architecture@wso2.org
https://mail.wso2.org/cgi-bin/mailman/listinfo/architecture


[Architecture] [Archi] DAS 3.0.1 ML integration approach

2016-06-30 Thread Niranda Perera
Hi all,

This is to explain how ML components will be integrated to DAS from DAS
3.1.0 release.

There are 2 modes

   - Standalone mode - ML components will share the Spark context instance
   created by the DAS components. Advised to be used for development and
   testing environments.



   - Clustered mode - ML components and DAS components will have separate
   Spark contexts. This means that the Spark cluster will be separated out and
   the consumed by the 2 components separately.


​Important points to note:

   - In the clustered approach, there will be a clear resource separation
   for the 2 components.
   - ML components will be disabled by default and can be enabled by
   passing a java env variable
   - Care should be taken when allocating resources, since there will be 2
   separate spark contexts in the cluster, and the ML instance needs to be
   taken out from the analyzer cluster


Future developments:
This approach will be changed later on, so that even in the clustered mode,
ML components have the option of using the analytics spark context.

Would like to know your inputs regarding this

Best
-- 
*Niranda Perera*
Software Engineer, WSO2 Inc.
Mobile: +94-71-554-8430
Twitter: @n1r44 <https://twitter.com/N1R44>
https://pythagoreanscript.wordpress.com/
___
Architecture mailing list
Architecture@wso2.org
https://mail.wso2.org/cgi-bin/mailman/listinfo/architecture


[Architecture] [Archi] [DAS] Changing the schema setting behavior in Spark SQL when using CarbonAnalytics connector

2016-06-24 Thread Niranda Perera
Hi all,

This is to inform you that we have made a small change to the existing
schema setting behavior in DAS when using the CarbonAnalytics connector in
SparkSQL.

Let me clarify the approaches here.

*Previous approach *
Assume that there is a table corresponding to a stream 'abcd' with the
schema 'int a, int b int c, int d'.

So, the following queries were available.

   1. CREATE TEMPORARY TABLE test USING CarbonAnalytics OPTIONS (tableName
   "abcd");  --> Infers the schema from the DAL (data access layer)
   2. CREATE TEMPORARY TABLE test USING CarbonAnalytics OPTIONS (tableName
   "abcd", schema "a int, b int, c int, d int");  --> this schema and the
   existing schema *will be merged and set in the DAL & in Spark*
   3. CREATE TEMPORARY TABLE test USING CarbonAnalytics OPTIONS (tableName
   "abcd", schema "a int, b int, c int, d int, *e int*");  --> because of
   the schema merge, this is also supported (to define a new field)
   4. CREATE TEMPORARY TABLE test USING CarbonAnalytics OPTIONS (tableName
   "abcd", schema "a int, b int, c int, d int, *_timestamp long*"); -->
   allows the timestamp to be used for queries

Implications:
Because of the merge approach in the #3 query, the final order of the
schema was not definite (which was set in the DAL).
Ex: (a,b,c,d) merge (a, b, c, d, e) --> (a, b, c, d, e )
BUT
(a, b, c, d) merger (e, d, c) --> (a, b,e, d, c)
This resulted in an issue where we had to put aliases for each field in the
insert statements.
Ex: INSERT INTO TABLE test SELECT 1, 2, 3, 4, 5; could result in a=1, b=2,
..., d = 5 OR a=1, b=2, e=3, d=4, c =5 depending on the merge
So, we had to use aliases.
INSERT INTO TABLE test SELECT 1 as a, 2 as b, 3 as c, 4 as d, 5 as e;

Because of this undefined nature of the merged schema, we had to fix the
position of the special field "_timestamp". So, "_timestamp" was put in as
the last element in the merged schema.

*New approach *

As new approach, have separated out the schema in spark and DAL. Now, when
a user explicitly mentions a schema, the merged schema will be set in the
DAL and the given schema will be used in Spark.

As per the same example before,

   1. CREATE TEMPORARY TABLE test USING CarbonAnalytics OPTIONS (tableName
   "abcd");  --> No change. Infers the schema from the DAL (data access layer)
   2. CREATE TEMPORARY TABLE test USING CarbonAnalytics OPTIONS (tableName
   "abcd", schema "a int, b int, c int, d int");  --> This schema and the
   existing schema will be merged and set in the DAL *only.* This given
   schema will be used in Spark
   3. CREATE TEMPORARY TABLE test USING CarbonAnalytics OPTIONS (tableName
   "abcd", schema "a int, b int, c int, d int, *e int*");  --> Merged
   schema will be set in DAL. This given schema will be used in Spark
   4. CREATE TEMPORARY TABLE test USING CarbonAnalytics OPTIONS (tableName
   "abcd", schema "a int, b int, c int, d int, *_timestamp long*"); -->
   allows the timestamp to be used for queries

So, now, there's no ambiguity in the schema setting. If you set a schema in
Spark SQL as 'int a, int b int c, int d', then it will be the final schema
in the Spark runtime.



This change should not ideally conflict with the current samples and
analytics4x implementations. Just wanted to keep you guys informed.

Best

-- 
*Niranda Perera*
Software Engineer, WSO2 Inc.
Mobile: +94-71-554-8430
Twitter: @n1r44 <https://twitter.com/N1R44>
https://pythagoreanscript.wordpress.com/
___
Architecture mailing list
Architecture@wso2.org
https://mail.wso2.org/cgi-bin/mailman/listinfo/architecture


Re: [Architecture] [DAS] Overhauling the Spark JDBC Connector

2016-06-17 Thread Niranda Perera
 will be testing against all DBs supported by DAS 
>>>>>>>> over the
>>>>>>>> following days. The connector is expected to be shipped with the DAS 
>>>>>>>> 3.1.0
>>>>>>>> release.
>>>>>>>>
>>>>>>>> Thoughts welcome.
>>>>>>>>
>>>>>>>> [1] https://github.com/wso2/carbon-analytics/pull/187
>>>>>>>>
>>>>>>>> Thanks,
>>>>>>>>
>>>>>>>> --
>>>>>>>> Gokul Balakrishnan
>>>>>>>> Senior Software Engineer,
>>>>>>>> WSO2, Inc. http://wso2.com
>>>>>>>> M +94 77 5935 789 | +44 7563 570502
>>>>>>>>
>>>>>>>>
>>>>>>>> ___
>>>>>>>> Architecture mailing list
>>>>>>>> Architecture@wso2.org
>>>>>>>> https://mail.wso2.org/cgi-bin/mailman/listinfo/architecture
>>>>>>>>
>>>>>>>>
>>>>>>>
>>>>>>>
>>>>>>> --
>>>>>>> Thanks & Regards,
>>>>>>>
>>>>>>> Inosh Goonewardena
>>>>>>> Associate Technical Lead- WSO2 Inc.
>>>>>>> Mobile: +94779966317
>>>>>>>
>>>>>>> ___
>>>>>>> Architecture mailing list
>>>>>>> Architecture@wso2.org
>>>>>>> https://mail.wso2.org/cgi-bin/mailman/listinfo/architecture
>>>>>>>
>>>>>>>
>>>>>>
>>>>>>
>>>>>> --
>>>>>> Gokul Balakrishnan
>>>>>> Senior Software Engineer,
>>>>>> WSO2, Inc. http://wso2.com
>>>>>> M +94 77 5935 789 | +44 7563 570502
>>>>>>
>>>>>>
>>>>>> ___
>>>>>> Architecture mailing list
>>>>>> Architecture@wso2.org
>>>>>> https://mail.wso2.org/cgi-bin/mailman/listinfo/architecture
>>>>>>
>>>>>>
>>>>>
>>>>>
>>>>> --
>>>>>
>>>>> Thanks & regards,
>>>>> Nirmal
>>>>>
>>>>> Team Lead - WSO2 Machine Learner
>>>>> Associate Technical Lead - Data Technologies Team, WSO2 Inc.
>>>>> Mobile: +94715779733
>>>>> Blog: http://nirmalfdo.blogspot.com/
>>>>>
>>>>>
>>>>>
>>>>
>>>>
>>>> --
>>>> *Anjana Fernando*
>>>> Senior Technical Lead
>>>> WSO2 Inc. | http://wso2.com
>>>> lean . enterprise . middleware
>>>>
>>>
>>>
>>>
>>> --
>>> Gokul Balakrishnan
>>> Senior Software Engineer,
>>> WSO2, Inc. http://wso2.com
>>> M +94 77 5935 789 | +44 7563 570502
>>>
>>>
>>
>>
>> --
>> Gokul Balakrishnan
>> Senior Software Engineer,
>> WSO2, Inc. http://wso2.com
>> M +94 77 5935 789 | +44 7563 570502
>>
>>
>> ___
>> Architecture mailing list
>> Architecture@wso2.org
>> https://mail.wso2.org/cgi-bin/mailman/listinfo/architecture
>>
>>
>
>
> --
> Dulitha Wijewantha (Chan)
> Software Engineer - Mobile Development
> WSO2 Inc
> Lean.Enterprise.Middleware
>  * ~Email   duli...@wso2.com <duli...@wso2mobile.com>*
> *  ~Mobile +94712112165 <%2B94712112165>*
> *  ~Website   dulitha.me <http://dulitha.me>*
> *  ~Twitter @dulitharw <https://twitter.com/dulitharw>*
>   *~Github @dulichan <https://github.com/dulichan>*
>   *~SO @chan <http://stackoverflow.com/users/813471/chan>*
>
> ___
> Architecture mailing list
> Architecture@wso2.org
> https://mail.wso2.org/cgi-bin/mailman/listinfo/architecture
>
>


-- 
*Niranda Perera*
Software Engineer, WSO2 Inc.
Mobile: +94-71-554-8430
Twitter: @n1r44 <https://twitter.com/N1R44>
https://pythagoreanscript.wordpress.com/
___
Architecture mailing list
Architecture@wso2.org
https://mail.wso2.org/cgi-bin/mailman/listinfo/architecture


Re: [Architecture] Incremental Processing Support in DAS

2016-03-24 Thread Niranda Perera
gt;>>>
>>>>>> Here's a quick introduction into that.
>>>>>>
>>>>>> *Execution*:
>>>>>>
>>>>>> In the first run of the script, it will process all the data in the
>>>>>> given table and store the last processed event timestamp.
>>>>>> Then from the next run onwards it would start processing starting
>>>>>> from that stored timestamp.
>>>>>>
>>>>>> Until the query that contains the data processing part, completes,
>>>>>> last processed event timestamp would not be overridden with the new 
>>>>>> value.
>>>>>> This is to ensure that the data processing for the query wouldn't
>>>>>> have to be done again, if the whole query fails.
>>>>>> This is ensured by adding a commit query after the main query.
>>>>>> Refer to the Syntax section for the example.
>>>>>>
>>>>>> *Syntax*:
>>>>>>
>>>>>> In the spark script, incremental processing support has to be
>>>>>> specified per table, this would happen in the create temporary table 
>>>>>> line.
>>>>>>
>>>>>> ex: CREATE TEMPORARY TABLE T1 USING CarbonAnalytics options
>>>>>> (tableName "test",
>>>>>> *incrementalProcessing "T1,3600");*
>>>>>>
>>>>>> INSERT INTO T2 SELECT username, age GROUP BY age FROM T1;
>>>>>>
>>>>>> INC_TABLE_COMMIT T1;
>>>>>>
>>>>>> The last line is where it ensures the processing took place
>>>>>> successfully and replaces the lastProcessed timestamp with the new one.
>>>>>>
>>>>>> *TimeWindow*:
>>>>>>
>>>>>> To do the incremental processing, the user has to provide the time
>>>>>> window per which the data would be processed.
>>>>>> In the above example. the data would be summarized in *1 hour *time
>>>>>> windows.
>>>>>>
>>>>>> *WindowOffset*:
>>>>>>
>>>>>> Events might arrive late that would belong to a previous processed
>>>>>> time window. To account to that, we have added an optional parameter that
>>>>>> would allow to
>>>>>> process immediately previous time windows as well ( acts like a
>>>>>> buffer).
>>>>>> ex: If this is set to 1, apart from the to-be-processed data, data
>>>>>> related to the previously processed time window will also be taken for
>>>>>> processing.
>>>>>>
>>>>>>
>>>>>> *Limitations*:
>>>>>>
>>>>>> Currently, multiple time windows cannot be specified per temporary
>>>>>> table in the same script.
>>>>>> It would have to be done using different temporary tables.
>>>>>>
>>>>>>
>>>>>>
>>>>>> *Future Improvements:*
>>>>>> - Add aggregation function support for incremental processing
>>>>>>
>>>>>>
>>>>>> Thanks,
>>>>>> Sachith
>>>>>> --
>>>>>> Sachith Withana
>>>>>> Software Engineer; WSO2 Inc.; http://wso2.com
>>>>>> E-mail: sachith AT wso2.com
>>>>>> M: +94715518127
>>>>>> Linked-In: <http://goog_416592669>
>>>>>> https://lk.linkedin.com/in/sachithwithana
>>>>>>
>>>>>
>>>>>
>>>>>
>>>>> --
>>>>> 
>>>>> Srinath Perera, Ph.D.
>>>>>http://people.apache.org/~hemapani/
>>>>>http://srinathsview.blogspot.com/
>>>>>
>>>>
>>>>
>>>>
>>>> --
>>>> Sachith Withana
>>>> Software Engineer; WSO2 Inc.; http://wso2.com
>>>> E-mail: sachith AT wso2.com
>>>> M: +94715518127
>>>> Linked-In: <http://goog_416592669>
>>>> https://lk.linkedin.com/in/sachithwithana
>>>>
>>>> ___
>>>> Architecture mailing list
>>>> Architecture@wso2.org
>>>> https://mail.wso2.org/cgi-bin/mailman/listinfo/architecture
>>>>
>>>>
>>>
>>>
>>> --
>>> Thanks & Regards,
>>>
>>> Inosh Goonewardena
>>> Associate Technical Lead- WSO2 Inc.
>>> Mobile: +94779966317
>>>
>>
>>
>>
>> --
>> Sachith Withana
>> Software Engineer; WSO2 Inc.; http://wso2.com
>> E-mail: sachith AT wso2.com
>> M: +94715518127
>> Linked-In: <http://goog_416592669>
>> https://lk.linkedin.com/in/sachithwithana
>>
>
>
>
> --
> Thanks & Regards,
>
> Inosh Goonewardena
> Associate Technical Lead- WSO2 Inc.
> Mobile: +94779966317
>
> ___
> Architecture mailing list
> Architecture@wso2.org
> https://mail.wso2.org/cgi-bin/mailman/listinfo/architecture
>
>


-- 
*Niranda Perera*
Software Engineer, WSO2 Inc.
Mobile: +94-71-554-8430
Twitter: @n1r44 <https://twitter.com/N1R44>
https://pythagoreanscript.wordpress.com/
___
Architecture mailing list
Architecture@wso2.org
https://mail.wso2.org/cgi-bin/mailman/listinfo/architecture


Re: [Architecture] [Dev] Carbon Spark JDBC connector

2015-08-12 Thread Niranda Perera
Hi Gihan,

are we talking about incremental processing here? insert into/overwrite
queries will normally be used to push analyzed data into summary tables.

in the spark jargon, insert overwrite table means, completely deleting the
table and recreating it. I'm a confused with the meaning of 'overwrite'
with respect to the previous 2.5.0 Hive scripts, are doing an update there?

rgds

On Tue, Aug 11, 2015 at 7:58 PM, Gihan Anuruddha gi...@wso2.com wrote:

 Hi Niranda,

 Are we going to solve those limitations before the GA? Specially
 limitation no.2. Over time we can have stat table with thousands of
 records, so are we going to remove all the records and reinsert every time
 that spark script runs?

 Regards,
 Gihan

 On Tue, Aug 11, 2015 at 7:13 AM, Niranda Perera nira...@wso2.com wrote:

 Hi all,

 we have implemented a custom Spark JDBC connector to be used in the
 Carbon environment.

 this enables the following

1. Now, temporary tables can be created in the Spark environment by
specifying an analytics datasource (configured by the
analytics-datasources.xml) and a table
2. Spark uses SELECT 1 FROM $table LIMIT 1 query to check the
existence of a table and the LIMIT query is not provided by all dbs. With
the new connector, this query can be provided with as a config. (this
config is still WIP)
3. Adding new spark dialects related for various dbs (WIP)

 the idea is to test this for the following dbs

- mysql
- h2
- mssql
- oracle
- postgres
- db2

 I have loosely tested the connector with MySQL, and I would like the APIM
 team to use it with the API usage stats use-case, and provide us some
 feedback.

 this connector can be accessed as follows. (docs are still not updated. I
 will do that ASAP)

 create temporary table temp_table using CarbonJDBC options (dataSource
 datasource name, tableName table name);

 select * from temp_table

 insert into/overwrite table temp_table some select statement

 known limitations

1.  when creating a temp table, it should already be created in the
underlying datasource
2. insert overwrite table deletes the existing table and creates it
again


 would be very grateful if you could use this connector in your current
 JDBC use cases and provide us with feedback.

 best
 --
 *Niranda Perera*
 Software Engineer, WSO2 Inc.
 Mobile: +94-71-554-8430
 Twitter: @n1r44 https://twitter.com/N1R44
 https://pythagoreanscript.wordpress.com/

 ___
 Architecture mailing list
 Architecture@wso2.org
 https://mail.wso2.org/cgi-bin/mailman/listinfo/architecture




 --
 W.G. Gihan Anuruddha
 Senior Software Engineer | WSO2, Inc.
 M: +94772272595

 ___
 Dev mailing list
 d...@wso2.org
 http://wso2.org/cgi-bin/mailman/listinfo/dev




-- 
*Niranda Perera*
Software Engineer, WSO2 Inc.
Mobile: +94-71-554-8430
Twitter: @n1r44 https://twitter.com/N1R44
https://pythagoreanscript.wordpress.com/
___
Architecture mailing list
Architecture@wso2.org
https://mail.wso2.org/cgi-bin/mailman/listinfo/architecture


[Architecture] [Dev] Carbon Spark JDBC connector

2015-08-10 Thread Niranda Perera
Hi all,

we have implemented a custom Spark JDBC connector to be used in the Carbon
environment.

this enables the following

   1. Now, temporary tables can be created in the Spark environment by
   specifying an analytics datasource (configured by the
   analytics-datasources.xml) and a table
   2. Spark uses SELECT 1 FROM $table LIMIT 1 query to check the
   existence of a table and the LIMIT query is not provided by all dbs. With
   the new connector, this query can be provided with as a config. (this
   config is still WIP)
   3. Adding new spark dialects related for various dbs (WIP)

the idea is to test this for the following dbs

   - mysql
   - h2
   - mssql
   - oracle
   - postgres
   - db2

I have loosely tested the connector with MySQL, and I would like the APIM
team to use it with the API usage stats use-case, and provide us some
feedback.

this connector can be accessed as follows. (docs are still not updated. I
will do that ASAP)

create temporary table temp_table using CarbonJDBC options (dataSource
datasource name, tableName table name);

select * from temp_table

insert into/overwrite table temp_table some select statement

known limitations

   1.  when creating a temp table, it should already be created in the
   underlying datasource
   2. insert overwrite table deletes the existing table and creates it
   again


would be very grateful if you could use this connector in your current JDBC
use cases and provide us with feedback.

best
-- 
*Niranda Perera*
Software Engineer, WSO2 Inc.
Mobile: +94-71-554-8430
Twitter: @n1r44 https://twitter.com/N1R44
https://pythagoreanscript.wordpress.com/
___
Architecture mailing list
Architecture@wso2.org
https://mail.wso2.org/cgi-bin/mailman/listinfo/architecture


[Architecture] [Dev] [DAS] Upgrading Spark 1.4.0 - 1.4.1 in DAS

2015-07-30 Thread Niranda Perera
Hi all,

this is to inform that we will be upgrading Spark from 1.4.0 to 1.4.1

the upgrade on the outset, does not have any API changes or dependency
upgrades. therefore, the version bump should not affect the current
components.

rgds

-- 
*Niranda Perera*
Software Engineer, WSO2 Inc.
Mobile: +94-71-554-8430
Twitter: @n1r44 https://twitter.com/N1R44
https://pythagoreanscript.wordpress.com/
___
Architecture mailing list
Architecture@wso2.org
https://mail.wso2.org/cgi-bin/mailman/listinfo/architecture


Re: [Architecture] [DAS] Changing the name of Message Console

2015-07-24 Thread Niranda Perera
created a JIRA to track this [1]

[1] https://wso2.org/jira/browse/BAM-2123

On Wed, Jul 15, 2015 at 4:50 PM, Maninda Edirisooriya mani...@wso2.com
wrote:

 +1 for data explorer which will also make sense for the people that were
 familiar with BAM Cassandra Explorer.
 Unless we are going to support other query languages like Siddhi or SQL it
 is good to keep the name spark console instead of query analyzer or
 query console.


 *Maninda Edirisooriya*
 Senior Software Engineer

 *WSO2, Inc.*lean.enterprise.middleware.

 *Blog* : http://maninda.blogspot.com/
 *E-mail* : mani...@wso2.com
 *Skype* : @manindae
 *Twitter* : @maninda

 On Mon, Jul 13, 2015 at 1:40 PM, Seshika Fernando sesh...@wso2.com
 wrote:

 My vote is for data explorer for message console and to keep spark
 console for spark console.

 seshi

 On Mon, Jul 13, 2015 at 10:36 AM, Anjana Fernando anj...@wso2.com
 wrote:

 Hi,

 +1 for Data Explorer for message console. The name Spark Console is
 fine the way it is now.

 Cheers,
 Anjana.

 On Sun, Jul 12, 2015 at 7:59 AM, Niranda Perera nira...@wso2.com
 wrote:

 Hi all,

 DAS currently ships a UI component named 'message console'. it can be
 used to browse data inside the DAS tables.
 IMO this name message console, is misleading. for a person who's new to
 DAS would not know the exact use of it just by reading the name.

 I suggest a more self-explanatory name such as, 'data explorer', 'data
 navigator' etc

 WDYT?

 --
 *Niranda Perera*
 Software Engineer, WSO2 Inc.
 Mobile: +94-71-554-8430
 Twitter: @n1r44 https://twitter.com/N1R44
 https://pythagoreanscript.wordpress.com/




 --
 *Anjana Fernando*
 Senior Technical Lead
 WSO2 Inc. | http://wso2.com
 lean . enterprise . middleware

 ___
 Architecture mailing list
 Architecture@wso2.org
 https://mail.wso2.org/cgi-bin/mailman/listinfo/architecture



 ___
 Architecture mailing list
 Architecture@wso2.org
 https://mail.wso2.org/cgi-bin/mailman/listinfo/architecture



 ___
 Architecture mailing list
 Architecture@wso2.org
 https://mail.wso2.org/cgi-bin/mailman/listinfo/architecture




-- 
*Niranda Perera*
Software Engineer, WSO2 Inc.
Mobile: +94-71-554-8430
Twitter: @n1r44 https://twitter.com/N1R44
https://pythagoreanscript.wordpress.com/
___
Architecture mailing list
Architecture@wso2.org
https://mail.wso2.org/cgi-bin/mailman/listinfo/architecture


[Architecture] [DAS] Changing the name of Message Console

2015-07-11 Thread Niranda Perera
Hi all,

DAS currently ships a UI component named 'message console'. it can be used
to browse data inside the DAS tables.
IMO this name message console, is misleading. for a person who's new to DAS
would not know the exact use of it just by reading the name.

I suggest a more self-explanatory name such as, 'data explorer', 'data
navigator' etc

WDYT?

-- 
*Niranda Perera*
Software Engineer, WSO2 Inc.
Mobile: +94-71-554-8430
Twitter: @n1r44 https://twitter.com/N1R44
https://pythagoreanscript.wordpress.com/
___
Architecture mailing list
Architecture@wso2.org
https://mail.wso2.org/cgi-bin/mailman/listinfo/architecture


Re: [Architecture] [DAS] Changing the name of Message Console

2015-07-11 Thread Niranda Perera
@suho +1

@iranga I think you are referring to the spark console. do you think we
should change the name from 'spark console' to 'query analyzer'?

On Sun, Jul 12, 2015 at 8:28 AM, Sriskandarajah Suhothayan s...@wso2.com
wrote:

 +1 for data explorer and I believe it's inline with the primary usecase
 of that.

 Suho

 On Sat, Jul 11, 2015 at 10:46 PM, Iranga Muthuthanthri ira...@wso2.com
 wrote:

 +1, Since the category mostly falls under interactive analytics and more
 related about  to querying  data , suggest Query Analyzer

 On Sun, Jul 12, 2015 at 7:59 AM, Niranda Perera nira...@wso2.com wrote:

 Hi all,

 DAS currently ships a UI component named 'message console'. it can be
 used to browse data inside the DAS tables.
 IMO this name message console, is misleading. for a person who's new to
 DAS would not know the exact use of it just by reading the name.

 I suggest a more self-explanatory name such as, 'data explorer', 'data
 navigator' etc

 WDYT?

 --
 *Niranda Perera*
 Software Engineer, WSO2 Inc.
 Mobile: +94-71-554-8430
 Twitter: @n1r44 https://twitter.com/N1R44
 https://pythagoreanscript.wordpress.com/

 ___
 Architecture mailing list
 Architecture@wso2.org
 https://mail.wso2.org/cgi-bin/mailman/listinfo/architecture




 --
 Thanks  Regards

 Iranga Muthuthanthri
 (M) -0777-255773
 Team Product Management


 ___
 Architecture mailing list
 Architecture@wso2.org
 https://mail.wso2.org/cgi-bin/mailman/listinfo/architecture




 --

 *S. Suhothayan*
 Technical Lead  Team Lead of WSO2 Complex Event Processor
  *WSO2 Inc. *http://wso2.com
 * http://wso2.com/*
 lean . enterprise . middleware


 *cell: (+94) 779 756 757 %28%2B94%29%20779%20756%20757 | blog:
 http://suhothayan.blogspot.com/ http://suhothayan.blogspot.com/twitter:
 http://twitter.com/suhothayan http://twitter.com/suhothayan | linked-in:
 http://lk.linkedin.com/in/suhothayan http://lk.linkedin.com/in/suhothayan*

 ___
 Architecture mailing list
 Architecture@wso2.org
 https://mail.wso2.org/cgi-bin/mailman/listinfo/architecture




-- 
*Niranda Perera*
Software Engineer, WSO2 Inc.
Mobile: +94-71-554-8430
Twitter: @n1r44 https://twitter.com/N1R44
https://pythagoreanscript.wordpress.com/
___
Architecture mailing list
Architecture@wso2.org
https://mail.wso2.org/cgi-bin/mailman/listinfo/architecture


Re: [Architecture] [POC] Performance evaluation of Hive vs Shark

2014-12-15 Thread Niranda Perera
Hi David,

Could you point me to an example where SparkSQL is used in Stratio Deep?

Rgds

On Mon, Dec 15, 2014 at 2:20 PM, David Morales dmora...@stratio.com wrote:

 Hi there,

 For sure, the new release does support SparkSQL, so you can use sparkSQL
 and Stratio Deep together jusy out of the box.

 About cross-data, it' not itself related to Spark but can use Spark-Deep.
 It's an interactive SQL like Hive, for example.


 Regards.

 2014-12-12 21:29 GMT+01:00 Niranda Perera nira...@wso2.com:

 Hi David,

 I have been going through the Deep-Spark examples. It looks very
 promising.

 On a follow up query, does Deep-spark/ deep-cassandra support SQL like
 operations on RDDs (like SparkSQL)?

 Example (from Datastax Cassandra connector demos):

 object SQLDemo extends DemoApp {

   val cc = new CassandraSQLContext(sc)

   CassandraConnector(conf).withSessionDo { session =
 session.execute(CREATE KEYSPACE IF NOT EXISTS test WITH REPLICATION
 = {'class': 'SimpleStrategy', 'replication_factor': 1 })
 session.execute(DROP TABLE IF EXISTS test.sql_demo)
 session.execute(CREATE TABLE test.sql_demo (key INT PRIMARY KEY, grp
 INT, value DOUBLE))
 session.execute(INSERT INTO test.sql_demo(key, grp, value) VALUES
 (1, 1, 1.0))
 session.execute(INSERT INTO test.sql_demo(key, grp, value) VALUES
 (2, 1, 2.5))
 session.execute(INSERT INTO test.sql_demo(key, grp, value) VALUES
 (3, 1, 10.0))
 session.execute(INSERT INTO test.sql_demo(key, grp, value) VALUES
 (4, 2, 4.0))
 session.execute(INSERT INTO test.sql_demo(key, grp, value) VALUES
 (5, 2, 2.2))
 session.execute(INSERT INTO test.sql_demo(key, grp, value) VALUES
 (6, 2, 2.8))
   }

   val rdd = cc.cassandraSql(SELECT grp, max(value) AS mv FROM
 test.sql_demo GROUP BY grp ORDER BY mv)
   rdd.collect().foreach(println)  // [2, 4.0] [1, 10.0]

   sc.stop()
 }

 I also read about Stratio Crossdata. Does Crossdata serve this purpose?

 Rgds

 On Tue, Dec 2, 2014 at 11:14 PM, David Morales dmora...@stratio.com
 wrote:

 Hi¡

 Please, check the develop branch if you want to see a more realistic
 view of our development path. Last commit was about two hours ago :)

 Stratio Deep is one of our core modules so there is a core team in
 Stratio fully devoted to spark + noSQL integration. In these last months,
 for example, we have added mongoDB, ElasticSearch and Aerospike to Stratio
 Deep, so you can talk to these databases from Spark just like you do with
 HDFS.

 Furthermore, we are working on more backends, such as neo4j or
 couchBase, for example.


 About our benchmarks, you can check out some results in this link:
 http://www.stratio.com/deep-vs-datastax/

 Please, keep in mind that spark integration with a datastore could be
 done in two ways: HCI or native. We are now working on improving native
 integration because it's quite more performant. In this way, we are just
 working on some other tests with even more impressive results.


 Here you can find a technical overview of all our platform.


 http://www.slideshare.net/Stratio/stratio-platform-overview-v41

 Regards

 2014-12-02 11:14 GMT+01:00 Niranda Perera nira...@wso2.com:

 Hi David,

 Sorry to re-initiate this thread. But may I know if you have done any
 benchmarking on Datastax Spark cassandra connector and Stratio Deep-spark
 cassandra integration? Would love to take a look at it.

 I recently checked deep-spark github repo and noticed that there is no
 activity since Oct 29th. May I know what your future plans on this
 particular project?

 Cheers

 On Tue, Aug 26, 2014 at 9:12 PM, David Morales dmora...@stratio.com
 wrote:

 Yes, it is already included in our benchmarks.

 It could be a nice idea to share our findings, let me talk about it
 here. Meanwhile, you can ask us any question by using my mail or this
 thread, we are glad to help you.


 Best regards.


 2014-08-24 15:49 GMT+02:00 Niranda Perera nira...@wso2.com:

 Hi David,

 Thank you for your detailed reply.

 It was great to hear about Stratio-Deep and I must say, it looks very
 interesting. Storage handlers for databases such Cassandra, MongoDB etc
 would be very helpful. We will definitely look up on Stratio-Deep.

 I came across with the Datastax Spark-Cassandra connector (
 https://github.com/datastax/spark-cassandra-connector ). Have you
 done any comparison with your implementation and Datastax's connector?

 And, yes, please do share the performance results with us once it's
 ready.

 On a different note, is there any way for us to interact with Stratio
 dev community, in the form of dev mail lists etc, so that we could 
 mutually
 share our findings?

 Best regards



 On Fri, Aug 22, 2014 at 2:07 PM, David Morales dmora...@stratio.com
 wrote:

 Hi there,

 *1. About the size of deployments.*

 It depends on your use case... specially when you combine spark with
 a datastore. We use to deploy spark with cassandra or mongodb, instead 
 of
 using HDFS for example.

 Spark will be faster if you put

Re: [Architecture] [POC] Performance evaluation of Hive vs Shark

2014-12-12 Thread Niranda Perera
Hi David,

I have been going through the Deep-Spark examples. It looks very promising.

On a follow up query, does Deep-spark/ deep-cassandra support SQL like
operations on RDDs (like SparkSQL)?

Example (from Datastax Cassandra connector demos):

object SQLDemo extends DemoApp {

  val cc = new CassandraSQLContext(sc)

  CassandraConnector(conf).withSessionDo { session =
session.execute(CREATE KEYSPACE IF NOT EXISTS test WITH REPLICATION =
{'class': 'SimpleStrategy', 'replication_factor': 1 })
session.execute(DROP TABLE IF EXISTS test.sql_demo)
session.execute(CREATE TABLE test.sql_demo (key INT PRIMARY KEY, grp
INT, value DOUBLE))
session.execute(INSERT INTO test.sql_demo(key, grp, value) VALUES (1,
1, 1.0))
session.execute(INSERT INTO test.sql_demo(key, grp, value) VALUES (2,
1, 2.5))
session.execute(INSERT INTO test.sql_demo(key, grp, value) VALUES (3,
1, 10.0))
session.execute(INSERT INTO test.sql_demo(key, grp, value) VALUES (4,
2, 4.0))
session.execute(INSERT INTO test.sql_demo(key, grp, value) VALUES (5,
2, 2.2))
session.execute(INSERT INTO test.sql_demo(key, grp, value) VALUES (6,
2, 2.8))
  }

  val rdd = cc.cassandraSql(SELECT grp, max(value) AS mv FROM
test.sql_demo GROUP BY grp ORDER BY mv)
  rdd.collect().foreach(println)  // [2, 4.0] [1, 10.0]

  sc.stop()
}

I also read about Stratio Crossdata. Does Crossdata serve this purpose?

Rgds

On Tue, Dec 2, 2014 at 11:14 PM, David Morales dmora...@stratio.com wrote:

 Hi¡

 Please, check the develop branch if you want to see a more realistic view
 of our development path. Last commit was about two hours ago :)

 Stratio Deep is one of our core modules so there is a core team in Stratio
 fully devoted to spark + noSQL integration. In these last months, for
 example, we have added mongoDB, ElasticSearch and Aerospike to Stratio
 Deep, so you can talk to these databases from Spark just like you do with
 HDFS.

 Furthermore, we are working on more backends, such as neo4j or couchBase,
 for example.


 About our benchmarks, you can check out some results in this link:
 http://www.stratio.com/deep-vs-datastax/

 Please, keep in mind that spark integration with a datastore could be done
 in two ways: HCI or native. We are now working on improving native
 integration because it's quite more performant. In this way, we are just
 working on some other tests with even more impressive results.


 Here you can find a technical overview of all our platform.


 http://www.slideshare.net/Stratio/stratio-platform-overview-v41

 Regards

 2014-12-02 11:14 GMT+01:00 Niranda Perera nira...@wso2.com:

 Hi David,

 Sorry to re-initiate this thread. But may I know if you have done any
 benchmarking on Datastax Spark cassandra connector and Stratio Deep-spark
 cassandra integration? Would love to take a look at it.

 I recently checked deep-spark github repo and noticed that there is no
 activity since Oct 29th. May I know what your future plans on this
 particular project?

 Cheers

 On Tue, Aug 26, 2014 at 9:12 PM, David Morales dmora...@stratio.com
 wrote:

 Yes, it is already included in our benchmarks.

 It could be a nice idea to share our findings, let me talk about it
 here. Meanwhile, you can ask us any question by using my mail or this
 thread, we are glad to help you.


 Best regards.


 2014-08-24 15:49 GMT+02:00 Niranda Perera nira...@wso2.com:

 Hi David,

 Thank you for your detailed reply.

 It was great to hear about Stratio-Deep and I must say, it looks very
 interesting. Storage handlers for databases such Cassandra, MongoDB etc
 would be very helpful. We will definitely look up on Stratio-Deep.

 I came across with the Datastax Spark-Cassandra connector (
 https://github.com/datastax/spark-cassandra-connector ). Have you done
 any comparison with your implementation and Datastax's connector?

 And, yes, please do share the performance results with us once it's
 ready.

 On a different note, is there any way for us to interact with Stratio
 dev community, in the form of dev mail lists etc, so that we could mutually
 share our findings?

 Best regards



 On Fri, Aug 22, 2014 at 2:07 PM, David Morales dmora...@stratio.com
 wrote:

 Hi there,

 *1. About the size of deployments.*

 It depends on your use case... specially when you combine spark with a
 datastore. We use to deploy spark with cassandra or mongodb, instead of
 using HDFS for example.

 Spark will be faster if you put the data in memory, so if you need a
 lot of speed (interactive queries, for example), you should have enough
 memory.


 *2. About storage handlers.*

 We have developed the first tight integration between Cassandra and
 Spark, called Stratio Deep, announced in the first spark summit. You can
 check Stratio Deep out here: https://github.com/Stratio/stratio-deep 
 (open,
 apache2 license).

 *Deep is a thin integration layer between Apache Spark and several
 NoSQL datastores. We actually support Apache Cassandra and MongoDB, but in
 the near

Re: [Architecture] [POC] Performance evaluation of Hive vs Shark

2014-12-02 Thread Niranda Perera
Hi David,

Sorry to re-initiate this thread. But may I know if you have done any
benchmarking on Datastax Spark cassandra connector and Stratio Deep-spark
cassandra integration? Would love to take a look at it.

I recently checked deep-spark github repo and noticed that there is no
activity since Oct 29th. May I know what your future plans on this
particular project?

Cheers

On Tue, Aug 26, 2014 at 9:12 PM, David Morales dmora...@stratio.com wrote:

 Yes, it is already included in our benchmarks.

 It could be a nice idea to share our findings, let me talk about it here.
 Meanwhile, you can ask us any question by using my mail or this thread, we
 are glad to help you.


 Best regards.


 2014-08-24 15:49 GMT+02:00 Niranda Perera nira...@wso2.com:

 Hi David,

 Thank you for your detailed reply.

 It was great to hear about Stratio-Deep and I must say, it looks very
 interesting. Storage handlers for databases such Cassandra, MongoDB etc
 would be very helpful. We will definitely look up on Stratio-Deep.

 I came across with the Datastax Spark-Cassandra connector (
 https://github.com/datastax/spark-cassandra-connector ). Have you done
 any comparison with your implementation and Datastax's connector?

 And, yes, please do share the performance results with us once it's ready.

 On a different note, is there any way for us to interact with Stratio dev
 community, in the form of dev mail lists etc, so that we could mutually
 share our findings?

 Best regards



 On Fri, Aug 22, 2014 at 2:07 PM, David Morales dmora...@stratio.com
 wrote:

 Hi there,

 *1. About the size of deployments.*

 It depends on your use case... specially when you combine spark with a
 datastore. We use to deploy spark with cassandra or mongodb, instead of
 using HDFS for example.

 Spark will be faster if you put the data in memory, so if you need a lot
 of speed (interactive queries, for example), you should have enough memory.


 *2. About storage handlers.*

 We have developed the first tight integration between Cassandra and
 Spark, called Stratio Deep, announced in the first spark summit. You can
 check Stratio Deep out here: https://github.com/Stratio/stratio-deep (open,
 apache2 license).

 *Deep is a thin integration layer between Apache Spark and several NoSQL
 datastores. We actually support Apache Cassandra and MongoDB, but in the
 near future we will add support for sever other datastores.*

 Datastax have announce its own driver for spark in the last spark
 summit, but we have been working in our solution for almost a year.

 Furthermore, we are working to extend this solution in order to
 work also with other databases... MongoDB integration is completed right
 now and ElasticSearch will be ready in a few weeks.

 And that is not all, we have also developed an integration with
 Cassandra and Lucene for indexing data (open source, apache2).

 *Stratio Cassandra is a fork of Apache Cassandra
 http://cassandra.apache.org/ where index functionality has been extended
 to provide near real time search such as ElasticSearch or Solr,
 including full text search
 http://en.wikipedia.org/wiki/Full_text_search capabilities and free
 multivariable search. It is achieved through an Apache Lucene
 http://lucene.apache.org/ based implementation of Cassandra secondary
 indexes, where each node of the cluster indexes its own data.*


 We will publish some benchmarks in two weeks, so i will share our
 results here if you are interested.


 If you are more interested in distributed file systems, you should take
 a look on Tachyon: http://tachyon-project.org/index.html


 *3. Spark - Hive compatibility*

 Spark will support anything with the Hadoop InputFormat interface.


 *4. Performance*

 We are working a lot with Cassandra and mongoDB and the performance is
 quite nice. We are finishing right now some benchmarks comparing Hadoop +
 HDFS vs Spark + HDFS vs Spark + Cassandra (using stratio deep and even our
 fork of Cassandra).

 Let me please share this results with you when they were ready, ok?




 Regards.



























 2014-08-22 7:53 GMT+02:00 Niranda Perera nira...@wso2.com:

 Hi Srinath,
 Yes, I am working on deploying it on a multi-node cluster with the debs
 dataset. I will keep architecture@ posted on the progress.


 Hi David,
 Thank you very much for the detailed insight you've provided.
 Few quick questions,
 1. Do you have experiences in using storage handlers in Spark?
 2. Would a storage handler used in Hive, be directly compatible with
 Spark?
 3. How do you grade the performance of Spark with other databases such
 as Cassandra, HBase, H2, etc?

 Thank you very much again for your interest. Look forward to hearing
 from you.

 Regards


 On Thu, Aug 21, 2014 at 7:02 PM, Srinath Perera srin...@wso2.com
 wrote:

 Niranda, we need test Spark in multi-node mode before making a
 decision. Spark is very fast, I think there is no doubt about that. We 
 need
 to make sure it stable.

 David, thanks

[Architecture] POODLE Vulnerability (SSL 3.0) in WSO2 Carbon 3.0 Products

2014-10-30 Thread Niranda Perera
Hi all,

This follows Prabath's bolgpost on POODLE Attack and Disabling SSL V3 in
WSO2 Carbon 4.2.0 Based Products [1]

I was trying to disable SSL v3 in Carbon 3.0 as per the blogpost, but ran
into the following problems.

1. I could not find a catalina-server.xml file in Carbon 3.0. Is there a
different configuration file which governs the SSL Protocol version in
Carbon 3.0?
AFAIK Carbon 3.0 uses Tomcat 5.5 (Pls, correct me if I am wrong!) which
supports SSLv1, v2 and v3.

2. In Carbon 3.0 products (ESB) axis2.xml file, I could not find a
transportReceiver configuration element for PassThroughHttpSSLListener? was
this introduced after Carbon 3.0?

Would be very grateful if you could help me out on this.

Cheers

[1]
http://blog.facilelogin.com/2014/10/poodle-attack-and-disabling-ssl-v3-in_18.html

-- 
*Niranda Perera*
Software Engineer, WSO2 Inc.
Mobile: +94-71-554-8430
Twitter: @n1r44 https://twitter.com/N1R44
___
Architecture mailing list
Architecture@wso2.org
https://mail.wso2.org/cgi-bin/mailman/listinfo/architecture


[Architecture] [Dev] [Suggestion] List of ports used by WSO2 products

2014-10-22 Thread Niranda Perera
Hi,

Is there a list of products used by WSO2 products by default?

IMO it would be better if we could have one, because it would be easier to
setup Security Group Rules while spawning support cloud instances.

Would like to know your opinion about this!

Cheers

-- 
*Niranda Perera*
Software Engineer, WSO2 Inc.
Mobile: +94-71-554-8430
Twitter: @n1r44 https://twitter.com/N1R44
___
Architecture mailing list
Architecture@wso2.org
https://mail.wso2.org/cgi-bin/mailman/listinfo/architecture


Re: [Architecture] [Dev] [Suggestion] List of ports used by WSO2 products

2014-10-22 Thread Niranda Perera
Thanks Nuwan. Yes, I was looking for this, my bad!

On Wed, Oct 22, 2014 at 12:00 PM, Nuwan Silva nuw...@wso2.com wrote:

 yes, looking for something like [1]?

 [1] https://docs.wso2.com/display/shared/Default+Ports+of+WSO2+Products

 On Wed, Oct 22, 2014 at 11:48 AM, Niranda Perera nira...@wso2.com wrote:

 Hi,

 Is there a list of products used by WSO2 products by default?

 IMO it would be better if we could have one, because it would be easier
 to setup Security Group Rules while spawning support cloud instances.

 Would like to know your opinion about this!

 Cheers

 --
 *Niranda Perera*
 Software Engineer, WSO2 Inc.
 Mobile: +94-71-554-8430
 Twitter: @n1r44 https://twitter.com/N1R44




 --


 *Nuwan Silva*
 *Senior Software Engineer - QA*
 Mobile: +94779804543

 WSO2 Inc.
 lean . enterprise . middlewear.
 http://www.wso2.com




-- 
*Niranda Perera*
Software Engineer, WSO2 Inc.
Mobile: +94-71-554-8430
Twitter: @n1r44 https://twitter.com/N1R44
___
Architecture mailing list
Architecture@wso2.org
https://mail.wso2.org/cgi-bin/mailman/listinfo/architecture


Re: [Architecture] BAM Performance tests

2014-09-02 Thread Niranda Perera

 ___
 Architecture mailing list
 Architecture@wso2.org
 https://mail.wso2.org/cgi-bin/mailman/listinfo/architecture




 --
 *Sinthuja Rajendran*
 Senior Software Engineer http://wso2.com/
 WSO2, Inc.:http://wso2.com

 Blog: http://sinthu-rajan.blogspot.com/
 Mobile: +94774273955



 ___
 Architecture mailing list
 Architecture@wso2.org
 https://mail.wso2.org/cgi-bin/mailman/listinfo/architecture




 --
 Regards,
 Thayalan Sivapaleswararajah
 Associate Technical Lead - QA
 Mob: +94(0)777872485
 Tel : +94(0)(11)2145345
 Fax : +94(0)(11)2145300
 Email: thaya...@wso2.com


 *Disclaimer*: *This communication may contain privileged or other
 confidential information and is intended exclusively for the addressee/s.
 If you are not the intended recipient/s, or believe that you may have
 received this communication in error, please reply to the sender indicating
 that fact and delete the copy you received and in addition, you should not
 print, copy, retransmit, disseminate, or otherwise use the information
 contained in this communication. Internet communications cannot be
 guaranteed to be timely, secure, error or virus-free. The sender does not
 accept liability for any errors or omissions.*




 --
 *Anjana Fernando*
 Senior Technical Lead
 WSO2 Inc. | http://wso2.com
 lean . enterprise . middleware




 --
 Regards,
 Thayalan Sivapaleswararajah
 Associate Technical Lead - QA
 Mob: +94(0)777872485
 Tel : +94(0)(11)2145345
 Fax : +94(0)(11)2145300
 Email: thaya...@wso2.com


 *Disclaimer*: *This communication may contain privileged or other
 confidential information and is intended exclusively for the addressee/s.
 If you are not the intended recipient/s, or believe that you may have
 received this communication in error, please reply to the sender indicating
 that fact and delete the copy you received and in addition, you should not
 print, copy, retransmit, disseminate, or otherwise use the information
 contained in this communication. Internet communications cannot be
 guaranteed to be timely, secure, error or virus-free. The sender does not
 accept liability for any errors or omissions.*




 --
 
 Srinath Perera, Ph.D.
http://people.apache.org/~hemapani/
http://srinathsview.blogspot.com/

 ___
 Architecture mailing list
 Architecture@wso2.org
 https://mail.wso2.org/cgi-bin/mailman/listinfo/architecture




-- 
*Niranda Perera*
Software Engineer, WSO2 Inc.
Mobile: +94-71-554-8430
Twitter: @n1r44 https://twitter.com/N1R44
___
Architecture mailing list
Architecture@wso2.org
https://mail.wso2.org/cgi-bin/mailman/listinfo/architecture


Re: [Architecture] [POC] Performance evaluation of Hive vs Shark

2014-08-21 Thread Niranda Perera
Hi Srinath,
Yes, I am working on deploying it on a multi-node cluster with the debs
dataset. I will keep architecture@ posted on the progress.


Hi David,
Thank you very much for the detailed insight you've provided.
Few quick questions,
1. Do you have experiences in using storage handlers in Spark?
2. Would a storage handler used in Hive, be directly compatible with Spark?
3. How do you grade the performance of Spark with other databases such as
Cassandra, HBase, H2, etc?

Thank you very much again for your interest. Look forward to hearing from
you.

Regards


On Thu, Aug 21, 2014 at 7:02 PM, Srinath Perera srin...@wso2.com wrote:

 Niranda, we need test Spark in multi-node mode before making a decision.
 Spark is very fast, I think there is no doubt about that. We need to make
 sure it stable.

 David, thanks for a detailed email! How big (nodes) is the Spark setup you
 guys are running?

 --Srinath



 On Thu, Aug 21, 2014 at 1:34 PM, David Morales dmora...@stratio.com
 wrote:

 Sorry for disturbing this thread, but i think that i can help clarifying
 a few things (we were attending the last Spark Summit, we were also
 speakers there and we are working very close to spark)

 * Hive/Shark and others benchmark*

 You can find a nice comparison and benchmark in this web:
 https://amplab.cs.berkeley.edu/benchmark/


 * Shark and SparkSQL*

 SparkSQL is the natural replacement for Shark, but SparkSQL is still
 young at this moment. If you are looking for Hive compatibility, you have
 to execute SparkSQL with an specific context.

 Quoted from spark website:

 * Note that Spark SQL currently uses a very basic SQL parser. Users that
 want a more complete dialect of SQL should look at the HiveSQL support
 provided by HiveContext.*

 So, only note that SparkSQL is a work in progress. If you want SparkSQL
 you have to run a SparkSQLContext, if you want Hive, you will have a
 different context...


 * Spark - Hadoop: the future*

 Most Hadoop distributions are including Spark: cloudera, hortonworks,
 mapR... and contributing to migrate all the Hadoop ecosystem to Spark.

 Spark is a bit more than Map/Reduce... as you can read here:
 http://gigaom.com/2014/06/28/4-reasons-why-spark-could-jolt-hadoop-into-hyperdrive/


 * Spark Streaming / Spark SQL*

 Spark Streaming is built on Spark and it provides streaming processing
 through an information abstraction called DStreams (a collection of RDDs in
 a window of time).

 There is some efforts in order to make SparkSQL compatible with Spark
 Streaming (something similar to trident for storm), as you can see here:

 *StreamSQL (https://github.com/thunderain-project/StreamSQL
 https://github.com/thunderain-project/StreamSQL) is a POC project based
 on Spark to combine the power of Catalyst and Spark Streaming, to offer
 people the ability to manipulate SQL on top of DStream as you wanted, this
 keep the same semantics with SparkSQL as offer a SchemaDStream on top of
 DStream. You don't need to do tricky thing like extracting rdd to register
 as a table. Besides other parts are the same as Spark.*

 So, you can apply a SQL in a data stream, but it is very simple at the
 moment... you can expect a bunch of improvements in this matter in the next
 months (i guess that sparkSQL will work on Spark streaming streams before
 the end of this year).



 * Spark Streaming / Spark SQL and CEP*

 There is no relationship at this moment between (your absolutely amazing)
 Siddhi CEP and Spark. As fas as i know, you are working in doing
 distributed CEP with Storm and Siddhi.

 We are currently working on doing an interactive cep built with kafka +
 spark streaming + siddhi, with some features such as an API, an interactive
 shell, built-in statistics and auditing, built-in functions
 (save2cassandra, save2mongo, save2elasticsearch...).

 If you are interested we can talk about this project, i think that it
 would be a nice idea¡


 Anyway, i don't think that SparkSQL will evolve in something like a CEP.
 Patterns, sequences, for example would be very complex to do with spark
 streaming (at least now).



 Thanks.

















 2014-08-21 6:18 GMT+02:00 Sriskandarajah Suhothayan s...@wso2.com:




 On Wed, Aug 20, 2014 at 1:36 PM, Niranda Perera nira...@wso2.com
 wrote:

 @Maninda,

 +1 for suggesting Spark SQL.

 Quote Databricks,
 Spark SQL provides state-of-the-art SQL performance and maintains
 compatibility with Shark/Hive. In particular, like Shark, Spark SQL
 supports all existing Hive data formats, user-defined functions (UDF), and
 the Hive metastore. [1]

 But I am not entirely sure if Spark SQL and Siddhi is comparable,
 because SparkSQL (like Hive) is designed for batch processing, where as
 Siddhi is real-time processing. But if there are implementations where
 Siddhi is run on top of Spark, it would be very interesting.

 Yes Siddhi's current way of operation does not support this. But with
 partitions and we can achieve this to some extent.

 Suho


 Spark supports

Re: [Architecture] [POC] Performance evaluation of Hive vs Shark

2014-08-20 Thread Niranda Perera
Hi Anjana and Srinath,

After the discussion I had with Anjana, I researched more on the
continuation of Shark project by Databricks.

Here's what I found out,
- Shark was built on the Hive codebase and achieved performance
improvements by swapping out the physical execution engine part of Hive.
While this approach enabled Shark users to speed up their Hive queries,
Shark inherited a large, complicated code base from Hive that made it hard
to optimize and maintain.
Hence, Databricks has announced that they are halting the development of
Shark from July, 2014. (Shark 0.9 would be the last release) [1]
- Shark will be replaced by Spark SQL. It beats Shark in TPC-DS performance
http://databricks.com/blog/2014/06/02/exciting-performance-improvements-on-the-horizon-for-spark-sql.html
by almost an order of magnitude. It also supports all existing Hive data
formats, user-defined functions (UDF), and the Hive metastore.  [2]
- Following is the Shark, Spark SQL migration plan
http://spark-summit.org/wp-content/uploads/2014/07/Future-of-Spark-Patrick-Wendell.pdf

- For the legacy Hive and MapReduce users, they have proposed a new 'Hive
on Spark Project' [3], [4]
But, given the performance enhancement, it is quite certain that Hive and
MR would be replaced by engines build on top of Spark (ex: Spark SQL)



In my opinion there are a few matters to figure out if we are migrating
from Hive,

1. whether we are changing the query engine only? (Then, we can replace
Hive by Shark)
2. whether we are changing the existing Hadoop/ MapReduce framework to
Spark? (Then we can replace Hive and Hadoop with Spark and Spark SQL)


In my opinion, considering the longterm impact and the availability of
support, it is best to migrate the Hive/Hadoop to Spark.
It is open for discussion!

In the mean time, I've already tried Spark SQL, and Databricks claims on
improved performance seems to be true. I will work more on this.

Cheers

[1]
http://databricks.com/blog/2014/07/01/shark-spark-sql-hive-on-spark-and-the-future-of-sql-on-spark.html
[2]
http://databricks.com/blog/2014/06/02/exciting-performance-improvements-on-the-horizon-for-spark-sql.html
[3] https://issues.apache.org/jira/browse/HIVE-7292
[4] https://cwiki.apache.org/confluence/display/Hive/Hive+on+Spark



On Thu, Aug 14, 2014 at 12:16 PM, Anjana Fernando anj...@wso2.com wrote:

 Hi Srinath,

 No, this has not been tested in multiple nodes. I told Niranda here in my
 last mail, to test a cluster with the same set of hardware we have, that we
 are using to test our large data set with Hive. As for the effort to make
 the change, we still have to figure out the MT aspects of Shark here.
 Sinthuja was working on making the latest Hive version MT ready, and most
 probably, we can do the same changes to the Hive version Shark is using. So
 after we do that, the integration should be seamless. And also, as I
 mentioned earlier here, we are also going to test this with the APIM Hive
 script, to check if there are any unforeseen incompatibilities.

 Cheers,
 Anjana.


 On Thu, Aug 14, 2014 at 11:53 AM, Srinath Perera srin...@wso2.com wrote:

 This look great.

 We need to test Spark with multiple nodes? Did we do that. Please create
 few VMs in performance could (talk to Lakmal) and test with at least 5
 nodes. We need to make sure it works OK with distributed setup as well.

 What does it take to change to spark? Anjana .. how much work is it?

 --Srinath


 On Wed, Aug 13, 2014 at 7:06 PM, Niranda Perera nira...@wso2.com wrote:

 Thank you Anjana.

 Yes, I am working on it.

 In the mean time, I found this in Hive documentation [1]. It talks about
 Hive on Spark, and compares Hive, Shark and Spark SQL at an higher
 architectural level.

 Additionally, it is said that the in-memory performance of Shark can be
 improved by introducing Tachyon [2]. I guess we can consider this later on.

 Cheers.

 [1]
 https://cwiki.apache.org/confluence/display/Hive/Hive+on+Spark#HiveonSpark-1.3ComparisonwithSharkandSparkSQL
 [2] http://tachyon-project.org/Running-Tachyon-Locally.html



 On Wed, Aug 13, 2014 at 3:17 PM, Anjana Fernando anj...@wso2.com
 wrote:

 Hi Niranda,

 Excellent analysis of Hive vs Shark! .. This gives a lot of insight
 into how both operates in different scenarios. As the next step, we will
 need to run this in an actual cluster of computers. Since you've used a
 subset of the dataset of 2014 DEBS challenge, we should use the full data
 set in a clustered environment and check this. Gokul is already working on
 the Hive based setup for this, after that is done, you can create a Shark
 cluster in the same hardware and run the tests there, to get a clear
 comparison on how these two match up in a cluster. Until the setup is
 ready, do continue with your next steps on checking the RDD support and
 Spark SQL use.

 After these are done, we should also do a trial run of our own APIM
 Hive scripts, migrated to Shark.

 Cheers,
 Anjana.


 On Mon, Aug 11, 2014 at 12:21 PM, Niranda

Re: [Architecture] [POC] Performance evaluation of Hive vs Shark

2014-08-20 Thread Niranda Perera
@Maninda,

+1 for suggesting Spark SQL.

Quote Databricks,
Spark SQL provides state-of-the-art SQL performance and maintains
compatibility with Shark/Hive. In particular, like Shark, Spark SQL
supports all existing Hive data formats, user-defined functions (UDF), and
the Hive metastore. [1]

But I am not entirely sure if Spark SQL and Siddhi is comparable, because
SparkSQL (like Hive) is designed for batch processing, where as Siddhi is
real-time processing. But if there are implementations where Siddhi is run
on top of Spark, it would be very interesting.

Spark supports either Hadoop1 or 2. But I think we should see, what is
best, MR1 or YARN+MR2

[image: Hadoop Architecture]
[2]

[1]
http://databricks.com/blog/2014/07/01/shark-spark-sql-hive-on-spark-and-the-future-of-sql-on-spark.html
[2] http://www.tomsitpro.com/articles/hadoop-2-vs-1,2-718.html


On Wed, Aug 20, 2014 at 1:13 PM, Lasantha Fernando lasan...@wso2.com
wrote:

 Hi Maninda,

 On 20 August 2014 12:02, Maninda Edirisooriya mani...@wso2.com wrote:

 In the case of discontinuity of Shark project, IMO we should not move to
 Shark at all.
 And it seems better to go with Spark SQL as we are already using Spark
 for CEP. But I am not sure the difference between Spark SQL and the Siddhi
 queries on the Spark engine.


 Currently, we are doing integration with CEP using Apache Storm, not
 Spark... :-). Spark Streaming is a possible candidate for integrating with
 CEP, but we have opted with Storm. I think there has been some independent
 work on integrating Kafka + Spark Streaming + Siddhi. Please refer to
 thread on arch@ [Architecture] A few questions about WSO2 CEP/Siddhi


 And we have to figure out how Spark SQL is used for historical data,
 whether it can execute incremental processing by default which will
 implement all out existing BAM use cases.
 On the other hand in Hadoop 2 [1] they are using a completely different
 platform for resource allocation known as Yarn. Sometimes this may be more
 suitable for batch jobs.

 [1] https://www.youtube.com/watch?v=RncoVN0l6dc


 Thanks,
 Lasantha


 *Maninda Edirisooriya*
 Senior Software Engineer

 *WSO2, Inc. *lean.enterprise.middleware.

 *Blog* : http://maninda.blogspot.com/
 *E-mail* : mani...@wso2.com
 *Skype* : @manindae
 *Twitter* : @maninda


 On Wed, Aug 20, 2014 at 11:33 AM, Niranda Perera nira...@wso2.com
 wrote:

 Hi Anjana and Srinath,

 After the discussion I had with Anjana, I researched more on the
 continuation of Shark project by Databricks.

 Here's what I found out,
 - Shark was built on the Hive codebase and achieved performance
 improvements by swapping out the physical execution engine part of Hive.
 While this approach enabled Shark users to speed up their Hive queries,
 Shark inherited a large, complicated code base from Hive that made it hard
 to optimize and maintain.
 Hence, Databricks has announced that they are halting the development of
 Shark from July, 2014. (Shark 0.9 would be the last release) [1]
 - Shark will be replaced by Spark SQL. It beats Shark in TPC-DS
 performance
 http://databricks.com/blog/2014/06/02/exciting-performance-improvements-on-the-horizon-for-spark-sql.html
 by almost an order of magnitude. It also supports all existing Hive data
 formats, user-defined functions (UDF), and the Hive metastore.  [2]
 - Following is the Shark, Spark SQL migration plan
 http://spark-summit.org/wp-content/uploads/2014/07/Future-of-Spark-Patrick-Wendell.pdf

 - For the legacy Hive and MapReduce users, they have proposed a new
 'Hive on Spark Project' [3], [4]
 But, given the performance enhancement, it is quite certain that Hive
 and MR would be replaced by engines build on top of Spark (ex: Spark SQL)



 In my opinion there are a few matters to figure out if we are migrating
 from Hive,

 1. whether we are changing the query engine only? (Then, we can replace
 Hive by Shark)
 2. whether we are changing the existing Hadoop/ MapReduce framework to
 Spark? (Then we can replace Hive and Hadoop with Spark and Spark SQL)


 In my opinion, considering the longterm impact and the availability of
 support, it is best to migrate the Hive/Hadoop to Spark.
 It is open for discussion!

 In the mean time, I've already tried Spark SQL, and Databricks claims on
 improved performance seems to be true. I will work more on this.

 Cheers

 [1]
 http://databricks.com/blog/2014/07/01/shark-spark-sql-hive-on-spark-and-the-future-of-sql-on-spark.html
 [2]
 http://databricks.com/blog/2014/06/02/exciting-performance-improvements-on-the-horizon-for-spark-sql.html
 [3] https://issues.apache.org/jira/browse/HIVE-7292
 [4] https://cwiki.apache.org/confluence/display/Hive/Hive+on+Spark



 On Thu, Aug 14, 2014 at 12:16 PM, Anjana Fernando anj...@wso2.com
 wrote:

 Hi Srinath,

 No, this has not been tested in multiple nodes. I told Niranda here in
 my last mail, to test a cluster with the same set of hardware we have, that
 we are using to test our large data set with Hive. As for the effort