Sorry for disturbing this thread, but i think that i can help clarifying a few things (we were attending the last Spark Summit, we were also speakers there and we are working very close to spark)
*> Hive/Shark and others benchmark* You can find a nice comparison and benchmark in this web: https://amplab.cs.berkeley.edu/benchmark/ *> Shark and SparkSQL* SparkSQL is the natural replacement for Shark, but SparkSQL is still young at this moment. If you are looking for Hive compatibility, you have to execute SparkSQL with an specific context. Quoted from spark website: *> Note that Spark SQL currently uses a very basic SQL parser. Users that want a more complete dialect of SQL should look at the HiveSQL support provided by HiveContext.* So, only note that SparkSQL is a work in progress. If you want SparkSQL you have to run a SparkSQLContext, if you want Hive, you will have a different context... *> Spark - Hadoop: the future* Most Hadoop distributions are including Spark: cloudera, hortonworks, mapR... and contributing to migrate all the Hadoop ecosystem to Spark. Spark is a bit more than Map/Reduce... as you can read here: http://gigaom.com/2014/06/28/4-reasons-why-spark-could-jolt-hadoop-into-hyperdrive/ *> Spark Streaming / Spark SQL* Spark Streaming is built on Spark and it provides streaming processing through an information abstraction called DStreams (a collection of RDDs in a window of time). There is some efforts in order to make SparkSQL compatible with Spark Streaming (something similar to trident for storm), as you can see here: *StreamSQL (https://github.com/thunderain-project/StreamSQL <https://github.com/thunderain-project/StreamSQL>) is a POC project based on Spark to combine the power of Catalyst and Spark Streaming, to offer people the ability to manipulate SQL on top of DStream as you wanted, this keep the same semantics with SparkSQL as offer a SchemaDStream on top of DStream. You don't need to do tricky thing like extracting rdd to register as a table. Besides other parts are the same as Spark.* So, you can apply a SQL in a data stream, but it is very simple at the moment... you can expect a bunch of improvements in this matter in the next months (i guess that sparkSQL will work on Spark streaming streams before the end of this year). *> Spark Streaming / Spark SQL and CEP* There is no relationship at this moment between (your absolutely amazing) Siddhi CEP and Spark. As fas as i know, you are working in doing distributed CEP with Storm and Siddhi. We are currently working on doing an interactive cep built with kafka + spark streaming + siddhi, with some features such as an API, an interactive shell, built-in statistics and auditing, built-in functions (save2cassandra, save2mongo, save2elasticsearch...). If you are interested we can talk about this project, i think that it would be a nice idea¡ Anyway, i don't think that SparkSQL will evolve in something like a CEP. Patterns, sequences, for example would be very complex to do with spark streaming (at least now). Thanks. 2014-08-21 6:18 GMT+02:00 Sriskandarajah Suhothayan <[email protected]>: > > > > On Wed, Aug 20, 2014 at 1:36 PM, Niranda Perera <[email protected]> wrote: > >> @Maninda, >> >> +1 for suggesting Spark SQL. >> >> Quote Databricks, >> "Spark SQL provides state-of-the-art SQL performance and maintains >> compatibility with Shark/Hive. In particular, like Shark, Spark SQL >> supports all existing Hive data formats, user-defined functions (UDF), and >> the Hive metastore." [1] >> >> But I am not entirely sure if Spark SQL and Siddhi is comparable, because >> SparkSQL (like Hive) is designed for batch processing, where as Siddhi is >> real-time processing. But if there are implementations where Siddhi is run >> on top of Spark, it would be very interesting. >> > Yes Siddhi's current way of operation does not support this. But with > partitions and we can achieve this to some extent. > > Suho > >> >> Spark supports either Hadoop1 or 2. But I think we should see, what is >> best, MR1 or YARN+MR2 >> >> [image: Hadoop Architecture] >> [2] >> >> [1] >> http://databricks.com/blog/2014/07/01/shark-spark-sql-hive-on-spark-and-the-future-of-sql-on-spark.html >> [2] http://www.tomsitpro.com/articles/hadoop-2-vs-1,2-718.html >> >> >> On Wed, Aug 20, 2014 at 1:13 PM, Lasantha Fernando <[email protected]> >> wrote: >> >>> Hi Maninda, >>> >>> On 20 August 2014 12:02, Maninda Edirisooriya <[email protected]> wrote: >>> >>>> In the case of discontinuity of Shark project, IMO we should not move >>>> to Shark at all. >>>> And it seems better to go with Spark SQL as we are already using Spark >>>> for CEP. But I am not sure the difference between Spark SQL and the Siddhi >>>> queries on the Spark engine. >>>> >>> >>> Currently, we are doing integration with CEP using Apache Storm, not >>> Spark... :-). Spark Streaming is a possible candidate for integrating with >>> CEP, but we have opted with Storm. I think there has been some independent >>> work on integrating Kafka + Spark Streaming + Siddhi. Please refer to >>> thread on arch@ "[Architecture] A few questions about WSO2 CEP/Siddhi" >>> >>> >>> And we have to figure out how Spark SQL is used for historical data, >>>> whether it can execute incremental processing by default which will >>>> implement all out existing BAM use cases. >>>> On the other hand in Hadoop 2 [1] they are using a completely different >>>> platform for resource allocation known as Yarn. Sometimes this may be more >>>> suitable for batch jobs. >>>> >>>> [1] https://www.youtube.com/watch?v=RncoVN0l6dc >>>> >>>> >>> Thanks, >>> Lasantha >>> >>>> >>>> *Maninda Edirisooriya* >>>> Senior Software Engineer >>>> >>>> *WSO2, Inc. *lean.enterprise.middleware. >>>> >>>> *Blog* : http://maninda.blogspot.com/ >>>> *E-mail* : [email protected] >>>> *Skype* : @manindae >>>> *Twitter* : @maninda >>>> >>>> >>>> On Wed, Aug 20, 2014 at 11:33 AM, Niranda Perera <[email protected]> >>>> wrote: >>>> >>>>> Hi Anjana and Srinath, >>>>> >>>>> After the discussion I had with Anjana, I researched more on the >>>>> continuation of Shark project by Databricks. >>>>> >>>>> Here's what I found out, >>>>> - Shark was built on the Hive codebase and achieved performance >>>>> improvements by swapping out the physical execution engine part of Hive. >>>>> While this approach enabled Shark users to speed up their Hive queries, >>>>> Shark inherited a large, complicated code base from Hive that made it hard >>>>> to optimize and maintain. >>>>> Hence, Databricks has announced that they are halting the development >>>>> of Shark from July, 2014. (Shark 0.9 would be the last release) [1] >>>>> - Shark will be replaced by Spark SQL. It beats Shark in TPC-DS >>>>> performance >>>>> <http://databricks.com/blog/2014/06/02/exciting-performance-improvements-on-the-horizon-for-spark-sql.html> >>>>> by almost an order of magnitude. It also supports all existing Hive data >>>>> formats, user-defined functions (UDF), and the Hive metastore. [2] >>>>> - Following is the Shark, Spark SQL migration plan >>>>> http://spark-summit.org/wp-content/uploads/2014/07/Future-of-Spark-Patrick-Wendell.pdf >>>>> >>>>> - For the legacy Hive and MapReduce users, they have proposed a new >>>>> 'Hive on Spark Project' [3], [4] >>>>> But, given the performance enhancement, it is quite certain that Hive >>>>> and MR would be replaced by engines build on top of Spark (ex: Spark SQL) >>>>> >>>>> >>>>> >>>>> In my opinion there are a few matters to figure out if we are >>>>> migrating from Hive, >>>>> >>>>> 1. whether we are changing the query engine only? (Then, we can >>>>> replace Hive by Shark) >>>>> 2. whether we are changing the existing Hadoop/ MapReduce framework to >>>>> Spark? (Then we can replace Hive and Hadoop with Spark and Spark SQL) >>>>> >>>>> >>>>> In my opinion, considering the longterm impact and the availability of >>>>> support, it is best to migrate the Hive/Hadoop to Spark. >>>>> It is open for discussion! >>>>> >>>>> In the mean time, I've already tried Spark SQL, and Databricks claims >>>>> on improved performance seems to be true. I will work more on this. >>>>> >>>>> Cheers >>>>> >>>>> [1] >>>>> http://databricks.com/blog/2014/07/01/shark-spark-sql-hive-on-spark-and-the-future-of-sql-on-spark.html >>>>> [2] >>>>> http://databricks.com/blog/2014/06/02/exciting-performance-improvements-on-the-horizon-for-spark-sql.html >>>>> [3] https://issues.apache.org/jira/browse/HIVE-7292 >>>>> [4] https://cwiki.apache.org/confluence/display/Hive/Hive+on+Spark >>>>> >>>>> >>>>> >>>>> On Thu, Aug 14, 2014 at 12:16 PM, Anjana Fernando <[email protected]> >>>>> wrote: >>>>> >>>>>> Hi Srinath, >>>>>> >>>>>> No, this has not been tested in multiple nodes. I told Niranda here >>>>>> in my last mail, to test a cluster with the same set of hardware we have, >>>>>> that we are using to test our large data set with Hive. As for the effort >>>>>> to make the change, we still have to figure out the MT aspects of Shark >>>>>> here. Sinthuja was working on making the latest Hive version MT ready, >>>>>> and >>>>>> most probably, we can do the same changes to the Hive version Shark is >>>>>> using. So after we do that, the integration should be seamless. And also, >>>>>> as I mentioned earlier here, we are also going to test this with the APIM >>>>>> Hive script, to check if there are any unforeseen incompatibilities. >>>>>> >>>>>> Cheers, >>>>>> Anjana. >>>>>> >>>>>> >>>>>> On Thu, Aug 14, 2014 at 11:53 AM, Srinath Perera <[email protected]> >>>>>> wrote: >>>>>> >>>>>>> This look great. >>>>>>> >>>>>>> We need to test Spark with multiple nodes? Did we do that. Please >>>>>>> create few VMs in performance could (talk to Lakmal) and test with at >>>>>>> least >>>>>>> 5 nodes. We need to make sure it works OK with distributed setup as >>>>>>> well. >>>>>>> >>>>>>> What does it take to change to spark? Anjana .. how much work is it? >>>>>>> >>>>>>> --Srinath >>>>>>> >>>>>>> >>>>>>> On Wed, Aug 13, 2014 at 7:06 PM, Niranda Perera <[email protected]> >>>>>>> wrote: >>>>>>> >>>>>>>> Thank you Anjana. >>>>>>>> >>>>>>>> Yes, I am working on it. >>>>>>>> >>>>>>>> In the mean time, I found this in Hive documentation [1]. It talks >>>>>>>> about Hive on Spark, and compares Hive, Shark and Spark SQL at an >>>>>>>> higher >>>>>>>> architectural level. >>>>>>>> >>>>>>>> Additionally, it is said that the in-memory performance of Shark >>>>>>>> can be improved by introducing Tachyon [2]. I guess we can consider >>>>>>>> this >>>>>>>> later on. >>>>>>>> >>>>>>>> Cheers. >>>>>>>> >>>>>>>> [1] >>>>>>>> https://cwiki.apache.org/confluence/display/Hive/Hive+on+Spark#HiveonSpark-1.3ComparisonwithSharkandSparkSQL >>>>>>>> [2] http://tachyon-project.org/Running-Tachyon-Locally.html >>>>>>>> >>>>>>>> >>>>>>>> >>>>>>>> On Wed, Aug 13, 2014 at 3:17 PM, Anjana Fernando <[email protected]> >>>>>>>> wrote: >>>>>>>> >>>>>>>>> Hi Niranda, >>>>>>>>> >>>>>>>>> Excellent analysis of Hive vs Shark! .. This gives a lot of >>>>>>>>> insight into how both operates in different scenarios. As the next >>>>>>>>> step, we >>>>>>>>> will need to run this in an actual cluster of computers. Since you've >>>>>>>>> used >>>>>>>>> a subset of the dataset of 2014 DEBS challenge, we should use the >>>>>>>>> full data >>>>>>>>> set in a clustered environment and check this. Gokul is already >>>>>>>>> working on >>>>>>>>> the Hive based setup for this, after that is done, you can create a >>>>>>>>> Shark >>>>>>>>> cluster in the same hardware and run the tests there, to get a clear >>>>>>>>> comparison on how these two match up in a cluster. Until the setup is >>>>>>>>> ready, do continue with your next steps on checking the RDD support >>>>>>>>> and >>>>>>>>> Spark SQL use. >>>>>>>>> >>>>>>>>> After these are done, we should also do a trial run of our own >>>>>>>>> APIM Hive scripts, migrated to Shark. >>>>>>>>> >>>>>>>>> Cheers, >>>>>>>>> Anjana. >>>>>>>>> >>>>>>>>> >>>>>>>>> On Mon, Aug 11, 2014 at 12:21 PM, Niranda Perera < >>>>>>>>> [email protected]> wrote: >>>>>>>>> >>>>>>>>>> Hi all, >>>>>>>>>> >>>>>>>>>> I have been evaluating the performance of Shark (distributed SQL >>>>>>>>>> query engine for Hadoop) against Hive. This is with the objective of >>>>>>>>>> seeing >>>>>>>>>> the possibility to move the WSO2 BAM data processing (which >>>>>>>>>> currently uses >>>>>>>>>> Hive) to Shark (and Apache Spark) for improved performance. >>>>>>>>>> >>>>>>>>>> I am sharing my findings herewith. >>>>>>>>>> >>>>>>>>>> *AMP Lab Shark* >>>>>>>>>> Shark can execute Hive QL queries up to 100 times faster than >>>>>>>>>> Hive without any modification to the existing data or queries. It >>>>>>>>>> supports >>>>>>>>>> Hive's QL, metastore, serialization formats, and user-defined >>>>>>>>>> functions, >>>>>>>>>> providing seamless integration with existing Hive deployments and a >>>>>>>>>> familiar, more powerful option for new ones. [1] >>>>>>>>>> >>>>>>>>>> >>>>>>>>>> *Apache Spark*Apache Spark is an open-source data analytics >>>>>>>>>> cluster computing framework. It fits into the Hadoop open-source >>>>>>>>>> community, >>>>>>>>>> building on top of the HDFS and promises performance up to 100 times >>>>>>>>>> faster >>>>>>>>>> than Hadoop MapReduce for certain applications. [2] >>>>>>>>>> Official documentation: [3] >>>>>>>>>> >>>>>>>>>> >>>>>>>>>> I carried out the comparison between the following Hive and Shark >>>>>>>>>> releases with input files ranging from 100 to 1 billion entries. >>>>>>>>>> >>>>>>>>>> QL Engine >>>>>>>>>> >>>>>>>>>> Apache Hive 0.11 >>>>>>>>>> >>>>>>>>>> Shark Shark 0.9.1 (Latest release) which uses, >>>>>>>>>> >>>>>>>>>> - >>>>>>>>>> >>>>>>>>>> Scala 2.10.3 >>>>>>>>>> - >>>>>>>>>> >>>>>>>>>> Spark 0.9.1 >>>>>>>>>> - >>>>>>>>>> >>>>>>>>>> AMPLab’s Hive 0.9.0 >>>>>>>>>> >>>>>>>>>> >>>>>>>>>> Framework >>>>>>>>>> >>>>>>>>>> Hadoop 1.0.4 >>>>>>>>>> Spark 0.9.1 >>>>>>>>>> >>>>>>>>>> File system >>>>>>>>>> >>>>>>>>>> HDFS >>>>>>>>>> HDFS >>>>>>>>>> >>>>>>>>>> Attached herewith is a report which describes in detail about the >>>>>>>>>> performance comparison between Shark and Hive. >>>>>>>>>> >>>>>>>>>> hive_vs_shark >>>>>>>>>> <https://docs.google.com/a/wso2.com/folderview?id=0B1GsnfycTl32QTZqUktKck1Ucjg&usp=drive_web> >>>>>>>>>> >>>>>>>>>> hive_vs_shark_report.odt >>>>>>>>>> <https://docs.google.com/a/wso2.com/file/d/0B1GsnfycTl32X3J5dTh6Slloa0E/edit?usp=drive_web> >>>>>>>>>> >>>>>>>>>> >>>>>>>>>> In summary, >>>>>>>>>> >>>>>>>>>> From the evaluation, following conclusions can be derived. >>>>>>>>>> >>>>>>>>>> - Shark is indifferent to Hive in DDL operations (CREATE, >>>>>>>>>> DROP .. TABLE, DATABASE). Both engines show a fairly constant >>>>>>>>>> performance >>>>>>>>>> as the input size increases. >>>>>>>>>> - Shark is indifferent to Hive in DML operations (LOAD, >>>>>>>>>> INSERT) but when a DML operation is called in conjuncture of a >>>>>>>>>> data >>>>>>>>>> retrieval operation (ex. INSERT <TBL> SELECT <PROP> FROM <TBL>), >>>>>>>>>> Shark >>>>>>>>>> significantly over-performs Hive with a performance factor of >>>>>>>>>> 10x+ (Ranging >>>>>>>>>> from 10x to 80x in some instances). Shark performance factor >>>>>>>>>> reduces with >>>>>>>>>> the input size increases, while HIVE performance is fairly >>>>>>>>>> indifferent. >>>>>>>>>> - Shark clearly over-performs Hive in Data Retrieval >>>>>>>>>> operations (FILTER, ORDER BY, JOIN). Hive performance is fairly >>>>>>>>>> indifferent >>>>>>>>>> in the data retrieval operations while Shark performance reduces >>>>>>>>>> as the >>>>>>>>>> input size increases. But at every instance Shark over-performed >>>>>>>>>> Hive with >>>>>>>>>> a minimum performance factor of 5x+ (Ranging from 5x to 80x in >>>>>>>>>> some >>>>>>>>>> instances). >>>>>>>>>> >>>>>>>>>> Please refer the 'hive_vs_shark_report', it has all the >>>>>>>>>> information about the queries and timings pictographically. >>>>>>>>>> >>>>>>>>>> The code repository can also be found in >>>>>>>>>> >>>>>>>>>> https://github.com/nirandaperera/hiveToShark/tree/master/hiveVsShark >>>>>>>>>> >>>>>>>>>> >>>>>>>>>> Moving forward, I am currently working on the following. >>>>>>>>>> >>>>>>>>>> - Apache Spark's resilient distributed dataset (RDD) >>>>>>>>>> abstraction (which is a collection of elements partitioned across >>>>>>>>>> the nodes >>>>>>>>>> of the cluster that can be operated on in parallel). The use of >>>>>>>>>> RDDs and >>>>>>>>>> its impact to the performance. >>>>>>>>>> - Spark SQL - Use of this Spark SQL over Shark on Spark >>>>>>>>>> framework >>>>>>>>>> >>>>>>>>>> >>>>>>>>>> [1] https://github.com/amplab/shark/wiki >>>>>>>>>> [2] http://en.wikipedia.org/wiki/Apache_Spark >>>>>>>>>> [3] http://spark.apache.org/docs/latest/ >>>>>>>>>> >>>>>>>>>> >>>>>>>>>> >>>>>>>>>> Would love to have your feedback on this. >>>>>>>>>> >>>>>>>>>> Best regards >>>>>>>>>> >>>>>>>>>> -- >>>>>>>>>> *Niranda Perera* >>>>>>>>>> Software Engineer, WSO2 Inc. >>>>>>>>>> Mobile: +94-71-554-8430 >>>>>>>>>> Twitter: @n1r44 <https://twitter.com/N1R44> >>>>>>>>>> >>>>>>>>> >>>>>>>>> >>>>>>>>> >>>>>>>>> -- >>>>>>>>> *Anjana Fernando* >>>>>>>>> Senior Technical Lead >>>>>>>>> WSO2 Inc. | http://wso2.com >>>>>>>>> lean . enterprise . middleware >>>>>>>>> >>>>>>>> >>>>>>>> >>>>>>>> >>>>>>>> -- >>>>>>>> *Niranda Perera* >>>>>>>> Software Engineer, WSO2 Inc. >>>>>>>> Mobile: +94-71-554-8430 >>>>>>>> Twitter: @n1r44 <https://twitter.com/N1R44> >>>>>>>> >>>>>>> >>>>>>> >>>>>>> >>>>>>> -- >>>>>>> ============================ >>>>>>> Srinath Perera, Ph.D. >>>>>>> http://people.apache.org/~hemapani/ >>>>>>> http://srinathsview.blogspot.com/ >>>>>>> >>>>>> >>>>>> >>>>>> >>>>>> -- >>>>>> *Anjana Fernando* >>>>>> Senior Technical Lead >>>>>> WSO2 Inc. | http://wso2.com >>>>>> lean . enterprise . middleware >>>>>> >>>>> >>>>> >>>>> >>>>> -- >>>>> *Niranda Perera* >>>>> Software Engineer, WSO2 Inc. >>>>> Mobile: +94-71-554-8430 >>>>> Twitter: @n1r44 <https://twitter.com/N1R44> >>>>> >>>> >>>> >>>> _______________________________________________ >>>> Architecture mailing list >>>> [email protected] >>>> https://mail.wso2.org/cgi-bin/mailman/listinfo/architecture >>>> >>>> >>> >>> >>> -- >>> *Lasantha Fernando* >>> Software Engineer - Data Technologies Team >>> WSO2 Inc. http://wso2.com >>> >>> email: [email protected] >>> mobile: (+94) 71 5247551 >>> >> >> >> >> -- >> *Niranda Perera* >> Software Engineer, WSO2 Inc. >> Mobile: +94-71-554-8430 >> Twitter: @n1r44 <https://twitter.com/N1R44> >> >> _______________________________________________ >> Architecture mailing list >> [email protected] >> https://mail.wso2.org/cgi-bin/mailman/listinfo/architecture >> >> > > > -- > > *S. Suhothayan* > Technical Lead & Team Lead of WSO2 Complex Event Processor > *WSO2 Inc. *http://wso2.com > * <http://wso2.com/>* > lean . enterprise . middleware > > > > *cell: (+94) 779 756 757 | blog: http://suhothayan.blogspot.com/ > <http://suhothayan.blogspot.com/>twitter: http://twitter.com/suhothayan > <http://twitter.com/suhothayan> | linked-in: > http://lk.linkedin.com/in/suhothayan <http://lk.linkedin.com/in/suhothayan>* > > _______________________________________________ > Architecture mailing list > [email protected] > https://mail.wso2.org/cgi-bin/mailman/listinfo/architecture > >
_______________________________________________ Architecture mailing list [email protected] https://mail.wso2.org/cgi-bin/mailman/listinfo/architecture
