Re: [Architecture] [POC] Performance evaluation of Hive vs Shark

Niranda Perera Fri, 12 Dec 2014 12:31:39 -0800

Hi David,

I have been going through the Deep-Spark examples. It looks very promising.


On a follow up query, does Deep-spark/ deep-cassandra support SQL like
operations on RDDs (like SparkSQL)?

Example (from Datastax Cassandra connector demos):

object SQLDemo extends DemoApp {

  val cc = new CassandraSQLContext(sc)

  CassandraConnector(conf).withSessionDo { session =>
    session.execute("CREATE KEYSPACE IF NOT EXISTS test WITH REPLICATION =
{'class': 'SimpleStrategy', 'replication_factor': 1 }")
    session.execute("DROP TABLE IF EXISTS test.sql_demo")
    session.execute("CREATE TABLE test.sql_demo (key INT PRIMARY KEY, grp
INT, value DOUBLE)")
    session.execute("INSERT INTO test.sql_demo(key, grp, value) VALUES (1,
1, 1.0)")
    session.execute("INSERT INTO test.sql_demo(key, grp, value) VALUES (2,
1, 2.5)")
    session.execute("INSERT INTO test.sql_demo(key, grp, value) VALUES (3,
1, 10.0)")
    session.execute("INSERT INTO test.sql_demo(key, grp, value) VALUES (4,
2, 4.0)")
    session.execute("INSERT INTO test.sql_demo(key, grp, value) VALUES (5,
2, 2.2)")
    session.execute("INSERT INTO test.sql_demo(key, grp, value) VALUES (6,
2, 2.8)")
  }

  val rdd = cc.cassandraSql("SELECT grp, max(value) AS mv FROM
test.sql_demo GROUP BY grp ORDER BY mv")
  rdd.collect().foreach(println)  // [2, 4.0] [1, 10.0]

  sc.stop()
}

I also read about Stratio Crossdata. Does Crossdata serve this purpose?

Rgds

On Tue, Dec 2, 2014 at 11:14 PM, David Morales <[email protected]> wrote:
>
> Hi¡
>
> Please, check the develop branch if you want to see a more realistic view
> of our development path. Last commit was about two hours ago :)
>
> Stratio Deep is one of our core modules so there is a core team in Stratio
> fully devoted to spark + noSQL integration. In these last months, for
> example, we have added mongoDB, ElasticSearch and Aerospike to Stratio
> Deep, so you can talk to these databases from Spark just like you do with
> HDFS.
>
> Furthermore, we are working on more backends, such as neo4j or couchBase,
> for example.
>
>
> About our benchmarks, you can check out some results in this link:
> http://www.stratio.com/deep-vs-datastax/
>
> Please, keep in mind that spark integration with a datastore could be done
> in two ways: HCI or native. We are now working on improving native
> integration because it's quite more performant. In this way, we are just
> working on some other tests with even more impressive results.
>
>
> Here you can find a technical overview of all our platform.
>
>
> http://www.slideshare.net/Stratio/stratio-platform-overview-v41
>
> Regards
>
> 2014-12-02 11:14 GMT+01:00 Niranda Perera <[email protected]>:
>
>> Hi David,
>>
>> Sorry to re-initiate this thread. But may I know if you have done any
>> benchmarking on Datastax Spark cassandra connector and Stratio Deep-spark
>> cassandra integration? Would love to take a look at it.
>>
>> I recently checked deep-spark github repo and noticed that there is no
>> activity since Oct 29th. May I know what your future plans on this
>> particular project?
>>
>> Cheers
>>
>> On Tue, Aug 26, 2014 at 9:12 PM, David Morales <[email protected]>
>> wrote:
>>
>>> Yes, it is already included in our benchmarks.
>>>
>>> It could be a nice idea to share our findings, let me talk about it
>>> here. Meanwhile, you can ask us any question by using my mail or this
>>> thread, we are glad to help you.
>>>
>>>
>>> Best regards.
>>>
>>>
>>> 2014-08-24 15:49 GMT+02:00 Niranda Perera <[email protected]>:
>>>
>>>> Hi David,
>>>>
>>>> Thank you for your detailed reply.
>>>>
>>>> It was great to hear about Stratio-Deep and I must say, it looks very
>>>> interesting. Storage handlers for databases such Cassandra, MongoDB etc
>>>> would be very helpful. We will definitely look up on Stratio-Deep.
>>>>
>>>> I came across with the Datastax Spark-Cassandra connector (
>>>> https://github.com/datastax/spark-cassandra-connector ). Have you done
>>>> any comparison with your implementation and Datastax's connector?
>>>>
>>>> And, yes, please do share the performance results with us once it's
>>>> ready.
>>>>
>>>> On a different note, is there any way for us to interact with Stratio
>>>> dev community, in the form of dev mail lists etc, so that we could mutually
>>>> share our findings?
>>>>
>>>> Best regards
>>>>
>>>>
>>>>
>>>> On Fri, Aug 22, 2014 at 2:07 PM, David Morales <[email protected]>
>>>> wrote:
>>>>
>>>>> Hi there,
>>>>>
>>>>> *1. About the size of deployments.*
>>>>>
>>>>> It depends on your use case... specially when you combine spark with a
>>>>> datastore. We use to deploy spark with cassandra or mongodb, instead of
>>>>> using HDFS for example.
>>>>>
>>>>> Spark will be faster if you put the data in memory, so if you need a
>>>>> lot of speed (interactive queries, for example), you should have enough
>>>>> memory.
>>>>>
>>>>>
>>>>> *2. About storage handlers.*
>>>>>
>>>>> We have developed the first tight integration between Cassandra and
>>>>> Spark, called Stratio Deep, announced in the first spark summit. You can
>>>>> check Stratio Deep out here: https://github.com/Stratio/stratio-deep 
>>>>> (open,
>>>>> apache2 license).
>>>>>
>>>>> *Deep is a thin integration layer between Apache Spark and several
>>>>> NoSQL datastores. We actually support Apache Cassandra and MongoDB, but in
>>>>> the near future we will add support for sever other datastores.*
>>>>>
>>>>> Datastax have announce its own driver for spark in the last spark
>>>>> summit, but we have been working in our solution for almost a year.
>>>>>
>>>>> Furthermore, we are working to extend this solution in order to
>>>>> work also with other databases... MongoDB integration is completed right
>>>>> now and ElasticSearch will be ready in a few weeks.
>>>>>
>>>>> And that is not all, we have also developed an integration with
>>>>> Cassandra and Lucene for indexing data (open source, apache2).
>>>>>
>>>>> *Stratio Cassandra is a fork of Apache Cassandra
>>>>> <http://cassandra.apache.org/> where index functionality has been extended
>>>>> to provide near real time search such as ElasticSearch or Solr,
>>>>> including full text search
>>>>> <http://en.wikipedia.org/wiki/Full_text_search> capabilities and free
>>>>> multivariable search. It is achieved through an Apache Lucene
>>>>> <http://lucene.apache.org/> based implementation of Cassandra secondary
>>>>> indexes, where each node of the cluster indexes its own data.*
>>>>>
>>>>>
>>>>> We will publish some benchmarks in two weeks, so i will share our
>>>>> results here if you are interested.
>>>>>
>>>>>
>>>>> If you are more interested in distributed file systems, you should
>>>>> take a look on Tachyon: http://tachyon-project.org/index.html
>>>>>
>>>>>
>>>>> *3. Spark - Hive compatibility*
>>>>>
>>>>> Spark will support anything with the Hadoop InputFormat interface.
>>>>>
>>>>>
>>>>> *4. Performance*
>>>>>
>>>>> We are working a lot with Cassandra and mongoDB and the performance is
>>>>> quite nice. We are finishing right now some benchmarks comparing Hadoop +
>>>>> HDFS vs Spark + HDFS vs Spark + Cassandra (using stratio deep and even our
>>>>> fork of Cassandra).
>>>>>
>>>>> Let me please share this results with you when they were ready, ok?
>>>>>
>>>>>
>>>>>
>>>>>
>>>>> Regards.
>>>>>
>>>>>
>>>>>
>>>>>
>>>>>
>>>>>
>>>>>
>>>>>
>>>>>
>>>>>
>>>>>
>>>>>
>>>>>
>>>>>
>>>>>
>>>>>
>>>>>
>>>>>
>>>>>
>>>>>
>>>>>
>>>>>
>>>>>
>>>>>
>>>>>
>>>>>
>>>>>
>>>>> 2014-08-22 7:53 GMT+02:00 Niranda Perera <[email protected]>:
>>>>>
>>>>> Hi Srinath,
>>>>>> Yes, I am working on deploying it on a multi-node cluster with the
>>>>>> debs dataset. I will keep architecture@ posted on the progress.
>>>>>>
>>>>>>
>>>>>> Hi David,
>>>>>> Thank you very much for the detailed insight you've provided.
>>>>>> Few quick questions,
>>>>>> 1. Do you have experiences in using storage handlers in Spark?
>>>>>> 2. Would a storage handler used in Hive, be directly compatible with
>>>>>> Spark?
>>>>>> 3. How do you grade the performance of Spark with other databases
>>>>>> such as Cassandra, HBase, H2, etc?
>>>>>>
>>>>>> Thank you very much again for your interest. Look forward to hearing
>>>>>> from you.
>>>>>>
>>>>>> Regards
>>>>>>
>>>>>>
>>>>>> On Thu, Aug 21, 2014 at 7:02 PM, Srinath Perera <[email protected]>
>>>>>> wrote:
>>>>>>
>>>>>>> Niranda, we need test Spark in multi-node mode before making a
>>>>>>> decision. Spark is very fast, I think there is no doubt about that. We 
>>>>>>> need
>>>>>>> to make sure it stable.
>>>>>>>
>>>>>>> David, thanks for a detailed email! How big (nodes) is the Spark
>>>>>>> setup you guys are running?
>>>>>>>
>>>>>>> --Srinath
>>>>>>>
>>>>>>>
>>>>>>>
>>>>>>> On Thu, Aug 21, 2014 at 1:34 PM, David Morales <[email protected]
>>>>>>> > wrote:
>>>>>>>
>>>>>>>> Sorry for disturbing this thread, but i think that i can help
>>>>>>>> clarifying a few things (we were attending the last Spark Summit, we 
>>>>>>>> were
>>>>>>>> also speakers there and we are working very close to spark)
>>>>>>>>
>>>>>>>> *> Hive/Shark and others benchmark*
>>>>>>>>
>>>>>>>> You can find a nice comparison and benchmark in this web:
>>>>>>>> https://amplab.cs.berkeley.edu/benchmark/
>>>>>>>>
>>>>>>>>
>>>>>>>> *> Shark and SparkSQL*
>>>>>>>>
>>>>>>>> SparkSQL is the natural replacement for Shark, but SparkSQL is
>>>>>>>> still young at this moment. If you are looking for Hive compatibility, 
>>>>>>>> you
>>>>>>>> have to execute SparkSQL with an specific context.
>>>>>>>>
>>>>>>>> Quoted from spark website:
>>>>>>>>
>>>>>>>> *> Note that Spark SQL currently uses a very basic SQL
>>>>>>>> parser. Users that want a more complete dialect of SQL should look at 
>>>>>>>> the
>>>>>>>> HiveSQL support provided by HiveContext.*
>>>>>>>>
>>>>>>>> So, only note that SparkSQL is a work in progress. If you want
>>>>>>>> SparkSQL you have to run a SparkSQLContext, if you want Hive, you will 
>>>>>>>> have
>>>>>>>> a different context...
>>>>>>>>
>>>>>>>>
>>>>>>>> *> Spark - Hadoop: the future*
>>>>>>>>
>>>>>>>> Most Hadoop distributions are including Spark: cloudera,
>>>>>>>> hortonworks, mapR... and contributing to migrate all the Hadoop 
>>>>>>>> ecosystem
>>>>>>>> to Spark.
>>>>>>>>
>>>>>>>> Spark is a bit more than Map/Reduce... as you can read here:
>>>>>>>> http://gigaom.com/2014/06/28/4-reasons-why-spark-could-jolt-hadoop-into-hyperdrive/
>>>>>>>>
>>>>>>>>
>>>>>>>> *> Spark Streaming / Spark SQL*
>>>>>>>>
>>>>>>>> Spark Streaming is built on Spark and it provides streaming
>>>>>>>> processing through an information abstraction called DStreams (a 
>>>>>>>> collection
>>>>>>>> of RDDs in a window of time).
>>>>>>>>
>>>>>>>> There is some efforts in order to make SparkSQL compatible with
>>>>>>>> Spark Streaming (something similar to trident for storm), as you can 
>>>>>>>> see
>>>>>>>> here:
>>>>>>>>
>>>>>>>> *StreamSQL (https://github.com/thunderain-project/StreamSQL
>>>>>>>> <https://github.com/thunderain-project/StreamSQL>) is a POC project 
>>>>>>>> based
>>>>>>>> on Spark to combine the power of Catalyst and Spark Streaming, to offer
>>>>>>>> people the ability to manipulate SQL on top of DStream as you wanted, 
>>>>>>>> this
>>>>>>>> keep the same semantics with SparkSQL as offer a SchemaDStream on top 
>>>>>>>> of
>>>>>>>> DStream. You don't need to do tricky thing like extracting rdd to 
>>>>>>>> register
>>>>>>>> as a table. Besides other parts are the same as Spark.*
>>>>>>>>
>>>>>>>> So, you can apply a SQL in a data stream, but it is very simple at
>>>>>>>> the moment... you can expect a bunch of improvements in this matter in 
>>>>>>>> the
>>>>>>>> next months (i guess that sparkSQL will work on Spark streaming streams
>>>>>>>> before the end of this year).
>>>>>>>>
>>>>>>>>
>>>>>>>>
>>>>>>>> *> Spark Streaming / Spark SQL and CEP*
>>>>>>>>
>>>>>>>> There is no relationship at this moment between (your absolutely
>>>>>>>> amazing) Siddhi CEP and Spark. As fas as i know, you are working in 
>>>>>>>> doing
>>>>>>>> distributed CEP with Storm and Siddhi.
>>>>>>>>
>>>>>>>> We are currently working on doing an interactive cep built with
>>>>>>>> kafka + spark streaming + siddhi, with some features such as an API, an
>>>>>>>> interactive shell, built-in statistics and auditing, built-in functions
>>>>>>>> (save2cassandra, save2mongo, save2elasticsearch...).
>>>>>>>>
>>>>>>>> If you are interested we can talk about this project, i think that
>>>>>>>> it would be a nice idea¡
>>>>>>>>
>>>>>>>>
>>>>>>>> Anyway, i don't think that SparkSQL will evolve in something like a
>>>>>>>> CEP. Patterns, sequences, for example would be very complex to do with
>>>>>>>> spark streaming (at least now).
>>>>>>>>
>>>>>>>>
>>>>>>>>
>>>>>>>> Thanks.
>>>>>>>>
>>>>>>>>
>>>>>>>>
>>>>>>>>
>>>>>>>>
>>>>>>>>
>>>>>>>>
>>>>>>>>
>>>>>>>>
>>>>>>>>
>>>>>>>>
>>>>>>>>
>>>>>>>>
>>>>>>>>
>>>>>>>>
>>>>>>>>
>>>>>>>>
>>>>>>>> 2014-08-21 6:18 GMT+02:00 Sriskandarajah Suhothayan <[email protected]>
>>>>>>>> :
>>>>>>>>
>>>>>>>>
>>>>>>>>>
>>>>>>>>>
>>>>>>>>> On Wed, Aug 20, 2014 at 1:36 PM, Niranda Perera <[email protected]>
>>>>>>>>> wrote:
>>>>>>>>>
>>>>>>>>>> @Maninda,
>>>>>>>>>>
>>>>>>>>>> +1 for suggesting Spark SQL.
>>>>>>>>>>
>>>>>>>>>> Quote Databricks,
>>>>>>>>>> "Spark SQL provides state-of-the-art SQL performance and
>>>>>>>>>> maintains compatibility with Shark/Hive. In particular, like Shark, 
>>>>>>>>>> Spark
>>>>>>>>>> SQL supports all existing Hive data formats, user-defined functions 
>>>>>>>>>> (UDF),
>>>>>>>>>> and the Hive metastore." [1]
>>>>>>>>>>
>>>>>>>>>> But I am not entirely sure if Spark SQL and Siddhi is comparable,
>>>>>>>>>> because SparkSQL (like Hive) is designed for batch processing, where 
>>>>>>>>>> as
>>>>>>>>>> Siddhi is real-time processing. But if there are implementations 
>>>>>>>>>> where
>>>>>>>>>> Siddhi is run on top of Spark, it would be very interesting.
>>>>>>>>>>
>>>>>>>>> Yes Siddhi's current way of operation does not support this. But
>>>>>>>>> with partitions and we can achieve this to some extent.
>>>>>>>>>
>>>>>>>>> Suho
>>>>>>>>>
>>>>>>>>>>
>>>>>>>>>> Spark supports either Hadoop1 or 2. But I think we should see,
>>>>>>>>>> what is best, MR1 or YARN+MR2
>>>>>>>>>>
>>>>>>>>>> [image: Hadoop Architecture]
>>>>>>>>>> [2]
>>>>>>>>>>
>>>>>>>>>> [1]
>>>>>>>>>> http://databricks.com/blog/2014/07/01/shark-spark-sql-hive-on-spark-and-the-future-of-sql-on-spark.html
>>>>>>>>>> [2] http://www.tomsitpro.com/articles/hadoop-2-vs-1,2-718.html
>>>>>>>>>>
>>>>>>>>>>
>>>>>>>>>> On Wed, Aug 20, 2014 at 1:13 PM, Lasantha Fernando <
>>>>>>>>>> [email protected]> wrote:
>>>>>>>>>>
>>>>>>>>>>> Hi Maninda,
>>>>>>>>>>>
>>>>>>>>>>> On 20 August 2014 12:02, Maninda Edirisooriya <[email protected]>
>>>>>>>>>>> wrote:
>>>>>>>>>>>
>>>>>>>>>>>> In the case of discontinuity of Shark project, IMO we should
>>>>>>>>>>>> not move to Shark at all.
>>>>>>>>>>>> And it seems better to go with Spark SQL as we are already
>>>>>>>>>>>> using Spark for CEP. But I am not sure the difference between 
>>>>>>>>>>>> Spark SQL and
>>>>>>>>>>>> the Siddhi queries on the Spark engine.
>>>>>>>>>>>>
>>>>>>>>>>>
>>>>>>>>>>> Currently, we are doing integration with CEP using Apache Storm,
>>>>>>>>>>> not Spark... :-). Spark Streaming is a possible candidate for 
>>>>>>>>>>> integrating
>>>>>>>>>>> with CEP, but we have opted with Storm. I think there has been some
>>>>>>>>>>> independent work on integrating Kafka + Spark Streaming + Siddhi. 
>>>>>>>>>>> Please
>>>>>>>>>>> refer to thread on arch@ "[Architecture] A few questions about
>>>>>>>>>>> WSO2 CEP/Siddhi"
>>>>>>>>>>>
>>>>>>>>>>>
>>>>>>>>>>> And we have to figure out how Spark SQL is used for historical
>>>>>>>>>>>> data, whether it can execute incremental processing by default 
>>>>>>>>>>>> which will
>>>>>>>>>>>> implement all out existing BAM use cases.
>>>>>>>>>>>> On the other hand in Hadoop 2 [1] they are using a completely
>>>>>>>>>>>> different platform for resource allocation known as Yarn. 
>>>>>>>>>>>> Sometimes this
>>>>>>>>>>>> may be more suitable for batch jobs.
>>>>>>>>>>>>
>>>>>>>>>>>> [1] https://www.youtube.com/watch?v=RncoVN0l6dc
>>>>>>>>>>>>
>>>>>>>>>>>>
>>>>>>>>>>> Thanks,
>>>>>>>>>>> Lasantha
>>>>>>>>>>>
>>>>>>>>>>>>
>>>>>>>>>>>> *Maninda Edirisooriya*
>>>>>>>>>>>> Senior Software Engineer
>>>>>>>>>>>>
>>>>>>>>>>>> *WSO2, Inc. *lean.enterprise.middleware.
>>>>>>>>>>>>
>>>>>>>>>>>> *Blog* : http://maninda.blogspot.com/
>>>>>>>>>>>> *E-mail* : [email protected]
>>>>>>>>>>>> *Skype* : @manindae
>>>>>>>>>>>> *Twitter* : @maninda
>>>>>>>>>>>>
>>>>>>>>>>>>
>>>>>>>>>>>> On Wed, Aug 20, 2014 at 11:33 AM, Niranda Perera <
>>>>>>>>>>>> [email protected]> wrote:
>>>>>>>>>>>>
>>>>>>>>>>>>> Hi Anjana and Srinath,
>>>>>>>>>>>>>
>>>>>>>>>>>>> After the discussion I had with Anjana, I researched more on
>>>>>>>>>>>>> the continuation of Shark project by Databricks.
>>>>>>>>>>>>>
>>>>>>>>>>>>> Here's what I found out,
>>>>>>>>>>>>> - Shark was built on the Hive codebase and achieved
>>>>>>>>>>>>> performance improvements by swapping out the physical execution 
>>>>>>>>>>>>> engine part
>>>>>>>>>>>>> of Hive. While this approach enabled Shark users to speed up 
>>>>>>>>>>>>> their Hive
>>>>>>>>>>>>> queries, Shark inherited a large, complicated code base from Hive 
>>>>>>>>>>>>> that made
>>>>>>>>>>>>> it hard to optimize and maintain.
>>>>>>>>>>>>> Hence, Databricks has announced that they are halting the
>>>>>>>>>>>>> development of Shark from July, 2014. (Shark 0.9 would be the 
>>>>>>>>>>>>> last release)
>>>>>>>>>>>>> [1]
>>>>>>>>>>>>> - Shark will be replaced by Spark SQL. It beats Shark in TPC-DS
>>>>>>>>>>>>> performance
>>>>>>>>>>>>> <http://databricks.com/blog/2014/06/02/exciting-performance-improvements-on-the-horizon-for-spark-sql.html>
>>>>>>>>>>>>> by almost an order of magnitude. It also supports all existing 
>>>>>>>>>>>>> Hive data
>>>>>>>>>>>>> formats, user-defined functions (UDF), and the Hive metastore.  
>>>>>>>>>>>>> [2]
>>>>>>>>>>>>> - Following is the Shark, Spark SQL migration plan
>>>>>>>>>>>>> http://spark-summit.org/wp-content/uploads/2014/07/Future-of-Spark-Patrick-Wendell.pdf
>>>>>>>>>>>>>
>>>>>>>>>>>>> - For the legacy Hive and MapReduce users, they have proposed
>>>>>>>>>>>>> a new 'Hive on Spark Project' [3], [4]
>>>>>>>>>>>>> But, given the performance enhancement, it is quite certain
>>>>>>>>>>>>> that Hive and MR would be replaced by engines build on top of 
>>>>>>>>>>>>> Spark (ex:
>>>>>>>>>>>>> Spark SQL)
>>>>>>>>>>>>>
>>>>>>>>>>>>>
>>>>>>>>>>>>>
>>>>>>>>>>>>> In my opinion there are a few matters to figure out if we are
>>>>>>>>>>>>> migrating from Hive,
>>>>>>>>>>>>>
>>>>>>>>>>>>> 1. whether we are changing the query engine only? (Then, we
>>>>>>>>>>>>> can replace Hive by Shark)
>>>>>>>>>>>>> 2. whether we are changing the existing Hadoop/ MapReduce
>>>>>>>>>>>>> framework to Spark? (Then we can replace Hive and Hadoop with 
>>>>>>>>>>>>> Spark and
>>>>>>>>>>>>> Spark SQL)
>>>>>>>>>>>>>
>>>>>>>>>>>>>
>>>>>>>>>>>>> In my opinion, considering the longterm impact and the
>>>>>>>>>>>>> availability of support, it is best to migrate the Hive/Hadoop to 
>>>>>>>>>>>>> Spark.
>>>>>>>>>>>>> It is open for discussion!
>>>>>>>>>>>>>
>>>>>>>>>>>>> In the mean time, I've already tried Spark SQL, and Databricks
>>>>>>>>>>>>> claims on improved performance seems to be true. I will work more 
>>>>>>>>>>>>> on this.
>>>>>>>>>>>>>
>>>>>>>>>>>>> Cheers
>>>>>>>>>>>>>
>>>>>>>>>>>>> [1]
>>>>>>>>>>>>> http://databricks.com/blog/2014/07/01/shark-spark-sql-hive-on-spark-and-the-future-of-sql-on-spark.html
>>>>>>>>>>>>> [2]
>>>>>>>>>>>>> http://databricks.com/blog/2014/06/02/exciting-performance-improvements-on-the-horizon-for-spark-sql.html
>>>>>>>>>>>>> [3] https://issues.apache.org/jira/browse/HIVE-7292
>>>>>>>>>>>>> [4]
>>>>>>>>>>>>> https://cwiki.apache.org/confluence/display/Hive/Hive+on+Spark
>>>>>>>>>>>>>
>>>>>>>>>>>>>
>>>>>>>>>>>>>
>>>>>>>>>>>>> On Thu, Aug 14, 2014 at 12:16 PM, Anjana Fernando <
>>>>>>>>>>>>> [email protected]> wrote:
>>>>>>>>>>>>>
>>>>>>>>>>>>>> Hi Srinath,
>>>>>>>>>>>>>>
>>>>>>>>>>>>>> No, this has not been tested in multiple nodes. I told
>>>>>>>>>>>>>> Niranda here in my last mail, to test a cluster with the same 
>>>>>>>>>>>>>> set of
>>>>>>>>>>>>>> hardware we have, that we are using to test our large data set 
>>>>>>>>>>>>>> with Hive.
>>>>>>>>>>>>>> As for the effort to make the change, we still have to figure 
>>>>>>>>>>>>>> out the MT
>>>>>>>>>>>>>> aspects of Shark here. Sinthuja was working on making the latest 
>>>>>>>>>>>>>> Hive
>>>>>>>>>>>>>> version MT ready, and most probably, we can do the same changes 
>>>>>>>>>>>>>> to the Hive
>>>>>>>>>>>>>> version Shark is using. So after we do that, the integration 
>>>>>>>>>>>>>> should be
>>>>>>>>>>>>>> seamless. And also, as I mentioned earlier here, we are also 
>>>>>>>>>>>>>> going to test
>>>>>>>>>>>>>> this with the APIM Hive script, to check if there are any 
>>>>>>>>>>>>>> unforeseen
>>>>>>>>>>>>>> incompatibilities.
>>>>>>>>>>>>>>
>>>>>>>>>>>>>> Cheers,
>>>>>>>>>>>>>> Anjana.
>>>>>>>>>>>>>>
>>>>>>>>>>>>>>
>>>>>>>>>>>>>> On Thu, Aug 14, 2014 at 11:53 AM, Srinath Perera <
>>>>>>>>>>>>>> [email protected]> wrote:
>>>>>>>>>>>>>>
>>>>>>>>>>>>>>> This look great.
>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>> We need to test Spark with multiple nodes? Did we do that.
>>>>>>>>>>>>>>> Please create few VMs in performance could (talk to Lakmal) and 
>>>>>>>>>>>>>>> test with
>>>>>>>>>>>>>>> at least 5 nodes. We need to make sure it works OK with 
>>>>>>>>>>>>>>> distributed setup
>>>>>>>>>>>>>>> as well.
>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>> What does it take to change to spark? Anjana .. how much
>>>>>>>>>>>>>>> work is it?
>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>> --Srinath
>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>> On Wed, Aug 13, 2014 at 7:06 PM, Niranda Perera <
>>>>>>>>>>>>>>> [email protected]> wrote:
>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>> Thank you Anjana.
>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>> Yes, I am working on it.
>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>> In the mean time, I found this in Hive documentation [1].
>>>>>>>>>>>>>>>> It talks about Hive on Spark, and compares Hive, Shark and 
>>>>>>>>>>>>>>>> Spark SQL at an
>>>>>>>>>>>>>>>> higher architectural level.
>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>> Additionally, it is said that the in-memory performance of
>>>>>>>>>>>>>>>> Shark can be improved by introducing Tachyon [2]. I guess we 
>>>>>>>>>>>>>>>> can consider
>>>>>>>>>>>>>>>> this later on.
>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>> Cheers.
>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>> [1]
>>>>>>>>>>>>>>>> https://cwiki.apache.org/confluence/display/Hive/Hive+on+Spark#HiveonSpark-1.3ComparisonwithSharkandSparkSQL
>>>>>>>>>>>>>>>> [2] http://tachyon-project.org/Running-Tachyon-Locally.html
>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>> On Wed, Aug 13, 2014 at 3:17 PM, Anjana Fernando <
>>>>>>>>>>>>>>>> [email protected]> wrote:
>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>>>  Hi Niranda,
>>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>>> Excellent analysis of Hive vs Shark! .. This gives a lot
>>>>>>>>>>>>>>>>> of insight into how both operates in different scenarios. As 
>>>>>>>>>>>>>>>>> the next step,
>>>>>>>>>>>>>>>>> we will need to run this in an actual cluster of computers. 
>>>>>>>>>>>>>>>>> Since you've
>>>>>>>>>>>>>>>>> used a subset of the dataset of 2014 DEBS challenge, we 
>>>>>>>>>>>>>>>>> should use the full
>>>>>>>>>>>>>>>>> data set in a clustered environment and check this. Gokul is 
>>>>>>>>>>>>>>>>> already
>>>>>>>>>>>>>>>>> working on the Hive based setup for this, after that is done, 
>>>>>>>>>>>>>>>>> you can
>>>>>>>>>>>>>>>>> create a Shark cluster in the same hardware and run the tests 
>>>>>>>>>>>>>>>>> there, to get
>>>>>>>>>>>>>>>>> a clear comparison on how these two match up in a cluster. 
>>>>>>>>>>>>>>>>> Until the setup
>>>>>>>>>>>>>>>>> is ready, do continue with your next steps on checking the 
>>>>>>>>>>>>>>>>> RDD support and
>>>>>>>>>>>>>>>>> Spark SQL use.
>>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>>> After these are done, we should also do a trial run of our
>>>>>>>>>>>>>>>>> own APIM Hive scripts, migrated to Shark.
>>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>>> Cheers,
>>>>>>>>>>>>>>>>> Anjana.
>>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>>>  On Mon, Aug 11, 2014 at 12:21 PM, Niranda Perera <
>>>>>>>>>>>>>>>>> [email protected]> wrote:
>>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>>>> Hi all,
>>>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>>>> I have been evaluating the performance of
>>>>>>>>>>>>>>>>>> Shark (distributed SQL query engine for Hadoop) against 
>>>>>>>>>>>>>>>>>> Hive. This is with
>>>>>>>>>>>>>>>>>> the objective of seeing the possibility to move the WSO2 BAM 
>>>>>>>>>>>>>>>>>> data
>>>>>>>>>>>>>>>>>> processing (which currently uses Hive) to Shark (and Apache 
>>>>>>>>>>>>>>>>>> Spark) for
>>>>>>>>>>>>>>>>>> improved performance.
>>>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>>>> I am sharing my findings herewith.
>>>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>>>>  *AMP Lab Shark*
>>>>>>>>>>>>>>>>>> Shark can execute Hive QL queries up to 100 times faster
>>>>>>>>>>>>>>>>>> than Hive without any modification to the existing data or 
>>>>>>>>>>>>>>>>>> queries. It
>>>>>>>>>>>>>>>>>> supports Hive's QL, metastore, serialization formats, and 
>>>>>>>>>>>>>>>>>> user-defined
>>>>>>>>>>>>>>>>>> functions, providing seamless integration with existing Hive 
>>>>>>>>>>>>>>>>>> deployments
>>>>>>>>>>>>>>>>>> and a familiar, more powerful option for new ones. [1]
>>>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>>>> *Apache Spark*Apache Spark is an open-source data
>>>>>>>>>>>>>>>>>> analytics cluster computing framework. It fits into the 
>>>>>>>>>>>>>>>>>> Hadoop open-source
>>>>>>>>>>>>>>>>>> community, building on top of the HDFS and promises 
>>>>>>>>>>>>>>>>>> performance up to 100
>>>>>>>>>>>>>>>>>> times faster than Hadoop MapReduce for certain applications. 
>>>>>>>>>>>>>>>>>> [2]
>>>>>>>>>>>>>>>>>> Official documentation: [3]
>>>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>>>> I carried out the comparison between the following Hive
>>>>>>>>>>>>>>>>>> and Shark releases with input files ranging from 100 to 1 
>>>>>>>>>>>>>>>>>> billion entries.
>>>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>>>> QL Engine
>>>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>>>> Apache Hive 0.11
>>>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>>>> Shark Shark 0.9.1 (Latest release) which uses,
>>>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>>>>    -
>>>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>>>>    Scala 2.10.3
>>>>>>>>>>>>>>>>>>    -
>>>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>>>>    Spark 0.9.1
>>>>>>>>>>>>>>>>>>    -
>>>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>>>>    AMPLab’s Hive 0.9.0
>>>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>>>> Framework
>>>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>>>> Hadoop 1.0.4
>>>>>>>>>>>>>>>>>> Spark 0.9.1
>>>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>>>> File system
>>>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>>>> HDFS
>>>>>>>>>>>>>>>>>> HDFS
>>>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>>>> Attached herewith is a report which describes in detail
>>>>>>>>>>>>>>>>>> about the performance comparison between Shark and Hive.
>>>>>>>>>>>>>>>>>> 
>>>>>>>>>>>>>>>>>>  hive_vs_shark
>>>>>>>>>>>>>>>>>> <https://docs.google.com/a/wso2.com/folderview?id=0B1GsnfycTl32QTZqUktKck1Ucjg&usp=drive_web>
>>>>>>>>>>>>>>>>>> 
>>>>>>>>>>>>>>>>>>  hive_vs_shark_report.odt
>>>>>>>>>>>>>>>>>> <https://docs.google.com/a/wso2.com/file/d/0B1GsnfycTl32X3J5dTh6Slloa0E/edit?usp=drive_web>
>>>>>>>>>>>>>>>>>> 
>>>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>>>> In summary,
>>>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>>>> From the evaluation, following conclusions can be derived.
>>>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>>>>    - Shark is indifferent to Hive in DDL operations
>>>>>>>>>>>>>>>>>>    (CREATE, DROP .. TABLE, DATABASE). Both engines show a 
>>>>>>>>>>>>>>>>>> fairly constant
>>>>>>>>>>>>>>>>>>    performance as the input size increases.
>>>>>>>>>>>>>>>>>>    - Shark is indifferent to Hive in DML operations
>>>>>>>>>>>>>>>>>>    (LOAD, INSERT) but when a DML operation is called in 
>>>>>>>>>>>>>>>>>> conjuncture of a data
>>>>>>>>>>>>>>>>>>    retrieval operation (ex. INSERT <TBL> SELECT <PROP> FROM 
>>>>>>>>>>>>>>>>>> <TBL>), Shark
>>>>>>>>>>>>>>>>>>    significantly over-performs Hive with a performance 
>>>>>>>>>>>>>>>>>> factor of 10x+ (Ranging
>>>>>>>>>>>>>>>>>>    from 10x to 80x in some instances). Shark performance 
>>>>>>>>>>>>>>>>>> factor reduces with
>>>>>>>>>>>>>>>>>>    the input size increases, while HIVE performance is 
>>>>>>>>>>>>>>>>>> fairly indifferent.
>>>>>>>>>>>>>>>>>>    - Shark clearly over-performs Hive in Data Retrieval
>>>>>>>>>>>>>>>>>>    operations (FILTER, ORDER BY, JOIN). Hive performance is 
>>>>>>>>>>>>>>>>>> fairly indifferent
>>>>>>>>>>>>>>>>>>    in the data retrieval operations while Shark performance 
>>>>>>>>>>>>>>>>>> reduces as the
>>>>>>>>>>>>>>>>>>    input size increases. But at every instance Shark 
>>>>>>>>>>>>>>>>>> over-performed Hive with
>>>>>>>>>>>>>>>>>>    a minimum performance factor of 5x+ (Ranging from 5x to 
>>>>>>>>>>>>>>>>>> 80x in some
>>>>>>>>>>>>>>>>>>    instances).
>>>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>>>> Please refer the 'hive_vs_shark_report', it has all the
>>>>>>>>>>>>>>>>>> information about the queries and timings pictographically.
>>>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>>>> The code repository can also be found in
>>>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>>>> https://github.com/nirandaperera/hiveToShark/tree/master/hiveVsShark
>>>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>>>> Moving forward, I am currently working on the following.
>>>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>>>>    - Apache Spark's resilient distributed dataset (RDD)
>>>>>>>>>>>>>>>>>>    abstraction (which is a collection of elements 
>>>>>>>>>>>>>>>>>> partitioned across the nodes
>>>>>>>>>>>>>>>>>>    of the cluster that can be operated on in parallel). The 
>>>>>>>>>>>>>>>>>> use of RDDs and
>>>>>>>>>>>>>>>>>>    its impact to the performance.
>>>>>>>>>>>>>>>>>>    - Spark SQL - Use of this Spark SQL over Shark on
>>>>>>>>>>>>>>>>>>    Spark framework
>>>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>>>> [1] https://github.com/amplab/shark/wiki
>>>>>>>>>>>>>>>>>> [2] http://en.wikipedia.org/wiki/Apache_Spark
>>>>>>>>>>>>>>>>>> [3] http://spark.apache.org/docs/latest/
>>>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>>>> Would love to have your feedback on this.
>>>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>>>> Best regards
>>>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>>>> --
>>>>>>>>>>>>>>>>>>  *Niranda Perera*
>>>>>>>>>>>>>>>>>> Software Engineer, WSO2 Inc.
>>>>>>>>>>>>>>>>>> Mobile: +94-71-554-8430
>>>>>>>>>>>>>>>>>> Twitter: @n1r44 <https://twitter.com/N1R44>
>>>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>>> --
>>>>>>>>>>>>>>>>> *Anjana Fernando*
>>>>>>>>>>>>>>>>> Senior Technical Lead
>>>>>>>>>>>>>>>>> WSO2 Inc. | http://wso2.com
>>>>>>>>>>>>>>>>> lean . enterprise . middleware
>>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>> --
>>>>>>>>>>>>>>>> *Niranda Perera*
>>>>>>>>>>>>>>>> Software Engineer, WSO2 Inc.
>>>>>>>>>>>>>>>> Mobile: +94-71-554-8430
>>>>>>>>>>>>>>>>  Twitter: @n1r44 <https://twitter.com/N1R44>
>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>> --
>>>>>>>>>>>>>>> ============================
>>>>>>>>>>>>>>> Srinath Perera, Ph.D.
>>>>>>>>>>>>>>>    http://people.apache.org/~hemapani/
>>>>>>>>>>>>>>>    http://srinathsview.blogspot.com/
>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>
>>>>>>>>>>>>>>
>>>>>>>>>>>>>>
>>>>>>>>>>>>>> --
>>>>>>>>>>>>>> *Anjana Fernando*
>>>>>>>>>>>>>> Senior Technical Lead
>>>>>>>>>>>>>> WSO2 Inc. | http://wso2.com
>>>>>>>>>>>>>> lean . enterprise . middleware
>>>>>>>>>>>>>>
>>>>>>>>>>>>>
>>>>>>>>>>>>>
>>>>>>>>>>>>>
>>>>>>>>>>>>> --
>>>>>>>>>>>>> *Niranda Perera*
>>>>>>>>>>>>> Software Engineer, WSO2 Inc.
>>>>>>>>>>>>> Mobile: +94-71-554-8430
>>>>>>>>>>>>> Twitter: @n1r44 <https://twitter.com/N1R44>
>>>>>>>>>>>>>
>>>>>>>>>>>>
>>>>>>>>>>>>
>>>>>>>>>>>> _______________________________________________
>>>>>>>>>>>> Architecture mailing list
>>>>>>>>>>>> [email protected]
>>>>>>>>>>>> https://mail.wso2.org/cgi-bin/mailman/listinfo/architecture
>>>>>>>>>>>>
>>>>>>>>>>>>
>>>>>>>>>>>
>>>>>>>>>>>
>>>>>>>>>>> --
>>>>>>>>>>> *Lasantha Fernando*
>>>>>>>>>>> Software Engineer - Data Technologies Team
>>>>>>>>>>> WSO2 Inc. http://wso2.com
>>>>>>>>>>>
>>>>>>>>>>> email: [email protected]
>>>>>>>>>>> mobile: (+94) 71 5247551
>>>>>>>>>>>
>>>>>>>>>>
>>>>>>>>>>
>>>>>>>>>>
>>>>>>>>>> --
>>>>>>>>>> *Niranda Perera*
>>>>>>>>>> Software Engineer, WSO2 Inc.
>>>>>>>>>> Mobile: +94-71-554-8430
>>>>>>>>>>  Twitter: @n1r44 <https://twitter.com/N1R44>
>>>>>>>>>>
>>>>>>>>>> _______________________________________________
>>>>>>>>>> Architecture mailing list
>>>>>>>>>> [email protected]
>>>>>>>>>> https://mail.wso2.org/cgi-bin/mailman/listinfo/architecture
>>>>>>>>>>
>>>>>>>>>>
>>>>>>>>>
>>>>>>>>>
>>>>>>>>> --
>>>>>>>>>
>>>>>>>>> *S. Suhothayan*
>>>>>>>>> Technical Lead & Team Lead of WSO2 Complex Event Processor
>>>>>>>>>  *WSO2 Inc. *http://wso2.com
>>>>>>>>> * <http://wso2.com/>*
>>>>>>>>> lean . enterprise . middleware
>>>>>>>>>
>>>>>>>>>
>>>>>>>>>
>>>>>>>>> *cell: (+94) 779 756 757 <%28%2B94%29%20779%20756%20757> | blog:
>>>>>>>>> http://suhothayan.blogspot.com/ <http://suhothayan.blogspot.com/> 
>>>>>>>>> twitter:
>>>>>>>>> http://twitter.com/suhothayan <http://twitter.com/suhothayan> | 
>>>>>>>>> linked-in:
>>>>>>>>> http://lk.linkedin.com/in/suhothayan 
>>>>>>>>> <http://lk.linkedin.com/in/suhothayan>*
>>>>>>>>>
>>>>>>>>> _______________________________________________
>>>>>>>>> Architecture mailing list
>>>>>>>>> [email protected]
>>>>>>>>> https://mail.wso2.org/cgi-bin/mailman/listinfo/architecture
>>>>>>>>>
>>>>>>>>>
>>>>>>>>
>>>>>>>> _______________________________________________
>>>>>>>> Architecture mailing list
>>>>>>>> [email protected]
>>>>>>>> https://mail.wso2.org/cgi-bin/mailman/listinfo/architecture
>>>>>>>>
>>>>>>>>
>>>>>>>
>>>>>>>
>>>>>>> --
>>>>>>> ============================
>>>>>>> Srinath Perera, Ph.D.
>>>>>>>    http://people.apache.org/~hemapani/
>>>>>>>    http://srinathsview.blogspot.com/
>>>>>>>
>>>>>>> _______________________________________________
>>>>>>> Architecture mailing list
>>>>>>> [email protected]
>>>>>>> https://mail.wso2.org/cgi-bin/mailman/listinfo/architecture
>>>>>>>
>>>>>>>
>>>>>>
>>>>>>
>>>>>> --
>>>>>> *Niranda Perera*
>>>>>> Software Engineer, WSO2 Inc.
>>>>>> Mobile: +94-71-554-8430
>>>>>> Twitter: @n1r44 <https://twitter.com/N1R44>
>>>>>>
>>>>>
>>>>>
>>>>
>>>>
>>>> --
>>>> *Niranda Perera*
>>>> Software Engineer, WSO2 Inc.
>>>> Mobile: +94-71-554-8430
>>>> Twitter: @n1r44 <https://twitter.com/N1R44>
>>>>
>>>
>>>
>>
>>
>> --
>> *Niranda Perera*
>> Software Engineer, WSO2 Inc.
>> Mobile: +94-71-554-8430
>> Twitter: @n1r44 <https://twitter.com/N1R44>
>>
>
>
>
> --
>
> David Morales de Frías  ::  +34 607 010 411 :: @dmoralesdf
> <https://twitter.com/dmoralesdf>
>
>
> <http://www.stratio.com/>
> Avenida de Europa, 26. Ática 5. 2ª Planta
> 28224 Pozuelo de Alarcón, Madrid
> Tel: +34 91 352 59 42 // *@stratiobd <https://twitter.com/StratioBD>*
>


-- 
*Niranda Perera*
Software Engineer, WSO2 Inc.
Mobile: +94-71-554-8430
Twitter: @n1r44 <https://twitter.com/N1R44>

_______________________________________________
Architecture mailing list
[email protected]
https://mail.wso2.org/cgi-bin/mailman/listinfo/architecture

Re: [Architecture] [POC] Performance evaluation of Hive vs Shark

Reply via email to