On Aug 13, 2014 3:18 PM, "Anjana Fernando" <[email protected]> wrote:
>
> Hi Niranda,
>
> Excellent analysis of Hive vs Shark! .. This gives a lot of insight into
how both operates in different scenarios. As the next step, we will need to
run this in an actual cluster of computers. Since you've used a subset of
the dataset of 2014 DEBS challenge, we should use the full data set in a
clustered environment and check this. Gokul is already working on the Hive
based setup for this, after that is done, you can create a Shark cluster in
the same hardware and run the tests there, to get a clear comparison on how
these two match up in a cluster. Until the setup is ready, do continue with
your next steps on checking the RDD support and Spark SQL use.
>
> After these are done, we should also do a trial run of our own APIM Hive
scripts, migrated to Shark.

Do we need to migrate?I thought existing Hive scripts can run as it is.
First of all we need to create a large data set of API stats.

>
> Cheers,
> Anjana.
>
>
> On Mon, Aug 11, 2014 at 12:21 PM, Niranda Perera <[email protected]> wrote:
>>
>> Hi all,
>>
>> I have been evaluating the performance of Shark (distributed SQL query
engine for Hadoop) against Hive. This is with the objective of seeing the
possibility to move the WSO2 BAM data processing (which currently uses
Hive) to Shark (and Apache Spark) for improved performance.
>>
>> I am sharing my findings herewith.
>>
>> AMP Lab Shark
>> Shark can execute Hive QL queries up to 100 times faster than Hive
without any modification to the existing data or queries. It supports
Hive's QL, metastore, serialization formats, and user-defined functions,
providing seamless integration with existing Hive deployments and a
familiar, more powerful option for new ones. [1]
>>
>> Apache Spark
>> Apache Spark is an open-source data analytics cluster computing
framework. It fits into the Hadoop open-source community, building on top
of the HDFS and promises performance up to 100 times faster than Hadoop
MapReduce for certain applications. [2]
>> Official documentation: [3]
>>
>>
>> I carried out the comparison between the following Hive and Shark
releases with input files ranging from 100 to 1 billion entries.
>>
>> QL Engine
>>
>> Apache Hive 0.11
>>
>> Shark Shark 0.9.1 (Latest release) which uses,
>>
>> Scala 2.10.3
>>
>> Spark 0.9.1
>>
>> AMPLab’s Hive 0.9.0
>>
>>
>> Framework
>>
>> Hadoop 1.0.4
>>
>> Spark 0.9.1
>>
>> File system
>>
>> HDFS
>>
>> HDFS
>>
>>
>> Attached herewith is a report which describes in detail about the
performance comparison between Shark and Hive.
>> ​
>>  hive_vs_shark
>> ​​
>>  hive_vs_shark_report.odt
>> ​​
>>
>> In summary,
>>
>> From the evaluation, following conclusions can be derived.
>> Shark is indifferent to Hive in DDL operations (CREATE, DROP .. TABLE,
DATABASE). Both engines show a fairly constant performance as the input
size increases.
>> Shark is indifferent to Hive in DML operations (LOAD, INSERT) but when a
DML operation is called in conjuncture of a data retrieval operation (ex.
INSERT <TBL> SELECT <PROP> FROM <TBL>), Shark significantly over-performs
Hive with a performance factor of 10x+ (Ranging from 10x to 80x in some
instances). Shark performance factor reduces with the input size increases,
while HIVE performance is fairly indifferent.
>> Shark clearly over-performs Hive in Data Retrieval operations (FILTER,
ORDER BY, JOIN). Hive performance is fairly indifferent in the data
retrieval operations while Shark performance reduces as the input size
increases. But at every instance Shark over-performed Hive with a minimum
performance factor of 5x+ (Ranging from 5x to 80x in some instances).
>> Please refer the 'hive_vs_shark_report', it has all the information
about the queries and timings pictographically.
>>
>> The code repository can also be found in
>> https://github.com/nirandaperera/hiveToShark/tree/master/hiveVsShark
>>
>> Moving forward, I am currently working on the following.
>> Apache Spark's resilient distributed dataset (RDD) abstraction (which is
a collection of elements partitioned across the nodes of the cluster that
can be operated on in parallel). The use of RDDs and its impact to the
performance.
>> Spark SQL - Use of this Spark SQL over Shark on Spark framework
>>
>> [1] https://github.com/amplab/shark/wiki
>> [2] http://en.wikipedia.org/wiki/Apache_Spark
>> [3] http://spark.apache.org/docs/latest/
>>
>>
>>
>> Would love to have your feedback on this.
>>
>> Best regards
>>
>> --
>> Niranda Perera
>> Software Engineer, WSO2 Inc.
>> Mobile: +94-71-554-8430
>> Twitter: @n1r44
>
>
>
>
> --
> Anjana Fernando
> Senior Technical Lead
> WSO2 Inc. | http://wso2.com
> lean . enterprise . middleware
>
> _______________________________________________
> Architecture mailing list
> [email protected]
> https://mail.wso2.org/cgi-bin/mailman/listinfo/architecture
>
_______________________________________________
Architecture mailing list
[email protected]
https://mail.wso2.org/cgi-bin/mailman/listinfo/architecture

Reply via email to