[jira] [Commented] (HAMA-990) GSoC'16: Apache Hama benchmark against Spark and Flink

Behroz Sikander (JIRA) Thu, 19 May 2016 16:55:24 -0700

    [ 
https://issues.apache.org/jira/browse/HAMA-990?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=15292377#comment-15292377
 ]


Behroz Sikander commented on HAMA-990:
--------------------------------------

>> I personally recommend you don't spend much time for other trivial bug fixes.
Okay. I am very close to understanding it completely but I will move my focus 
towards the main goal as you mentioned.

Regarding the main goal, I think that we should check Hama on the following 
types of algorithms.
1- Batch
2- Iterative
3- Graph
4- Query Processing

and the proposed algorithms are
1- Batch - Word Count
2- Iterative/ML - K-Means
3- Graph - Page Rank
4- Query Processing - We can use MRQL for this and can perform a scan/join on a 
dataset.[2]

According to [1] and [3], Apache Flink is faster than Spark in K-Means, Page 
Rank and Query Processing whereas Spark is faster in Word Count. We can 
reproduce these results in our cluster and then can calculate the results for 
Hama. Once we have all the results we can compare all the systems.

Further, 
1- for monitoring the memory, CPU, harddrive and network usage we can use [4]. 
What do you think about this ?
2- Karamel can be used for easy installation of Spark and Flink [5]. I am also 
okay with manual installation. Any suggestions ?
3- Spark and Flink also have a TeraSort benchmark where Flink is apparently 
faster. [6]. Should we also do a TeraSort benchmark ?
4- Should we try all the systems Flink/Spark/Hama on default configurations or 
we should tweak them for best performance  for each algorithm ?


[1] - http://www.slideshare.net/sbaltagi/overview-of-apacheflinkbyslimbaltagi   
  - See slide 63
[2] - http://database.cs.brown.edu/sigmod09/benchmarks-sigmod09.pdf
[3] - http://link.springer.com/chapter/10.1007/978-3-319-19027-3_3
[4] - https://github.com/shelan/collectl-monitoring
[5] - http://karamel.readthedocs.io/en/latest/text/overview.html
[6] - 
http://shelan.org/blog/2016/01/31/reproducible-experiment-to-compare-apache-spark-and-apache-flink-batch-processing/

> GSoC'16: Apache Hama benchmark against Spark and Flink
> ------------------------------------------------------
>
>                 Key: HAMA-990
>                 URL: https://issues.apache.org/jira/browse/HAMA-990
>             Project: Hama
>          Issue Type: Documentation
>            Reporter: Behroz Sikander
>            Priority: Minor
>




--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

[jira] [Commented] (HAMA-990) GSoC'16: Apache Hama benchmark against Spark and Flink

Reply via email to