[ 
https://issues.apache.org/jira/browse/BEAM-9440?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17054519#comment-17054519
 ] 

Ismaël Mejía commented on BEAM-9440:
------------------------------------

I read the paper this weekend, it is a nice read, I found two weird points 
there: (1) using the claim on the case that is 58 times slower is a little bit 
of a useful worse case to sell the arguments of the paper, but probably of low 
value because it is a clear outlier. I suppose that Apache Apex was included to 
have a third runner to compare with, but the Apex runner is almost unmaintained 
and probably not optimized at all because the Apache Apex project was abandoned 
after the runner was donated. (2) there is a case where the Beam translation is 
better than the native system which is impossible, given the arguments exposed 
above so worth to review the methodology of that case. I suppose that a more 
suitable third candidate to evaluate at the present would be the Samza runner 
who is better maintained and also open source, and even with a portable version.

I am mentoring a Google Summer of Code project to implement the Nexmark queries 
in python and do performance comparisons with its Java counterparts BEAM-8258 
If you have interest [~GuenterHe] you may pass this information to some of your 
students. I would be interested in a collaboration, eventually for a follow up 
of the publication.

And back to Luke’s comment I agree that we should probably find more specific 
tasks. My proposal is to let this ticket open and start to gather those as 
sub-tasks.

> Performance Issues with Beam Runners compared with Native Systems
> -----------------------------------------------------------------
>
>                 Key: BEAM-9440
>                 URL: https://issues.apache.org/jira/browse/BEAM-9440
>             Project: Beam
>          Issue Type: Bug
>          Components: runner-apex, runner-flink, runner-spark
>            Reporter: Soumabrata Chakraborty
>            Priority: Major
>
> While doing a performance evaluation of Apache Beam with Spark Runner - I 
> found that even for a simple word count problem on a text file – Beam with 
> Spark runner was slower by a factor of 5 times as compared to Spark for a 
> dataset as small as 14 GB.
> You will find more details on this evaluation here - 
> [https://github.com/soumabrata-chakraborty/spark-vs-beam/blob/master/README.md]
> I also came across this analysis called _**Quantitative Impact Evaluation of 
> an Abstraction Layer for Data Stream Processing Systems_ 
> ([https://arxiv.org/pdf/1907.08302.pdf] / 
> [https://ieeexplore.ieee.org/document/8884832])
> According to it, the observation was that for most scenarios the slowdown was 
> at least a factor of 3 with the worse case being a factor of 58!
> While it is understood that an abstraction layer would come with some 
> performance cost - the current performance cost seems to be very high.



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

Reply via email to