[ 
https://issues.apache.org/jira/browse/BEAM-9440?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17052236#comment-17052236
 ] 

Ismaël Mejía commented on BEAM-9440:
------------------------------------

Thanks a lot for the reference I was not aware of that publication. There is 
always a performance overhead on Beam translations (in all runners) for 
multiple reasons: (1) Beam adds extra data that needs to be serialized per each 
element for example timestamps and windowing/pane information. (2) Translations 
may add extra steps to cope with that data and respect the windowing/triggering 
semantics of Beam so this can produce in some cases double shuffles e.g. if the 
runner does independently group by key and group by window (3) For portable 
runners we have an extra overhead of communication between the target system 
and the SDK Harness via gRPC, the impact of this in theory is less important 
when the data is big but it adds as an extra overhead (4) some native systems 
can optimize pipelines based on data schemas, this is something that was not 
possible on Beam because PCollections did  not have an associated schema until 
recently. BEAM-9451 tracks how to optimize this for the new Spark Structured 
Streaming runner and finally (5) Runners also can be doing extra work or not 
optimized translations, we have been optimizing the Spark runner to improve 
these, if you are interested there is a really nice talk presented last year in 
Berlin Buzzwords https://www.youtube.com/watch?v=rJIpva0tD0g

So in principle it is good to have awareness of the issue you report and this 
definitely deserves some investigation in particular to find if there is any 
issue related to (5). The other issues are unavoidable because they are strong 
requirements to support the Beam model (1)(2) and portability (3) (support for 
multiple languages).

> Performance Issue with Spark Runner compared with Native Spark
> --------------------------------------------------------------
>
>                 Key: BEAM-9440
>                 URL: https://issues.apache.org/jira/browse/BEAM-9440
>             Project: Beam
>          Issue Type: Bug
>          Components: runner-spark
>            Reporter: Soumabrata Chakraborty
>            Priority: Major
>
> While doing a performance evaluation of Apache Beam with Spark Runner - I 
> found that even for a simple word count problem on a text file – Beam with 
> Spark runner was slower by a factor of 5 times as compared to Spark for a 
> dataset as small as 14 GB.
> You will find more details on this evaluation here - 
> [https://github.com/soumabrata-chakraborty/spark-vs-beam/blob/master/README.md]
> I also came across this analysis called _**Quantitative Impact Evaluation of 
> an Abstraction Layer for Data Stream Processing Systems_ 
> ([https://arxiv.org/pdf/1907.08302.pdf])
> According to it, the observation was that for most scenarios the slowdown was 
> at least a factor of 3 with the worse case being a factor of 58!
> While it is understood that an abstraction layer would come with some 
> performance cost - the current performance cost seems to be very high.



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

Reply via email to