[jira] [Comment Edited] (BEAM-12135) Batch optimized translation for Spark Runner

Tao Li (Jira) Fri, 20 Aug 2021 14:19:05 -0700


    [ 
https://issues.apache.org/jira/browse/BEAM-12135?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17402423#comment-17402423
 ]


Tao Li edited comment on BEAM-12135 at 8/20/21, 9:18 PM:
---------------------------------------------------------

Thanks [~iemejia] for the MR. Do you think this will fix the perf issue I was 
facing? As I mentioned on the email thread in the user mail list, I was simply 
using ParquetIO to read some parquet files (~20GB) and write it back to 
parquet. The processing time was about 6 min with splittable IO, which is much 
longer than 2 min processing time using a native spark app. We were seeing a 
lot of GC cost from the call stack (see attached).




was (Author: sekiforever):
Thanks [~iemejia] for the MR. Do you think this will fix the perf issue I was 
facing? As I mentioned on the email thread in the user mail list, I was simply 
using ParquetIO to read some parquet files (~20GB) and write it back to 
parquet. The processing time was about 6 min with splittable IO, which is much 
longer than 2 min processing time using a native spark app. We were seeing a 
lot of GC cost from below call stack. See attached.



> Batch optimized translation for Spark Runner
> --------------------------------------------
>
>                 Key: BEAM-12135
>                 URL: https://issues.apache.org/jira/browse/BEAM-12135
>             Project: Beam
>          Issue Type: Improvement
>          Components: runner-spark
>            Reporter: Ismaël Mejía
>            Priority: P3
>         Attachments: image001.png
>
>          Time Spent: 2h 50m
>  Remaining Estimate: 0h
>
> Spark Runner and maybe all other runners that deal with batch only data might 
> benefit of a batch optimized translation where details about the full Beam 
> model matter less because we are in Global window, no panes info is needed 
> and all records use the sane (min) timestamp. With this premise the records 
> can be encoded as 'value only' WindowValues and transforms like GroupByKey 
> may ignore windowing (GABW) to improve performance.



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

[jira] [Comment Edited] (BEAM-12135) Batch optimized translation for Spark Runner

Reply via email to