[
https://issues.apache.org/jira/browse/BEAM-12135?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17402423#comment-17402423
]
Tao Li edited comment on BEAM-12135 at 8/20/21, 9:18 PM:
---------------------------------------------------------
Thanks [~iemejia] for the MR. Do you think this will fix the perf issue I was
facing? As I mentioned on the email thread in the user mail list, I was simply
using ParquetIO to read some parquet files (~20GB) and write it back to
parquet. The processing time was about 6 min with splittable IO, which is much
longer than 2 min processing time using a native spark app. We were seeing a
lot of GC cost from the call stack (see attached).
was (Author: sekiforever):
Thanks [~iemejia] for the MR. Do you think this will fix the perf issue I was
facing? As I mentioned on the email thread in the user mail list, I was simply
using ParquetIO to read some parquet files (~20GB) and write it back to
parquet. The processing time was about 6 min with splittable IO, which is much
longer than 2 min processing time using a native spark app. We were seeing a
lot of GC cost from below call stack. See attached.
> Batch optimized translation for Spark Runner
> --------------------------------------------
>
> Key: BEAM-12135
> URL: https://issues.apache.org/jira/browse/BEAM-12135
> Project: Beam
> Issue Type: Improvement
> Components: runner-spark
> Reporter: Ismaël Mejía
> Priority: P3
> Attachments: image001.png
>
> Time Spent: 2h 50m
> Remaining Estimate: 0h
>
> Spark Runner and maybe all other runners that deal with batch only data might
> benefit of a batch optimized translation where details about the full Beam
> model matter less because we are in Global window, no panes info is needed
> and all records use the sane (min) timestamp. With this premise the records
> can be encoded as 'value only' WindowValues and transforms like GroupByKey
> may ignore windowing (GABW) to improve performance.
--
This message was sent by Atlassian Jira
(v8.3.4#803005)