[
https://issues.apache.org/jira/browse/BEAM-14534?focusedWorklogId=777014&page=com.atlassian.jira.plugin.system.issuetabpanels:worklog-tabpanel#worklog-777014
]
ASF GitHub Bot logged work on BEAM-14534:
-----------------------------------------
Author: ASF GitHub Bot
Created on: 01/Jun/22 17:01
Start Date: 01/Jun/22 17:01
Worklog Time Spent: 10m
Work Description: steveniemitz commented on PR #17783:
URL: https://github.com/apache/beam/pull/17783#issuecomment-1143881332
> The other part of the change makes sense to reduce byte[] copies by using
ByteString.
>
> CC: @tudorm
Maybe I'll pull the ByteString refactoring stuff out into another review
just to make this easier? Do you have any particular issues with it using
ByteString there?
The downsides with using Output/Input stream are really too big to ignore
here, the performance differences are orders of magnitude in our tests. The
main problem is that most "stream" compressor implementations are designed to
compress a large amount of data, but in this case we're usually only
compressing a few 100-1KB. It makes the overhead from creating/destroying the
compressor streams very high (comparatively at least). We ran into this
problem both with deflate and zstd, and its one of the reasons we ended up with
an interface like this. If its really a non-starter putting this on OSS with a
similar interface that's fine though, we can continue maintaining this in our
own fork for the time being.
The PipelineVisitor idea is interesting, although I'm skeptical how well
it'd work in practice. For example with a Combine the coder for the data being
shuffled is the accumulator coder, not the value coder of the KV. I bet you'd
need a bunch of special cases to pick the "right" coder to wrap for various
transforms.
Issue Time Tracking
-------------------
Worklog Id: (was: 777014)
Time Spent: 1h 20m (was: 1h 10m)
> Add an interface to allow users to compress values being written to shuffle
> ---------------------------------------------------------------------------
>
> Key: BEAM-14534
> URL: https://issues.apache.org/jira/browse/BEAM-14534
> Project: Beam
> Issue Type: Improvement
> Components: runner-dataflow
> Reporter: Steve Niemitz
> Assignee: Steve Niemitz
> Priority: P2
> Time Spent: 1h 20m
> Remaining Estimate: 0h
>
> Frequently values being shuffled are large and compressible, while users can
> compress them on their own by using a coder that compresses the data, it
> would be nice to be able to do so globally for all values.
--
This message was sent by Atlassian Jira
(v8.20.7#820007)