[jira] [Work logged] (BEAM-14534) Add an interface to allow users to compress values being written to shuffle

ASF GitHub Bot (Jira) Fri, 03 Jun 2022 13:53:04 -0700


     [ 
https://issues.apache.org/jira/browse/BEAM-14534?focusedWorklogId=778310&page=com.atlassian.jira.plugin.system.issuetabpanels:worklog-tabpanel#worklog-778310
 ]


ASF GitHub Bot logged work on BEAM-14534:
-----------------------------------------

                Author: ASF GitHub Bot
            Created on: 03/Jun/22 20:52
            Start Date: 03/Jun/22 20:52
    Worklog Time Spent: 10m 
      Work Description: lukecwik commented on PR #17783:
URL: https://github.com/apache/beam/pull/17783#issuecomment-1146352894

   > > The other part of the change makes sense to reduce byte[] copies by 
using ByteString.
   > > CC: @tudorm
   > 
   > Maybe I'll pull the ByteString refactoring stuff out into another review 
just to make this easier? Do you have any particular issues with it using 
ByteString there?
   > 
   > The downsides with using Output/Input stream are really too big to ignore 
here, the performance differences are orders of magnitude in our tests. The 
main problem is that most "stream" compressor implementations are designed to 
compress a large amount of data, but in this case we're usually only 
compressing a few 100-1KB. It makes the overhead from creating/destroying the 
compressor streams very high (comparatively at least). We ran into this problem 
both with deflate and zstd, and its one of the reasons we ended up with an 
interface like this. If its really a non-starter putting this on OSS with a 
similar interface that's fine though, we can continue maintaining this in our 
own fork for the time being.
   > 
   > The PipelineVisitor idea is interesting, although I'm skeptical how well 
it'd work in practice. For example with a Combine the coder for the data being 
shuffled is the accumulator coder, not the value coder of the KV. I bet you'd 
need a bunch of special cases to pick the "right" coder to wrap for various 
transforms.
   
   I'm happy with using ByteStrings as it reduces copies and will review and 
merge that and then we can revisit this.
   
   I was unaware of the additional overhead for using a stream 
compressor/decompressor vs one that worked on fixed length byte strings. I'm 
curious to learn more about it though as I might see something that wasn't 
considered before and also with the migration to Dataflow runner v2 you'll lose 
the ability to have this option have an effect unless you go the coder route 
which will have additional benefits since you'll reduce the amount of data 
being sent over gRPC between the SDK harness and the runner and also how much 
the runner materializes in shuffle in addition to those other use cases such as 
side inputs/user state/...
   
   Using the PipelineVisitor should work quite well as since you really only 
need to handle GBK, CoGBK and Combine and that would give you everything that 
is optimized across runners today to my knowledge.




Issue Time Tracking
-------------------

    Worklog Id:     (was: 778310)
    Time Spent: 1h 40m  (was: 1.5h)

> Add an interface to allow users to compress values being written to shuffle
> ---------------------------------------------------------------------------
>
>                 Key: BEAM-14534
>                 URL: https://issues.apache.org/jira/browse/BEAM-14534
>             Project: Beam
>          Issue Type: Improvement
>          Components: runner-dataflow
>            Reporter: Steve Niemitz
>            Assignee: Steve Niemitz
>            Priority: P2
>          Time Spent: 1h 40m
>  Remaining Estimate: 0h
>
> Frequently values being shuffled are large and compressible, while users can 
> compress them on their own by using a coder that compresses the data, it 
> would be nice to be able to do so globally for all values.



--
This message was sent by Atlassian Jira
(v8.20.7#820007)

[jira] [Work logged] (BEAM-14534) Add an interface to allow users to compress values being written to shuffle

Reply via email to