[ 
https://issues.apache.org/jira/browse/BEAM-12506?focusedWorklogId=628798&page=com.atlassian.jira.plugin.system.issuetabpanels:worklog-tabpanel#worklog-628798
 ]

ASF GitHub Bot logged work on BEAM-12506:
-----------------------------------------

                Author: ASF GitHub Bot
            Created on: 28/Jul/21 00:03
            Start Date: 28/Jul/21 00:03
    Worklog Time Spent: 10m 
      Work Description: KevinGG commented on pull request #15217:
URL: https://github.com/apache/beam/pull/15217#issuecomment-887911629


   > Looks good at a glance, just wondering about the performance. By turning 
the WindowedValueHolder into a Row coder, how does this impact performance? I'm 
assuming that the performance increases. Subjectively, do you see a difference?
   
   For performance, the main difference is that I've added the `urn` field to 
differentiate a decoded `WindowedValueHolder` from a `Row` created by the user 
happening to contain a `windowed_value` field (which is a scenario too specific 
that shouldn't happen that often). So it should take more space in encoded 
bytes and cost more when encoding and decoding. The extra cost is 
`constant_for_urn_field * #_of_elements`. Since it's used in test_stream, it 
shouldn't affect real production jobs. If the performance regression is not 
acceptable, we can always remove the `urn` field.
   
   As for `RowCoder` and pickled python coder, I don't think there is much 
difference. The `RowCoder` seems to fall back to `FastPrimitiveCoder` when the 
Row is very simple. The advantage is that the `WindowedValueHolder` encoded in 
a language can be decoded in another without issues. Then we can use the cache 
as a medium for external transforms.
   
   One thing I think we should re-visit is to figure out how to encode/decode 
in the streaming cache. 
   
   1. The `AddElement` should only accept encoded `bytes` as the elements, see 
[here](https://github.com/apache/beam/blob/7d98ad2b6554258eeccf9f7c1b017f9a001bfd87/model/pipeline/src/main/proto/beam_runner_api.proto#L662).
 We should encode the elements when creating the `TestStreamFileRecord` protos. 
The missing encoding is 
[here](https://github.com/apache/beam/blob/205fbb10998c9e2a1c7842ab7efd88aef80828a6/sdks/python/apache_beam/testing/test_stream.py#L611).
  
   2. Our cache uses 
[SerializeToString](https://github.com/apache/beam/blob/205fbb10998c9e2a1c7842ab7efd88aef80828a6/sdks/python/apache_beam/testing/test_stream.py#L477)
 to convert `TestStreamFileRecord` to bytes and then encode them with 
`SafeFastPrimitvesCoder` before writing to text files. It's basically a `byte` 
coder. Thus the coder we register for `labels` in the cache manager has nothing 
to do with the underlying values we cache. We need to register 2 coders for 
`labels`: one for the `TestStreamFileRecord`, one for the underlying value. Or 
simpler, always use a default coder (ProtoCoder? 2 variants for 
`TestStreamFileHeader` and `TestStreamFileRecord`) to encode the proto bytes. 
The registered coder is only used when encoding/decoding the underlying values.
   


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: [email protected]

For queries about this service, please contact Infrastructure at:
[email protected]


Issue Time Tracking
-------------------

    Worklog Id:     (was: 628798)
    Time Spent: 2h  (was: 1h 50m)

> Change WindowedValueHolder into a Row Schema
> --------------------------------------------
>
>                 Key: BEAM-12506
>                 URL: https://issues.apache.org/jira/browse/BEAM-12506
>             Project: Beam
>          Issue Type: Improvement
>          Components: runner-py-interactive
>            Reporter: Ning
>            Assignee: Ning
>            Priority: P2
>          Time Spent: 2h
>  Remaining Estimate: 0h
>
> The WindowedValueHolder is a Python type that requires a special 
> `SafeFastPrimitivesCoder` instead of the native `FastPrimitivesCoder` in 
> cache_manager to encode and decode.
> When reading cache of it and applying an external transform such as a 
> SqlTransform, it introduces a pickled Python coder that is not xLang friendly.
> We could build a Row schema to hold the WindowedValueHolder to make the cache 
> reading xLang friendly and also get rid of the additional layer of 
> `SafeFastPrimitivesCoder`.



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

Reply via email to