[ 
https://issues.apache.org/jira/browse/BEAM-6120?focusedWorklogId=169910&page=com.atlassian.jira.plugin.system.issuetabpanels:worklog-tabpanel#worklog-169910
 ]

ASF GitHub Bot logged work on BEAM-6120:
----------------------------------------

                Author: ASF GitHub Bot
            Created on: 27/Nov/18 17:45
            Start Date: 27/Nov/18 17:45
    Worklog Time Spent: 10m 
      Work Description: lukecwik commented on a change in pull request #7127: 
[BEAM-6120] Support retrieval of large gbk iterables over the state API.
URL: https://github.com/apache/beam/pull/7127#discussion_r236769635
 
 

 ##########
 File path: model/pipeline/src/main/proto/beam_runner_api.proto
 ##########
 @@ -588,6 +588,10 @@ message StandardCoders {
     // of the element
     // Components: The element coder and the window coder, in that order
     WINDOWED_VALUE = 8 [(beam_urn) = "beam:coder:windowed_value:v1"];
+
+    // Encodes an iterable of elements, some of which may be stored in state.
+    // Components: Coder for a single element.
+    STATE_BACKED_ITERABLE = 9 [(beam_urn) = 
"beam:coder:state_backed_iterable:v1"];
 
 Review comment:
   I do believe the composition is possible. The state backed length prefix 
coder would read all the data from the Data API up until the end including the 
optional state token if it exists. If it hits the end and there is no state 
token or the component coder doesn't support lazy decoding then it can use the 
component coder to decode the data and return that result. If there is a state 
token and the component coder supports a lazy view then it calls a new method 
on the component coder which creates a lazy view over a reiterable bytestring 
and its up to the component coder implementation to decode/seek over the data 
as needed.
   
   I have a partial implementation of the reiterable bytestring in 
StateFetchingIterators.java, just need to create the lazy decoding iterable 
view on top of it.
   
   Writing the data should allow the remote reference to be emitted as is so 
that the runner then internally forwards the data itself. This would require 
building context as whether the receiving party of the encoded bytes supports 
remote references (for example using coders within an IO implementation would 
not support remote references). Without forwarding a remote reference forward, 
how would you expect a runner to receive and decode an enormous value iterable. 
All the runners assume they can store the whole element (or a partial view of 
an element) within memory.

----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on GitHub and use the
URL above to go to the specific comment.
 
For queries about this service, please contact Infrastructure at:
[email protected]


Issue Time Tracking
-------------------

    Worklog Id:     (was: 169910)
    Time Spent: 1.5h  (was: 1h 20m)

> Support retrieval of large gbk iterables over the state API.
> ------------------------------------------------------------
>
>                 Key: BEAM-6120
>                 URL: https://issues.apache.org/jira/browse/BEAM-6120
>             Project: Beam
>          Issue Type: Improvement
>          Components: sdk-py-harness
>            Reporter: Robert Bradshaw
>            Assignee: Robert Bradshaw
>            Priority: Major
>          Time Spent: 1.5h
>  Remaining Estimate: 0h
>




--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

Reply via email to