[
https://issues.apache.org/jira/browse/BEAM-6120?focusedWorklogId=169910&page=com.atlassian.jira.plugin.system.issuetabpanels:worklog-tabpanel#worklog-169910
]
ASF GitHub Bot logged work on BEAM-6120:
----------------------------------------
Author: ASF GitHub Bot
Created on: 27/Nov/18 17:45
Start Date: 27/Nov/18 17:45
Worklog Time Spent: 10m
Work Description: lukecwik commented on a change in pull request #7127:
[BEAM-6120] Support retrieval of large gbk iterables over the state API.
URL: https://github.com/apache/beam/pull/7127#discussion_r236769635
##########
File path: model/pipeline/src/main/proto/beam_runner_api.proto
##########
@@ -588,6 +588,10 @@ message StandardCoders {
// of the element
// Components: The element coder and the window coder, in that order
WINDOWED_VALUE = 8 [(beam_urn) = "beam:coder:windowed_value:v1"];
+
+ // Encodes an iterable of elements, some of which may be stored in state.
+ // Components: Coder for a single element.
+ STATE_BACKED_ITERABLE = 9 [(beam_urn) =
"beam:coder:state_backed_iterable:v1"];
Review comment:
I do believe the composition is possible. The state backed length prefix
coder would read all the data from the Data API up until the end including the
optional state token if it exists. If it hits the end and there is no state
token or the component coder doesn't support lazy decoding then it can use the
component coder to decode the data and return that result. If there is a state
token and the component coder supports a lazy view then it calls a new method
on the component coder which creates a lazy view over a reiterable bytestring
and its up to the component coder implementation to decode/seek over the data
as needed.
I have a partial implementation of the reiterable bytestring in
StateFetchingIterators.java, just need to create the lazy decoding iterable
view on top of it.
Writing the data should allow the remote reference to be emitted as is so
that the runner then internally forwards the data itself. This would require
building context as whether the receiving party of the encoded bytes supports
remote references (for example using coders within an IO implementation would
not support remote references). Without forwarding a remote reference forward,
how would you expect a runner to receive and decode an enormous value iterable.
All the runners assume they can store the whole element (or a partial view of
an element) within memory.
----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on GitHub and use the
URL above to go to the specific comment.
For queries about this service, please contact Infrastructure at:
[email protected]
Issue Time Tracking
-------------------
Worklog Id: (was: 169910)
Time Spent: 1.5h (was: 1h 20m)
> Support retrieval of large gbk iterables over the state API.
> ------------------------------------------------------------
>
> Key: BEAM-6120
> URL: https://issues.apache.org/jira/browse/BEAM-6120
> Project: Beam
> Issue Type: Improvement
> Components: sdk-py-harness
> Reporter: Robert Bradshaw
> Assignee: Robert Bradshaw
> Priority: Major
> Time Spent: 1.5h
> Remaining Estimate: 0h
>
--
This message was sent by Atlassian JIRA
(v7.6.3#76005)