[
https://issues.apache.org/jira/browse/BEAM-10942?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
]
Luke Cwik updated BEAM-10942:
-----------------------------
Summary: beam_PostCommit_Java_Nexmark_Flink failing due to loss of state
when executing splittable DoFn (was: beam_PostCommit_Java_Nexmark_Flink
failing)
> beam_PostCommit_Java_Nexmark_Flink failing due to loss of state when
> executing splittable DoFn
> ----------------------------------------------------------------------------------------------
>
> Key: BEAM-10942
> URL: https://issues.apache.org/jira/browse/BEAM-10942
> Project: Beam
> Issue Type: Bug
> Components: runner-flink
> Affects Versions: 2.25.0
> Reporter: Luke Cwik
> Assignee: Luke Cwik
> Priority: P0
>
> Nexmark fails due to NPE caused by failure in loading element/restriction
> state for splittable DoFn.
> Additional logging in
> https://github.com/lukecwik/incubator-beam/tree/beam10670.2 shows that the
> element restriction is null/null
> {noformat}
> Sep 21, 2020 5:26:23 PM
> org.apache.beam.runners.core.SplittableParDoViaKeyedWorkItems$ProcessFn
> processElement
> INFO: before: KV{null, null}
> {noformat}
> Earlier logging shows that we have set this from state and then set a timer
> and then failed to load the data:
> {noformat}
> INFO: Setting:
> //:713916102:element:ValueInGlobalWindow{value=UnboundedEventSource(0,
> 100000), pane=PaneInfo.NO_FIRING}
> INFO: Setting:
> //:713916102:restriction:UnboundedSourceRestriction{source=UnboundedEventSource(33000,
> 34000), checkpoint=GeneratorCheckpoint{numEvents=1000,
> wallclockBaseTime=1600734382646}, watermark=294247-01-10T04:00:54.775Z}
> INFO: Reading:
> //:713916102:element:ValueInGlobalWindow{value=UnboundedEventSource(0,
> 100000), pane=PaneInfo.NO_FIRING}
> INFO: Reading:
> //:713916102:restriction:UnboundedSourceRestriction{source=UnboundedEventSource(33000,
> 34000), checkpoint=GeneratorCheckpoint{numEvents=1000,
> wallclockBaseTime=1600734382646}, watermark=294247-01-10T04:00:54.775Z}
> INFO: Setting: //:713916102:watermarkEstimatorState:294247-01-10T04:00:54.775Z
> INFO: Setting timer: 713916102 1:1600734383019// at 1600734383019 with output
> time 1600734383019
> INFO: Reading: //:713916102:element:null
> INFO: Reading: //:713916102:restriction:null
> INFO: Reading: //:713916102:watermarkEstimatorState:null
> {noformat}
> for byte[] key with hash 713916102 in the global window namespace.
> Note that in a rerun the keys will be different since they are UUIDs but the
> pattern is the same.
> To rerun:
> checkout https://github.com/lukecwik/incubator-beam/tree/beam10670.2 for
> additional log output
> setup gradle task:
> with task:
> {noformat}
> :sdks:java:testing:nexmark:run
> {noformat}
> with args:
> {noformat}
> -Pnexmark.runner=":runners:flink:1.11"
> -Pnexmark.args=" --runner=FlinkRunner --suite=SMOKE
> --streamTimeout=60 --streaming=true --manageResources=false
> --monitorJobs=true --flinkMaster=[local] --numEvents=100000 --query=4
> --checkpointingInterval=3000 --shutdownSourcesAfterIdleMs=60000"
> {noformat}
> Note, you may need to increase your console size (Editor>General>Console) to
> capture enough output before it fails. I usually add a breakpoint for the
> UserCodeException constructor to pause the program and inspect the state at
> the point in time.
--
This message was sent by Atlassian Jira
(v8.3.4#803005)