[ 
https://issues.apache.org/jira/browse/BEAM-10942?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Luke Cwik updated BEAM-10942:
-----------------------------
    Summary: beam_PostCommit_Java_Nexmark_Flink failing due to loss of state 
when executing splittable DoFn  (was: beam_PostCommit_Java_Nexmark_Flink 
failing)

> beam_PostCommit_Java_Nexmark_Flink failing due to loss of state when 
> executing splittable DoFn
> ----------------------------------------------------------------------------------------------
>
>                 Key: BEAM-10942
>                 URL: https://issues.apache.org/jira/browse/BEAM-10942
>             Project: Beam
>          Issue Type: Bug
>          Components: runner-flink
>    Affects Versions: 2.25.0
>            Reporter: Luke Cwik
>            Assignee: Luke Cwik
>            Priority: P0
>
> Nexmark fails due to NPE caused by failure in loading element/restriction 
> state for splittable DoFn.
> Additional logging in 
> https://github.com/lukecwik/incubator-beam/tree/beam10670.2 shows that the 
> element restriction is null/null 
> {noformat}
> Sep 21, 2020 5:26:23 PM 
> org.apache.beam.runners.core.SplittableParDoViaKeyedWorkItems$ProcessFn 
> processElement
> INFO: before: KV{null, null}
> {noformat}
> Earlier logging shows that we have set this from state and then set a timer 
> and then failed to load the data:
> {noformat}
> INFO: Setting: 
> //:713916102:element:ValueInGlobalWindow{value=UnboundedEventSource(0, 
> 100000), pane=PaneInfo.NO_FIRING}
> INFO: Setting: 
> //:713916102:restriction:UnboundedSourceRestriction{source=UnboundedEventSource(33000,
>  34000), checkpoint=GeneratorCheckpoint{numEvents=1000, 
> wallclockBaseTime=1600734382646}, watermark=294247-01-10T04:00:54.775Z}
> INFO: Reading: 
> //:713916102:element:ValueInGlobalWindow{value=UnboundedEventSource(0, 
> 100000), pane=PaneInfo.NO_FIRING}
> INFO: Reading: 
> //:713916102:restriction:UnboundedSourceRestriction{source=UnboundedEventSource(33000,
>  34000), checkpoint=GeneratorCheckpoint{numEvents=1000, 
> wallclockBaseTime=1600734382646}, watermark=294247-01-10T04:00:54.775Z}
> INFO: Setting: //:713916102:watermarkEstimatorState:294247-01-10T04:00:54.775Z
> INFO: Setting timer: 713916102 1:1600734383019// at 1600734383019 with output 
> time 1600734383019
> INFO: Reading: //:713916102:element:null
> INFO: Reading: //:713916102:restriction:null
> INFO: Reading: //:713916102:watermarkEstimatorState:null
> {noformat}
> for byte[] key with hash 713916102 in the global window namespace.
> Note that in a rerun the keys will be different since they are UUIDs but the 
> pattern is the same.
> To rerun:
> checkout https://github.com/lukecwik/incubator-beam/tree/beam10670.2 for 
> additional log output
> setup gradle task:
> with task:
> {noformat}
> :sdks:java:testing:nexmark:run
> {noformat}
> with args:
> {noformat}
> -Pnexmark.runner=":runners:flink:1.11"
> -Pnexmark.args="     --runner=FlinkRunner     --suite=SMOKE     
> --streamTimeout=60     --streaming=true     --manageResources=false     
> --monitorJobs=true     --flinkMaster=[local] --numEvents=100000 --query=4 
> --checkpointingInterval=3000 --shutdownSourcesAfterIdleMs=60000"
> {noformat}
> Note, you may need to increase your console size (Editor>General>Console) to 
> capture enough output before it fails. I usually add a breakpoint for the 
> UserCodeException constructor to pause the program and inspect the state at 
> the point in time.



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

Reply via email to