[jira] [Work logged] (BEAM-5110) Reconile Flink JVM singleton management with deployment

ASF GitHub Bot (JIRA) Wed, 08 Aug 2018 16:17:23 -0700


     [ 
https://issues.apache.org/jira/browse/BEAM-5110?focusedWorklogId=132764&page=com.atlassian.jira.plugin.system.issuetabpanels:worklog-tabpanel#worklog-132764
 ]


ASF GitHub Bot logged work on BEAM-5110:
----------------------------------------

                Author: ASF GitHub Bot
            Created on: 08/Aug/18 23:16
            Start Date: 08/Aug/18 23:16
    Worklog Time Spent: 10m 
      Work Description: bsidhom commented on a change in pull request #6189: 
[BEAM-5110] Explicitly count the references for 
BatchFlinkExecutableStageContext …
URL: https://github.com/apache/beam/pull/6189#discussion_r208763438
 
 

 ##########
 File path: 
runners/flink/src/main/java/org/apache/beam/runners/flink/translation/functions/FlinkExecutableStageContext.java
 ##########
 @@ -32,11 +32,23 @@
    * defined at translation time and distributed to TaskManagers.
    */
   interface Factory extends Serializable {
+
+    /**
+     * Get or create {@link FlinkExecutableStageContext} for given {@link 
JobInfo}. {@link
+     * Factory#release(JobInfo)} should be called for each get request for the 
jobInfo.
+     */
     FlinkExecutableStageContext get(JobInfo jobInfo);
+
+    /**
+     * Release the context for the jobInfo. This method should be called for 
each call to {@link
+     * Factory#get(JobInfo)}. This method should always be called after the 
call to {@link
+     * Factory#get(JobInfo)} is completed.
+     */
+    void release(JobInfo jobInfo);
 
 Review comment:
   I prefer `close()` as that allows us to idiomatically us a 
try-with-resources block. However, this should _not_ live on the factory 
itself, but on the Context that is returned. Doing so would be less error prone 
for clients.
   
   The main difficulty is that instead of returning a direct reference to a 
cached context, the refcounted cache instead needs to return a refcounting 
wrapper around the underlying implementation. Doing so would have the added 
benefit of allowing us to prevent double-decrements and could allow us to make 
close() idempotent.

----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on GitHub and use the
URL above to go to the specific comment.
 
For queries about this service, please contact Infrastructure at:
[email protected]


Issue Time Tracking
-------------------

    Worklog Id:     (was: 132764)
    Time Spent: 0.5h  (was: 20m)

> Reconile Flink JVM singleton management with deployment
> -------------------------------------------------------
>
>                 Key: BEAM-5110
>                 URL: https://issues.apache.org/jira/browse/BEAM-5110
>             Project: Beam
>          Issue Type: Bug
>          Components: runner-flink
>            Reporter: Ben Sidhom
>            Assignee: Ben Sidhom
>            Priority: Major
>          Time Spent: 0.5h
>  Remaining Estimate: 0h
>
> [~angoenka] noticed through debugging that multiple instances of 
> BatchFlinkExecutableStageContext.BatchFactory are loaded for a given job when 
> executing in standalone cluster mode. This context factory is responsible for 
> maintaining singleton state across a TaskManager (JVM) in order to share SDK 
> Environments across workers in a given job. The multiple-loading breaks 
> singleton semantics and results in an indeterminate number of Environments 
> being created.
> It turns out that the [Flink classloading 
> mechanism|https://ci.apache.org/projects/flink/flink-docs-release-1.5/monitoring/debugging_classloading.html]
>  is determined by deployment mode. Note that "user code" as referenced by 
> this link is actually the Flink job server jar. Actual end-user code lives 
> inside of the SDK Environment and uploaded artifacts.
> In order to maintain singletons without resorting to IPC (for example, using 
> file locks and/or additional gRPC servers), we need to force non-dynamic 
> classloading. For example, this happens when jobs are submitted to YARN for 
> one-off deployments via `flink run`. However, connecting to an existing 
> (Flink standalone) deployment results in dynamic classloading.
> We should investigate this behavior and either document (and attempt to 
> enforce) deployment modes that are consistent with our requirements, or (if 
> possible) create a custom classloader that enforces singleton loading.



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

[jira] [Work logged] (BEAM-5110) Reconile Flink JVM singleton management with deployment

Reply via email to