[ 
https://issues.apache.org/jira/browse/FLINK-8431?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16330244#comment-16330244
 ] 

ASF GitHub Bot commented on FLINK-8431:
---------------------------------------

GitHub user eastcirclek opened a pull request:

    https://github.com/apache/flink/pull/5307

    [FLINK-8431] [mesos] Allow to specify # GPUs for TaskManager in Mesos

    ## What is the purpose of the change
    
    This PR introduces a new configuration property named 
"mesos.resourcemanager.tasks.gpus"  to allow users to specify # of GPUs for 
each TaskManager process in Mesos. The configuration property is necessary 
because TaskManagers that do not specify to use GPUs cannot see GPUs at all 
when Mesos agents are configured to isolate GPUs as shown in [1].
    
    [1] http://mesos.apache.org/documentation/latest/gpu-support/#agent-flags
    
    ## Brief change log
    
    * Modify MesosTaskManagerParameters instead of 
ContaineredTaskManagerParameters to confine this problem to Mesos
    * Augment data types: (1) offers from Mesos and (2) task requests of Mesos 
frameworks 
    * Add GPU_RESOURCES to the list of framework capabilities if 
"mesos.resourcemanager.tasks.gpus" > 0. Otherwise, LaunchCoordinator gets no 
offers from Mesos masters that are configured to prevent Mesos frameworks 
without GPU_RESOURCES from being given resources offers of GPU-equipped agents.
    
    ## Verifying this change
    
    I tested it by launching a standalone Flink cluster using 
./bin/mesos-appmaster.sh. I tested the following scenarios with Mesos 
configured with --filter_gpu_resources.
    
    * When mesos.resourcemanager.tasks.gpus is not specified or is set to 0.0
    LaunchCoordinator isn't given any offer because MesosFlinkResourceManager 
does not enable GPU_RESOURCES capability when mesos.resourcemanager.tasks.gpus 
is not specified or it is set to 0.
    * When mesos.resourcemanager.tasks.gpus is smaller than or equal to the 
available GPUs on a node 
    Given offers, LaunchCoordinator aggregates offers of different roles from 
the same node and puts aggregated offers to Fenzo for scheduling resources over 
nodes. When notified of the success of scheduling from Fenzo, LaunchCoordinator 
allocates resources of different roles to tasks and then populate 
Protos.TaskInfo using the allocated resources which is then wired to the Mesos 
master.
    * When mesos.resourcemanager.tasks.gpus is bigger than the available GPUs 
on a node 
    Given offers, LaunchCoordinator aggregates offers of different roles from 
the same node and puts aggregated offers to Fenzo. However, Fenzo notifies 
LaunchCoordinator of the failure of scheduling with the following messages:
        AssignmentFailure {resource=Other, asking=3.0, used=0.0, available=2.0, 
message=gpus}.
    
    ## Does this pull request potentially affect one of the following parts:
    
      - Dependencies (does it add or upgrade a dependency): yes, it includes an 
upgrade (Fenzo)
      - The public API, i.e., is any changed class annotated with 
`@Public(Evolving)`: no
      - The serializers: no
      - The runtime per-record code paths (performance sensitive): no
      - Anything that affects deployment or recovery: JobManager (and its 
components), Checkpointing, Yarn/Mesos, ZooKeeper: yes (JobManager and 
TaskManager on Mesos)
      - The S3 file system connector: no
    
    ## Documentation
    
      - Does this pull request introduce a new feature? no


You can merge this pull request into a Git repository by running:

    $ git pull https://github.com/eastcirclek/flink FLINK-8431

Alternatively you can review and apply these changes as the patch at:

    https://github.com/apache/flink/pull/5307.patch

To close this pull request, make a commit to your master/trunk branch
with (at least) the following in the commit message:

    This closes #5307
    
----
commit 1dc73302eb6e0c02ad785fa3711a9b8d0f12c9a9
Author: eastcirclek <eastcirclek@...>
Date:   2018-01-17T10:47:15Z

    [FLINK-8431] Allow to specify # GPUs for TaskManager in Mesos

----


> Allow to specify # GPUs for TaskManager in Mesos
> ------------------------------------------------
>
>                 Key: FLINK-8431
>                 URL: https://issues.apache.org/jira/browse/FLINK-8431
>             Project: Flink
>          Issue Type: Improvement
>          Components: Cluster Management, Mesos
>            Reporter: Dongwon Kim
>            Assignee: Dongwon Kim
>            Priority: Minor
>
> Mesos provides first-class support for Nvidia GPUs [1], but Flink does not 
> exploit it when scheduling TaskManagers. If Mesos agents are configured to 
> isolate GPUs as shown in [2], TaskManagers that do not specify to use GPUs 
> cannot see GPUs at all.
> We, therefore, need to introduce a new configuration property named 
> "mesos.resourcemanager.tasks.gpus" to allow users to specify # of GPUs for 
> each TaskManager process in Mesos.
> [1] http://mesos.apache.org/documentation/latest/gpu-support/
> [2] http://mesos.apache.org/documentation/latest/gpu-support/#agent-flags



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

Reply via email to