cdmikechen created SUBMARINE-1376:
-------------------------------------

             Summary: XGBoost experiment pods will be deleted so that submarine 
can not get logs
                 Key: SUBMARINE-1376
                 URL: https://issues.apache.org/jira/browse/SUBMARINE-1376
             Project: Apache Submarine
          Issue Type: Bug
            Reporter: cdmikechen


After submitting the xgboost task using the following json, submarine was able 
to monitor the status of the xgboost task correctly. 
POST http://127.0.0.1:32080/api/v1/experiment
{code:json}
{
    "meta": {
        "name": "xgboost-example",
        "tags": [],
        "framework": "Xgboost",
        "cmd": "python /opt/mlkube/main.py --job_type=Train 
--xgboost_parameter=objective:multi:softprob,num_class:3 --n_estimators=10 
--learning_rate=0.1 --model_path=/tmp/xgboost-model --model_storage_type=local",
        "envVars": {}
    },
    "environment": {
        "image": "docker.io/merlintang/xgboost-dist-iris:1.1"
    },
    "spec": {
        "Worker": {
            "replicas": 2,
            "resources": "cpu=0.5,nvidia.com/gpu=0,memory=512M"
        },
        "Master": {
            "replicas": 1,
            "resources": "cpu=0.5,nvidia.com/gpu=0,memory=512M"
        }
    }
}
{code}

However, after the task was finished, it was found that the training-operator 
deleted the pods. This caused submarine to be unable to confirm the names of 
the pods that had been executed and the logging status of each pod.

I had checked training-operator(1.4.0) and found logs:

{code}
time="2023-04-01T09:26:31Z" level=info msg="xgboostJob 
experiment-1680334381873-0006 is created."
time="2023-04-01T09:26:31Z" level=info msg="Reconciling for job 
experiment-1680334381873-0006"
time="2023-04-01T09:26:31Z" level=info msg="Need to create new pod: worker-0" 
job=submarine.experiment-1680334381873-0006 replica-type=worker 
uid=20673c7b-e336-4ab0-b584-7453bc6b3234
time="2023-04-01T09:26:31Z" level=info msg="Controller 
experiment-1680334381873-0006 created pod 
experiment-1680334381873-0006-worker-0" job=.experiment-1680334381873-0006 
pod=.experiment-1680334381873-0006-worker-0 uid=
time="2023-04-01T09:26:31Z" level=info msg="Need to create new pod: worker-1" 
job=submarine.experiment-1680334381873-0006 replica-type=worker 
uid=20673c7b-e336-4ab0-b584-7453bc6b3234
2023-04-01T09:26:31.270Z        DEBUG   controller-runtime.manager.events       
Normal  {"object": 
{"kind":"XGBoostJob","namespace":"submarine","name":"experiment-1680334381873-0006","uid":"20673c7b-e336-4ab0-b584-7453bc6b3234","apiVersion":"kubeflow.org/v1","resourceVersion":"40565"},
 "reason": "SuccessfulCreatePod", "message": "Created pod: 
experiment-1680334381873-0006-worker-0"}
time="2023-04-01T09:26:31Z" level=info msg="Controller 
experiment-1680334381873-0006 created pod 
experiment-1680334381873-0006-worker-1" job=.experiment-1680334381873-0006 
pod=.experiment-1680334381873-0006-worker-1 uid=
time="2023-04-01T09:26:31Z" level=info msg="need to create new service: 
Worker-0" job=submarine.experiment-1680334381873-0006 replica-type=worker 
uid=20673c7b-e336-4ab0-b584-7453bc6b3234
2023-04-01T09:26:31.307Z        DEBUG   controller-runtime.manager.events       
Normal  {"object": 
{"kind":"XGBoostJob","namespace":"submarine","name":"experiment-1680334381873-0006","uid":"20673c7b-e336-4ab0-b584-7453bc6b3234","apiVersion":"kubeflow.org/v1","resourceVersion":"40565"},
 "reason": "SuccessfulCreatePod", "message": "Created pod: 
experiment-1680334381873-0006-worker-1"}
time="2023-04-01T09:26:31Z" level=info msg="Controller 
experiment-1680334381873-0006 created service 
experiment-1680334381873-0006-worker-0"
time="2023-04-01T09:26:31Z" level=info msg="need to create new service: 
Worker-1" job=submarine.experiment-1680334381873-0006 replica-type=worker 
uid=20673c7b-e336-4ab0-b584-7453bc6b3234
2023-04-01T09:26:31.344Z        DEBUG   controller-runtime.manager.events       
Normal  {"object": 
{"kind":"XGBoostJob","namespace":"submarine","name":"experiment-1680334381873-0006","uid":"20673c7b-e336-4ab0-b584-7453bc6b3234","apiVersion":"kubeflow.org/v1","resourceVersion":"40565"},
 "reason": "SuccessfulCreateService", "message": "Created service: 
experiment-1680334381873-0006-worker-0"}
time="2023-04-01T09:26:31Z" level=info msg="Controller 
experiment-1680334381873-0006 created service 
experiment-1680334381873-0006-worker-1"
time="2023-04-01T09:26:31Z" level=info msg="Need to create new pod: master-0" 
job=submarine.experiment-1680334381873-0006 replica-type=master 
uid=20673c7b-e336-4ab0-b584-7453bc6b3234
2023-04-01T09:26:31.410Z        DEBUG   controller-runtime.manager.events       
Normal  {"object": 
{"kind":"XGBoostJob","namespace":"submarine","name":"experiment-1680334381873-0006","uid":"20673c7b-e336-4ab0-b584-7453bc6b3234","apiVersion":"kubeflow.org/v1","resourceVersion":"40565"},
 "reason": "SuccessfulCreateService", "message": "Created service: 
experiment-1680334381873-0006-worker-1"}
time="2023-04-01T09:26:31Z" level=info msg="Controller 
experiment-1680334381873-0006 created pod 
experiment-1680334381873-0006-master-0" job=.experiment-1680334381873-0006 
pod=.experiment-1680334381873-0006-master-0 uid=
time="2023-04-01T09:26:31Z" level=info msg="need to create new service: 
Master-0" job=submarine.experiment-1680334381873-0006 replica-type=master 
uid=20673c7b-e336-4ab0-b584-7453bc6b3234
2023-04-01T09:26:31.462Z        DEBUG   controller-runtime.manager.events       
Normal  {"object": 
{"kind":"XGBoostJob","namespace":"submarine","name":"experiment-1680334381873-0006","uid":"20673c7b-e336-4ab0-b584-7453bc6b3234","apiVersion":"kubeflow.org/v1","resourceVersion":"40565"},
 "reason": "SuccessfulCreatePod", "message": "Created pod: 
experiment-1680334381873-0006-master-0"}
time="2023-04-01T09:26:31Z" level=info msg="Controller 
experiment-1680334381873-0006 created service 
experiment-1680334381873-0006-master-0"
time="2023-04-01T09:26:31Z" level=info 
msg="XGBoostJob=experiment-1680334381873-0006, ReplicaType=Master expected=1, 
running=0, succeeded=0 , failed=0"
time="2023-04-01T09:26:31Z" level=info 
msg="XGBoostJob=experiment-1680334381873-0006, ReplicaType=Worker expected=2, 
running=0, succeeded=0 , failed=0"
time="2023-04-01T09:26:31Z" level=info msg="XGBoostJob 
experiment-1680334381873-0006 is running." 
job=submarine.experiment-1680334381873-0006 
uid=20673c7b-e336-4ab0-b584-7453bc6b3234
2023-04-01T09:26:31.487Z        DEBUG   controller-runtime.manager.events       
Normal  {"object": 
{"kind":"XGBoostJob","namespace":"submarine","name":"experiment-1680334381873-0006","uid":"20673c7b-e336-4ab0-b584-7453bc6b3234","apiVersion":"kubeflow.org/v1","resourceVersion":"40565"},
 "reason": "SuccessfulCreateService", "message": "Created service: 
experiment-1680334381873-0006-master-0"}
time="2023-04-01T09:26:31Z" level=info msg="Reconciling for job 
experiment-1680334381873-0006"
time="2023-04-01T09:26:31Z" level=info 
msg="XGBoostJob=experiment-1680334381873-0006, ReplicaType=Master expected=1, 
running=0, succeeded=0 , failed=0"
time="2023-04-01T09:26:31Z" level=info 
msg="XGBoostJob=experiment-1680334381873-0006, ReplicaType=Worker expected=2, 
running=0, succeeded=0 , failed=0"
time="2023-04-01T09:26:31Z" level=info msg="XGBoostJob 
experiment-1680334381873-0006 is running." 
job=submarine.experiment-1680334381873-0006 
uid=20673c7b-e336-4ab0-b584-7453bc6b3234
time="2023-04-01T09:26:31Z" level=error msg="Operation cannot be fulfilled on 
xgboostjobs.kubeflow.org \"experiment-1680334381873-0006\": the object has been 
modified; please apply your changes to the latest version and try againfailed 
to update XGBoost Job conditions in the API server" 
job=submarine.experiment-1680334381873-0006 
uid=20673c7b-e336-4ab0-b584-7453bc6b3234
2023-04-01T09:26:31.538Z        ERROR   controllers.XGBoostJob  Reconcile 
XGBoost Job error     {"xgboostjob": "submarine/experiment-1680334381873-0006", 
"error": "Operation cannot be fulfilled on xgboostjobs.kubeflow.org 
\"experiment-1680334381873-0006\": the object has been modified; please apply 
your changes to the latest version and try again"}
2023-04-01T09:26:31.538Z        ERROR   
controller-runtime.manager.controller.xgboostjob-controller     Reconciler 
error        {"name": "experiment-1680334381873-0006", "namespace": 
"submarine", "error": "Operation cannot be fulfilled on 
xgboostjobs.kubeflow.org \"experiment-1680334381873-0006\": the object has been 
modified; please apply your changes to the latest version and try again"}
time="2023-04-01T09:26:31Z" level=info msg="Reconciling for job 
experiment-1680334381873-0006"
time="2023-04-01T09:26:31Z" level=info 
msg="XGBoostJob=experiment-1680334381873-0006, ReplicaType=Master expected=1, 
running=0, succeeded=0 , failed=0"
time="2023-04-01T09:26:31Z" level=info 
msg="XGBoostJob=experiment-1680334381873-0006, ReplicaType=Worker expected=2, 
running=0, succeeded=0 , failed=0"
time="2023-04-01T09:26:31Z" level=info msg="XGBoostJob 
experiment-1680334381873-0006 is running." 
job=submarine.experiment-1680334381873-0006 
uid=20673c7b-e336-4ab0-b584-7453bc6b3234
time="2023-04-01T09:26:31Z" level=info msg="Reconciling for job 
experiment-1680334381873-0006"
time="2023-04-01T09:26:31Z" level=info 
msg="XGBoostJob=experiment-1680334381873-0006, ReplicaType=Master expected=1, 
running=0, succeeded=0 , failed=0"
time="2023-04-01T09:26:31Z" level=info 
msg="XGBoostJob=experiment-1680334381873-0006, ReplicaType=Worker expected=2, 
running=0, succeeded=0 , failed=0"
time="2023-04-01T09:26:31Z" level=info msg="XGBoostJob 
experiment-1680334381873-0006 is running." 
job=submarine.experiment-1680334381873-0006 
uid=20673c7b-e336-4ab0-b584-7453bc6b3234
time="2023-04-01T09:26:33Z" level=info msg="Reconciling for job 
experiment-1680334381873-0006"
time="2023-04-01T09:26:33Z" level=info 
msg="XGBoostJob=experiment-1680334381873-0006, ReplicaType=Master expected=1, 
running=0, succeeded=0 , failed=0"
time="2023-04-01T09:26:33Z" level=info 
msg="XGBoostJob=experiment-1680334381873-0006, ReplicaType=Worker expected=2, 
running=1, succeeded=0 , failed=0"
time="2023-04-01T09:26:33Z" level=info msg="XGBoostJob 
experiment-1680334381873-0006 is running." 
job=submarine.experiment-1680334381873-0006 
uid=20673c7b-e336-4ab0-b584-7453bc6b3234
time="2023-04-01T09:26:33Z" level=info msg="Reconciling for job 
experiment-1680334381873-0006"
time="2023-04-01T09:26:33Z" level=info 
msg="XGBoostJob=experiment-1680334381873-0006, ReplicaType=Master expected=1, 
running=0, succeeded=0 , failed=0"
time="2023-04-01T09:26:33Z" level=info 
msg="XGBoostJob=experiment-1680334381873-0006, ReplicaType=Worker expected=2, 
running=1, succeeded=0 , failed=0"
time="2023-04-01T09:26:33Z" level=info msg="XGBoostJob 
experiment-1680334381873-0006 is running." 
job=submarine.experiment-1680334381873-0006 
uid=20673c7b-e336-4ab0-b584-7453bc6b3234
time="2023-04-01T09:26:33Z" level=info msg="Reconciling for job 
experiment-1680334381873-0006"
time="2023-04-01T09:26:33Z" level=info 
msg="XGBoostJob=experiment-1680334381873-0006, ReplicaType=Master expected=1, 
running=1, succeeded=0 , failed=0"
time="2023-04-01T09:26:33Z" level=info 
msg="XGBoostJob=experiment-1680334381873-0006, ReplicaType=Worker expected=2, 
running=1, succeeded=0 , failed=0"
time="2023-04-01T09:26:33Z" level=info msg="XGBoostJob 
experiment-1680334381873-0006 is running." 
job=submarine.experiment-1680334381873-0006 
uid=20673c7b-e336-4ab0-b584-7453bc6b3234
time="2023-04-01T09:26:33Z" level=info msg="Reconciling for job 
experiment-1680334381873-0006"
time="2023-04-01T09:26:33Z" level=info 
msg="XGBoostJob=experiment-1680334381873-0006, ReplicaType=Master expected=1, 
running=1, succeeded=0 , failed=0"
time="2023-04-01T09:26:33Z" level=info 
msg="XGBoostJob=experiment-1680334381873-0006, ReplicaType=Worker expected=2, 
running=1, succeeded=0 , failed=0"
time="2023-04-01T09:26:33Z" level=info msg="XGBoostJob 
experiment-1680334381873-0006 is running." 
job=submarine.experiment-1680334381873-0006 
uid=20673c7b-e336-4ab0-b584-7453bc6b3234
time="2023-04-01T09:26:33Z" level=info msg="Reconciling for job 
experiment-1680334381873-0006"
time="2023-04-01T09:26:33Z" level=info 
msg="XGBoostJob=experiment-1680334381873-0006, ReplicaType=Worker expected=2, 
running=2, succeeded=0 , failed=0"
time="2023-04-01T09:26:33Z" level=info 
msg="XGBoostJob=experiment-1680334381873-0006, ReplicaType=Master expected=1, 
running=1, succeeded=0 , failed=0"
time="2023-04-01T09:26:33Z" level=info msg="XGBoostJob 
experiment-1680334381873-0006 is running." 
job=submarine.experiment-1680334381873-0006 
uid=20673c7b-e336-4ab0-b584-7453bc6b3234
time="2023-04-01T09:26:33Z" level=info msg="Reconciling for job 
experiment-1680334381873-0006"
time="2023-04-01T09:26:33Z" level=info 
msg="XGBoostJob=experiment-1680334381873-0006, ReplicaType=Master expected=1, 
running=1, succeeded=0 , failed=0"
time="2023-04-01T09:26:33Z" level=info 
msg="XGBoostJob=experiment-1680334381873-0006, ReplicaType=Worker expected=2, 
running=2, succeeded=0 , failed=0"
time="2023-04-01T09:26:33Z" level=info msg="XGBoostJob 
experiment-1680334381873-0006 is running." 
job=submarine.experiment-1680334381873-0006 
uid=20673c7b-e336-4ab0-b584-7453bc6b3234
time="2023-04-01T09:27:04Z" level=info msg="Reconciling for job 
experiment-1680334381873-0006"
time="2023-04-01T09:27:04Z" level=info msg="Ignoring inactive pod 
submarine/experiment-1680334381873-0006-master-0 in state Succeeded, deletion 
time <nil>"
time="2023-04-01T09:27:04Z" level=info msg="Pod: 
submarine.experiment-1680334381873-0006-master-0 exited with code 0" 
job=submarine.experiment-1680334381873-0006 replica-type=master 
uid=20673c7b-e336-4ab0-b584-7453bc6b3234
time="2023-04-01T09:27:04Z" level=info 
msg="XGBoostJob=experiment-1680334381873-0006, ReplicaType=Master expected=0, 
running=0, succeeded=1 , failed=0"
time="2023-04-01T09:27:04Z" level=info msg="XGBoostJob 
experiment-1680334381873-0006 is successfully completed."
2023-04-01T09:27:04.010Z        DEBUG   controller-runtime.manager.events       
Normal  {"object": 
{"kind":"XGBoostJob","namespace":"submarine","name":"experiment-1680334381873-0006","uid":"20673c7b-e336-4ab0-b584-7453bc6b3234","apiVersion":"kubeflow.org/v1","resourceVersion":"40623"},
 "reason": "ExitedWithCode", "message": "Pod: 
submarine.experiment-1680334381873-0006-master-0 exited with code 0"}
2023-04-01T09:27:04.010Z        DEBUG   controller-runtime.manager.events       
Normal  {"object": 
{"kind":"XGBoostJob","namespace":"submarine","name":"experiment-1680334381873-0006","uid":"20673c7b-e336-4ab0-b584-7453bc6b3234","apiVersion":"kubeflow.org/v1","resourceVersion":"40623"},
 "reason": "XGBoostJobSucceeded", "message": "XGBoostJob 
experiment-1680334381873-0006 is successfully completed."}
time="2023-04-01T09:27:04Z" level=info msg="Reconciling for job 
experiment-1680334381873-0006"
time="2023-04-01T09:27:04Z" level=info msg="Controller 
experiment-1680334381873-0006 deleting pod 
submarine/experiment-1680334381873-0006-worker-1" 
job=submarine.experiment-1680334381873-0006 
uid=20673c7b-e336-4ab0-b584-7453bc6b3234
2023-04-01T09:27:04.067Z        DEBUG   controller-runtime.manager.events       
Normal  {"object": 
{"kind":"XGBoostJob","namespace":"submarine","name":"experiment-1680334381873-0006","uid":"20673c7b-e336-4ab0-b584-7453bc6b3234","apiVersion":"kubeflow.org/v1","resourceVersion":"40677"},
 "reason": "SuccessfulDeletePod", "message": "Deleted pod: 
experiment-1680334381873-0006-worker-1"}
time="2023-04-01T09:27:04Z" level=info msg="Controller 
experiment-1680334381873-0006 deleting service 
submarine/experiment-1680334381873-0006-worker-1"
2023-04-01T09:27:04.113Z        DEBUG   controller-runtime.manager.events       
Normal  {"object": 
{"kind":"XGBoostJob","namespace":"submarine","name":"experiment-1680334381873-0006","uid":"20673c7b-e336-4ab0-b584-7453bc6b3234","apiVersion":"kubeflow.org/v1","resourceVersion":"40677"},
 "reason": "SuccessfulDeleteService", "message": "Deleted service: 
experiment-1680334381873-0006-worker-1"}
time="2023-04-01T09:27:04Z" level=info msg="Controller 
experiment-1680334381873-0006 deleting pod 
submarine/experiment-1680334381873-0006-worker-0" 
job=submarine.experiment-1680334381873-0006 
uid=20673c7b-e336-4ab0-b584-7453bc6b3234
2023-04-01T09:27:04.145Z        DEBUG   controller-runtime.manager.events       
Normal  {"object": 
{"kind":"XGBoostJob","namespace":"submarine","name":"experiment-1680334381873-0006","uid":"20673c7b-e336-4ab0-b584-7453bc6b3234","apiVersion":"kubeflow.org/v1","resourceVersion":"40677"},
 "reason": "SuccessfulDeletePod", "message": "Deleted pod: 
experiment-1680334381873-0006-worker-0"}
time="2023-04-01T09:27:04Z" level=info msg="Controller 
experiment-1680334381873-0006 deleting service 
submarine/experiment-1680334381873-0006-worker-0"
2023-04-01T09:27:04.162Z        DEBUG   controller-runtime.manager.events       
Normal  {"object": 
{"kind":"XGBoostJob","namespace":"submarine","name":"experiment-1680334381873-0006","uid":"20673c7b-e336-4ab0-b584-7453bc6b3234","apiVersion":"kubeflow.org/v1","resourceVersion":"40677"},
 "reason": "SuccessfulDeleteService", "message": "Deleted service: 
experiment-1680334381873-0006-worker-0"}
time="2023-04-01T09:27:04Z" level=info msg="Controller 
experiment-1680334381873-0006 deleting pod 
submarine/experiment-1680334381873-0006-master-0" 
job=submarine.experiment-1680334381873-0006 
uid=20673c7b-e336-4ab0-b584-7453bc6b3234
2023-04-01T09:27:04.175Z        DEBUG   controller-runtime.manager.events       
Normal  {"object": 
{"kind":"XGBoostJob","namespace":"submarine","name":"experiment-1680334381873-0006","uid":"20673c7b-e336-4ab0-b584-7453bc6b3234","apiVersion":"kubeflow.org/v1","resourceVersion":"40677"},
 "reason": "SuccessfulDeletePod", "message": "Deleted pod: 
experiment-1680334381873-0006-master-0"}
time="2023-04-01T09:27:04Z" level=info msg="Controller 
experiment-1680334381873-0006 deleting service 
submarine/experiment-1680334381873-0006-master-0"
2023-04-01T09:27:04.185Z        DEBUG   controller-runtime.manager.events       
Normal  {"object": 
{"kind":"XGBoostJob","namespace":"submarine","name":"experiment-1680334381873-0006","uid":"20673c7b-e336-4ab0-b584-7453bc6b3234","apiVersion":"kubeflow.org/v1","resourceVersion":"40677"},
 "reason": "SuccessfulDeleteService", "message": "Deleted service: 
experiment-1680334381873-0006-master-0"}
time="2023-04-01T09:27:04Z" level=info msg="Reconciling for job 
experiment-1680334381873-0006"
time="2023-04-01T09:27:04Z" level=info msg="pod 
submarine/experiment-1680334381873-0006-worker-1 is terminating, skip deleting" 
job=submarine.experiment-1680334381873-0006 
uid=20673c7b-e336-4ab0-b584-7453bc6b3234
time="2023-04-01T09:27:13Z" level=info msg="Reconciling for job 
experiment-1680334381873-0006"
time="2023-04-01T09:27:13Z" level=info msg="pod 
submarine/experiment-1680334381873-0006-worker-1 is terminating, skip deleting" 
job=submarine.experiment-1680334381873-0006 
uid=20673c7b-e336-4ab0-b584-7453bc6b3234
time="2023-04-01T09:27:13Z" level=info msg="Reconciling for job 
experiment-1680334381873-0006"
{code}




--
This message was sent by Atlassian Jira
(v8.20.10#820010)

---------------------------------------------------------------------
To unsubscribe, e-mail: dev-unsubscr...@submarine.apache.org
For additional commands, e-mail: dev-h...@submarine.apache.org

Reply via email to