[ https://issues.apache.org/jira/browse/SUBMARINE-1376?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ]
Work on SUBMARINE-1376 stopped by cdmikechen. --------------------------------------------- > XGBoost experiment pods will be deleted so that submarine can not get logs > -------------------------------------------------------------------------- > > Key: SUBMARINE-1376 > URL: https://issues.apache.org/jira/browse/SUBMARINE-1376 > Project: Apache Submarine > Issue Type: Bug > Components: experiment > Reporter: cdmikechen > Assignee: cdmikechen > Priority: Blocker > Attachments: submarine-xgboost-pods.jpg > > > After submitting the xgboost task using the following json, submarine was > able to monitor the status of the xgboost task correctly. > POST http://127.0.0.1:32080/api/v1/experiment > {code:json} > { > "meta": { > "name": "xgboost-example", > "tags": [], > "framework": "Xgboost", > "cmd": "python /opt/mlkube/main.py --job_type=Train > --xgboost_parameter=objective:multi:softprob,num_class:3 --n_estimators=10 > --learning_rate=0.1 --model_path=/tmp/xgboost-model > --model_storage_type=local", > "envVars": {} > }, > "environment": { > "image": "docker.io/merlintang/xgboost-dist-iris:1.1" > }, > "spec": { > "Worker": { > "replicas": 2, > "resources": "cpu=0.5,nvidia.com/gpu=0,memory=512M" > }, > "Master": { > "replicas": 1, > "resources": "cpu=0.5,nvidia.com/gpu=0,memory=512M" > } > } > } > {code} > However, after the task was finished, it was found that the training-operator > deleted the pods. This caused submarine to be unable to confirm the names of > the pods that had been executed and the logging status of each pod. > I had checked training-operator(1.4.0) and found logs: > {code} > time="2023-04-01T09:26:31Z" level=info msg="xgboostJob > experiment-1680334381873-0006 is created." > time="2023-04-01T09:26:31Z" level=info msg="Reconciling for job > experiment-1680334381873-0006" > time="2023-04-01T09:26:31Z" level=info msg="Need to create new pod: worker-0" > job=submarine.experiment-1680334381873-0006 replica-type=worker > uid=20673c7b-e336-4ab0-b584-7453bc6b3234 > time="2023-04-01T09:26:31Z" level=info msg="Controller > experiment-1680334381873-0006 created pod > experiment-1680334381873-0006-worker-0" job=.experiment-1680334381873-0006 > pod=.experiment-1680334381873-0006-worker-0 uid= > time="2023-04-01T09:26:31Z" level=info msg="Need to create new pod: worker-1" > job=submarine.experiment-1680334381873-0006 replica-type=worker > uid=20673c7b-e336-4ab0-b584-7453bc6b3234 > 2023-04-01T09:26:31.270Z DEBUG controller-runtime.manager.events > Normal {"object": > {"kind":"XGBoostJob","namespace":"submarine","name":"experiment-1680334381873-0006","uid":"20673c7b-e336-4ab0-b584-7453bc6b3234","apiVersion":"kubeflow.org/v1","resourceVersion":"40565"}, > "reason": "SuccessfulCreatePod", "message": "Created pod: > experiment-1680334381873-0006-worker-0"} > time="2023-04-01T09:26:31Z" level=info msg="Controller > experiment-1680334381873-0006 created pod > experiment-1680334381873-0006-worker-1" job=.experiment-1680334381873-0006 > pod=.experiment-1680334381873-0006-worker-1 uid= > time="2023-04-01T09:26:31Z" level=info msg="need to create new service: > Worker-0" job=submarine.experiment-1680334381873-0006 replica-type=worker > uid=20673c7b-e336-4ab0-b584-7453bc6b3234 > 2023-04-01T09:26:31.307Z DEBUG controller-runtime.manager.events > Normal {"object": > {"kind":"XGBoostJob","namespace":"submarine","name":"experiment-1680334381873-0006","uid":"20673c7b-e336-4ab0-b584-7453bc6b3234","apiVersion":"kubeflow.org/v1","resourceVersion":"40565"}, > "reason": "SuccessfulCreatePod", "message": "Created pod: > experiment-1680334381873-0006-worker-1"} > time="2023-04-01T09:26:31Z" level=info msg="Controller > experiment-1680334381873-0006 created service > experiment-1680334381873-0006-worker-0" > time="2023-04-01T09:26:31Z" level=info msg="need to create new service: > Worker-1" job=submarine.experiment-1680334381873-0006 replica-type=worker > uid=20673c7b-e336-4ab0-b584-7453bc6b3234 > 2023-04-01T09:26:31.344Z DEBUG controller-runtime.manager.events > Normal {"object": > {"kind":"XGBoostJob","namespace":"submarine","name":"experiment-1680334381873-0006","uid":"20673c7b-e336-4ab0-b584-7453bc6b3234","apiVersion":"kubeflow.org/v1","resourceVersion":"40565"}, > "reason": "SuccessfulCreateService", "message": "Created service: > experiment-1680334381873-0006-worker-0"} > time="2023-04-01T09:26:31Z" level=info msg="Controller > experiment-1680334381873-0006 created service > experiment-1680334381873-0006-worker-1" > time="2023-04-01T09:26:31Z" level=info msg="Need to create new pod: master-0" > job=submarine.experiment-1680334381873-0006 replica-type=master > uid=20673c7b-e336-4ab0-b584-7453bc6b3234 > 2023-04-01T09:26:31.410Z DEBUG controller-runtime.manager.events > Normal {"object": > {"kind":"XGBoostJob","namespace":"submarine","name":"experiment-1680334381873-0006","uid":"20673c7b-e336-4ab0-b584-7453bc6b3234","apiVersion":"kubeflow.org/v1","resourceVersion":"40565"}, > "reason": "SuccessfulCreateService", "message": "Created service: > experiment-1680334381873-0006-worker-1"} > time="2023-04-01T09:26:31Z" level=info msg="Controller > experiment-1680334381873-0006 created pod > experiment-1680334381873-0006-master-0" job=.experiment-1680334381873-0006 > pod=.experiment-1680334381873-0006-master-0 uid= > time="2023-04-01T09:26:31Z" level=info msg="need to create new service: > Master-0" job=submarine.experiment-1680334381873-0006 replica-type=master > uid=20673c7b-e336-4ab0-b584-7453bc6b3234 > 2023-04-01T09:26:31.462Z DEBUG controller-runtime.manager.events > Normal {"object": > {"kind":"XGBoostJob","namespace":"submarine","name":"experiment-1680334381873-0006","uid":"20673c7b-e336-4ab0-b584-7453bc6b3234","apiVersion":"kubeflow.org/v1","resourceVersion":"40565"}, > "reason": "SuccessfulCreatePod", "message": "Created pod: > experiment-1680334381873-0006-master-0"} > time="2023-04-01T09:26:31Z" level=info msg="Controller > experiment-1680334381873-0006 created service > experiment-1680334381873-0006-master-0" > time="2023-04-01T09:26:31Z" level=info > msg="XGBoostJob=experiment-1680334381873-0006, ReplicaType=Master expected=1, > running=0, succeeded=0 , failed=0" > time="2023-04-01T09:26:31Z" level=info > msg="XGBoostJob=experiment-1680334381873-0006, ReplicaType=Worker expected=2, > running=0, succeeded=0 , failed=0" > time="2023-04-01T09:26:31Z" level=info msg="XGBoostJob > experiment-1680334381873-0006 is running." > job=submarine.experiment-1680334381873-0006 > uid=20673c7b-e336-4ab0-b584-7453bc6b3234 > 2023-04-01T09:26:31.487Z DEBUG controller-runtime.manager.events > Normal {"object": > {"kind":"XGBoostJob","namespace":"submarine","name":"experiment-1680334381873-0006","uid":"20673c7b-e336-4ab0-b584-7453bc6b3234","apiVersion":"kubeflow.org/v1","resourceVersion":"40565"}, > "reason": "SuccessfulCreateService", "message": "Created service: > experiment-1680334381873-0006-master-0"} > time="2023-04-01T09:26:31Z" level=info msg="Reconciling for job > experiment-1680334381873-0006" > time="2023-04-01T09:26:31Z" level=info > msg="XGBoostJob=experiment-1680334381873-0006, ReplicaType=Master expected=1, > running=0, succeeded=0 , failed=0" > time="2023-04-01T09:26:31Z" level=info > msg="XGBoostJob=experiment-1680334381873-0006, ReplicaType=Worker expected=2, > running=0, succeeded=0 , failed=0" > time="2023-04-01T09:26:31Z" level=info msg="XGBoostJob > experiment-1680334381873-0006 is running." > job=submarine.experiment-1680334381873-0006 > uid=20673c7b-e336-4ab0-b584-7453bc6b3234 > time="2023-04-01T09:26:31Z" level=error msg="Operation cannot be fulfilled on > xgboostjobs.kubeflow.org \"experiment-1680334381873-0006\": the object has > been modified; please apply your changes to the latest version and try > againfailed to update XGBoost Job conditions in the API server" > job=submarine.experiment-1680334381873-0006 > uid=20673c7b-e336-4ab0-b584-7453bc6b3234 > 2023-04-01T09:26:31.538Z ERROR controllers.XGBoostJob Reconcile > XGBoost Job error {"xgboostjob": > "submarine/experiment-1680334381873-0006", "error": "Operation cannot be > fulfilled on xgboostjobs.kubeflow.org \"experiment-1680334381873-0006\": the > object has been modified; please apply your changes to the latest version and > try again"} > 2023-04-01T09:26:31.538Z ERROR > controller-runtime.manager.controller.xgboostjob-controller Reconciler > error {"name": "experiment-1680334381873-0006", "namespace": > "submarine", "error": "Operation cannot be fulfilled on > xgboostjobs.kubeflow.org \"experiment-1680334381873-0006\": the object has > been modified; please apply your changes to the latest version and try again"} > time="2023-04-01T09:26:31Z" level=info msg="Reconciling for job > experiment-1680334381873-0006" > time="2023-04-01T09:26:31Z" level=info > msg="XGBoostJob=experiment-1680334381873-0006, ReplicaType=Master expected=1, > running=0, succeeded=0 , failed=0" > time="2023-04-01T09:26:31Z" level=info > msg="XGBoostJob=experiment-1680334381873-0006, ReplicaType=Worker expected=2, > running=0, succeeded=0 , failed=0" > time="2023-04-01T09:26:31Z" level=info msg="XGBoostJob > experiment-1680334381873-0006 is running." > job=submarine.experiment-1680334381873-0006 > uid=20673c7b-e336-4ab0-b584-7453bc6b3234 > time="2023-04-01T09:26:31Z" level=info msg="Reconciling for job > experiment-1680334381873-0006" > time="2023-04-01T09:26:31Z" level=info > msg="XGBoostJob=experiment-1680334381873-0006, ReplicaType=Master expected=1, > running=0, succeeded=0 , failed=0" > time="2023-04-01T09:26:31Z" level=info > msg="XGBoostJob=experiment-1680334381873-0006, ReplicaType=Worker expected=2, > running=0, succeeded=0 , failed=0" > time="2023-04-01T09:26:31Z" level=info msg="XGBoostJob > experiment-1680334381873-0006 is running." > job=submarine.experiment-1680334381873-0006 > uid=20673c7b-e336-4ab0-b584-7453bc6b3234 > time="2023-04-01T09:26:33Z" level=info msg="Reconciling for job > experiment-1680334381873-0006" > time="2023-04-01T09:26:33Z" level=info > msg="XGBoostJob=experiment-1680334381873-0006, ReplicaType=Master expected=1, > running=0, succeeded=0 , failed=0" > time="2023-04-01T09:26:33Z" level=info > msg="XGBoostJob=experiment-1680334381873-0006, ReplicaType=Worker expected=2, > running=1, succeeded=0 , failed=0" > time="2023-04-01T09:26:33Z" level=info msg="XGBoostJob > experiment-1680334381873-0006 is running." > job=submarine.experiment-1680334381873-0006 > uid=20673c7b-e336-4ab0-b584-7453bc6b3234 > time="2023-04-01T09:26:33Z" level=info msg="Reconciling for job > experiment-1680334381873-0006" > time="2023-04-01T09:26:33Z" level=info > msg="XGBoostJob=experiment-1680334381873-0006, ReplicaType=Master expected=1, > running=0, succeeded=0 , failed=0" > time="2023-04-01T09:26:33Z" level=info > msg="XGBoostJob=experiment-1680334381873-0006, ReplicaType=Worker expected=2, > running=1, succeeded=0 , failed=0" > time="2023-04-01T09:26:33Z" level=info msg="XGBoostJob > experiment-1680334381873-0006 is running." > job=submarine.experiment-1680334381873-0006 > uid=20673c7b-e336-4ab0-b584-7453bc6b3234 > time="2023-04-01T09:26:33Z" level=info msg="Reconciling for job > experiment-1680334381873-0006" > time="2023-04-01T09:26:33Z" level=info > msg="XGBoostJob=experiment-1680334381873-0006, ReplicaType=Master expected=1, > running=1, succeeded=0 , failed=0" > time="2023-04-01T09:26:33Z" level=info > msg="XGBoostJob=experiment-1680334381873-0006, ReplicaType=Worker expected=2, > running=1, succeeded=0 , failed=0" > time="2023-04-01T09:26:33Z" level=info msg="XGBoostJob > experiment-1680334381873-0006 is running." > job=submarine.experiment-1680334381873-0006 > uid=20673c7b-e336-4ab0-b584-7453bc6b3234 > time="2023-04-01T09:26:33Z" level=info msg="Reconciling for job > experiment-1680334381873-0006" > time="2023-04-01T09:26:33Z" level=info > msg="XGBoostJob=experiment-1680334381873-0006, ReplicaType=Master expected=1, > running=1, succeeded=0 , failed=0" > time="2023-04-01T09:26:33Z" level=info > msg="XGBoostJob=experiment-1680334381873-0006, ReplicaType=Worker expected=2, > running=1, succeeded=0 , failed=0" > time="2023-04-01T09:26:33Z" level=info msg="XGBoostJob > experiment-1680334381873-0006 is running." > job=submarine.experiment-1680334381873-0006 > uid=20673c7b-e336-4ab0-b584-7453bc6b3234 > time="2023-04-01T09:26:33Z" level=info msg="Reconciling for job > experiment-1680334381873-0006" > time="2023-04-01T09:26:33Z" level=info > msg="XGBoostJob=experiment-1680334381873-0006, ReplicaType=Worker expected=2, > running=2, succeeded=0 , failed=0" > time="2023-04-01T09:26:33Z" level=info > msg="XGBoostJob=experiment-1680334381873-0006, ReplicaType=Master expected=1, > running=1, succeeded=0 , failed=0" > time="2023-04-01T09:26:33Z" level=info msg="XGBoostJob > experiment-1680334381873-0006 is running." > job=submarine.experiment-1680334381873-0006 > uid=20673c7b-e336-4ab0-b584-7453bc6b3234 > time="2023-04-01T09:26:33Z" level=info msg="Reconciling for job > experiment-1680334381873-0006" > time="2023-04-01T09:26:33Z" level=info > msg="XGBoostJob=experiment-1680334381873-0006, ReplicaType=Master expected=1, > running=1, succeeded=0 , failed=0" > time="2023-04-01T09:26:33Z" level=info > msg="XGBoostJob=experiment-1680334381873-0006, ReplicaType=Worker expected=2, > running=2, succeeded=0 , failed=0" > time="2023-04-01T09:26:33Z" level=info msg="XGBoostJob > experiment-1680334381873-0006 is running." > job=submarine.experiment-1680334381873-0006 > uid=20673c7b-e336-4ab0-b584-7453bc6b3234 > time="2023-04-01T09:27:04Z" level=info msg="Reconciling for job > experiment-1680334381873-0006" > time="2023-04-01T09:27:04Z" level=info msg="Ignoring inactive pod > submarine/experiment-1680334381873-0006-master-0 in state Succeeded, deletion > time <nil>" > time="2023-04-01T09:27:04Z" level=info msg="Pod: > submarine.experiment-1680334381873-0006-master-0 exited with code 0" > job=submarine.experiment-1680334381873-0006 replica-type=master > uid=20673c7b-e336-4ab0-b584-7453bc6b3234 > time="2023-04-01T09:27:04Z" level=info > msg="XGBoostJob=experiment-1680334381873-0006, ReplicaType=Master expected=0, > running=0, succeeded=1 , failed=0" > time="2023-04-01T09:27:04Z" level=info msg="XGBoostJob > experiment-1680334381873-0006 is successfully completed." > 2023-04-01T09:27:04.010Z DEBUG controller-runtime.manager.events > Normal {"object": > {"kind":"XGBoostJob","namespace":"submarine","name":"experiment-1680334381873-0006","uid":"20673c7b-e336-4ab0-b584-7453bc6b3234","apiVersion":"kubeflow.org/v1","resourceVersion":"40623"}, > "reason": "ExitedWithCode", "message": "Pod: > submarine.experiment-1680334381873-0006-master-0 exited with code 0"} > 2023-04-01T09:27:04.010Z DEBUG controller-runtime.manager.events > Normal {"object": > {"kind":"XGBoostJob","namespace":"submarine","name":"experiment-1680334381873-0006","uid":"20673c7b-e336-4ab0-b584-7453bc6b3234","apiVersion":"kubeflow.org/v1","resourceVersion":"40623"}, > "reason": "XGBoostJobSucceeded", "message": "XGBoostJob > experiment-1680334381873-0006 is successfully completed."} > time="2023-04-01T09:27:04Z" level=info msg="Reconciling for job > experiment-1680334381873-0006" > time="2023-04-01T09:27:04Z" level=info msg="Controller > experiment-1680334381873-0006 deleting pod > submarine/experiment-1680334381873-0006-worker-1" > job=submarine.experiment-1680334381873-0006 > uid=20673c7b-e336-4ab0-b584-7453bc6b3234 > 2023-04-01T09:27:04.067Z DEBUG controller-runtime.manager.events > Normal {"object": > {"kind":"XGBoostJob","namespace":"submarine","name":"experiment-1680334381873-0006","uid":"20673c7b-e336-4ab0-b584-7453bc6b3234","apiVersion":"kubeflow.org/v1","resourceVersion":"40677"}, > "reason": "SuccessfulDeletePod", "message": "Deleted pod: > experiment-1680334381873-0006-worker-1"} > time="2023-04-01T09:27:04Z" level=info msg="Controller > experiment-1680334381873-0006 deleting service > submarine/experiment-1680334381873-0006-worker-1" > 2023-04-01T09:27:04.113Z DEBUG controller-runtime.manager.events > Normal {"object": > {"kind":"XGBoostJob","namespace":"submarine","name":"experiment-1680334381873-0006","uid":"20673c7b-e336-4ab0-b584-7453bc6b3234","apiVersion":"kubeflow.org/v1","resourceVersion":"40677"}, > "reason": "SuccessfulDeleteService", "message": "Deleted service: > experiment-1680334381873-0006-worker-1"} > time="2023-04-01T09:27:04Z" level=info msg="Controller > experiment-1680334381873-0006 deleting pod > submarine/experiment-1680334381873-0006-worker-0" > job=submarine.experiment-1680334381873-0006 > uid=20673c7b-e336-4ab0-b584-7453bc6b3234 > 2023-04-01T09:27:04.145Z DEBUG controller-runtime.manager.events > Normal {"object": > {"kind":"XGBoostJob","namespace":"submarine","name":"experiment-1680334381873-0006","uid":"20673c7b-e336-4ab0-b584-7453bc6b3234","apiVersion":"kubeflow.org/v1","resourceVersion":"40677"}, > "reason": "SuccessfulDeletePod", "message": "Deleted pod: > experiment-1680334381873-0006-worker-0"} > time="2023-04-01T09:27:04Z" level=info msg="Controller > experiment-1680334381873-0006 deleting service > submarine/experiment-1680334381873-0006-worker-0" > 2023-04-01T09:27:04.162Z DEBUG controller-runtime.manager.events > Normal {"object": > {"kind":"XGBoostJob","namespace":"submarine","name":"experiment-1680334381873-0006","uid":"20673c7b-e336-4ab0-b584-7453bc6b3234","apiVersion":"kubeflow.org/v1","resourceVersion":"40677"}, > "reason": "SuccessfulDeleteService", "message": "Deleted service: > experiment-1680334381873-0006-worker-0"} > time="2023-04-01T09:27:04Z" level=info msg="Controller > experiment-1680334381873-0006 deleting pod > submarine/experiment-1680334381873-0006-master-0" > job=submarine.experiment-1680334381873-0006 > uid=20673c7b-e336-4ab0-b584-7453bc6b3234 > 2023-04-01T09:27:04.175Z DEBUG controller-runtime.manager.events > Normal {"object": > {"kind":"XGBoostJob","namespace":"submarine","name":"experiment-1680334381873-0006","uid":"20673c7b-e336-4ab0-b584-7453bc6b3234","apiVersion":"kubeflow.org/v1","resourceVersion":"40677"}, > "reason": "SuccessfulDeletePod", "message": "Deleted pod: > experiment-1680334381873-0006-master-0"} > time="2023-04-01T09:27:04Z" level=info msg="Controller > experiment-1680334381873-0006 deleting service > submarine/experiment-1680334381873-0006-master-0" > 2023-04-01T09:27:04.185Z DEBUG controller-runtime.manager.events > Normal {"object": > {"kind":"XGBoostJob","namespace":"submarine","name":"experiment-1680334381873-0006","uid":"20673c7b-e336-4ab0-b584-7453bc6b3234","apiVersion":"kubeflow.org/v1","resourceVersion":"40677"}, > "reason": "SuccessfulDeleteService", "message": "Deleted service: > experiment-1680334381873-0006-master-0"} > time="2023-04-01T09:27:04Z" level=info msg="Reconciling for job > experiment-1680334381873-0006" > time="2023-04-01T09:27:04Z" level=info msg="pod > submarine/experiment-1680334381873-0006-worker-1 is terminating, skip > deleting" job=submarine.experiment-1680334381873-0006 > uid=20673c7b-e336-4ab0-b584-7453bc6b3234 > time="2023-04-01T09:27:13Z" level=info msg="Reconciling for job > experiment-1680334381873-0006" > time="2023-04-01T09:27:13Z" level=info msg="pod > submarine/experiment-1680334381873-0006-worker-1 is terminating, skip > deleting" job=submarine.experiment-1680334381873-0006 > uid=20673c7b-e336-4ab0-b584-7453bc6b3234 > time="2023-04-01T09:27:13Z" level=info msg="Reconciling for job > experiment-1680334381873-0006" > {code} -- This message was sent by Atlassian Jira (v8.20.10#820010) --------------------------------------------------------------------- To unsubscribe, e-mail: dev-unsubscr...@submarine.apache.org For additional commands, e-mail: dev-h...@submarine.apache.org