*Update:* After restart K8s, the problem (cannot connect to notebook) is gone, I'm not sure if there's anybody else hits the issue, it will be better if we can add documentation to the user doc and help users to do troubleshooting.
Verified the following: - Run example code inside the nodebook, the deepfm_example runs without any isssue. "submarine-experiment_sdk" also runs fine, the only issue is "submarine_client.list_experiments(status=status)" takes some time to execute, I checked the PODs status is: NAME READY STATUS RESTARTS AGE mnist-dist-ps-0 0/1 PodInitializing 0 103s mnist-dist-worker-0 0/1 PodInitializing 0 103s (QUESTION) I fast click "run" button from notebook page, for all paragraphs (exclude the last one), and the experiment show in the UI after a while (1-2 mins), I'm not sure why it takes that long. And it is hard to understand it from both notebook and UI. - (NEED HELP) After that, I tried to restart the notebook, I clicked the button "Delete" on the UI, it showed a message on the UI: "Http failure response for http://127.0.0.1:32080/api/v1/notebook/null: 404 Not Found" - (NEED HELP) Also, I cannot launch a new notebook session, click "+ New Notebook" doesn't have any effect, Chrome console error showed: "ERROR TypeError: Cannot read property 'environment' of null at Object.eval [as updateRenderer] (ng:///NotebookModule/NotebookComponent.ngfactory.js:92) at Object.debugUpdateRenderer [as updateRenderer] (vendor.js:88356) at checkAndUpdateView (vendor.js:87731)" - (QUESTION) Also, on the Notebook List UI, all Environment, Docker Image, Resources, Status are empty for the running notebook "notebook1111", I don't know if it is normal or not. - (QUESTION) On the environment page, "my-submarine-env" is showed up, is there any example uses the my-submarine-env? If it is introduced by some previous test, I think we should remove it. (We should only ship usable examples/configs during release). - (NEED HELP) I tried to follow the notebook guide ( https://github.com/apache/submarine/blob/master/docs/userdocs/k8s/notebook.md), run the Experiment example (see "Experiment with your notebook "), it has error message below: --------------------------------------------------------------------------- ApiException Traceback (most recent call last) <ipython-input-1-cc077e98d460> in <module> 30 31 # Create experiment ---> 32 experiment = submarine_client.create_experiment(experiment_spec=experiment_spec) /opt/conda/lib/python3.7/site-packages/submarine/experiment/api/experiment_client.py in create_experiment(self, experiment_spec) 57 :return: submarine experiment 58 """ ---> 59 response = self.experiment_api.create_experiment(experiment_spec=experiment_spec) 60 return response.result 61 /opt/conda/lib/python3.7/site-packages/submarine/experiment/api/experiment_api.py in create_experiment(self, **kwargs) 75 """ 76 kwargs['_return_http_data_only'] = True ---> 77 return self.create_experiment_with_http_info(**kwargs) # noqa: E501 78 79 def create_experiment_with_http_info(self, **kwargs): # noqa: E501 /opt/conda/lib/python3.7/site-packages/submarine/experiment/api/experiment_api.py in create_experiment_with_http_info(self, **kwargs) 163 _preload_content=local_var_params.get('_preload_content', True), 164 _request_timeout=local_var_params.get('_request_timeout'), --> 165 collection_formats=collection_formats) 166 167 def delete_experiment(self, id, **kwargs): # noqa: E501 /opt/conda/lib/python3.7/site-packages/submarine/experiment/api_client.py in call_api(self, resource_path, method, path_params, query_params, header_params, body, post_params, files, response_type, auth_settings, async_req, _return_http_data_only, collection_formats, _preload_content, _request_timeout, _host) 417 auth_settings, _return_http_data_only, 418 collection_formats, _preload_content, --> 419 _request_timeout, _host) 420 421 return self.pool.apply_async( /opt/conda/lib/python3.7/site-packages/submarine/experiment/api_client.py in __call_api(self, resource_path, method, path_params, query_params, header_params, body, post_params, files, response_type, auth_settings, _return_http_data_only, collection_formats, _preload_content, _request_timeout, _host) 218 except ApiException as e: 219 e.body = e.body.decode('utf-8') if six.PY3 else e.body --> 220 raise e 221 222 content_type = response_data.getheader('content-type') /opt/conda/lib/python3.7/site-packages/submarine/experiment/api_client.py in __call_api(self, resource_path, method, path_params, query_params, header_params, body, post_params, files, response_type, auth_settings, _return_http_data_only, collection_formats, _preload_content, _request_timeout, _host) 215 body=body, 216 _preload_content=_preload_content, --> 217 _request_timeout=_request_timeout) 218 except ApiException as e: 219 e.body = e.body.decode('utf-8') if six.PY3 else e.body /opt/conda/lib/python3.7/site-packages/submarine/experiment/api_client.py in request(self, method, url, query_params, headers, post_params, body, _preload_content, _request_timeout) 461 _preload_content=_preload_content, 462 _request_timeout=_request_timeout, --> 463 body=body) 464 elif method == "PUT": 465 return self.rest_client.PUT(url, /opt/conda/lib/python3.7/site-packages/submarine/experiment/rest.py in POST(self, url, headers, query_params, post_params, body, _preload_content, _request_timeout) 324 _preload_content=_preload_content, 325 _request_timeout=_request_timeout, --> 326 body=body) 327 328 def PUT(self, /opt/conda/lib/python3.7/site-packages/submarine/experiment/rest.py in request(self, method, url, query_params, headers, body, post_params, _preload_content, _request_timeout) 247 248 if not 200 <= r.status <= 299: --> 249 raise ApiException(http_resp=r) 250 251 return r ApiException: (409) Reason: Conflict HTTP response headers: HTTPHeaderDict({'Date': 'Tue, 03 Nov 2020 00:08:40 GMT', 'Content-Type': 'application/json;charset=utf-8', 'Content-Length': '140', 'Server': 'Jetty(9.4.18.v20190429)'}) HTTP response body: {"status":"CONFLICT","code":409,"success":null,"message":"K8s submitter: parse Job object failed by Conflict","result":null,"attributes":{}} - Also, submitted PR: https://github.com/apache/submarine/pull/444 for documentation-related improvements, please help to review, I think it is important to get these issues fixed. After restart K8s, the problem (cannot connect to notebook) is gone, I'm not sure if there's anybody else hits the issue, it will be better if we can add documentation to the user doc and help users to do troubleshooting. On Mon, Nov 2, 2020 at 9:54 AM Wangda Tan <[email protected]> wrote: > Nevermind, please ignore the previous message. I think it is caused by > previous installation (0.4.0), I reset K8s cluster and now everything can > be installed, trying to follow other steps now. > > 1) Why by default Submarine sever login has maria_dev as default user > name? > > 2) There's no doc about using Submarine UI, we need to add one, I will > file a PR later (TODO) > > 3) Trying to create notebook, the notebook name initially I gave is > "nb_123", but I got the error, "K8s submitter: parse Job object failed by > Unprocessable Entity, please try again" > If we have limitation of how naming of notebook should be, we should add > it to the UI (like only character or numbers are supported). > From the doc it mentioned: "Name of the notebook server. It should be > unique and include no spaces." We need to update both of doc and UI. > > 4) When I choose the environment for notebook, I saw both my-submarine-env > and notebook-env. What are the differences between the two? > - I found only notebook-env works, "my-submarine-env" failed to start. We > should remove the my-submarine-env from the default helm installation. > > 5) Also, I found even if notebook is not fully start and running, the UI > indicate it is created: > > [image: image.png] > > Clicking notebook name will show you an error page, we should improve this > part. > > 6) After waiting for ~5 mins, the notebook started, but I still cannot > access the notebook UI, clicking the link on the Submarine UI tells me: > > HTTP ERROR 404 > > Problem accessing /notebook/default/notebook1111/. Reason: > > Not Found > > > And logs for the notebook pod tells me; > > > kubectl logs notebook1111-0 > Conda current version is currentVersion=4.8.3;. Moving forward with env > creation and activation. > [I 17:45:50.056 NotebookApp] Writing notebook server cookie secret to > /home/jovyan/.local/share/jupyter/runtime/notebook_cookie_secret > [W 17:45:52.108 NotebookApp] All authentication is disabled. Anyone who can > connect to this server will be able to run code. > [I 17:45:52.144 NotebookApp] Serving notebooks from local directory: > /home/jovyan > [I 17:45:52.145 NotebookApp] Jupyter Notebook 6.1.3 is running at: > [I 17:45:52.146 NotebookApp] > http://notebook1111-0:8888/notebook/default/notebook1111/ > [I 17:45:52.147 NotebookApp] Use Control-C to stop this server and shut down > all kernels (twice to skip confirmation). > > > Is it bind to a wrong port? > > > On Mon, Nov 2, 2020 at 9:18 AM Wangda Tan <[email protected]> wrote: > >> Hi Kevin, >> >> Thank you so much for running this release. >> >> Trying to follow the helm install stage, but notebook controller is >> failed to start. >> >> I downloaded RC1 source code, and follow the guidance: >> https://github.com/apache/submarine/blob/master/docs/userdocs/k8s/helm.md >> >> The pods in my system: >> >> NAMESPACE NAME READY >> STATUS RESTARTS AGE >> default notebook-controller-deployment-58797bdd75-9fstx 0/1 >> CrashLoopBackOff 7 12m >> default pytorch-operator-75fd845678-dpc52 1/1 >> Running 0 12m >> default submarine-database-54776644c6-x2zrm 1/1 >> Running 0 12m >> default submarine-server-5d846f7b4f-m2fw8 1/1 >> Running 0 12m >> default submarine-traefik-d55c689b5-cq6db 1/1 >> Running 0 12m >> default tf-job-operator-598686fd84-2fwt9 1/1 >> Running 0 12m >> >> And notebook-controller pod has the following information: (kubectl >> describe pods notebook-controller-deployment-58797bdd75-9fstx) >> >> Events: >> Type Reason Age From >> Message >> ---- ------ ---- ---- >> ------- >> Normal Scheduled 12m default-scheduler >> Successfully assigned >> default/notebook-controller-deployment-58797bdd75-9fstx to docker-desktop >> Normal Pulling 12m kubelet, docker-desktop >> Pulling image "apache/submarine:notebook-controller-v1.1.0-g253890cb" >> Normal Pulled 12m kubelet, docker-desktop >> Successfully pulled image >> "apache/submarine:notebook-controller-v1.1.0-g253890cb" >> Normal Pulled 11m (x4 over 12m) kubelet, docker-desktop >> Container image "apache/submarine:notebook-controller-v1.1.0-g253890cb" >> already present on machine >> Normal Created 11m (x5 over 12m) kubelet, docker-desktop >> Created container manager >> Normal Started 11m (x5 over 12m) kubelet, docker-desktop >> Started container manager >> Warning BackOff 2m45s (x51 over 12m) kubelet, docker-desktop >> Back-off restarting failed container >> >> Logs: (kubectl log notebook-controller-deployment-58797bdd75-9fstx) >> >> 2020-11-02T17:11:53.159Z ERROR setup unable to create controller >> {"controller": "Notebook", "error": "no matches for kind \"Notebook\" in >> version \"kubeflow.org/v1beta1\ <http://kubeflow.org/v1beta1%5C>""} >> github.com/go-logr/zapr.(*zapLogger).Error >> /go/pkg/mod/github.com/go-logr/[email protected]/zapr.go:128 >> main.main >> /workspace/notebook-controller/main.go:76 >> runtime.main >> /usr/local/go/src/runtime/proc.go:200 >> >> I guess it might be version of K8s, I'm using DockerDesktop: >> >> kubectl version >> Client Version: version.Info{Major:"1", Minor:"15", GitVersion:"v1.15.6", >> GitCommit:"7015f71e75f670eb9e7ebd4b5749639d42e20079", GitTreeState:"clean", >> BuildDate:"2019-11-13T11:20:18Z", GoVersion:"go1.12.12", Compiler:"gc", >> Platform:"darwin/amd64"} >> Server Version: version.Info{Major:"1", Minor:"16+", >> GitVersion:"v1.16.6-beta.0", >> GitCommit:"e7f962ba86f4ce7033828210ca3556393c377bcc", GitTreeState:"clean", >> BuildDate:"2020-01-15T08:18:29Z", GoVersion:"go1.13.5", Compiler:"gc", >> Platform:"linux/amd64"} >> >> I'm not sure what I should do now? Last time when we release 0.4.0, we >> verified 1.14, 1.15, 1.16 should all work. >> >> >> https://github.com/apache/submarine/blob/master/docs/userdocs/k8s/README.md >> >> What is our recommendation for now? >> >> Thanks, >> Wangda >> >> >> >> >> >> On Fri, Oct 30, 2020 at 8:47 PM Wanqiang Ji <[email protected]> wrote: >> >>> +1 for this RC1. Thanks Kevin drive this release and everyone for the >>> great >>> works. >>> I've done the following tests: >>> 1. Install the helm charts against the minikube >>> 2. Install the helm charts against the kind >>> 3. Check the UI's features, create the notebook/experiment and run work >>> >>> BR, >>> Wanqiang Ji >>> >>> >>> On Wed, Oct 28, 2020 at 4:52 PM Zhankun Tang <[email protected]> wrote: >>> >>> > Thanks for the great efforts! Kevin! >>> > I've done the below testing >>> > 1. Verify the signatures (I signed Kevin's key and updated the KEYS >>> file) >>> > 2. Build from source >>> > 3. Install the helm charts against Docker desktop k8s (v1.14.8) >>> > 4. Check basic UI's experiment and notebook features >>> > >>> > And I like the built-in examples in our notebook image. It's >>> > straightforward for beginners to get familiar with our SDK. >>> > One minor suggestion is that we can add a link in parent readme >>> > < >>> > >>> https://github.com/apache/submarine/blob/master/docs/userdocs/k8s/README.md >>> > > >>> > to our notebook.md doc. But this is not a blocker, we can improve docs >>> > later since it's in GitHub for now. >>> > >>> > I'll give my *+1(binding)* to this RC1. >>> > >>> > @Wei-Chiu Chuang <[email protected]> BTW, I can download the >>> > "apache/submarine:mini-0.5.0-RC1" image on my laptop. >>> > >>> > BR, >>> > Zhankun >>> > >>> > On Mon, 26 Oct 2020 at 10:21, Wei-Chiu Chuang >>> <[email protected] >>> > > >>> > wrote: >>> > >>> > > Curious -- I see mini-0.5.0-RC1 and RC0 docker images, however, the >>> > > operator, server, database and jupyter-notebook docker images are all >>> > > tagged version 0.5.0. Was this intentional? >>> > > >>> > > I kept getting disk full error trying to pull the RC1 minisubmarine >>> > > image: docker pull apache/submarine:mini-0.5.0-RC1 >>> > > >>> > > failed to register layer: Error processing tar file(exit status 1): >>> write >>> > > >>> > > >>> > >>> /home/yarn/submarine/tf2-venv/lib/python3.6/site-packages/tensorflow_core/python/_pywrap_tensorflow_internal.so: >>> > > no space left on device >>> > > >>> > > I've already run docker system prune to remove unnecessary volumes >>> and >>> > > images, and my local disk has more than 200GB of space. >>> > > >>> > > After the release, we should update the release information in the >>> Apache >>> > > system: https://reporter.apache.org/addrelease.html?submarine (I >>> believe >>> > > the access is restricted to PMCs) >>> > > >>> > > The docker images should be regarded as convenience binary, and our >>> vote >>> > is >>> > > for the source code. So if you need to update the images, no need to >>> > > bote again. >>> > > >>> > > On Sun, Oct 25, 2020 at 12:46 PM Wei-Chiu Chuang <[email protected] >>> > >>> > > wrote: >>> > > >>> > > > Not a blocker, but I do notice our doc has lots of "TODO" and >>> "FIXME" >>> > and >>> > > > then realized our doc is a WIP from SUBMARINE-518. >>> > > > I'd like to find a time to contribute to the docs later. >>> > > > >>> > > > On Thu, Oct 22, 2020 at 4:16 AM Kevin Su <[email protected]> >>> wrote: >>> > > > >>> > > >> Hi folks, >>> > > >> >>> > > >> >>> > > >> Thanks to everyone's help on this release. Special thanks to >>> Wangda, >>> > > >> >>> > > >> Zhankun, Xun, Wei-Chiu, Wanqiang, Ryan, Manikandan, and JohnTing! >>> > > >> >>> > > >> I've created a release candidate (RC1) for submarine 0.5.0. The >>> > > >> highlighted >>> > > >> >>> > > >> features are as follows: >>> > > >> >>> > > >> 1. Submarine Experiments: Redefined the experiment spec, sync up >>> code >>> > > from >>> > > >> Git, it could be https and ssh >>> > > >> >>> > > >> 2. Predefined experiment template: Register A experiment template >>> and >>> > > >> submit related parameter to run an experiment using Rest API >>> > > >> >>> > > >> 3. Environment profile: Users could easily manage their docker >>> image >>> > and >>> > > >> conda environment >>> > > >> >>> > > >> 4. Jupyter Notebook: Spawn a jupyter notebook using Rest API, and >>> > > execute >>> > > >> ML code on K8s, or submit an experiment to submarine server >>> > > >> >>> > > >> 5. Submarine Workbench UI: CRUD Experiment, Environment, Notebook >>> > > through >>> > > >> the UI >>> > > >> >>> > > >> The RC tag in git is here: >>> > > >> >>> https://github.com/apache/submarine/releases/tag/release-0.5.0-RC1 >>> > > >> The RC release artifacts are available at: >>> > > >> http://home.apache.org/~pingsutw/submarine-0.5.0-RC1 >>> > > >> >>> > > >> The mini-submarine image is here: >>> > > >> >>> > > >> >>> > > >>> > >>> https://hub.docker.com/layers/apache/submarine/mini-0.5.0-RC1/images/sha256-b8bc864a9a6409361de96d93e467c6458c96f6d2d85c74639a201b1c1b9af3a0?context=explore >>> > > >> >>> > > >> The server image is here: >>> > > >> >>> > > >> >>> > > >>> > >>> https://hub.docker.com/layers/apache/submarine/server-0.5.0/images/sha256-3c197696a773cebf3409acc5ed89504e9f56240a1748b107031bf32a4ba79e40?context=explore >>> > > >> >>> > > >> The database image is here: >>> > > >> >>> > > >> >>> > > >>> > >>> https://hub.docker.com/layers/apache/submarine/database-0.5.0/images/sha256-fcf72289e0aa46e83fc8e65c8aca79be4bba96ec9813d54568e5679925cdc94f?context=explore >>> > > >> >>> > > >> The Jupyter Notebook image is here: >>> > > >> >>> > > >> >>> > > >>> > >>> https://hub.docker.com/layers/apache/submarine/jupyter-notebook-0.5.0/images/sha256-1e05cdd3c814063b3cac9de12bdecd70475d38d708c06794c2d3b55ef97de82a?context=explore >>> > > >> >>> > > >> The Maven staging repository is here: >>> > > >> >>> > > >>> > >>> https://repository.apache.org/content/repositories/orgapachesubmarine-1015 >>> > > >> >>> > > >> My public key is here: >>> > > >> https://dist.apache.org/repos/dist/release/submarine/KEYS >>> > > >> >>> > > >> *This vote will run for 7 days, ending on Oct 29, 2020, at 11:59 >>> pm >>> > > PST.* >>> > > >> >>> > > >> For the testing, I have verified the >>> > > >> >>> > > >> 1. Build from source, Run the mnist on Hadoop >>> > > >> >>> > > >> 2. Example with mini-submarine(both local and remote mode) >>> > > >> >>> > > >> 3. Verified the experiment operations to K8s by Submarine Server >>> REST >>> > > and >>> > > >> PySubmarine. >>> > > >> >>> > > >> 4. Workbench UI (experiment, environment, notebook) >>> > > >> >>> > > >> 5. HTTP sync code in the experiment >>> > > >> >>> > > >> 6. Environment profile REST API >>> > > >> >>> > > >> 7. Notebook REST API >>> > > >> >>> > > >> >>> > > >> Please follow the document to test these features. >>> > > >> * >>> > > >> >>> > > >>> > >>> https://github.com/apache/submarine/tree/master/dev-support/mini-submarine >>> > > >> * >>> > > >>> https://github.com/apache/submarine/blob/master/docs/user-guide-home.md >>> > > >> >>> > > >> My +1 to start. Thanks! >>> > > >> >>> > > >> BR, >>> > > >> Kevin Su >>> > > >> >>> > > > >>> > > >>> > >>> >>
