[ 
https://ovirt-jira.atlassian.net/browse/OVIRT-2794?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=39797#comment-39797
 ] 

Barak Korren commented on OVIRT-2794:
-------------------------------------

This was a bit Puzzling, we've seen issues between {{docker_cleanup.py}} and 
Docker appear sporadically in the past, and therefore have have made the job 
code generally not fail when {{docker_cleanup.py}} fails, and instead send an 
email to the infra list. It turn out that was only true for the V2 code, for 
the V1 code (which is still used in the manual job and the nightly jobs) thos 
failures could still arise.

We did verify that {{docker_cleanup.py}} works on CentOS 7 with the Python 3 
docker API client before merging the patch, so its strange we did not see the 
issue then.

[~accountid:557058:5ca52a09-2675-4285-a044-12ad20f6166a] some of your 
statements above seem to include some wrong assumption about how the system is 
built. We're not actually exposing the host's Docker deamon to the CI code, 
instead we we our own docker instance running inside the container that is used 
to run the CI code. That way we can ensure there can be no cross-talk when 
running multiple CI containers on the same hosts.

[~accountid:557058:cc1e0e66-9881-45e2-b0b7-ccaa3e60f26e] as far as using 
podman, I think doing that at this point will be quite a challenge for a number 
of reasons:
# We're currently using OpenShift 3.7 to manage our containers, this implies 
that we must run Docker on our hosts, since AFAIK OpenShift only started 
supporting CRIO in 4.0 or 4.1.
# To allow CI scripts and tests suits to use Docker we run nested Docker 
instances inside the CI containers. We know that Docker in Docker work well for 
our use cases. Running Podman in Docker will probably be more challenging.
# Since we're still using {{mock}} to encapsulate the CI script inside the CI 
container, we're bind-mounting the docker socket from the container into mock. 
We know there are issues when running Podman in mock, so solving those will 
take some work.
# People that write CI scripts and suits tend to expect things to "just work" 
in CI like it does on their laptops, and hence tend to use Docker commands. 
Removing docker will force everyone to learn Podman, and we'll need to make 
changes everywhere.

Out current suspicion is that this issue may have to do with the particular 
version Docker that is installed inside the CI container. While our 
{{global_setup.sh}} script generally keeps Docker up to date on the CI slaves, 
we've intentionally skipped that update code when running in a container. I 
suspect that the version of Docker that is in the CI containers is older then 
the once running on the CI slaves. That would explain why we did not see this 
issue when working on the {{docker_cleanup.py}} patch, since that was tested on 
the the normal slaves and not the containers.

Here is what I think we should do now:
# Verify again, that {{docker_cleanup.py}} woks well on CentOS with the Python 
3 Docker client API .
# If so, inspect the version of Docker we have in the containers and finally
# Build an updated container image with a newer version of Docker as needed

Note that updating the container image will require us to tests it thoroughly 
and ensure it can properly run both OST and {{kubevirt-ci}}. 



> OST is broken since this morning - looks like infra issue
> ---------------------------------------------------------
>
>                 Key: OVIRT-2794
>                 URL: https://ovirt-jira.atlassian.net/browse/OVIRT-2794
>             Project: oVirt - virtualization made easy
>          Issue Type: By-EMAIL
>            Reporter: Nir Soffer
>            Assignee: infra
>
> The last successful build was today at 08:10:
> Since then all builds fail very early with the error below - which is not
> related to oVirt.
> {code}
> Removing image:
> sha256:f8e5aa8e979155e074411bfef9adade6cdcdf3a5a2eb1d5ad2dbf0288d585ffa,
> force=True
> Traceback (most recent call last):
>   File "/usr/lib/python3.6/site-packages/docker/api/client.py", line 222,
> in _raise_for_status
>     response.raise_for_status()
>   File "/usr/lib/python3.6/site-packages/requests/models.py", line 893, in
> raise_for_status
>     raise HTTPError(http_error_msg, response=self)
> requests.exceptions.HTTPError: 404 Client Error: Not Found for url:
> http+docker://localunixsocket/v1.30/images/sha256:f8e5aa8e979155e074411bfef9adade6cdcdf3a5a2eb1d5ad2dbf0288d585ffa?force=True&noprune=False
> During handling of the above exception, another exception occurred:
> Traceback (most recent call last):
>   File
> "/home/jenkins/workspace/ovirt-system-tests_manual/jenkins/scripts/docker_cleanup.py",
> line 349, in <module>
>     main()
>   File
> "/home/jenkins/workspace/ovirt-system-tests_manual/jenkins/scripts/docker_cleanup.py",
> line 37, in main
>     safe_image_cleanup(client, whitelisted_repos)
>   File
> "/home/jenkins/workspace/ovirt-system-tests_manual/jenkins/scripts/docker_cleanup.py",
> line 107, in safe_image_cleanup
>     _safe_rm(client, parent)
>   File
> "/home/jenkins/workspace/ovirt-system-tests_manual/jenkins/scripts/docker_cleanup.py",
> line 329, in _safe_rm
>     client.images.remove(image_id, force=force)
>   File "/usr/lib/python3.6/site-packages/docker/models/images.py", line
> 288, in remove
>     self.client.api.remove_image(*args, **kwargs)
>   File "/usr/lib/python3.6/site-packages/docker/utils/decorators.py", line
> 19, in wrapped
>     return f(self, resource_id, *args, **kwargs)
>   File "/usr/lib/python3.6/site-packages/docker/api/image.py", line 481, in
> remove_image
>     return self._result(res, True)
>   File "/usr/lib/python3.6/site-packages/docker/api/client.py", line 228,
> in _result
>     self._raise_for_status(response)
>   File "/usr/lib/python3.6/site-packages/docker/api/client.py", line 224,
> in _raise_for_status
>     raise create_api_error_from_http_exception(e)
>   File "/usr/lib/python3.6/site-packages/docker/errors.py", line 31, in
> create_api_error_from_http_exception
>     raise cls(e, response=response, explanation=explanation)
> docker.errors.NotFound: 404 Client Error: Not Found ("reference does not
> exist")
> Aborting.
> Build step 'Execute shell' marked build as failure
> {code}
> x
> [image: Failed > Console Output]
> <https://jenkins.ovirt.org/job/ovirt-system-tests_manual/5542/console>
> #5542 <https://jenkins.ovirt.org/job/ovirt-system-tests_manual/5542/>
> Sep 5, 2019 3:02 PM
> <https://jenkins.ovirt.org/job/ovirt-system-tests_manual/5542/>
> [image: Failed > Console Output]
> <https://jenkins.ovirt.org/job/ovirt-system-tests_manual/5541/console>
> #5541 <https://jenkins.ovirt.org/job/ovirt-system-tests_manual/5541/>
> Sep 5, 2019 3:02 PM
> <https://jenkins.ovirt.org/job/ovirt-system-tests_manual/5541/>
> [image: Failed > Console Output]
> <https://jenkins.ovirt.org/job/ovirt-system-tests_manual/5540/console>
> #5540 <https://jenkins.ovirt.org/job/ovirt-system-tests_manual/5540/>
> Sep 5, 2019 3:01 PM
> <https://jenkins.ovirt.org/job/ovirt-system-tests_manual/5540/>
> [image: Failed > Console Output]
> <https://jenkins.ovirt.org/job/ovirt-system-tests_manual/5539/console>
> #5539 <https://jenkins.ovirt.org/job/ovirt-system-tests_manual/5539/>
> Sep 5, 2019 2:13 PM
> <https://jenkins.ovirt.org/job/ovirt-system-tests_manual/5539/>
> [image: Failed > Console Output]
> <https://jenkins.ovirt.org/job/ovirt-system-tests_manual/5538/console>
> #5538 <https://jenkins.ovirt.org/job/ovirt-system-tests_manual/5538/>
> Sep 5, 2019 1:58 PM
> <https://jenkins.ovirt.org/job/ovirt-system-tests_manual/5538/>
> [image: Failed > Console Output]
> <https://jenkins.ovirt.org/job/ovirt-system-tests_manual/5537/console>
> #5537 <https://jenkins.ovirt.org/job/ovirt-system-tests_manual/5537/>
> Sep 5, 2019 1:50 PM
> <https://jenkins.ovirt.org/job/ovirt-system-tests_manual/5537/>
> [image: Failed > Console Output]
> <https://jenkins.ovirt.org/job/ovirt-system-tests_manual/5536/console>
> #5536 <https://jenkins.ovirt.org/job/ovirt-system-tests_manual/5536/>
> Sep 5, 2019 10:21 AM
> <https://jenkins.ovirt.org/job/ovirt-system-tests_manual/5536/>
>  [image: x]
> <http://jenkins.ovirt.org/job/ovirt-system-tests_manual/jobConfigHistory/showDiffFiles?timestamp1=2019-08-27_12-38-35&timestamp2=2019-09-05_08-22-23>
> [image: Success > Console Output]
> <https://jenkins.ovirt.org/job/ovirt-system-tests_manual/5535/console>
> #5535 <https://jenkins.ovirt.org/job/ovirt-system-tests_manual/5535/>
> Sep 5, 2019 8:10 AM
> <https://jenkins.ovirt.org/job/ovirt-system-tests_manual/5535/>



--
This message was sent by Atlassian Jira
(v1001.0.0-SNAPSHOT#100109)
_______________________________________________
Infra mailing list -- [email protected]
To unsubscribe send an email to [email protected]
Privacy Statement: https://www.ovirt.org/site/privacy-policy/
oVirt Code of Conduct: 
https://www.ovirt.org/community/about/community-guidelines/
List Archives: 
https://lists.ovirt.org/archives/list/[email protected]/message/QFY57AKCEAGBOPOTTTRQS37LBLNLFLKW/

Reply via email to