On Wed, Nov 27, 2019 at 5:44 PM Nir Soffer <[email protected]> wrote:
> On Wed, Nov 27, 2019 at 5:54 PM Tal Nisan <[email protected]> wrote: > >> >> >> On Wed, Nov 27, 2019 at 1:27 PM Marcin Sobczyk <[email protected]> >> wrote: >> >>> Hi, >>> >>> I ran OST on my physical server. >>> I'm experiencing probably the same issues as described in the thread >>> below. >>> >>> On one of the hosts: >>> >>> [root@lago-basic-suite-master-host-0 tmp]# ls -l /rhev/data-center/mnt/ >>> ls: cannot access '/rhev/data-center/mnt/192.168.200.4:_exports_nfs_share1': >>> Operation not permitted >>> ls: cannot access '/rhev/data-center/mnt/192.168.200.4:_exports_nfs_share2': >>> Operation not permitted >>> ls: cannot access >>> '/rhev/data-center/mnt/192.168.200.4:_exports_nfs_exported': >>> Operation not permitted >>> total 0 >>> d?????????? ? ? ? ? ? 192.168.200.4: >>> _exports_nfs_exported >>> d?????????? ? ? ? ? ? 192.168.200.4:_exports_nfs_share1 >>> d?????????? ? ? ? ? ? 192.168.200.4:_exports_nfs_share2 >>> drwxr-xr-x. 3 vdsm kvm 50 Nov 27 04:22 blockSD >>> >>> I think there's some problem with the nfs shares on engine. >>> >> We saw it recently with the move to RHEL8, Nir isn't that the same issue >> with the NFS squashing? >> > > root not being able to access NFS is expected if the NFS server is not > configured > with annonuid=36,annongid=36. > > This is not new and did not change in rhel8. The change is probably in > libvirt, trying to > access disk it should not access since we disable dac in the xml for disks. > > When this happens vms do not start, and here the issue seems to be that vm > get paused after > some time because storage becomes inaccessible. > > I can mount engine's nfs shares directly from server's native OS: >>> >>> ➜ /tmp mkdir -p /tmp/aaa && mount "192.168.200.4:/exports/nfs/share1" >>> /tmp/aaa >>> ➜ /tmp ls -l /tmp/aaa >>> total 4 >>> drwxr-xr-x. 5 36 kvm 4096 Nov 27 10:18 >>> 3332759c-a943-4fbd-80aa-a5f72cd87c7c >>> ➜ /tmp >>> >>> But trying to do that from one of the hosts fails: >>> >>> [root@lago-basic-suite-master-host-0 tmp]# mkdir -p /tmp/aaa && mount >>> -v "192.168.200.4:/exports/nfs/share1" /tmp/aaa >>> mount.nfs: timeout set for Wed Nov 27 06:26:19 2019 >>> mount.nfs: trying text-based options >>> 'vers=4.2,addr=192.168.200.4,clientaddr=192.168.201.2' >>> mount.nfs: mount(2): Operation not permitted >>> mount.nfs: trying text-based options 'addr=192.168.200.4' >>> mount.nfs: prog 100003, trying vers=3, prot=6 >>> mount.nfs: portmap query failed: RPC: Remote system error - No route to >>> host >>> >> > Smells like broken network. > > As I reproduced this scenario, ping was working, while NFS not working. > On the engine side, '/var/log/messages' seems to be flooded with nfs >>> issues, example failures: >>> >>> Nov 27 06:25:25 lago-basic-suite-master-engine kernel: >>> __find_in_sessionid_hashtbl: 1574853151:3430996717:11:0 >>> Nov 27 06:25:25 lago-basic-suite-master-engine kernel: nfsd4_sequence: >>> slotid 0 >>> Nov 27 06:25:25 lago-basic-suite-master-engine kernel: check_slot_seqid >>> enter. seqid 405 slot_seqid 404 >>> Nov 27 06:25:25 lago-basic-suite-master-engine kernel: nfsv4 compound op >>> ffff9042fc202080 opcnt 3 #1: 53: status 0 >>> Nov 27 06:25:25 lago-basic-suite-master-engine kernel: nfsv4 compound op >>> #2/3: 22 (OP_PUTFH) >>> Nov 27 06:25:25 lago-basic-suite-master-engine kernel: nfsd: >>> fh_verify(28: 00070001 00340001 00000000 e50ae88b 5c44c45a 2b7c3991) >>> Nov 27 06:25:25 lago-basic-suite-master-engine kernel: nfsd: request >>> from insecure port 192.168.200.1, port=51529! >>> Nov 27 06:25:25 lago-basic-suite-master-engine kernel: nfsv4 compound op >>> ffff9042fc202080 opcnt 3 #2: 22: status 1 >>> Nov 27 06:25:25 lago-basic-suite-master-engine kernel: nfsv4 compound >>> returned 1 >>> Nov 27 06:25:25 lago-basic-suite-master-engine kernel: --> >>> nfsd4_store_cache_entry slot ffff9042c4d97000 >>> Nov 27 06:25:25 lago-basic-suite-master-engine kernel: renewing client >>> (clientid 5dde5a1f/cc80daed) >>> Nov 27 06:25:25 lago-basic-suite-master-engine kernel: nfsd_dispatch: >>> vers 4 proc 1 >>> Nov 27 06:25:25 lago-basic-suite-master-engine kernel: nfsv4 compound op >>> #1/3: 53 (OP_SEQUENCE) >>> Nov 27 06:25:25 lago-basic-suite-master-engine kernel: >>> __find_in_sessionid_hashtbl: 1574853151:3430996717:11:0 >>> Nov 27 06:25:25 lago-basic-suite-master-engine kernel: nfsd4_sequence: >>> slotid 0 >>> Nov 27 06:25:25 lago-basic-suite-master-engine kernel: check_slot_seqid >>> enter. seqid 406 slot_seqid 405 >>> Nov 27 06:25:25 lago-basic-suite-master-engine kernel: nfsv4 compound op >>> ffff9042fc202080 opcnt 3 #1: 53: status 0 >>> Nov 27 06:25:25 lago-basic-suite-master-engine kernel: nfsv4 compound op >>> #2/3: 22 (OP_PUTFH) >>> Nov 27 06:25:25 lago-basic-suite-master-engine kernel: nfsd: >>> fh_verify(28: 00070001 00340001 00000000 e50ae88b 5c44c45a 2b7c3991) >>> Nov 27 06:25:25 lago-basic-suite-master-engine kernel: nfsd: request >>> from insecure port 192.168.200.1, port=51529! >>> Nov 27 06:25:25 lago-basic-suite-master-engine kernel: nfsv4 compound op >>> ffff9042fc202080 opcnt 3 #2: 22: status 1 >>> Nov 27 06:25:25 lago-basic-suite-master-engine kernel: nfsv4 compound >>> returned 1 >>> Nov 27 06:25:25 lago-basic-suite-master-engine kernel: --> >>> nfsd4_store_cache_entry slot ffff9042c4d97000 >>> Nov 27 06:25:25 lago-basic-suite-master-engine kernel: renewing client >>> (clientid 5dde5a1f/cc80daed) >>> >>> Regards, Marcin >>> >>> On 11/26/19 8:40 PM, Martin Perina wrote: >>> >>> I've just merged https://gerrit.ovirt.org/105111 which only silence the >>> issue, but we really need to unblock OST, as it's suffering from this for >>> more than 2 weeks now. >>> >>> Tal/Nir, could someone really investigate why the storage become >>> unavailable after some time? It may be caused by recent switch of hosts to >>> CentOS 8, but may be not related >>> >>> Thanks, >>> Martin >>> >>> >>> On Tue, Nov 26, 2019 at 9:17 AM Dominik Holler <[email protected]> >>> wrote: >>> >>>> >>>> >>>> On Mon, Nov 25, 2019 at 7:12 PM Nir Soffer <[email protected]> wrote: >>>> >>>>> On Mon, Nov 25, 2019 at 7:15 PM Dominik Holler <[email protected]> >>>>> wrote: >>>>> > >>>>> > >>>>> > >>>>> > On Mon, Nov 25, 2019 at 6:03 PM Nir Soffer <[email protected]> >>>>> wrote: >>>>> >> >>>>> >> On Mon, Nov 25, 2019 at 6:48 PM Dominik Holler <[email protected]> >>>>> wrote: >>>>> >> > >>>>> >> > >>>>> >> > >>>>> >> > On Mon, Nov 25, 2019 at 5:16 PM Nir Soffer <[email protected]> >>>>> wrote: >>>>> >> >> >>>>> >> >> On Mon, Nov 25, 2019 at 6:05 PM Dominik Holler < >>>>> [email protected]> wrote: >>>>> >> >> > >>>>> >> >> > >>>>> >> >> > >>>>> >> >> > On Mon, Nov 25, 2019 at 4:50 PM Nir Soffer <[email protected]> >>>>> wrote: >>>>> >> >> >> >>>>> >> >> >> On Mon, Nov 25, 2019 at 11:00 AM Dominik Holler < >>>>> [email protected]> wrote: >>>>> >> >> >> > >>>>> >> >> >> > >>>>> >> >> >> > >>>>> >> >> >> > On Fri, Nov 22, 2019 at 8:57 PM Dominik Holler < >>>>> [email protected]> wrote: >>>>> >> >> >> >> >>>>> >> >> >> >> >>>>> >> >> >> >> >>>>> >> >> >> >> On Fri, Nov 22, 2019 at 5:54 PM Dominik Holler < >>>>> [email protected]> wrote: >>>>> >> >> >> >>> >>>>> >> >> >> >>> >>>>> >> >> >> >>> >>>>> >> >> >> >>> On Fri, Nov 22, 2019 at 5:48 PM Nir Soffer < >>>>> [email protected]> wrote: >>>>> >> >> >> >>>> >>>>> >> >> >> >>>> >>>>> >> >> >> >>>> >>>>> >> >> >> >>>> On Fri, Nov 22, 2019, 18:18 Marcin Sobczyk < >>>>> [email protected]> wrote: >>>>> >> >> >> >>>>> >>>>> >> >> >> >>>>> >>>>> >> >> >> >>>>> >>>>> >> >> >> >>>>> On 11/22/19 4:54 PM, Martin Perina wrote: >>>>> >> >> >> >>>>> >>>>> >> >> >> >>>>> >>>>> >> >> >> >>>>> >>>>> >> >> >> >>>>> On Fri, Nov 22, 2019 at 4:43 PM Dominik Holler < >>>>> [email protected]> wrote: >>>>> >> >> >> >>>>>> >>>>> >> >> >> >>>>>> >>>>> >> >> >> >>>>>> On Fri, Nov 22, 2019 at 12:17 PM Dominik Holler < >>>>> [email protected]> wrote: >>>>> >> >> >> >>>>>>> >>>>> >> >> >> >>>>>>> >>>>> >> >> >> >>>>>>> >>>>> >> >> >> >>>>>>> On Fri, Nov 22, 2019 at 12:00 PM Miguel Duarte de >>>>> Mora Barroso <[email protected]> wrote: >>>>> >> >> >> >>>>>>>> >>>>> >> >> >> >>>>>>>> On Fri, Nov 22, 2019 at 11:54 AM Vojtech Juranek < >>>>> [email protected]> wrote: >>>>> >> >> >> >>>>>>>> > >>>>> >> >> >> >>>>>>>> > On pátek 22. listopadu 2019 9:56:56 CET Miguel >>>>> Duarte de Mora Barroso wrote: >>>>> >> >> >> >>>>>>>> > > On Fri, Nov 22, 2019 at 9:49 AM Vojtech Juranek < >>>>> [email protected]> >>>>> >> >> >> >>>>>>>> > > wrote: >>>>> >> >> >> >>>>>>>> > > > >>>>> >> >> >> >>>>>>>> > > > >>>>> >> >> >> >>>>>>>> > > > On pátek 22. listopadu 2019 9:41:26 CET >>>>> Dominik Holler wrote: >>>>> >> >> >> >>>>>>>> > > > >>>>> >> >> >> >>>>>>>> > > > > On Fri, Nov 22, 2019 at 8:40 AM Dominik >>>>> Holler <[email protected]> >>>>> >> >> >> >>>>>>>> > > > > wrote: >>>>> >> >> >> >>>>>>>> > > > > >>>>> >> >> >> >>>>>>>> > > > > > On Thu, Nov 21, 2019 at 10:54 PM Nir >>>>> Soffer <[email protected]> >>>>> >> >> >> >>>>>>>> > > > > > wrote: >>>>> >> >> >> >>>>>>>> > > > > > >>>>> >> >> >> >>>>>>>> > > > > >> On Thu, Nov 21, 2019 at 11:24 PM Vojtech >>>>> Juranek >>>>> >> >> >> >>>>>>>> > > > > >> <[email protected]> >>>>> >> >> >> >>>>>>>> > > > > >> >>>>> >> >> >> >>>>>>>> > > > > >> >>>>> >> >> >> >>>>>>>> > > > > >> >>>>> >> >> >> >>>>>>>> > > > > >> wrote: >>>>> >> >> >> >>>>>>>> > > > > >> >>>>> >> >> >> >>>>>>>> > > > > >> > Hi, >>>>> >> >> >> >>>>>>>> > > > > >> > OST fails (see e.g. [1]) in >>>>> 002_bootstrap.check_update_host. It >>>>> >> >> >> >>>>>>>> > > > > >> > fails >>>>> >> >> >> >>>>>>>> > > > > >> >>>>> >> >> >> >>>>>>>> > > > > >> >>>>> >> >> >> >>>>>>>> > > > > >> >>>>> >> >> >> >>>>>>>> > > > > >> with >>>>> >> >> >> >>>>>>>> > > > > >> >>>>> >> >> >> >>>>>>>> > > > > >> >>>>> >> >> >> >>>>>>>> > > > > >> >>>>> >> >> >> >>>>>>>> > > > > >> > FAILED! => {"changed": false, >>>>> "failures": [], "msg": "Depsolve >>>>> >> >> >> >>>>>>>> > > > > >> > Error >>>>> >> >> >> >>>>>>>> > > > > >> >>>>> >> >> >> >>>>>>>> > > > > >> >>>>> >> >> >> >>>>>>>> > > > > >> >>>>> >> >> >> >>>>>>>> > > > > >> occured: >>>>> >> >> >> >>>>>>>> > > > > >> >>>>> >> >> >> >>>>>>>> > > > > >> > \n Problem 1: cannot install the best >>>>> update candidate for package >>>>> >> >> >> >>>>>>>> > > > > >> > vdsm- >>>>> >> >> >> >>>>>>>> > > > > >> > >>>>> network-4.40.0-1236.git63ea8cb8b.el8.x86_64\n - nothing provides >>>>> >> >> >> >>>>>>>> > > > > >> >>>>> >> >> >> >>>>>>>> > > > > >> >>>>> >> >> >> >>>>>>>> > > > > >> >>>>> >> >> >> >>>>>>>> > > > > >> nmstate >>>>> >> >> >> >>>>>>>> > > > > >> >>>>> >> >> >> >>>>>>>> > > > > >> >>>>> >> >> >> >>>>>>>> > > > > >> >>>>> >> >> >> >>>>>>>> > > > > >> > needed by >>>>> vdsm-network-4.40.0-1271.git524e08c8a.el8.x86_64\n >>>>> >> >> >> >>>>>>>> > > > > >> > Problem 2: >>>>> >> >> >> >>>>>>>> > > > > >> > package >>>>> vdsm-python-4.40.0-1271.git524e08c8a.el8.noarch requires >>>>> >> >> >> >>>>>>>> > > > > >> >>>>> >> >> >> >>>>>>>> > > > > >> >>>>> >> >> >> >>>>>>>> > > > > >> >>>>> >> >> >> >>>>>>>> > > > > >> vdsm-network >>>>> >> >> >> >>>>>>>> > > > > >> >>>>> >> >> >> >>>>>>>> > > > > >> >>>>> >> >> >> >>>>>>>> > > > > >> >>>>> >> >> >> >>>>>>>> > > > > >> > = 4.40.0-1271.git524e08c8a.el8, but >>>>> none of the providers can be >>>>> >> >> >> >>>>>>>> > > > > >> >>>>> >> >> >> >>>>>>>> > > > > >> >>>>> >> >> >> >>>>>>>> > > > > >> >>>>> >> >> >> >>>>>>>> > > > > >> installed\n >>>>> >> >> >> >>>>>>>> > > > > >> >>>>> >> >> >> >>>>>>>> > > > > >> >>>>> >> >> >> >>>>>>>> > > > > >> >>>>> >> >> >> >>>>>>>> > > > > >> > - cannot install the best update >>>>> candidate for package vdsm- >>>>> >> >> >> >>>>>>>> > > > > >> > >>>>> python-4.40.0-1236.git63ea8cb8b.el8.noarch\n - nothing provides >>>>> >> >> >> >>>>>>>> > > > > >> > nmstate >>>>> >> >> >> >>>>>>>> > > > > >> > needed by >>>>> vdsm-network-4.40.0-1271.git524e08c8a.el8.x86_64\n >>>>> >> >> >> >>>>>>>> > > > > >> >>>>> >> >> >> >>>>>>>> > > > > >> >>>>> >> >> >> >>>>>>>> > > > > >> >>>>> >> >> >> >>>>>>>> > > > > >> nmstate should be provided by copr repo >>>>> enabled by >>>>> >> >> >> >>>>>>>> > > > > >> ovirt-release-master. >>>>> >> >> >> >>>>>>>> > > > > > >>>>> >> >> >> >>>>>>>> > > > > > >>>>> >> >> >> >>>>>>>> > > > > > >>>>> >> >> >> >>>>>>>> > > > > > I re-triggered as >>>>> >> >> >> >>>>>>>> > > > > > >>>>> https://jenkins.ovirt.org/job/ovirt-system-tests_manual/6131 >>>>> >> >> >> >>>>>>>> > > > > > maybe >>>>> >> >> >> >>>>>>>> > > > > > https://gerrit.ovirt.org/#/c/104825/ >>>>> >> >> >> >>>>>>>> > > > > > was missing >>>>> >> >> >> >>>>>>>> > > > > >>>>> >> >> >> >>>>>>>> > > > > >>>>> >> >> >> >>>>>>>> > > > > >>>>> >> >> >> >>>>>>>> > > > > Looks like >>>>> >> >> >> >>>>>>>> > > > > https://gerrit.ovirt.org/#/c/104825/ is >>>>> ignored by OST. >>>>> >> >> >> >>>>>>>> > > > >>>>> >> >> >> >>>>>>>> > > > >>>>> >> >> >> >>>>>>>> > > > >>>>> >> >> >> >>>>>>>> > > > maybe not. You re-triggered with [1], which >>>>> really missed this patch. >>>>> >> >> >> >>>>>>>> > > > I did a rebase and now running with this patch >>>>> in build #6132 [2]. Let's >>>>> >> >> >> >>>>>>>> > > > wait >>>>> >> >> >> >>>>>>>> > for it to see if gerrit #104825 helps. >>>>> >> >> >> >>>>>>>> > > > >>>>> >> >> >> >>>>>>>> > > > >>>>> >> >> >> >>>>>>>> > > > >>>>> >> >> >> >>>>>>>> > > > [1] >>>>> https://jenkins.ovirt.org/job/standard-manual-runner/909/ >>>>> >> >> >> >>>>>>>> > > > [2] >>>>> https://jenkins.ovirt.org/job/ovirt-system-tests_manual/6132/ >>>>> >> >> >> >>>>>>>> > > > >>>>> >> >> >> >>>>>>>> > > > >>>>> >> >> >> >>>>>>>> > > > >>>>> >> >> >> >>>>>>>> > > > > Miguel, do you think merging >>>>> >> >> >> >>>>>>>> > > > > >>>>> >> >> >> >>>>>>>> > > > > >>>>> >> >> >> >>>>>>>> > > > > >>>>> >> >> >> >>>>>>>> > > > > >>>>> https://gerrit.ovirt.org/#/c/104495/15/common/yum-repos/ovirt-master-hos >>>>> >> >> >> >>>>>>>> > > > > t-cq >>>>> >> >> >> >>>>>>>> > .repo.in >>>>> >> >> >> >>>>>>>> > > > > >>>>> >> >> >> >>>>>>>> > > > > >>>>> >> >> >> >>>>>>>> > > > > >>>>> >> >> >> >>>>>>>> > > > > would solve this? >>>>> >> >> >> >>>>>>>> > > >>>>> >> >> >> >>>>>>>> > > >>>>> >> >> >> >>>>>>>> > > I've split the patch Dominik mentions above in >>>>> two, one of them adding >>>>> >> >> >> >>>>>>>> > > the nmstate / networkmanager copr repos - [3]. >>>>> >> >> >> >>>>>>>> > > >>>>> >> >> >> >>>>>>>> > > Let's see if it fixes it. >>>>> >> >> >> >>>>>>>> > >>>>> >> >> >> >>>>>>>> > it fixes original issue, but OST still fails in >>>>> >> >> >> >>>>>>>> > 098_ovirt_provider_ovn.use_ovn_provider: >>>>> >> >> >> >>>>>>>> > >>>>> >> >> >> >>>>>>>> > >>>>> https://jenkins.ovirt.org/job/ovirt-system-tests_manual/6134 >>>>> >> >> >> >>>>>>>> >>>>> >> >> >> >>>>>>>> I think Dominik was looking into this issue; >>>>> +Dominik Holler please confirm. >>>>> >> >> >> >>>>>>>> >>>>> >> >> >> >>>>>>>> Let me know if you need any help Dominik. >>>>> >> >> >> >>>>>>> >>>>> >> >> >> >>>>>>> >>>>> >> >> >> >>>>>>> >>>>> >> >> >> >>>>>>> Thanks. >>>>> >> >> >> >>>>>>> The problem is that the hosts lost connection to >>>>> storage: >>>>> >> >> >> >>>>>>> >>>>> https://jenkins.ovirt.org/job/ovirt-system-tests_manual/6134/artifact/exported-artifacts/test_logs/basic-suite-master/post-098_ovirt_provider_ovn.py/lago-basic-suite-master-host-0/_var_log/vdsm/vdsm.log >>>>> : >>>>> >> >> >> >>>>>>> >>>>> >> >> >> >>>>>>> 2019-11-22 05:39:12,326-0500 DEBUG (jsonrpc/5) >>>>> [common.commands] /usr/bin/taskset --cpu-list 0-1 /usr/bin/sudo -n >>>>> /sbin/lvm vgs --config 'devices { preferred_names=["^/dev/mapper/"] >>>>> ignore_suspended_devices=1 write_cache_state=0 >>>>> disable_after_error_count=3 >>>>> filter=["a|^/dev/mapper/36001405107ea8b4e3ac4ddeb3e19890f$|^/dev/mapper/360014054924c91df75e41178e4b8a80c$|^/dev/mapper/3600140561c0d02829924b77ab7323f17$|^/dev/mapper/3600140582feebc04ca5409a99660dbbc$|^/dev/mapper/36001405c3c53755c13c474dada6be354$|", >>>>> "r|.*|"] } global { locking_type=1 prioritise_write_locks=1 >>>>> wait_for_locks=1 use_lvmetad=0 } backup { retain_min=50 retain_days=0 >>>>> }' >>>>> --noheadings --units b --nosuffix --separator '|' --ignoreskippedcluster >>>>> -o >>>>> uuid,name,attr,size,free,extent_size,extent_count,free_count,tags,vg_mda_size,vg_mda_free,lv_count,pv_count,pv_name >>>>> (cwd None) (commands:153) >>>>> >> >> >> >>>>>>> 2019-11-22 05:39:12,415-0500 ERROR (check/loop) >>>>> [storage.Monitor] Error checking path >>>>> /rhev/data-center/mnt/192.168.200.4:_exports_nfs_share2/d10879c6-8de1-40ba-87fa-f447844eed2a/dom_md/metadata >>>>> (monitor:501) >>>>> >> >> >> >>>>>>> Traceback (most recent call last): >>>>> >> >> >> >>>>>>> File >>>>> "/usr/lib/python3.6/site-packages/vdsm/storage/monitor.py", line 499, in >>>>> _pathChecked >>>>> >> >> >> >>>>>>> delay = result.delay() >>>>> >> >> >> >>>>>>> File >>>>> "/usr/lib/python3.6/site-packages/vdsm/storage/check.py", line 391, in >>>>> delay >>>>> >> >> >> >>>>>>> raise exception.MiscFileReadException(self.path, >>>>> self.rc, self.err) >>>>> >> >> >> >>>>>>> vdsm.storage.exception.MiscFileReadException: >>>>> Internal file read failure: >>>>> ('/rhev/data-center/mnt/192.168.200.4:_exports_nfs_share2/d10879c6-8de1-40ba-87fa-f447844eed2a/dom_md/metadata', >>>>> 1, 'Read timeout') >>>>> >> >> >> >>>>>>> 2019-11-22 05:39:12,416-0500 INFO (check/loop) >>>>> [storage.Monitor] Domain d10879c6-8de1-40ba-87fa-f447844eed2a became >>>>> INVALID (monitor:472) >>>>> >> >> >> >>>>>>> >>>>> >> >> >> >>>>>>> >>>>> >> >> >> >>>>>>> I failed to reproduce local to analyze this, I will >>>>> try again, any hints welcome. >>>>> >> >> >> >>>>>>> >>>>> >> >> >> >>>>>> >>>>> >> >> >> >>>>>> >>>>> >> >> >> >>>>>> >>>>> >> >> >> >>>>>> https://gerrit.ovirt.org/#/c/104925/1/ shows that >>>>> 008_basic_ui_sanity.py triggers the problem. >>>>> >> >> >> >>>>>> Is there someone with knowledge about the >>>>> basic_ui_sanity around? >>>>> >> >> >> >>>>> >>>>> >> >> >> >>>>> How do you think it's related? By commenting out the ui >>>>> sanity tests and seeing OST with successful finish? >>>>> >> >> >> >>>>> >>>>> >> >> >> >>>>> Looking at 6134 run you were discussing: >>>>> >> >> >> >>>>> >>>>> >> >> >> >>>>> - timing of the ui sanity set-up [1]: >>>>> >> >> >> >>>>> >>>>> >> >> >> >>>>> 11:40:20 @ Run test: 008_basic_ui_sanity.py: >>>>> >> >> >> >>>>> >>>>> >> >> >> >>>>> - timing of first encountered storage error [2]: >>>>> >> >> >> >>>>> >>>>> >> >> >> >>>>> 2019-11-22 05:39:12,415-0500 ERROR (check/loop) >>>>> [storage.Monitor] Error checking path >>>>> /rhev/data-center/mnt/192.168.200.4:_exports_nfs_share2/d10879c6-8de1-40ba-87fa-f447844eed2a/dom_md/metadata >>>>> (monitor:501) >>>>> >> >> >> >>>>> Traceback (most recent call last): >>>>> >> >> >> >>>>> File >>>>> "/usr/lib/python3.6/site-packages/vdsm/storage/monitor.py", line 499, in >>>>> _pathChecked >>>>> >> >> >> >>>>> delay = result.delay() >>>>> >> >> >> >>>>> File >>>>> "/usr/lib/python3.6/site-packages/vdsm/storage/check.py", line 391, in >>>>> delay >>>>> >> >> >> >>>>> raise exception.MiscFileReadException(self.path, >>>>> self.rc, self.err) >>>>> >> >> >> >>>>> vdsm.storage.exception.MiscFileReadException: Internal >>>>> file read failure: >>>>> ('/rhev/data-center/mnt/192.168.200.4:_exports_nfs_share2/d10879c6-8de1-40ba-87fa-f447844eed2a/dom_md/metadata', >>>>> 1, 'Read timeout') >>>>> >> >> >> >>>>> >>>>> >> >> >> >>>>> Timezone difference aside, it seems to me that these >>>>> storage errors occured before doing anything ui-related. >>>>> >> >> >> >> >>>>> >> >> >> >> >>>>> >> >> >> >> >>>>> >> >> >> >> You are right, a time.sleep(8*60) in >>>>> >> >> >> >> https://gerrit.ovirt.org/#/c/104925/2 >>>>> >> >> >> >> has triggers the issue the same way. >>>>> >> >> >> >>>>> >> >> >> So this is a test issues, assuming that the UI tests can >>>>> complete in >>>>> >> >> >> less than 8 minutes? >>>>> >> >> >> >>>>> >> >> > >>>>> >> >> > To my eyes this looks like storage is just stop working after >>>>> some time. >>>>> >> >> > >>>>> >> >> >> >>>>> >> >> >> >> >>>>> >> >> >> > >>>>> >> >> >> > Nir or Steve, can you please confirm that this is a storage >>>>> problem? >>>>> >> >> >> >>>>> >> >> >> Why do you think we have a storage problem? >>>>> >> >> >> >>>>> >> >> > >>>>> >> >> > I understand from the posted log snippets that they say that >>>>> the storage is not accessible anymore, >>>>> >> >> >>>>> >> >> No, so far one read timeout was reported, this does not mean >>>>> storage >>>>> >> >> is not available anymore. >>>>> >> >> It can be temporary issue that does not harm anything. >>>>> >> >> >>>>> >> >> > while the host is still responsive. >>>>> >> >> > This might be triggered by something outside storage, e.g. the >>>>> network providing the storage stopped working, >>>>> >> >> > But I think a possible next step in analysing this issue would >>>>> be to find the reason why storage is not happy. >>>>> >> >> >>>>> >> > >>>>> >> > Sounds like there was a miscommunication in this thread. >>>>> >> > I try to address all of your points, please let me know if >>>>> something is missing or not clearly expressed. >>>>> >> > >>>>> >> >> >>>>> >> >> First step is to understand which test fails, >>>>> >> > >>>>> >> > >>>>> >> > 098_ovirt_provider_ovn.use_ovn_provider >>>>> >> > >>>>> >> >> >>>>> >> >> and why. This can be done by the owner of the test, >>>>> >> > >>>>> >> > >>>>> >> > The test was added by the network team. >>>>> >> > >>>>> >> >> >>>>> >> >> understanding what the test does >>>>> >> > >>>>> >> > >>>>> >> > The test tries to add a vNIC. >>>>> >> > >>>>> >> >> >>>>> >> >> and what is the expected system behavior. >>>>> >> >> >>>>> >> > >>>>> >> > It is expected that adding a vNIC works, because the VM should be >>>>> up. >>>>> >> >>>>> >> What was the actual behavior? >>>>> >> >>>>> >> >> If the owner of the test thinks that the test failed because of >>>>> a storage issue >>>>> >> > >>>>> >> > >>>>> >> > I am not sure who is the owner, but I do. >>>>> >> >>>>> >> Can you explain why how a vNIC failed because of a storage issue? >>>>> >> >>>>> > >>>>> > >>>>> > Test fails with: >>>>> > >>>>> > Cannot add a Network Interface when VM is not Down, Up or >>>>> Image-Locked. >>>>> > >>>>> > engine.log says: >>>>> > {"jsonrpc": "2.0", "method": >>>>> "|virt|VM_status|308bd254-9af9-4570-98ea-822609550acf", "params": >>>>> {"308bd254-9af9-4570-98ea-822609550acf": {"status": "Paused", "pauseCode": >>>>> "EOTHER", "ioerror": {"alias": "ua-953dd722-5e8b-4b24-bccd-a2a5d5befeb6", >>>>> "name": "vda", "path": >>>>> "/rhev/data-center/38c691d4-8556-4882-8f04-a88dff5d0973/bcd1622c-876b-460c-95a7-d09536c42ffe/images/953dd722-5e8b-4b24-bccd-a2a5d5befeb6/dcb5fec4-f219-4d3f-986c-628b0d00b349"}}, >>>>> "notify_time": 4298388570}} >>>>> >>>>> So you think adding vNIC failed because the VM was paused? >>>>> >>>>> >>>> Yes, because of the error message "Cannot add a Network Interface when >>>> VM is not Down, Up or Image-Locked." >>>> >>>> >>>>> > vdsm.log says: >>>>> > >>>>> > 2019-11-20 10:51:06,026-0500 ERROR (check/loop) [storage.Monitor] >>>>> Error checking path >>>>> /rhev/data-center/mnt/192.168.200.4:_exports_nfs_share1/bcd1622c-876b-460c-95a7-d09536c42ffe/dom_md/metadata >>>>> (monitor:501) >>>>> > Traceback (most recent call last): >>>>> > File "/usr/lib/python3.6/site-packages/vdsm/storage/monitor.py", >>>>> line 499, in _pathChecked >>>>> > delay = result.delay() >>>>> > File "/usr/lib/python3.6/site-packages/vdsm/storage/check.py", >>>>> line 391, in delay >>>>> > raise exception.MiscFileReadException(self.path, self.rc, >>>>> self.err) >>>>> > vdsm.storage.exception.MiscFileReadException: Internal file read >>>>> failure: >>>>> ('/rhev/data-center/mnt/192.168.200.4:_exports_nfs_share1/bcd1622c-876b-460c-95a7-d09536c42ffe/dom_md/metadata', >>>>> 1, 'Read timeout') >>>>> >>>>> Is this related to the paused vm? >>>>> >>>>> >>>> The log entry : '{"status": "Paused", "pauseCode": "EOTHER", "ioerror"' >>>> makes me thinking this. >>>> >>>> >>>>> You did not provide a timestamp for the engine event above. >>>>> >>>>> >>>> >>>> I can't find last weeks log, maybe they are faded out already. >>>> Please find more recent logs in >>>> >>>> https://jenkins.ovirt.org/job/ovirt-system-tests_standard-check-patch/6492 >>>> >>>> >>>> >>>>> > ... >>>>> > >>>>> > 2019-11-20 10:51:56,249-0500 WARN (check/loop) [storage.check] >>>>> Checker >>>>> '/rhev/data-center/mnt/192.168.200.4:_exports_nfs_share2/64daa060-1d83-46b9-b7e8-72a902e1134b/dom_md/metadata' >>>>> is blocked for 60.00 seconds (check:282) >>>>> > 2019-11-20 10:51:56,885-0500 ERROR (monitor/775b710) >>>>> [storage.Monitor] Error checking domain >>>>> 775b7102-7f2c-4eee-a4d0-a41b55451f7e (monitor:427) >>>>> > Traceback (most recent call last): >>>>> > File "/usr/lib/python3.6/site-packages/vdsm/storage/monitor.py", >>>>> line 408, in _checkDomainStatus >>>>> > self.domain.selftest() >>>>> > File "/usr/lib/python3.6/site-packages/vdsm/storage/fileSD.py", >>>>> line 710, in selftest >>>>> > self.oop.os.statvfs(self.domaindir) >>>>> > File >>>>> "/usr/lib/python3.6/site-packages/vdsm/storage/outOfProcess.py", line 242, >>>>> in statvfs >>>>> > return self._iop.statvfs(path) >>>>> > File "/usr/lib/python3.6/site-packages/ioprocess/__init__.py", >>>>> line 479, in statvfs >>>>> > resdict = self._sendCommand("statvfs", {"path": path}, >>>>> self.timeout) >>>>> > File "/usr/lib/python3.6/site-packages/ioprocess/__init__.py", >>>>> line 442, in _sendCommand >>>>> > raise Timeout(os.strerror(errno.ETIMEDOUT)) >>>>> > ioprocess.Timeout: Connection timed out >>>>> >>>>> This show that storage was not accessible for 60 seconds (ioprocess >>>>> uses 60 seconds timeout). >>>>> >>>>> 60 seconds timeout is bad. If we have leases on this storage domain >>>>> (e.g. SPM lease) they will >>>>> expire in 20 seconds after this event and the vdsm on the SPM host >>>>> will be killed. >>>>> >>>>> Do we have network tests changing the network used by the NFS storage >>>>> domain before this event? >>>>> >>>>> >>>> No. >>>> >>>> >>>>> What were the changes the network tests or code since OST was >>>>> successful? >>>>> >>>>> >>>> I am not aware of a change, which might be relevant. >>>> Maybe the fact that the hosts are on CentOS 8, while the Engine >>>> (storage) is on CentOS 7 is relevant. >>>> Also the occurrence of this issue seems not to be 100% deterministic, I >>>> guess because it is timing related. >>>> >>>> The error is reproducible locally by running OST, and just keep the >>>> environment alive after basic-suite-master succeeded. >>>> After some time, the storage will become inaccessible. >>>> >>>> >>>>> >> Can you explain how adding 8 minutes sleep instead of the UI tests >>>>> >> reproduced the issue? >>>>> >> >>>>> > >>>>> > >>>>> > This shows that the issue is not triggered by the UI test, but maybe >>>>> by passing time. >>>>> >>>>> Do we run the ovn tests after the UI tests? >>>>> >>>>> >> >> someone from storage can look at this. >>>>> >> >> >>>>> >> > >>>>> >> > Thanks, I would appreciate this. >>>>> >> > >>>>> >> >> >>>>> >> >> But the fact that adding long sleep reproduce the issue means it >>>>> is not related >>>>> >> >> in any way to storage. >>>>> >> >> >>>>> >> >> Nir >>>>> >> >> >>>>> >> >> > >>>>> >> >> >> >>>>> >> >> >> > >>>>> >> >> >> >> >>>>> >> >> >> >> >>>>> >> >> >> >>>>> >>>>> >> >> >> >>>>> I remember talking with Steven Rosenberg on IRC a >>>>> couple of days ago about some storage metadata issues and he said he got a >>>>> response from Nir, that "it's a known issue". >>>>> >> >> >> >>>>> >>>>> >> >> >> >>>>> Nir, Amit, can you comment on this? >>>>> >> >> >> >>>> >>>>> >> >> >> >>>> >>>>> >> >> >> >>>> The error mentioned here is not vdsm error but warning >>>>> about storage accessibility. We sould convert the tracebacks to warning. >>>>> >> >> >> >>>> >>>>> >> >> >> >>>> The reason for such issue can be misconfigured network >>>>> (maybe network team is testing negative flows?), >>>>> >> >> >> >>> >>>>> >> >> >> >>> >>>>> >> >> >> >>> No. >>>>> >> >> >> >>> >>>>> >> >> >> >>>> >>>>> >> >> >> >>>> or some issue in the NFS server. >>>>> >> >> >> >>>> >>>>> >> >> >> >>> >>>>> >> >> >> >>> Only hint I found is >>>>> >> >> >> >>> "Exiting Time2Retain handler because >>>>> session_reinstatement=1" >>>>> >> >> >> >>> but I have no idea what this means or if this is relevant >>>>> at all. >>>>> >> >> >> >>> >>>>> >> >> >> >>>> >>>>> >> >> >> >>>> One read timeout is not an issue. We have a real issue >>>>> only if we have consistent read timeouts or errors for couple of minutes, >>>>> after that engine can deactivate the storage domain or some hosts if only >>>>> these hosts are having trouble to access storage. >>>>> >> >> >> >>>> >>>>> >> >> >> >>>> In OST we never expect such conditions since we don't >>>>> test negative flows, and we should have good connectivity with the vms >>>>> running on the same host. >>>>> >> >> >> >>>> >>>>> >> >> >> >>> >>>>> >> >> >> >>> Ack, this seems to be the problem. >>>>> >> >> >> >>> >>>>> >> >> >> >>>> >>>>> >> >> >> >>>> Nir >>>>> >> >> >> >>>> >>>>> >> >> >> >>>> >>>>> >> >> >> >>>>> [1] >>>>> https://jenkins.ovirt.org/job/ovirt-system-tests_manual/6134/console >>>>> >> >> >> >>>>> [2] >>>>> https://jenkins.ovirt.org/job/ovirt-system-tests_manual/6134/artifact/exported-artifacts/test_logs/basic-suite-master/post-098_ovirt_provider_ovn.py/lago-basic-suite-master-host-0/_var_log/vdsm/vdsm.log >>>>> >> >> >> >>>>>> >>>>> >> >> >> >>>>>> >>>>> >> >> >> >>>>> >>>>> >> >> >> >>>>> Marcin, could you please take a look? >>>>> >> >> >> >>>>>> >>>>> >> >> >> >>>>>> >>>>> >> >> >> >>>>>> >>>>> >> >> >> >>>>>>>> >>>>> >> >> >> >>>>>>>> > >>>>> >> >> >> >>>>>>>> > > [3] - https://gerrit.ovirt.org/#/c/104897/ >>>>> >> >> >> >>>>>>>> > > >>>>> >> >> >> >>>>>>>> > > >>>>> >> >> >> >>>>>>>> > > > > >>>>> >> >> >> >>>>>>>> > > > > >>>>> >> >> >> >>>>>>>> > > > > >> Who installs this rpm in OST? >>>>> >> >> >> >>>>>>>> > > > > > >>>>> >> >> >> >>>>>>>> > > > > > >>>>> >> >> >> >>>>>>>> > > > > > >>>>> >> >> >> >>>>>>>> > > > > > I do not understand the question. >>>>> >> >> >> >>>>>>>> > > > > > >>>>> >> >> >> >>>>>>>> > > > > > >>>>> >> >> >> >>>>>>>> > > > > > >>>>> >> >> >> >>>>>>>> > > > > >> > [...] >>>>> >> >> >> >>>>>>>> > > > > >> > >>>>> >> >> >> >>>>>>>> > > > > >> > >>>>> >> >> >> >>>>>>>> > > > > >> > >>>>> >> >> >> >>>>>>>> > > > > >> > See [2] for full error. >>>>> >> >> >> >>>>>>>> > > > > >> > >>>>> >> >> >> >>>>>>>> > > > > >> > >>>>> >> >> >> >>>>>>>> > > > > >> > >>>>> >> >> >> >>>>>>>> > > > > >> > Can someone please take a look? >>>>> >> >> >> >>>>>>>> > > > > >> > Thanks >>>>> >> >> >> >>>>>>>> > > > > >> > Vojta >>>>> >> >> >> >>>>>>>> > > > > >> > >>>>> >> >> >> >>>>>>>> > > > > >> > >>>>> >> >> >> >>>>>>>> > > > > >> > >>>>> >> >> >> >>>>>>>> > > > > >> > [1] >>>>> https://jenkins.ovirt.org/job/ovirt-system-tests_manual/6128/ >>>>> >> >> >> >>>>>>>> > > > > >> > [2] >>>>> >> >> >> >>>>>>>> > > > > >> >>>>> >> >> >> >>>>>>>> > > > > >> >>>>> >> >> >> >>>>>>>> > > > > >> >>>>> >> >> >> >>>>>>>> > > > > >> >>>>> https://jenkins.ovirt.org/job/ovirt-system-tests_manual/6128/artifact >>>>> >> >> >> >>>>>>>> > > > > >> / >>>>> >> >> >> >>>>>>>> > > > > >> >>>>> >> >> >> >>>>>>>> > > > > >> >>>>> >> >> >> >>>>>>>> > > > > >> >>>>> >> >> >> >>>>>>>> > > > > >> > >>>>> exported-artifacts/test_logs/basic-suite-master/ >>>>> >> >> >> >>>>>>>> > > > > >> >>>>> >> >> >> >>>>>>>> > > > > >> >>>>> >> >> >> >>>>>>>> > > > > >> >>>>> >> >> >> >>>>>>>> > > > > >> post-002_bootstrap.py/lago- >>>>> >> >> >> >>>>>>>> > > > > >> >>>>> >> >> >> >>>>>>>> > > > > >> >>>>> >> >> >> >>>>>>>> > > > > >> >>>>> >> >> >> >>>>>>>> > > > > >> >>>>> basic-suite-master-engine/_var_log/ovirt-engine/engine.log___________ >>>>> >> >> >> >>>>>>>> > > > > >> ____ >>>>> >> >> >> >>>>>>>> > > > > >> ________________________________>> >>>>> >> >> >> >>>>>>>> > > > > >> >>>>> >> >> >> >>>>>>>> > > > > >> > Devel mailing list -- [email protected] >>>>> >> >> >> >>>>>>>> > > > > >> > To unsubscribe send an email to >>>>> [email protected] >>>>> >> >> >> >>>>>>>> > > > > >> > Privacy Statement: >>>>> https://www.ovirt.org/site/privacy-policy/ >>>>> >> >> >> >>>>>>>> > > > > >> >>>>> >> >> >> >>>>>>>> > > > > >> >>>>> >> >> >> >>>>>>>> > > > > >> >>>>> >> >> >> >>>>>>>> > > > > >> > oVirt Code of Conduct: >>>>> >> >> >> >>>>>>>> > > > > >> >>>>> >> >> >> >>>>>>>> > > > > >> >>>>> https://www.ovirt.org/community/about/community-guidelines/ >>>>> >> >> >> >>>>>>>> > > > > >> >>>>> >> >> >> >>>>>>>> > > > > >> >>>>> >> >> >> >>>>>>>> > > > > >> >>>>> >> >> >> >>>>>>>> > > > > >> > List Archives: >>>>> >> >> >> >>>>>>>> > > > > >> >>>>> >> >> >> >>>>>>>> > > > > >> >>>>> https://lists.ovirt.org/archives/list/[email protected]/message/4K5N3VQ >>>>> >> >> >> >>>>>>>> > > > > >> N26B >>>>> >> >> >> >>>>>>>> > > > > >> L73K7D45A2IR7R3UMMM23/ >>>>> >> >> >> >>>>>>>> > > > > >> >>>>> _______________________________________________ >>>>> >> >> >> >>>>>>>> > > > > >> Devel mailing list -- [email protected] >>>>> >> >> >> >>>>>>>> > > > > >> To unsubscribe send an email to >>>>> [email protected] >>>>> >> >> >> >>>>>>>> > > > > >> Privacy Statement: >>>>> https://www.ovirt.org/site/privacy-policy/ >>>>> >> >> >> >>>>>>>> > > > > >> oVirt Code of Conduct: >>>>> >> >> >> >>>>>>>> > > > > >> >>>>> https://www.ovirt.org/community/about/community-guidelines/ >>>>> >> >> >> >>>>>>>> > > > > >> List Archives: >>>>> >> >> >> >>>>>>>> > > > > >> >>>>> https://lists.ovirt.org/archives/list/[email protected]/message/JN7MNUZ >>>>> >> >> >> >>>>>>>> > > > > >> N5K3 >>>>> >> >> >> >>>>>>>> > > > > >> NS5TGXFCILYES77KI5TZU/ >>>>> >> >> >> >>>>>>>> > > > >>>>> >> >> >> >>>>>>>> > > > >>>>> >> >> >> >>>>>>>> > > >>>>> >> >> >> >>>>>>>> > > _______________________________________________ >>>>> >> >> >> >>>>>>>> > > Devel mailing list -- [email protected] >>>>> >> >> >> >>>>>>>> > > To unsubscribe send an email to >>>>> [email protected] >>>>> >> >> >> >>>>>>>> > > Privacy Statement: >>>>> https://www.ovirt.org/site/privacy-policy/ >>>>> >> >> >> >>>>>>>> > > oVirt Code of Conduct: >>>>> >> >> >> >>>>>>>> > > >>>>> https://www.ovirt.org/community/about/community-guidelines/ List >>>>> Archives: >>>>> >> >> >> >>>>>>>> > > >>>>> https://lists.ovirt.org/archives/list/[email protected]/message/UPJ5SEAV5Z65H >>>>> >> >> >> >>>>>>>> > > 5BQ3SCHOYZX6JMTQPBW/ >>>>> >> >> >> >>>>>>>> > >>>>> >> >> >> >>>>>>>> >>>>> >> >> >> >>>>> >>>>> >> >> >> >>>>> >>>>> >> >> >> >>>>> -- >>>>> >> >> >> >>>>> Martin Perina >>>>> >> >> >> >>>>> Manager, Software Engineering >>>>> >> >> >> >>>>> Red Hat Czech s.r.o. >>>>> >> >> >> >>>>> >>>>> >> >> >> >>>>> >>>>> >> >> >> >>>>> >> >> >>>>> >> >>>>> >>>>> >>> >>> -- >>> Martin Perina >>> Manager, Software Engineering >>> Red Hat Czech s.r.o. >>> >>> >>>
_______________________________________________ Devel mailing list -- [email protected] To unsubscribe send an email to [email protected] Privacy Statement: https://www.ovirt.org/site/privacy-policy/ oVirt Code of Conduct: https://www.ovirt.org/community/about/community-guidelines/ List Archives: https://lists.ovirt.org/archives/list/[email protected]/message/D2GXHKZICATF4C6Z52ZKZ544WNVZQOA5/
