On Wed, Nov 27, 2019 at 1:27 PM Marcin Sobczyk <[email protected]> wrote:

> Hi,
>
> I ran OST on my physical server.
> I'm experiencing probably the same issues as described in the thread below.
>
> On one of the hosts:
>
> [root@lago-basic-suite-master-host-0 tmp]# ls -l /rhev/data-center/mnt/
> ls: cannot access '/rhev/data-center/mnt/192.168.200.4:_exports_nfs_share1':
> Operation not permitted
> ls: cannot access '/rhev/data-center/mnt/192.168.200.4:_exports_nfs_share2':
> Operation not permitted
> ls: cannot access '/rhev/data-center/mnt/192.168.200.4:_exports_nfs_exported':
> Operation not permitted
> total 0
> d?????????? ? ?    ?    ?            ? 192.168.200.4:_exports_nfs_exported
> d?????????? ? ?    ?    ?            ? 192.168.200.4:_exports_nfs_share1
> d?????????? ? ?    ?    ?            ? 192.168.200.4:_exports_nfs_share2
> drwxr-xr-x. 3 vdsm kvm 50 Nov 27 04:22 blockSD
>
> I think there's some problem with the nfs shares on engine.
>
We saw it recently with the move to RHEL8, Nir isn't that the same issue
with the NFS squashing?

>
> I can mount engine's nfs shares directly from server's native OS:
>
> ➜  /tmp mkdir -p /tmp/aaa && mount "192.168.200.4:/exports/nfs/share1"
> /tmp/aaa
> ➜  /tmp ls -l /tmp/aaa
> total 4
> drwxr-xr-x. 5 36 kvm 4096 Nov 27 10:18 3332759c-a943-4fbd-80aa-a5f72cd87c7c
> ➜  /tmp
>
> But trying to do that from one of the hosts fails:
>
> [root@lago-basic-suite-master-host-0 tmp]# mkdir -p /tmp/aaa && mount -v
> "192.168.200.4:/exports/nfs/share1" /tmp/aaa
> mount.nfs: timeout set for Wed Nov 27 06:26:19 2019
> mount.nfs: trying text-based options
> 'vers=4.2,addr=192.168.200.4,clientaddr=192.168.201.2'
> mount.nfs: mount(2): Operation not permitted
> mount.nfs: trying text-based options 'addr=192.168.200.4'
> mount.nfs: prog 100003, trying vers=3, prot=6
> mount.nfs: portmap query failed: RPC: Remote system error - No route to
> host
>
> On the engine side, '/var/log/messages' seems to be flooded with nfs
> issues, example failures:
>
> Nov 27 06:25:25 lago-basic-suite-master-engine kernel:
> __find_in_sessionid_hashtbl: 1574853151:3430996717:11:0
> Nov 27 06:25:25 lago-basic-suite-master-engine kernel: nfsd4_sequence:
> slotid 0
> Nov 27 06:25:25 lago-basic-suite-master-engine kernel: check_slot_seqid
> enter. seqid 405 slot_seqid 404
> Nov 27 06:25:25 lago-basic-suite-master-engine kernel: nfsv4 compound op
> ffff9042fc202080 opcnt 3 #1: 53: status 0
> Nov 27 06:25:25 lago-basic-suite-master-engine kernel: nfsv4 compound op
> #2/3: 22 (OP_PUTFH)
> Nov 27 06:25:25 lago-basic-suite-master-engine kernel: nfsd: fh_verify(28:
> 00070001 00340001 00000000 e50ae88b 5c44c45a 2b7c3991)
> Nov 27 06:25:25 lago-basic-suite-master-engine kernel: nfsd: request from
> insecure port 192.168.200.1, port=51529!
> Nov 27 06:25:25 lago-basic-suite-master-engine kernel: nfsv4 compound op
> ffff9042fc202080 opcnt 3 #2: 22: status 1
> Nov 27 06:25:25 lago-basic-suite-master-engine kernel: nfsv4 compound
> returned 1
> Nov 27 06:25:25 lago-basic-suite-master-engine kernel: -->
> nfsd4_store_cache_entry slot ffff9042c4d97000
> Nov 27 06:25:25 lago-basic-suite-master-engine kernel: renewing client
> (clientid 5dde5a1f/cc80daed)
> Nov 27 06:25:25 lago-basic-suite-master-engine kernel: nfsd_dispatch: vers
> 4 proc 1
> Nov 27 06:25:25 lago-basic-suite-master-engine kernel: nfsv4 compound op
> #1/3: 53 (OP_SEQUENCE)
> Nov 27 06:25:25 lago-basic-suite-master-engine kernel:
> __find_in_sessionid_hashtbl: 1574853151:3430996717:11:0
> Nov 27 06:25:25 lago-basic-suite-master-engine kernel: nfsd4_sequence:
> slotid 0
> Nov 27 06:25:25 lago-basic-suite-master-engine kernel: check_slot_seqid
> enter. seqid 406 slot_seqid 405
> Nov 27 06:25:25 lago-basic-suite-master-engine kernel: nfsv4 compound op
> ffff9042fc202080 opcnt 3 #1: 53: status 0
> Nov 27 06:25:25 lago-basic-suite-master-engine kernel: nfsv4 compound op
> #2/3: 22 (OP_PUTFH)
> Nov 27 06:25:25 lago-basic-suite-master-engine kernel: nfsd: fh_verify(28:
> 00070001 00340001 00000000 e50ae88b 5c44c45a 2b7c3991)
> Nov 27 06:25:25 lago-basic-suite-master-engine kernel: nfsd: request from
> insecure port 192.168.200.1, port=51529!
> Nov 27 06:25:25 lago-basic-suite-master-engine kernel: nfsv4 compound op
> ffff9042fc202080 opcnt 3 #2: 22: status 1
> Nov 27 06:25:25 lago-basic-suite-master-engine kernel: nfsv4 compound
> returned 1
> Nov 27 06:25:25 lago-basic-suite-master-engine kernel: -->
> nfsd4_store_cache_entry slot ffff9042c4d97000
> Nov 27 06:25:25 lago-basic-suite-master-engine kernel: renewing client
> (clientid 5dde5a1f/cc80daed)
>
> Regards, Marcin
>
> On 11/26/19 8:40 PM, Martin Perina wrote:
>
> I've just merged https://gerrit.ovirt.org/105111 which only silence the
> issue, but we really need to unblock OST, as it's suffering from this for
> more than 2 weeks now.
>
> Tal/Nir, could someone really investigate why the storage become
> unavailable after some time? It may be caused by recent switch of hosts to
> CentOS 8, but may be not related
>
> Thanks,
> Martin
>
>
> On Tue, Nov 26, 2019 at 9:17 AM Dominik Holler <[email protected]> wrote:
>
>>
>>
>> On Mon, Nov 25, 2019 at 7:12 PM Nir Soffer <[email protected]> wrote:
>>
>>> On Mon, Nov 25, 2019 at 7:15 PM Dominik Holler <[email protected]>
>>> wrote:
>>> >
>>> >
>>> >
>>> > On Mon, Nov 25, 2019 at 6:03 PM Nir Soffer <[email protected]> wrote:
>>> >>
>>> >> On Mon, Nov 25, 2019 at 6:48 PM Dominik Holler <[email protected]>
>>> wrote:
>>> >> >
>>> >> >
>>> >> >
>>> >> > On Mon, Nov 25, 2019 at 5:16 PM Nir Soffer <[email protected]>
>>> wrote:
>>> >> >>
>>> >> >> On Mon, Nov 25, 2019 at 6:05 PM Dominik Holler <[email protected]>
>>> wrote:
>>> >> >> >
>>> >> >> >
>>> >> >> >
>>> >> >> > On Mon, Nov 25, 2019 at 4:50 PM Nir Soffer <[email protected]>
>>> wrote:
>>> >> >> >>
>>> >> >> >> On Mon, Nov 25, 2019 at 11:00 AM Dominik Holler <
>>> [email protected]> wrote:
>>> >> >> >> >
>>> >> >> >> >
>>> >> >> >> >
>>> >> >> >> > On Fri, Nov 22, 2019 at 8:57 PM Dominik Holler <
>>> [email protected]> wrote:
>>> >> >> >> >>
>>> >> >> >> >>
>>> >> >> >> >>
>>> >> >> >> >> On Fri, Nov 22, 2019 at 5:54 PM Dominik Holler <
>>> [email protected]> wrote:
>>> >> >> >> >>>
>>> >> >> >> >>>
>>> >> >> >> >>>
>>> >> >> >> >>> On Fri, Nov 22, 2019 at 5:48 PM Nir Soffer <
>>> [email protected]> wrote:
>>> >> >> >> >>>>
>>> >> >> >> >>>>
>>> >> >> >> >>>>
>>> >> >> >> >>>> On Fri, Nov 22, 2019, 18:18 Marcin Sobczyk <
>>> [email protected]> wrote:
>>> >> >> >> >>>>>
>>> >> >> >> >>>>>
>>> >> >> >> >>>>>
>>> >> >> >> >>>>> On 11/22/19 4:54 PM, Martin Perina wrote:
>>> >> >> >> >>>>>
>>> >> >> >> >>>>>
>>> >> >> >> >>>>>
>>> >> >> >> >>>>> On Fri, Nov 22, 2019 at 4:43 PM Dominik Holler <
>>> [email protected]> wrote:
>>> >> >> >> >>>>>>
>>> >> >> >> >>>>>>
>>> >> >> >> >>>>>> On Fri, Nov 22, 2019 at 12:17 PM Dominik Holler <
>>> [email protected]> wrote:
>>> >> >> >> >>>>>>>
>>> >> >> >> >>>>>>>
>>> >> >> >> >>>>>>>
>>> >> >> >> >>>>>>> On Fri, Nov 22, 2019 at 12:00 PM Miguel Duarte de Mora
>>> Barroso <[email protected]> wrote:
>>> >> >> >> >>>>>>>>
>>> >> >> >> >>>>>>>> On Fri, Nov 22, 2019 at 11:54 AM Vojtech Juranek <
>>> [email protected]> wrote:
>>> >> >> >> >>>>>>>> >
>>> >> >> >> >>>>>>>> > On pátek 22. listopadu 2019 9:56:56 CET Miguel
>>> Duarte de Mora Barroso wrote:
>>> >> >> >> >>>>>>>> > > On Fri, Nov 22, 2019 at 9:49 AM Vojtech Juranek <
>>> [email protected]>
>>> >> >> >> >>>>>>>> > > wrote:
>>> >> >> >> >>>>>>>> > > >
>>> >> >> >> >>>>>>>> > > >
>>> >> >> >> >>>>>>>> > > > On pátek 22. listopadu 2019 9:41:26 CET Dominik
>>> Holler wrote:
>>> >> >> >> >>>>>>>> > > >
>>> >> >> >> >>>>>>>> > > > > On Fri, Nov 22, 2019 at 8:40 AM Dominik Holler
>>> <[email protected]>
>>> >> >> >> >>>>>>>> > > > > wrote:
>>> >> >> >> >>>>>>>> > > > >
>>> >> >> >> >>>>>>>> > > > > > On Thu, Nov 21, 2019 at 10:54 PM Nir Soffer <
>>> [email protected]>
>>> >> >> >> >>>>>>>> > > > > > wrote:
>>> >> >> >> >>>>>>>> > > > > >
>>> >> >> >> >>>>>>>> > > > > >> On Thu, Nov 21, 2019 at 11:24 PM Vojtech
>>> Juranek
>>> >> >> >> >>>>>>>> > > > > >> <[email protected]>
>>> >> >> >> >>>>>>>> > > > > >>
>>> >> >> >> >>>>>>>> > > > > >>
>>> >> >> >> >>>>>>>> > > > > >>
>>> >> >> >> >>>>>>>> > > > > >> wrote:
>>> >> >> >> >>>>>>>> > > > > >>
>>> >> >> >> >>>>>>>> > > > > >> > Hi,
>>> >> >> >> >>>>>>>> > > > > >> > OST fails (see e.g. [1]) in
>>> 002_bootstrap.check_update_host. It
>>> >> >> >> >>>>>>>> > > > > >> > fails
>>> >> >> >> >>>>>>>> > > > > >>
>>> >> >> >> >>>>>>>> > > > > >>
>>> >> >> >> >>>>>>>> > > > > >>
>>> >> >> >> >>>>>>>> > > > > >> with
>>> >> >> >> >>>>>>>> > > > > >>
>>> >> >> >> >>>>>>>> > > > > >>
>>> >> >> >> >>>>>>>> > > > > >>
>>> >> >> >> >>>>>>>> > > > > >> >  FAILED! => {"changed": false,
>>> "failures": [], "msg": "Depsolve
>>> >> >> >> >>>>>>>> > > > > >> >  Error
>>> >> >> >> >>>>>>>> > > > > >>
>>> >> >> >> >>>>>>>> > > > > >>
>>> >> >> >> >>>>>>>> > > > > >>
>>> >> >> >> >>>>>>>> > > > > >> occured:
>>> >> >> >> >>>>>>>> > > > > >>
>>> >> >> >> >>>>>>>> > > > > >> > \n Problem 1: cannot install the best
>>> update candidate for package
>>> >> >> >> >>>>>>>> > > > > >> > vdsm-
>>> >> >> >> >>>>>>>> > > > > >> >
>>> network-4.40.0-1236.git63ea8cb8b.el8.x86_64\n  - nothing provides
>>> >> >> >> >>>>>>>> > > > > >>
>>> >> >> >> >>>>>>>> > > > > >>
>>> >> >> >> >>>>>>>> > > > > >>
>>> >> >> >> >>>>>>>> > > > > >> nmstate
>>> >> >> >> >>>>>>>> > > > > >>
>>> >> >> >> >>>>>>>> > > > > >>
>>> >> >> >> >>>>>>>> > > > > >>
>>> >> >> >> >>>>>>>> > > > > >> > needed by
>>> vdsm-network-4.40.0-1271.git524e08c8a.el8.x86_64\n
>>> >> >> >> >>>>>>>> > > > > >> > Problem 2:
>>> >> >> >> >>>>>>>> > > > > >> > package
>>> vdsm-python-4.40.0-1271.git524e08c8a.el8.noarch requires
>>> >> >> >> >>>>>>>> > > > > >>
>>> >> >> >> >>>>>>>> > > > > >>
>>> >> >> >> >>>>>>>> > > > > >>
>>> >> >> >> >>>>>>>> > > > > >> vdsm-network
>>> >> >> >> >>>>>>>> > > > > >>
>>> >> >> >> >>>>>>>> > > > > >>
>>> >> >> >> >>>>>>>> > > > > >>
>>> >> >> >> >>>>>>>> > > > > >> > = 4.40.0-1271.git524e08c8a.el8, but none
>>> of the providers can be
>>> >> >> >> >>>>>>>> > > > > >>
>>> >> >> >> >>>>>>>> > > > > >>
>>> >> >> >> >>>>>>>> > > > > >>
>>> >> >> >> >>>>>>>> > > > > >> installed\n
>>> >> >> >> >>>>>>>> > > > > >>
>>> >> >> >> >>>>>>>> > > > > >>
>>> >> >> >> >>>>>>>> > > > > >>
>>> >> >> >> >>>>>>>> > > > > >> > - cannot install the best update
>>> candidate for package vdsm-
>>> >> >> >> >>>>>>>> > > > > >> >
>>> python-4.40.0-1236.git63ea8cb8b.el8.noarch\n  - nothing provides
>>> >> >> >> >>>>>>>> > > > > >> > nmstate
>>> >> >> >> >>>>>>>> > > > > >> > needed by
>>> vdsm-network-4.40.0-1271.git524e08c8a.el8.x86_64\n
>>> >> >> >> >>>>>>>> > > > > >>
>>> >> >> >> >>>>>>>> > > > > >>
>>> >> >> >> >>>>>>>> > > > > >>
>>> >> >> >> >>>>>>>> > > > > >> nmstate should be provided by copr repo
>>> enabled by
>>> >> >> >> >>>>>>>> > > > > >> ovirt-release-master.
>>> >> >> >> >>>>>>>> > > > > >
>>> >> >> >> >>>>>>>> > > > > >
>>> >> >> >> >>>>>>>> > > > > >
>>> >> >> >> >>>>>>>> > > > > > I re-triggered as
>>> >> >> >> >>>>>>>> > > > > >
>>> https://jenkins.ovirt.org/job/ovirt-system-tests_manual/6131
>>> >> >> >> >>>>>>>> > > > > > maybe
>>> >> >> >> >>>>>>>> > > > > > https://gerrit.ovirt.org/#/c/104825/
>>> >> >> >> >>>>>>>> > > > > > was missing
>>> >> >> >> >>>>>>>> > > > >
>>> >> >> >> >>>>>>>> > > > >
>>> >> >> >> >>>>>>>> > > > >
>>> >> >> >> >>>>>>>> > > > > Looks like
>>> >> >> >> >>>>>>>> > > > > https://gerrit.ovirt.org/#/c/104825/ is
>>> ignored by OST.
>>> >> >> >> >>>>>>>> > > >
>>> >> >> >> >>>>>>>> > > >
>>> >> >> >> >>>>>>>> > > >
>>> >> >> >> >>>>>>>> > > > maybe not. You re-triggered with [1], which
>>> really missed this patch.
>>> >> >> >> >>>>>>>> > > > I did a rebase and now running with this patch
>>> in build #6132 [2]. Let's
>>> >> >> >> >>>>>>>> > > > wait
>>> >> >> >> >>>>>>>> >  for it to see if gerrit #104825 helps.
>>> >> >> >> >>>>>>>> > > >
>>> >> >> >> >>>>>>>> > > >
>>> >> >> >> >>>>>>>> > > >
>>> >> >> >> >>>>>>>> > > > [1]
>>> https://jenkins.ovirt.org/job/standard-manual-runner/909/
>>> >> >> >> >>>>>>>> > > > [2]
>>> https://jenkins.ovirt.org/job/ovirt-system-tests_manual/6132/
>>> >> >> >> >>>>>>>> > > >
>>> >> >> >> >>>>>>>> > > >
>>> >> >> >> >>>>>>>> > > >
>>> >> >> >> >>>>>>>> > > > > Miguel, do you think merging
>>> >> >> >> >>>>>>>> > > > >
>>> >> >> >> >>>>>>>> > > > >
>>> >> >> >> >>>>>>>> > > > >
>>> >> >> >> >>>>>>>> > > > >
>>> https://gerrit.ovirt.org/#/c/104495/15/common/yum-repos/ovirt-master-hos
>>> >> >> >> >>>>>>>> > > > > t-cq
>>> >> >> >> >>>>>>>> >  .repo.in
>>> >> >> >> >>>>>>>> > > > >
>>> >> >> >> >>>>>>>> > > > >
>>> >> >> >> >>>>>>>> > > > >
>>> >> >> >> >>>>>>>> > > > > would solve this?
>>> >> >> >> >>>>>>>> > >
>>> >> >> >> >>>>>>>> > >
>>> >> >> >> >>>>>>>> > > I've split the patch Dominik mentions above in
>>> two, one of them adding
>>> >> >> >> >>>>>>>> > > the nmstate / networkmanager copr repos - [3].
>>> >> >> >> >>>>>>>> > >
>>> >> >> >> >>>>>>>> > > Let's see if it fixes it.
>>> >> >> >> >>>>>>>> >
>>> >> >> >> >>>>>>>> > it fixes original issue, but OST still fails in
>>> >> >> >> >>>>>>>> > 098_ovirt_provider_ovn.use_ovn_provider:
>>> >> >> >> >>>>>>>> >
>>> >> >> >> >>>>>>>> >
>>> https://jenkins.ovirt.org/job/ovirt-system-tests_manual/6134
>>> >> >> >> >>>>>>>>
>>> >> >> >> >>>>>>>> I think Dominik was looking into this issue; +Dominik
>>> Holler please confirm.
>>> >> >> >> >>>>>>>>
>>> >> >> >> >>>>>>>> Let me know if you need any help Dominik.
>>> >> >> >> >>>>>>>
>>> >> >> >> >>>>>>>
>>> >> >> >> >>>>>>>
>>> >> >> >> >>>>>>> Thanks.
>>> >> >> >> >>>>>>> The problem is that the hosts lost connection to
>>> storage:
>>> >> >> >> >>>>>>>
>>> https://jenkins.ovirt.org/job/ovirt-system-tests_manual/6134/artifact/exported-artifacts/test_logs/basic-suite-master/post-098_ovirt_provider_ovn.py/lago-basic-suite-master-host-0/_var_log/vdsm/vdsm.log
>>> :
>>> >> >> >> >>>>>>>
>>> >> >> >> >>>>>>> 2019-11-22 05:39:12,326-0500 DEBUG (jsonrpc/5)
>>> [common.commands] /usr/bin/taskset --cpu-list 0-1 /usr/bin/sudo -n
>>> /sbin/lvm vgs --config 'devices {  preferred_names=["^/dev/mapper/"]
>>> ignore_suspended_devices=1  write_cache_state=0
>>> disable_after_error_count=3
>>> filter=["a|^/dev/mapper/36001405107ea8b4e3ac4ddeb3e19890f$|^/dev/mapper/360014054924c91df75e41178e4b8a80c$|^/dev/mapper/3600140561c0d02829924b77ab7323f17$|^/dev/mapper/3600140582feebc04ca5409a99660dbbc$|^/dev/mapper/36001405c3c53755c13c474dada6be354$|",
>>> "r|.*|"] } global {  locking_type=1  prioritise_write_locks=1
>>> wait_for_locks=1  use_lvmetad=0 } backup {  retain_min=50  retain_days=0 }'
>>> --noheadings --units b --nosuffix --separator '|' --ignoreskippedcluster -o
>>> uuid,name,attr,size,free,extent_size,extent_count,free_count,tags,vg_mda_size,vg_mda_free,lv_count,pv_count,pv_name
>>> (cwd None) (commands:153)
>>> >> >> >> >>>>>>> 2019-11-22 05:39:12,415-0500 ERROR (check/loop)
>>> [storage.Monitor] Error checking path 
>>> /rhev/data-center/mnt/192.168.200.4:_exports_nfs_share2/d10879c6-8de1-40ba-87fa-f447844eed2a/dom_md/metadata
>>> (monitor:501)
>>> >> >> >> >>>>>>> Traceback (most recent call last):
>>> >> >> >> >>>>>>>   File
>>> "/usr/lib/python3.6/site-packages/vdsm/storage/monitor.py", line 499, in
>>> _pathChecked
>>> >> >> >> >>>>>>>     delay = result.delay()
>>> >> >> >> >>>>>>>   File
>>> "/usr/lib/python3.6/site-packages/vdsm/storage/check.py", line 391, in delay
>>> >> >> >> >>>>>>>     raise exception.MiscFileReadException(self.path,
>>> self.rc, self.err)
>>> >> >> >> >>>>>>> vdsm.storage.exception.MiscFileReadException: Internal
>>> file read failure: 
>>> ('/rhev/data-center/mnt/192.168.200.4:_exports_nfs_share2/d10879c6-8de1-40ba-87fa-f447844eed2a/dom_md/metadata',
>>> 1, 'Read timeout')
>>> >> >> >> >>>>>>> 2019-11-22 05:39:12,416-0500 INFO  (check/loop)
>>> [storage.Monitor] Domain d10879c6-8de1-40ba-87fa-f447844eed2a became
>>> INVALID (monitor:472)
>>> >> >> >> >>>>>>>
>>> >> >> >> >>>>>>>
>>> >> >> >> >>>>>>> I failed to reproduce local to analyze this, I will try
>>> again, any hints welcome.
>>> >> >> >> >>>>>>>
>>> >> >> >> >>>>>>
>>> >> >> >> >>>>>>
>>> >> >> >> >>>>>>
>>> >> >> >> >>>>>> https://gerrit.ovirt.org/#/c/104925/1/ shows that
>>> 008_basic_ui_sanity.py triggers the problem.
>>> >> >> >> >>>>>> Is there someone with knowledge about the
>>> basic_ui_sanity around?
>>> >> >> >> >>>>>
>>> >> >> >> >>>>> How do you think it's related? By commenting out the ui
>>> sanity tests and seeing OST with successful finish?
>>> >> >> >> >>>>>
>>> >> >> >> >>>>> Looking at 6134 run you were discussing:
>>> >> >> >> >>>>>
>>> >> >> >> >>>>>  - timing of the ui sanity set-up [1]:
>>> >> >> >> >>>>>
>>> >> >> >> >>>>> 11:40:20 @ Run test: 008_basic_ui_sanity.py:
>>> >> >> >> >>>>>
>>> >> >> >> >>>>> - timing of first encountered storage error [2]:
>>> >> >> >> >>>>>
>>> >> >> >> >>>>> 2019-11-22 05:39:12,415-0500 ERROR (check/loop)
>>> [storage.Monitor] Error checking path 
>>> /rhev/data-center/mnt/192.168.200.4:_exports_nfs_share2/d10879c6-8de1-40ba-87fa-f447844eed2a/dom_md/metadata
>>> (monitor:501)
>>> >> >> >> >>>>> Traceback (most recent call last):
>>> >> >> >> >>>>>   File
>>> "/usr/lib/python3.6/site-packages/vdsm/storage/monitor.py", line 499, in
>>> _pathChecked
>>> >> >> >> >>>>>     delay = result.delay()
>>> >> >> >> >>>>>   File
>>> "/usr/lib/python3.6/site-packages/vdsm/storage/check.py", line 391, in delay
>>> >> >> >> >>>>>     raise exception.MiscFileReadException(self.path,
>>> self.rc, self.err)
>>> >> >> >> >>>>> vdsm.storage.exception.MiscFileReadException: Internal
>>> file read failure: 
>>> ('/rhev/data-center/mnt/192.168.200.4:_exports_nfs_share2/d10879c6-8de1-40ba-87fa-f447844eed2a/dom_md/metadata',
>>> 1, 'Read timeout')
>>> >> >> >> >>>>>
>>> >> >> >> >>>>> Timezone difference aside, it seems to me that these
>>> storage errors occured before doing anything ui-related.
>>> >> >> >> >>
>>> >> >> >> >>
>>> >> >> >> >>
>>> >> >> >> >> You are right, a time.sleep(8*60) in
>>> >> >> >> >> https://gerrit.ovirt.org/#/c/104925/2
>>> >> >> >> >> has triggers the issue the same way.
>>> >> >> >>
>>> >> >> >> So this is a test issues, assuming that the UI tests can
>>> complete in
>>> >> >> >> less than 8 minutes?
>>> >> >> >>
>>> >> >> >
>>> >> >> > To my eyes this looks like storage is just stop working after
>>> some time.
>>> >> >> >
>>> >> >> >>
>>> >> >> >> >>
>>> >> >> >> >
>>> >> >> >> > Nir or Steve, can you please confirm that this is a storage
>>> problem?
>>> >> >> >>
>>> >> >> >> Why do you think we have a storage problem?
>>> >> >> >>
>>> >> >> >
>>> >> >> > I understand from the posted log snippets that they say that the
>>> storage is not accessible anymore,
>>> >> >>
>>> >> >> No, so far one read timeout was reported, this does not mean
>>> storage
>>> >> >> is not available anymore.
>>> >> >> It can be temporary issue that does not harm anything.
>>> >> >>
>>> >> >> > while the host is still responsive.
>>> >> >> > This might be triggered by something outside storage, e.g. the
>>> network providing the storage stopped working,
>>> >> >> > But I think a possible next step in analysing this issue would
>>> be to find the reason why storage is not happy.
>>> >> >>
>>> >> >
>>> >> > Sounds like there was a miscommunication in this thread.
>>> >> > I try to address all of your points, please let me know if
>>> something is missing or not clearly expressed.
>>> >> >
>>> >> >>
>>> >> >> First step is to understand which test fails,
>>> >> >
>>> >> >
>>> >> > 098_ovirt_provider_ovn.use_ovn_provider
>>> >> >
>>> >> >>
>>> >> >> and why. This can be done by the owner of the test,
>>> >> >
>>> >> >
>>> >> > The test was added by the network team.
>>> >> >
>>> >> >>
>>> >> >> understanding what the test does
>>> >> >
>>> >> >
>>> >> > The test tries to add a vNIC.
>>> >> >
>>> >> >>
>>> >> >> and what is the expected system behavior.
>>> >> >>
>>> >> >
>>> >> > It is expected that adding a vNIC works, because the VM should be
>>> up.
>>> >>
>>> >> What was the actual behavior?
>>> >>
>>> >> >> If the owner of the test thinks that the test failed because of a
>>> storage issue
>>> >> >
>>> >> >
>>> >> > I am not sure who is the owner, but I do.
>>> >>
>>> >> Can you explain why how a vNIC failed because of a storage issue?
>>> >>
>>> >
>>> >
>>> > Test fails with:
>>> >
>>> > Cannot add a Network Interface when VM is not Down, Up or Image-Locked.
>>> >
>>> > engine.log says:
>>> > {"jsonrpc": "2.0", "method":
>>> "|virt|VM_status|308bd254-9af9-4570-98ea-822609550acf", "params":
>>> {"308bd254-9af9-4570-98ea-822609550acf": {"status": "Paused", "pauseCode":
>>> "EOTHER", "ioerror": {"alias": "ua-953dd722-5e8b-4b24-bccd-a2a5d5befeb6",
>>> "name": "vda", "path":
>>> "/rhev/data-center/38c691d4-8556-4882-8f04-a88dff5d0973/bcd1622c-876b-460c-95a7-d09536c42ffe/images/953dd722-5e8b-4b24-bccd-a2a5d5befeb6/dcb5fec4-f219-4d3f-986c-628b0d00b349"}},
>>> "notify_time": 4298388570}}
>>>
>>> So you think adding vNIC failed because the VM was paused?
>>>
>>>
>> Yes, because of the error message "Cannot add a Network Interface when VM
>> is not Down, Up or Image-Locked."
>>
>>
>>> > vdsm.log says:
>>> >
>>> > 2019-11-20 10:51:06,026-0500 ERROR (check/loop) [storage.Monitor]
>>> Error checking path 
>>> /rhev/data-center/mnt/192.168.200.4:_exports_nfs_share1/bcd1622c-876b-460c-95a7-d09536c42ffe/dom_md/metadata
>>> (monitor:501)
>>> > Traceback (most recent call last):
>>> >   File "/usr/lib/python3.6/site-packages/vdsm/storage/monitor.py",
>>> line 499, in _pathChecked
>>> >     delay = result.delay()
>>> >   File "/usr/lib/python3.6/site-packages/vdsm/storage/check.py", line
>>> 391, in delay
>>> >     raise exception.MiscFileReadException(self.path, self.rc, self.err)
>>> > vdsm.storage.exception.MiscFileReadException: Internal file read
>>> failure: 
>>> ('/rhev/data-center/mnt/192.168.200.4:_exports_nfs_share1/bcd1622c-876b-460c-95a7-d09536c42ffe/dom_md/metadata',
>>> 1, 'Read timeout')
>>>
>>> Is this related to the paused vm?
>>>
>>>
>> The log entry : '{"status": "Paused", "pauseCode": "EOTHER", "ioerror"'
>> makes me thinking this.
>>
>>
>>> You did not provide a timestamp for the engine event above.
>>>
>>>
>>
>> I can't find last weeks log, maybe they are faded out already.
>> Please find more recent logs in
>> https://jenkins.ovirt.org/job/ovirt-system-tests_standard-check-patch/6492
>>
>>
>>
>>> > ...
>>> >
>>> > 2019-11-20 10:51:56,249-0500 WARN  (check/loop) [storage.check]
>>> Checker 
>>> '/rhev/data-center/mnt/192.168.200.4:_exports_nfs_share2/64daa060-1d83-46b9-b7e8-72a902e1134b/dom_md/metadata'
>>> is blocked for 60.00 seconds (check:282)
>>> > 2019-11-20 10:51:56,885-0500 ERROR (monitor/775b710) [storage.Monitor]
>>> Error checking domain 775b7102-7f2c-4eee-a4d0-a41b55451f7e (monitor:427)
>>> > Traceback (most recent call last):
>>> >   File "/usr/lib/python3.6/site-packages/vdsm/storage/monitor.py",
>>> line 408, in _checkDomainStatus
>>> >     self.domain.selftest()
>>> >   File "/usr/lib/python3.6/site-packages/vdsm/storage/fileSD.py", line
>>> 710, in selftest
>>> >     self.oop.os.statvfs(self.domaindir)
>>> >   File
>>> "/usr/lib/python3.6/site-packages/vdsm/storage/outOfProcess.py", line 242,
>>> in statvfs
>>> >     return self._iop.statvfs(path)
>>> >   File "/usr/lib/python3.6/site-packages/ioprocess/__init__.py", line
>>> 479, in statvfs
>>> >     resdict = self._sendCommand("statvfs", {"path": path},
>>> self.timeout)
>>> >   File "/usr/lib/python3.6/site-packages/ioprocess/__init__.py", line
>>> 442, in _sendCommand
>>> >     raise Timeout(os.strerror(errno.ETIMEDOUT))
>>> > ioprocess.Timeout: Connection timed out
>>>
>>> This show that storage was not accessible for 60 seconds (ioprocess
>>> uses 60 seconds timeout).
>>>
>>> 60 seconds timeout is bad. If we have leases on this storage domain
>>> (e.g. SPM lease) they will
>>> expire in 20 seconds after this event and the vdsm on the SPM host
>>> will be killed.
>>>
>>> Do we have network tests changing the network used by the NFS storage
>>> domain before this event?
>>>
>>>
>> No.
>>
>>
>>> What were the changes the network tests or code since OST was successful?
>>>
>>>
>> I am not aware of a change, which might be relevant.
>> Maybe the fact that the hosts are on CentOS 8, while the Engine (storage)
>> is on CentOS 7 is relevant.
>> Also the occurrence of this issue seems not to be 100% deterministic, I
>> guess because it is timing related.
>>
>> The error is reproducible locally by running OST, and just keep the
>> environment alive after basic-suite-master succeeded.
>> After some time, the storage will become inaccessible.
>>
>>
>>> >> Can you explain how adding 8 minutes sleep instead of the UI tests
>>> >> reproduced the issue?
>>> >>
>>> >
>>> >
>>> > This shows that the issue is not triggered by the UI test, but maybe
>>> by passing time.
>>>
>>> Do we run the ovn tests after the UI tests?
>>>
>>> >> >> someone from storage can look at this.
>>> >> >>
>>> >> >
>>> >> > Thanks, I would appreciate this.
>>> >> >
>>> >> >>
>>> >> >> But the fact that adding long sleep reproduce the issue means it
>>> is not related
>>> >> >> in any way to storage.
>>> >> >>
>>> >> >> Nir
>>> >> >>
>>> >> >> >
>>> >> >> >>
>>> >> >> >> >
>>> >> >> >> >>
>>> >> >> >> >>
>>> >> >> >> >>>>>
>>> >> >> >> >>>>> I remember talking with Steven Rosenberg on IRC a couple
>>> of days ago about some storage metadata issues and he said he got a
>>> response from Nir, that "it's a known issue".
>>> >> >> >> >>>>>
>>> >> >> >> >>>>> Nir, Amit, can you comment on this?
>>> >> >> >> >>>>
>>> >> >> >> >>>>
>>> >> >> >> >>>> The error mentioned here is not vdsm error but warning
>>> about storage accessibility. We sould convert the tracebacks to warning.
>>> >> >> >> >>>>
>>> >> >> >> >>>> The reason for such issue can be misconfigured network
>>> (maybe network team is testing negative flows?),
>>> >> >> >> >>>
>>> >> >> >> >>>
>>> >> >> >> >>> No.
>>> >> >> >> >>>
>>> >> >> >> >>>>
>>> >> >> >> >>>> or some issue in the NFS server.
>>> >> >> >> >>>>
>>> >> >> >> >>>
>>> >> >> >> >>> Only hint I found is
>>> >> >> >> >>> "Exiting Time2Retain handler because
>>> session_reinstatement=1"
>>> >> >> >> >>> but I have no idea what this means or if this is relevant
>>> at all.
>>> >> >> >> >>>
>>> >> >> >> >>>>
>>> >> >> >> >>>> One read timeout is not an issue. We have a real issue
>>> only if we have consistent read timeouts or errors for couple of minutes,
>>> after that engine can deactivate the storage domain or some hosts if only
>>> these hosts are having trouble to access storage.
>>> >> >> >> >>>>
>>> >> >> >> >>>> In OST we never expect such conditions since we don't test
>>> negative flows, and we should have good connectivity with the vms running
>>> on the same host.
>>> >> >> >> >>>>
>>> >> >> >> >>>
>>> >> >> >> >>> Ack, this seems to be the problem.
>>> >> >> >> >>>
>>> >> >> >> >>>>
>>> >> >> >> >>>> Nir
>>> >> >> >> >>>>
>>> >> >> >> >>>>
>>> >> >> >> >>>>> [1]
>>> https://jenkins.ovirt.org/job/ovirt-system-tests_manual/6134/console
>>> >> >> >> >>>>> [2]
>>> https://jenkins.ovirt.org/job/ovirt-system-tests_manual/6134/artifact/exported-artifacts/test_logs/basic-suite-master/post-098_ovirt_provider_ovn.py/lago-basic-suite-master-host-0/_var_log/vdsm/vdsm.log
>>> >> >> >> >>>>>>
>>> >> >> >> >>>>>>
>>> >> >> >> >>>>>
>>> >> >> >> >>>>> Marcin, could you please take a look?
>>> >> >> >> >>>>>>
>>> >> >> >> >>>>>>
>>> >> >> >> >>>>>>
>>> >> >> >> >>>>>>>>
>>> >> >> >> >>>>>>>> >
>>> >> >> >> >>>>>>>> > > [3] - https://gerrit.ovirt.org/#/c/104897/
>>> >> >> >> >>>>>>>> > >
>>> >> >> >> >>>>>>>> > >
>>> >> >> >> >>>>>>>> > > > >
>>> >> >> >> >>>>>>>> > > > >
>>> >> >> >> >>>>>>>> > > > > >> Who installs this rpm in OST?
>>> >> >> >> >>>>>>>> > > > > >
>>> >> >> >> >>>>>>>> > > > > >
>>> >> >> >> >>>>>>>> > > > > >
>>> >> >> >> >>>>>>>> > > > > > I do not understand the question.
>>> >> >> >> >>>>>>>> > > > > >
>>> >> >> >> >>>>>>>> > > > > >
>>> >> >> >> >>>>>>>> > > > > >
>>> >> >> >> >>>>>>>> > > > > >> > [...]
>>> >> >> >> >>>>>>>> > > > > >> >
>>> >> >> >> >>>>>>>> > > > > >> >
>>> >> >> >> >>>>>>>> > > > > >> >
>>> >> >> >> >>>>>>>> > > > > >> > See [2] for full error.
>>> >> >> >> >>>>>>>> > > > > >> >
>>> >> >> >> >>>>>>>> > > > > >> >
>>> >> >> >> >>>>>>>> > > > > >> >
>>> >> >> >> >>>>>>>> > > > > >> > Can someone please take a look?
>>> >> >> >> >>>>>>>> > > > > >> > Thanks
>>> >> >> >> >>>>>>>> > > > > >> > Vojta
>>> >> >> >> >>>>>>>> > > > > >> >
>>> >> >> >> >>>>>>>> > > > > >> >
>>> >> >> >> >>>>>>>> > > > > >> >
>>> >> >> >> >>>>>>>> > > > > >> > [1]
>>> https://jenkins.ovirt.org/job/ovirt-system-tests_manual/6128/
>>> >> >> >> >>>>>>>> > > > > >> > [2]
>>> >> >> >> >>>>>>>> > > > > >>
>>> >> >> >> >>>>>>>> > > > > >>
>>> >> >> >> >>>>>>>> > > > > >>
>>> >> >> >> >>>>>>>> > > > > >>
>>> https://jenkins.ovirt.org/job/ovirt-system-tests_manual/6128/artifact
>>> >> >> >> >>>>>>>> > > > > >> /
>>> >> >> >> >>>>>>>> > > > > >>
>>> >> >> >> >>>>>>>> > > > > >>
>>> >> >> >> >>>>>>>> > > > > >>
>>> >> >> >> >>>>>>>> > > > > >> >
>>> exported-artifacts/test_logs/basic-suite-master/
>>> >> >> >> >>>>>>>> > > > > >>
>>> >> >> >> >>>>>>>> > > > > >>
>>> >> >> >> >>>>>>>> > > > > >>
>>> >> >> >> >>>>>>>> > > > > >> post-002_bootstrap.py/lago-
>>> >> >> >> >>>>>>>> > > > > >>
>>> >> >> >> >>>>>>>> > > > > >>
>>> >> >> >> >>>>>>>> > > > > >>
>>> >> >> >> >>>>>>>> > > > > >>
>>> basic-suite-master-engine/_var_log/ovirt-engine/engine.log___________
>>> >> >> >> >>>>>>>> > > > > >> ____
>>> >> >> >> >>>>>>>> > > > > >> ________________________________>>
>>> >> >> >> >>>>>>>> > > > > >>
>>> >> >> >> >>>>>>>> > > > > >> > Devel mailing list -- [email protected]
>>> >> >> >> >>>>>>>> > > > > >> > To unsubscribe send an email to
>>> [email protected]
>>> >> >> >> >>>>>>>> > > > > >> > Privacy Statement:
>>> https://www.ovirt.org/site/privacy-policy/
>>> >> >> >> >>>>>>>> > > > > >>
>>> >> >> >> >>>>>>>> > > > > >>
>>> >> >> >> >>>>>>>> > > > > >>
>>> >> >> >> >>>>>>>> > > > > >> > oVirt Code of Conduct:
>>> >> >> >> >>>>>>>> > > > > >>
>>> >> >> >> >>>>>>>> > > > > >>
>>> https://www.ovirt.org/community/about/community-guidelines/
>>> >> >> >> >>>>>>>> > > > > >>
>>> >> >> >> >>>>>>>> > > > > >>
>>> >> >> >> >>>>>>>> > > > > >>
>>> >> >> >> >>>>>>>> > > > > >> > List Archives:
>>> >> >> >> >>>>>>>> > > > > >>
>>> >> >> >> >>>>>>>> > > > > >>
>>> https://lists.ovirt.org/archives/list/[email protected]/message/4K5N3VQ
>>> >> >> >> >>>>>>>> > > > > >> N26B
>>> >> >> >> >>>>>>>> > > > > >> L73K7D45A2IR7R3UMMM23/
>>> >> >> >> >>>>>>>> > > > > >>
>>> _______________________________________________
>>> >> >> >> >>>>>>>> > > > > >> Devel mailing list -- [email protected]
>>> >> >> >> >>>>>>>> > > > > >> To unsubscribe send an email to
>>> [email protected]
>>> >> >> >> >>>>>>>> > > > > >> Privacy Statement:
>>> https://www.ovirt.org/site/privacy-policy/
>>> >> >> >> >>>>>>>> > > > > >> oVirt Code of Conduct:
>>> >> >> >> >>>>>>>> > > > > >>
>>> https://www.ovirt.org/community/about/community-guidelines/
>>> >> >> >> >>>>>>>> > > > > >> List Archives:
>>> >> >> >> >>>>>>>> > > > > >>
>>> https://lists.ovirt.org/archives/list/[email protected]/message/JN7MNUZ
>>> >> >> >> >>>>>>>> > > > > >> N5K3
>>> >> >> >> >>>>>>>> > > > > >> NS5TGXFCILYES77KI5TZU/
>>> >> >> >> >>>>>>>> > > >
>>> >> >> >> >>>>>>>> > > >
>>> >> >> >> >>>>>>>> > >
>>> >> >> >> >>>>>>>> > > _______________________________________________
>>> >> >> >> >>>>>>>> > > Devel mailing list -- [email protected]
>>> >> >> >> >>>>>>>> > > To unsubscribe send an email to
>>> [email protected]
>>> >> >> >> >>>>>>>> > > Privacy Statement:
>>> https://www.ovirt.org/site/privacy-policy/
>>> >> >> >> >>>>>>>> > > oVirt Code of Conduct:
>>> >> >> >> >>>>>>>> > >
>>> https://www.ovirt.org/community/about/community-guidelines/ List
>>> Archives:
>>> >> >> >> >>>>>>>> > >
>>> https://lists.ovirt.org/archives/list/[email protected]/message/UPJ5SEAV5Z65H
>>> >> >> >> >>>>>>>> > > 5BQ3SCHOYZX6JMTQPBW/
>>> >> >> >> >>>>>>>> >
>>> >> >> >> >>>>>>>>
>>> >> >> >> >>>>>
>>> >> >> >> >>>>>
>>> >> >> >> >>>>> --
>>> >> >> >> >>>>> Martin Perina
>>> >> >> >> >>>>> Manager, Software Engineering
>>> >> >> >> >>>>> Red Hat Czech s.r.o.
>>> >> >> >> >>>>>
>>> >> >> >> >>>>>
>>> >> >> >>
>>> >> >>
>>> >>
>>>
>>>
>
> --
> Martin Perina
> Manager, Software Engineering
> Red Hat Czech s.r.o.
>
>
>
_______________________________________________
Devel mailing list -- [email protected]
To unsubscribe send an email to [email protected]
Privacy Statement: https://www.ovirt.org/site/privacy-policy/
oVirt Code of Conduct: 
https://www.ovirt.org/community/about/community-guidelines/
List Archives: 
https://lists.ovirt.org/archives/list/[email protected]/message/CNWCAJG2NQPLGWSIINLWBQRHFROHN6F3/

Reply via email to