I'm having another crack at this, I think it will be worth it once it works.
Firstly, another documentation error:
https://www.linbit.com/drbd-user-guide/linstor-guide-1_0-en/#s-using_the_linstor_client
In case anything goes wrong with the storage pool’s VG/zPool, e.g. the
VG having been renamed or somehow became invalid you can delete the
storage pool in LINSTOR with the following command, given that only
resources with all their volumes in the so-called ‘lost’ storage pool
are attached. This feature is available since LINSTOR v0.9.13.
# linstor storage-pool lost alpha pool_ssd
linstor storage-pool lost castle vg_hdd
usage: linstor storage-pool [-h]
{create, delete, list, list-properties,
set-property} ...
linstor storage-pool: error: argument {create, delete, list,
list-properties, set-property}: invalid choice: 'lost' (choose from
'create', 'c', 'delete', 'd', 'list', 'l', 'list-properties', 'lp',
'set-property', 'sp')
Changing to use delete instead of lost:
castle:~# linstor storage-pool delete castle vg_hdd
ERROR:
Description:
Storage pool definition 'vg_hdd' not found.
Cause:
The specified storage pool definition 'vg_hdd' could not be found
in the database
Correction:
Create a storage pool definition 'vg_hdd' first.
Details:
Node: castle, Storage pool name: vg_hdd
Show reports:
linstor error-reports show 5F0D500C-00000-000000
castle:~# linstor storage-pool list
╭───────────────────────────────────────────────────────────────────────────────────────────────────────────╮
┊ StoragePool ┊ Node ┊ Driver ┊ PoolName ┊ FreeCapacity ┊
TotalCapacity ┊ CanSnapshots ┊ State ┊
╞═══════════════════════════════════════════════════════════════════════════════════════════════════════════╡
┊ DfltDisklessStorPool ┊ castle ┊ DISKLESS ┊ ┊
┊ ┊ False ┊ Ok ┊
┊ DfltDisklessStorPool ┊ san5 ┊ DISKLESS ┊ ┊
┊ ┊ False ┊ Ok ┊
┊ DfltDisklessStorPool ┊ san6 ┊ DISKLESS ┊ ┊
┊ ┊ False ┊ Ok ┊
┊ pool ┊ castle ┊ LVM ┊ vg_hdd ┊ 2.95 TiB
┊ 3.44 TiB ┊ False ┊ Ok ┊
┊ pool ┊ san5 ┊ LVM ┊ vg_hdd ┊ 3.87 TiB
┊ 4.36 TiB ┊ False ┊ Ok ┊
┊ pool ┊ san6 ┊ LVM ┊ vg_ssd ┊ 1.26 TiB
┊ 1.75 TiB ┊ False ┊ Ok ┊
╰───────────────────────────────────────────────────────────────────────────────────────────────────────────╯
I was hoping I could just remove the storage pool from castle (since it
doesn't seem to be working properly), and then destroy it, re-create it,
and then re-add it and see if that solves the problem. However, while it
seems to exist, it also doesn't (can't delete it).
Possibly part of the cause of my original problem is that I have a
script that automatically creates a snapshot for each LV, and this
created a snapshot of testvm1_00000 named
backup_testvm1_00000_blahblah.... I've now manually deleted that, and
fixed my script to avoid messing with the VG allocated to linstor, but
so far, there is no change in the current status (as per below).
Would appreciate any suggestions on what might be going wrong, and/or
how to fix it?
Regards,
Adam
On 24/6/20 11:46, Adam Goryachev wrote:
On 23/6/20 21:53, Gábor Hernádi wrote:
Hi,
apparently something is quite broken... maybe it's somehow your setup
or environment, I am not sure...
linstor resource list
╭────────────────────────────────────────────────────────────────────────────╮
┊ ResourceName ┊ Node ┊ Port ┊ Usage ┊ Conns
┊ State ┊
╞════════════════════════════════════════════════════════════════════════════╡
┊ testvm1 ┊ castle ┊ 7000 ┊ ┊ ┊
Unknown ┊
┊ testvm1 ┊ san5 ┊ 7000 ┊ ┊ ┊
Unknown ┊
┊ testvm1 ┊ san6 ┊ 7000 ┊ Unused ┊ Connecting(san5,castle)
┊ UpToDate ┊
╰────────────────────────────────────────────────────────────────────────────╯
This looks like some kind of network issues.
# linstor storage-pool list --groupby Size
However, the second command produces a usage error (documentation
bug perhaps).
Thanks for reporting, we will look into this.
WARNING:
Description:
No active connection to satellite 'san5'
Details:
The controller is trying to (re-) establish a connection to
the satellite. The controller stored the changes and as soon the
satellite is connected, it will receive this update.
So Linstor has obviously no connection to satellite 'san5'.
[95078.599813] drbd testvm1 castle: conn( Unconnected -> Connecting )
[95078.604454] drbd testvm1 san5: conn( Unconnected -> Connecting )
... and DRBD apparently also has troubles connecting...
linstor n l
╭───────────────────────────────────────────────────────────╮
┊ Node ┊ NodeType ┊ Addresses ┊ State ┊
╞═══════════════════════════════════════════════════════════╡
┊ castle ┊ SATELLITE ┊ 192.168.5.204:3366
<http://192.168.5.204:3366> (PLAIN) ┊ Unknown ┊
┊ san5 ┊ SATELLITE ┊ 192.168.5.205:3366
<http://192.168.5.205:3366> (PLAIN) ┊ Unknown ┊
┊ san6 ┊ SATELLITE ┊ 192.168.5.206:3366
<http://192.168.5.206:3366> (PLAIN) ┊ Unknown ┊
╰───────────────────────────────────────────────────────────╯
Now this is really strange. I will spare you with some details, but
I assume you have triggered some bad exception in Linstor which
somehow killed a necessary thread.
You should check
linstor err list
and see if you can find some related error reports.
Also, restarting the controller might help you here.
Thank you!
linstor err list showed a list of errors, but the contents didn't make
a lot of sense to me. Let me know if you are interested in them, and I
can send them.
I did a systemctl restart linstor-controller.service on san6, and
things started looking much better.
linstor n l
╭──────────────────────────────────────────────────────────╮
┊ Node ┊ NodeType ┊ Addresses ┊ State ┊
╞══════════════════════════════════════════════════════════╡
┊ castle ┊ SATELLITE ┊ 192.168.5.204:3366 (PLAIN) ┊ Online ┊
┊ san5 ┊ SATELLITE ┊ 192.168.5.205:3366 (PLAIN) ┊ Online ┊
┊ san6 ┊ SATELLITE ┊ 192.168.5.206:3366 (PLAIN) ┊ Online ┊
╰──────────────────────────────────────────────────────────╯
So, all nodes agree that they are now online and talking to each
other. I assume this proves there is no network issues.
linstor resource list
╭─────────────────────────────────────────────────────────────────────────────────╮
┊ ResourceName ┊ Node ┊ Port ┊ Usage ┊ Conns ┊ State ┊
╞═════════════════════════════════════════════════════════════════════════════════╡
┊ testvm1 ┊ castle ┊ 7000 ┊ ┊ ┊ Unknown ┊
┊ testvm1 ┊ san5 ┊ 7000 ┊ Unused ┊ Connecting(castle) ┊
SyncTarget(12.67%) ┊
┊ testvm1 ┊ san6 ┊ 7000 ┊ Unused ┊ Connecting(castle)
┊ UpToDate ┊
╰─────────────────────────────────────────────────────────────────────────────────╯
From this, it looks like san6 (the controller) thinks it has the up to
date data, probably based on the fact it was created there first or
something. The data is syncing to san5 (in progress, and progressing
steadily), so that is good also. However, castle doesn't seem to be
syncing/connecting.
On castle, I see this:
Jun 24 11:01:55 castle Satellite[7499]: 11:01:55.177 [DeviceManager]
ERROR LINSTOR/Satellite - SYSTEM - Failed to create meta-data for DRBD
volume testvm1/0 [Report number 5EF2A316-31431-000002]
linstor err show give this:
ERROR REPORT 5EF2A316-31431-000002
============================================================
Application: LINBIT® LINSTOR
Module: Satellite
Version: 1.7.1
Build ID: 6760637d6fae7a5862103ced4ea0ab0a758861f9
Build time: 2020-05-14T13:14:11+00:00
Error time: 2020-06-24 11:01:55
Node: castle
============================================================
Reported error:
===============
Description:
Failed to create meta-data for DRBD volume testvm1/0
Category: LinStorException
Class name: VolumeException
Class canonical name:
com.linbit.linstor.storage.layer.exceptions.VolumeException
Generated at: Method 'createMetaData', Source
file 'DrbdLayer.java', Line #995
Error message: Failed to create meta-data for
DRBD volume testvm1/0
Error context:
An error occurred while processing resource 'Node: 'castle', Rsc:
'testvm1''
Call backtrace:
Method Native Class:Line number
createMetaData N
com.linbit.linstor.storage.layer.adapter.drbd.DrbdLayer:995
adjustDrbd N
com.linbit.linstor.storage.layer.adapter.drbd.DrbdLayer:575
process N
com.linbit.linstor.storage.layer.adapter.drbd.DrbdLayer:373
process N
com.linbit.linstor.core.devmgr.DeviceHandlerImpl:731
processResourcesAndSnapshots N
com.linbit.linstor.core.devmgr.DeviceHandlerImpl:300
dispatchResources N
com.linbit.linstor.core.devmgr.DeviceHandlerImpl:138
dispatchResources N
com.linbit.linstor.core.devmgr.DeviceManagerImpl:258
phaseDispatchDeviceHandlers N
com.linbit.linstor.core.devmgr.DeviceManagerImpl:896
devMgrLoop N
com.linbit.linstor.core.devmgr.DeviceManagerImpl:618
run N
com.linbit.linstor.core.devmgr.DeviceManagerImpl:535
run N java.lang.Thread:834
Caused by:
==========
Description:
Execution of the external command 'drbdadm' failed.
Cause:
The external command exited with error code 1.
Correction:
- Check whether the external program is operating properly.
- Check whether the command line is correct.
Contact a system administrator or a developer if the command
line is no longer valid
for the installed version of the external program.
Additional information:
The full command line executed was:
drbdadm -vvv --max-peers 7 -- --force create-md testvm1/0
The external command sent the following output data:
The external command sent the following error information:
no resources defined!
Category: LinStorException
Class name: ExtCmdFailedException
Class canonical name: com.linbit.extproc.ExtCmdFailedException
Generated at: Method 'execute', Source file
'DrbdAdm.java', Line #550
Error message: The external command 'drbdadm'
exited with error code 1
Call backtrace:
Method Native Class:Line number
execute N
com.linbit.linstor.storage.layer.adapter.drbd.utils.DrbdAdm:550
simpleAdmCommand N
com.linbit.linstor.storage.layer.adapter.drbd.utils.DrbdAdm:495
createMd N
com.linbit.linstor.storage.layer.adapter.drbd.utils.DrbdAdm:262
createMetaData N
com.linbit.linstor.storage.layer.adapter.drbd.DrbdLayer:923
adjustDrbd N
com.linbit.linstor.storage.layer.adapter.drbd.DrbdLayer:575
process N
com.linbit.linstor.storage.layer.adapter.drbd.DrbdLayer:373
process N
com.linbit.linstor.core.devmgr.DeviceHandlerImpl:731
processResourcesAndSnapshots N
com.linbit.linstor.core.devmgr.DeviceHandlerImpl:300
dispatchResources N
com.linbit.linstor.core.devmgr.DeviceHandlerImpl:138
dispatchResources N
com.linbit.linstor.core.devmgr.DeviceManagerImpl:258
phaseDispatchDeviceHandlers N
com.linbit.linstor.core.devmgr.DeviceManagerImpl:896
devMgrLoop N
com.linbit.linstor.core.devmgr.DeviceManagerImpl:618
run N
com.linbit.linstor.core.devmgr.DeviceManagerImpl:535
run N java.lang.Thread:834
END OF ERROR REPORT.
Indeed, re-running the same command from the CLI provides the shown
error message:
drbdadm -vvv --max-peers 7 -- --force create-md testvm1/0
no resources defined!
Some other random status information which may or may not be relevant...
linstor storage-pool list
╭───────────────────────────────────────────────────────────────────────────────────────────────────────────╮
┊ StoragePool ┊ Node ┊ Driver ┊ PoolName ┊ FreeCapacity ┊
TotalCapacity ┊ CanSnapshots ┊ State ┊
╞═══════════════════════════════════════════════════════════════════════════════════════════════════════════╡
┊ DfltDisklessStorPool ┊ castle ┊ DISKLESS ┊ ┊
┊ ┊ False ┊ Ok ┊
┊ DfltDisklessStorPool ┊ san5 ┊ DISKLESS ┊ ┊
┊ ┊ False ┊ Ok ┊
┊ DfltDisklessStorPool ┊ san6 ┊ DISKLESS ┊ ┊
┊ ┊ False ┊ Ok ┊
┊ pool ┊ castle ┊ LVM ┊ vg_hdd ┊ 2.95 TiB
┊ 3.44 TiB ┊ False ┊ Ok ┊
┊ pool ┊ san5 ┊ LVM ┊ vg_hdd ┊ 3.87 TiB
┊ 4.36 TiB ┊ False ┊ Ok ┊
┊ pool ┊ san6 ┊ LVM ┊ vg_ssd ┊ 1.26 TiB
┊ 1.75 TiB ┊ False ┊ Ok ┊
╰───────────────────────────────────────────────────────────────────────────────────────────────────────────╯
I've tried to restart linstor-satellite service on castle, but it
didn't make any difference.
After a reboot of castle, and now I get this:
linstor resource list
╭────────────────────────────────────────────────────────────────────╮
┊ ResourceName ┊ Node ┊ Port ┊ Usage ┊ Conns ┊ State ┊
╞════════════════════════════════════════════════════════════════════╡
┊ testvm1 ┊ castle ┊ 7000 ┊ Unused ┊ Ok ┊ Diskless ┊
┊ testvm1 ┊ san5 ┊ 7000 ┊ Unused ┊ Ok ┊ SyncTarget(55.99%) ┊
┊ testvm1 ┊ san6 ┊ 7000 ┊ Unused ┊ Ok ┊ UpToDate ┊
╰────────────────────────────────────────────────────────────────────╯
However, looking at the err reports, and I see the exactl same error
about creating the metadata on castle.
One interesting thing is that the LV seems to have been created:
lvs
/dev/drbd0: open failed: Wrong medium type
/dev/drbd1: open failed: Wrong medium type
LV VG Attr LSize Pool
Origin Data% Meta% Move Log Cpy%Sync Convert
backup_system_20200624_062513 storage swi-a-s--- 4.00g system 3.06
system storage owi-aos--- 5.00g
testvm1_00000 vg_hdd -wi-a----- <500.11g
Any suggestions on where to look next? Or what I might have done wrong
now?
Regards,
Adam
_______________________________________________
Star us on GITHUB: https://github.com/LINBIT
drbd-user mailing list
[email protected]
https://lists.linbit.com/mailman/listinfo/drbd-user
_______________________________________________
Star us on GITHUB: https://github.com/LINBIT
drbd-user mailing list
[email protected]
https://lists.linbit.com/mailman/listinfo/drbd-user