Re: [lustre-discuss] Suspended jobs and rebooting lustre servers

Colin Faber Thu, 21 Feb 2019 09:52:42 -0800

Ah yes,

If you're adding to an existing OSS, then you will need to reconfigure the
file system which requires writeconf event.


On Thu, Feb 21, 2019 at 10:00 AM Raj Ayyampalayam <[email protected]> wrote:

> The new OST's will be added to the existing file system (the OSS nodes are
> already part of the filesystem), I will have to re-configure the current HA
> resource configuration to tell it about the 4 new OST's.
> Our exascaler's HA monitors the individual OST and I need to re-configure
> the HA on the existing filesystem.
>
> Our vendor support has confirmed that we would have to restart the
> filesystem if we want to regenerate the HA configs to include the new OST's.
>
> Thanks,
> -Raj
>
>
> On Thu, Feb 21, 2019 at 11:23 AM Colin Faber <[email protected]> wrote:
>
>> It seems to me that steps may still be missing?
>>
>> You're going to rack/stack and provision the OSS nodes with new OSTs'.
>>
>> Then you're going to introduce failover options somewhere? new osts?
>> existing system? etc?
>>
>> If you're introducing failover with the new OST's and leaving the
>> existing system in place, you should be able to accomplish this without
>> bringing the system offline.
>>
>> If you're going to be introducing failover to your existing system then
>> you will need to reconfigure the file system to accommodate the new
>> failover settings (failover nides, etc.)
>>
>> -cf
>>
>>
>> On Thu, Feb 21, 2019 at 9:13 AM Raj Ayyampalayam <[email protected]>
>> wrote:
>>
>>> Our upgrade strategy is as follows:
>>>
>>> 1) Load all disks into the storage array.
>>> 2) Create RAID pools and virtual disks.
>>> 3) Create lustre file system using mkfs.lustre command. (I still have to
>>> figure out all the parameters used on the existing OSTs).
>>> 4) Create mount points on all OSSs.
>>> 5) Mount the lustre OSTs.
>>> 6) Maybe rebalance the filesystem.
>>> My understanding is that the above can be done without bringing the
>>> filesystem down. I want to create the HA configuration (corosync and
>>> pacemaker) for the new OSTs. This step requires the filesystem to be down.
>>> I want to know what would happen to the suspended processes across the
>>> cluster when I bring the filesystem down to re-generate the HA configs.
>>>
>>> Thanks,
>>> -Raj
>>>
>>> On Thu, Feb 21, 2019 at 12:59 AM Colin Faber <[email protected]> wrote:
>>>
>>>> Can you provide more details on your upgrade strategy? In some cases
>>>> expanding your storage shouldn't impact client / job activity at all.
>>>>
>>>> On Wed, Feb 20, 2019, 11:09 AM Raj Ayyampalayam <[email protected]>
>>>> wrote:
>>>>
>>>>> Hello,
>>>>>
>>>>> We are planning on expanding our storage by adding more OSTs to our
>>>>> lustre file system. It looks like it would be easier to expand if we bring
>>>>> the filesystem down and perform the necessary operations. We are planning
>>>>> to suspend all the jobs running on the cluster. We originally planned to
>>>>> add new OSTs to the live filesystem.
>>>>>
>>>>> We are trying to determine the potential impact to the suspended jobs
>>>>> if we bring down the filesystem for the upgrade.
>>>>> One of the questions we have is what would happen to the suspended
>>>>> processes that hold an open file handle in the lustre file system when the
>>>>> filesystem is brought down for the upgrade?
>>>>> Will they recover from the client eviction?
>>>>>
>>>>> We do have vendor support and have engaged them. I wanted to ask the
>>>>> community and get some feedback.
>>>>>
>>>>> Thanks,
>>>>> -Raj
>>>>>
>>>> _______________________________________________
>>>>> lustre-discuss mailing list
>>>>> [email protected]
>>>>> http://lists.lustre.org/listinfo.cgi/lustre-discuss-lustre.org
>>>>>
>>>>

_______________________________________________
lustre-discuss mailing list
[email protected]
http://lists.lustre.org/listinfo.cgi/lustre-discuss-lustre.org

Re: [lustre-discuss] Suspended jobs and rebooting lustre servers

Reply via email to