Re: [lustre-discuss] Suspended jobs and rebooting lustre servers

Raj Ayyampalayam Thu, 21 Feb 2019 20:30:51 -0800

Hi Raj,

Thanks for the explanation. We will have to rethink our upgrade process.


Thanks again.
Raj

On Thu, Feb 21, 2019, 10:23 PM Raj <[email protected]> wrote:

> Hello Raj,
> It’s best and safe to unmount from all the clients and then do the
> upgrade. Your FS is getting more OSTs and changing conf in the existing
> ones, your client needs to get the new layout by remounting it.
> Also you mentioned about client eviction, during eviction the client has
> to drop it’s dirty pages and all the open file descriptors in the FS will
> be gone.
>
> On Thu, Feb 21, 2019 at 12:25 PM Raj Ayyampalayam <[email protected]>
> wrote:
>
>> What can I expect to happen to the jobs that are suspended during the
>> file system restart?
>> Will the processes holding an open file handle die when I unsuspend them
>> after the filesystem restart?
>>
>> Thanks!
>> -Raj
>>
>>
>> On Thu, Feb 21, 2019 at 12:52 PM Colin Faber <[email protected]> wrote:
>>
>>> Ah yes,
>>>
>>> If you're adding to an existing OSS, then you will need to reconfigure
>>> the file system which requires writeconf event.
>>>
>>
>>> On Thu, Feb 21, 2019 at 10:00 AM Raj Ayyampalayam <[email protected]>
>>> wrote:
>>>
>>>> The new OST's will be added to the existing file system (the OSS nodes
>>>> are already part of the filesystem), I will have to re-configure the
>>>> current HA resource configuration to tell it about the 4 new OST's.
>>>> Our exascaler's HA monitors the individual OST and I need to
>>>> re-configure the HA on the existing filesystem.
>>>>
>>>> Our vendor support has confirmed that we would have to restart the
>>>> filesystem if we want to regenerate the HA configs to include the new 
>>>> OST's.
>>>>
>>>> Thanks,
>>>> -Raj
>>>>
>>>>
>>>> On Thu, Feb 21, 2019 at 11:23 AM Colin Faber <[email protected]> wrote:
>>>>
>>>>> It seems to me that steps may still be missing?
>>>>>
>>>>> You're going to rack/stack and provision the OSS nodes with new OSTs'.
>>>>>
>>>>> Then you're going to introduce failover options somewhere? new osts?
>>>>> existing system? etc?
>>>>>
>>>>> If you're introducing failover with the new OST's and leaving the
>>>>> existing system in place, you should be able to accomplish this without
>>>>> bringing the system offline.
>>>>>
>>>>> If you're going to be introducing failover to your existing system
>>>>> then you will need to reconfigure the file system to accommodate the new
>>>>> failover settings (failover nides, etc.)
>>>>>
>>>>> -cf
>>>>>
>>>>>
>>>>> On Thu, Feb 21, 2019 at 9:13 AM Raj Ayyampalayam <[email protected]>
>>>>> wrote:
>>>>>
>>>>>> Our upgrade strategy is as follows:
>>>>>>
>>>>>> 1) Load all disks into the storage array.
>>>>>> 2) Create RAID pools and virtual disks.
>>>>>> 3) Create lustre file system using mkfs.lustre command. (I still have
>>>>>> to figure out all the parameters used on the existing OSTs).
>>>>>> 4) Create mount points on all OSSs.
>>>>>> 5) Mount the lustre OSTs.
>>>>>> 6) Maybe rebalance the filesystem.
>>>>>> My understanding is that the above can be done without bringing the
>>>>>> filesystem down. I want to create the HA configuration (corosync and
>>>>>> pacemaker) for the new OSTs. This step requires the filesystem to be 
>>>>>> down.
>>>>>> I want to know what would happen to the suspended processes across the
>>>>>> cluster when I bring the filesystem down to re-generate the HA configs.
>>>>>>
>>>>>> Thanks,
>>>>>> -Raj
>>>>>>
>>>>>> On Thu, Feb 21, 2019 at 12:59 AM Colin Faber <[email protected]>
>>>>>> wrote:
>>>>>>
>>>>>>> Can you provide more details on your upgrade strategy? In some cases
>>>>>>> expanding your storage shouldn't impact client / job activity at all.
>>>>>>>
>>>>>>> On Wed, Feb 20, 2019, 11:09 AM Raj Ayyampalayam <[email protected]>
>>>>>>> wrote:
>>>>>>>
>>>>>>>> Hello,
>>>>>>>>
>>>>>>>> We are planning on expanding our storage by adding more OSTs to our
>>>>>>>> lustre file system. It looks like it would be easier to expand if we 
>>>>>>>> bring
>>>>>>>> the filesystem down and perform the necessary operations. We are 
>>>>>>>> planning
>>>>>>>> to suspend all the jobs running on the cluster. We originally planned 
>>>>>>>> to
>>>>>>>> add new OSTs to the live filesystem.
>>>>>>>>
>>>>>>>> We are trying to determine the potential impact to the suspended
>>>>>>>> jobs if we bring down the filesystem for the upgrade.
>>>>>>>> One of the questions we have is what would happen to the suspended
>>>>>>>> processes that hold an open file handle in the lustre file system when 
>>>>>>>> the
>>>>>>>> filesystem is brought down for the upgrade?
>>>>>>>> Will they recover from the client eviction?
>>>>>>>>
>>>>>>>> We do have vendor support and have engaged them. I wanted to ask
>>>>>>>> the community and get some feedback.
>>>>>>>>
>>>>>>>> Thanks,
>>>>>>>> -Raj
>>>>>>>>
>>>>>>> _______________________________________________
>>>>>>>> lustre-discuss mailing list
>>>>>>>> [email protected]
>>>>>>>> http://lists.lustre.org/listinfo.cgi/lustre-discuss-lustre.org
>>>>>>>>
>>>>>>> _______________________________________________
>> lustre-discuss mailing list
>> [email protected]
>> http://lists.lustre.org/listinfo.cgi/lustre-discuss-lustre.org
>>
>

_______________________________________________
lustre-discuss mailing list
[email protected]
http://lists.lustre.org/listinfo.cgi/lustre-discuss-lustre.org

Re: [lustre-discuss] Suspended jobs and rebooting lustre servers

Reply via email to