Re: [gpfsug-discuss] CES log files

2017-01-11 Thread Christof Schmitt
A winbindd process taking up 100% could be caused by the problem 
documented in https://bugzilla.samba.org/show_bug.cgi?id=12105

Capturing a brief strace of the affected process and reporting that 
through a PMR would be helpful to debug this problem and provide a fix.

To answer the wider question: Log files are kept in /var/adm/ras/. In case 
more detailed traces are required, use the mmprotocoltrace command.

Regards,

Christof Schmitt || IBM || Spectrum Scale Development || Tucson, AZ
christof.schm...@us.ibm.com  ||  +1-520-799-2469(T/L: 321-2469)



From:   "Sobey, Richard A" 
To: gpfsug main discussion list 
Date:   01/11/2017 07:00 AM
Subject:Re: [gpfsug-discuss] CES log files
Sent by:gpfsug-discuss-boun...@spectrumscale.org



Thanks. Some of the node would just say “failed” or “degraded” with the 
DCs offline. Of those that thought they were happy to host a CES IP 
address, they did not respond and winbindd process would take up 100% CPU 
as seen through top with no users on it.
 
Interesting that even though all CES nodes had the same configuration, 
three of them never had a problem at all.
 
JF – I’ll look at the protocol tracing next time this happens. It’s a rare 
thing that three DCs go offline at once but even so there should have been 
enough resiliency to cope.
 
Thanks
Richard
 
From: gpfsug-discuss-boun...@spectrumscale.org [
mailto:gpfsug-discuss-boun...@spectrumscale.org] On Behalf Of Andrew 
Beattie
Sent: 11 January 2017 09:55
To: gpfsug-discuss@spectrumscale.org
Cc: gpfsug-discuss@spectrumscale.org
Subject: Re: [gpfsug-discuss] CES log files
 
mmhealth might be a good place to start
 
CES should probably throw a message along the lines of the following:
 
mmhealth shows something is wrong with AD server:
...
CES  DEGRADED ads_down 
...
Andrew Beattie
Software Defined Storage  - IT Specialist
Phone: 614-2133-7927
E-mail: abeat...@au1.ibm.com
 
 
- Original message -
From: "Sobey, Richard A" 
Sent by: gpfsug-discuss-boun...@spectrumscale.org
To: "'gpfsug-discuss@spectrumscale.org'" 
Cc:
Subject: [gpfsug-discuss] CES log files
Date: Wed, Jan 11, 2017 7:27 PM
 
Which files do I need to look in to determine what’s happening with CES… 
supposing for example a load of domain controllers were shut down and CES 
had no clue how to handle this and stopped working until the DCs were 
switched back on again.
 
Mmfs.log.latest said everything was fine btw.
 
Thanks
Richard
___
gpfsug-discuss mailing list
gpfsug-discuss at spectrumscale.org
http://gpfsug.org/mailman/listinfo/gpfsug-discuss
 
 ___
gpfsug-discuss mailing list
gpfsug-discuss at spectrumscale.org
http://gpfsug.org/mailman/listinfo/gpfsug-discuss




___
gpfsug-discuss mailing list
gpfsug-discuss at spectrumscale.org
http://gpfsug.org/mailman/listinfo/gpfsug-discuss


Re: [gpfsug-discuss] nodes being ejected out of the cluster

2017-01-11 Thread Jan-Frode Myklebust
Don't think you can change it without reloading gpfs. Also it should be
turned off for all nodes.. So it's a big change, unfortunately..


-jf
ons. 11. jan. 2017 kl. 20.22 skrev Damir Krstic :

> Can this be done live? Meaning can GPFS remain up when I turn this off?
>
> Thanks,
> Damir
>
> On Wed, Jan 11, 2017 at 12:38 PM Jan-Frode Myklebust 
> wrote:
>
> And there you have:
>
> [ems1-fdr,compute,gss_ppc64]
> verbsRdmaSend yes
>
> Try turning this off.
>
>
> -jf
> ons. 11. jan. 2017 kl. 18.54 skrev Damir Krstic :
>
> Thanks for all the suggestions. Here is our mmlsconfig file. We just
> purchased another GL6. During the installation of the new GL6 IBM will
> upgrade our existing GL6 up to the latest code levels. This will happen
> during the week of 23rd of Jan.
>
> I am skeptical that the upgrade is going to fix the issue.
>
> On our IO servers we are running in connected mode (please note that IB
> interfaces are bonded)
>
> [root@gssio1 ~]# cat /sys/class/net/ib0/mode
>
> connected
>
> [root@gssio1 ~]# cat /sys/class/net/ib1/mode
>
> connected
>
> [root@gssio1 ~]# cat /sys/class/net/ib2/mode
>
> connected
>
> [root@gssio1 ~]# cat /sys/class/net/ib3/mode
>
> connected
>
> [root@gssio2 ~]# cat /sys/class/net/ib0/mode
>
> connected
>
> [root@gssio2 ~]# cat /sys/class/net/ib1/mode
>
> connected
>
> [root@gssio2 ~]# cat /sys/class/net/ib2/mode
>
> connected
>
> [root@gssio2 ~]# cat /sys/class/net/ib3/mode
>
> connected
>
> Our login nodes are also running connected mode as well.
>
> However, all of our compute nodes are running in datagram:
>
> [root@mgt ~]# psh compute cat /sys/class/net/ib0/mode
>
> qnode0758: datagram
>
> qnode0763: datagram
>
> qnode0760: datagram
>
> qnode0772: datagram
>
> qnode0773: datagram
> etc.
>
> Here is our mmlsconfig:
>
> [root@gssio1 ~]# mmlsconfig
>
> Configuration data for cluster ess-qstorage.it.northwestern.edu:
>
> 
>
> clusterName ess-qstorage.it.northwestern.edu
>
> clusterId 17746506346828356609
>
> dmapiFileHandleSize 32
>
> minReleaseLevel 4.2.0.1
>
> ccrEnabled yes
>
> cipherList AUTHONLY
>
> [gss_ppc64]
>
> nsdRAIDBufferPoolSizePct 80
>
> maxBufferDescs 2m
>
> prefetchPct 5
>
> nsdRAIDTracks 128k
>
> nsdRAIDSmallBufferSize 256k
>
> nsdMaxWorkerThreads 3k
>
> nsdMinWorkerThreads 3k
>
> nsdRAIDSmallThreadRatio 2
>
> nsdRAIDThreadsPerQueue 16
>
> nsdRAIDEventLogToConsole all
>
> nsdRAIDFastWriteFSDataLimit 256k
>
> nsdRAIDFastWriteFSMetadataLimit 1M
>
> nsdRAIDReconstructAggressiveness 1
>
> nsdRAIDFlusherBuffersLowWatermarkPct 20
>
> nsdRAIDFlusherBuffersLimitPct 80
>
> nsdRAIDFlusherTracksLowWatermarkPct 20
>
> nsdRAIDFlusherTracksLimitPct 80
>
> nsdRAIDFlusherFWLogHighWatermarkMB 1000
>
> nsdRAIDFlusherFWLogLimitMB 5000
>
> nsdRAIDFlusherThreadsLowWatermark 1
>
> nsdRAIDFlusherThreadsHighWatermark 512
>
> nsdRAIDBlockDeviceMaxSectorsKB 8192
>
> nsdRAIDBlockDeviceNrRequests 32
>
> nsdRAIDBlockDeviceQueueDepth 16
>
> nsdRAIDBlockDeviceScheduler deadline
>
> nsdRAIDMaxTransientStale2FT 1
>
> nsdRAIDMaxTransientStale3FT 1
>
> nsdMultiQueue 512
>
> syncWorkerThreads 256
>
> nsdInlineWriteMax 32k
>
> maxGeneralThreads 1280
>
> maxReceiverThreads 128
>
> nspdQueues 64
>
> [common]
>
> maxblocksize 16m
>
> [ems1-fdr,compute,gss_ppc64]
>
> numaMemoryInterleave yes
>
> [gss_ppc64]
>
> maxFilesToCache 12k
>
> [ems1-fdr,compute]
>
> maxFilesToCache 128k
>
> [ems1-fdr,compute,gss_ppc64]
>
> flushedDataTarget 1024
>
> flushedInodeTarget 1024
>
> maxFileCleaners 1024
>
> maxBufferCleaners 1024
>
> logBufferCount 20
>
> logWrapAmountPct 2
>
> logWrapThreads 128
>
> maxAllocRegionsPerNode 32
>
> maxBackgroundDeletionThreads 16
>
> maxInodeDeallocPrefetch 128
>
> [gss_ppc64]
>
> maxMBpS 16000
>
> [ems1-fdr,compute]
>
> maxMBpS 1
>
> [ems1-fdr,compute,gss_ppc64]
>
> worker1Threads 1024
>
> worker3Threads 32
>
> [gss_ppc64]
>
> ioHistorySize 64k
>
> [ems1-fdr,compute]
>
> ioHistorySize 4k
>
> [gss_ppc64]
>
> verbsRdmaMinBytes 16k
>
> [ems1-fdr,compute]
>
> verbsRdmaMinBytes 32k
>
> [ems1-fdr,compute,gss_ppc64]
>
> verbsRdmaSend yes
>
> [gss_ppc64]
>
> verbsRdmasPerConnection 16
>
> [ems1-fdr,compute]
>
> verbsRdmasPerConnection 256
>
> [gss_ppc64]
>
> verbsRdmasPerNode 3200
>
> [ems1-fdr,compute]
>
> verbsRdmasPerNode 1024
>
> [ems1-fdr,compute,gss_ppc64]
>
> verbsSendBufferMemoryMB 1024
>
> verbsRdmasPerNodeOptimize yes
>
> verbsRdmaUseMultiCqThreads yes
>
> [ems1-fdr,compute]
>
> ignorePrefetchLUNCount yes
>
> [gss_ppc64]
>
> scatterBufferSize 256K
>
> [ems1-fdr,compute]
>
> scatterBufferSize 256k
>
> syncIntervalStrict yes
>
> [ems1-fdr,compute,gss_ppc64]
>
> nsdClientCksumTypeLocal ck64
>
> nsdClientCksumTypeRemote ck64
>
> [gss_ppc64]
>
> pagepool 72856M
>
> [ems1-fdr]
>
> pagepool 17544M
>
> [compute]
>
> pagepool 4g
>
> [ems1-fdr,qsched03-ib0,quser10-fdr,compute,gss_ppc64]
>
> verbsRdma enable
>
> [gss_ppc64]
>
> verbsPorts mlx5_0/1 

Re: [gpfsug-discuss] nodes being ejected out of the cluster

2017-01-11 Thread Damir Krstic
Can this be done live? Meaning can GPFS remain up when I turn this off?

Thanks,
Damir

On Wed, Jan 11, 2017 at 12:38 PM Jan-Frode Myklebust 
wrote:

> And there you have:
>
> [ems1-fdr,compute,gss_ppc64]
> verbsRdmaSend yes
>
> Try turning this off.
>
>
> -jf
> ons. 11. jan. 2017 kl. 18.54 skrev Damir Krstic :
>
> Thanks for all the suggestions. Here is our mmlsconfig file. We just
> purchased another GL6. During the installation of the new GL6 IBM will
> upgrade our existing GL6 up to the latest code levels. This will happen
> during the week of 23rd of Jan.
>
> I am skeptical that the upgrade is going to fix the issue.
>
> On our IO servers we are running in connected mode (please note that IB
> interfaces are bonded)
>
> [root@gssio1 ~]# cat /sys/class/net/ib0/mode
>
> connected
>
> [root@gssio1 ~]# cat /sys/class/net/ib1/mode
>
> connected
>
> [root@gssio1 ~]# cat /sys/class/net/ib2/mode
>
> connected
>
> [root@gssio1 ~]# cat /sys/class/net/ib3/mode
>
> connected
>
> [root@gssio2 ~]# cat /sys/class/net/ib0/mode
>
> connected
>
> [root@gssio2 ~]# cat /sys/class/net/ib1/mode
>
> connected
>
> [root@gssio2 ~]# cat /sys/class/net/ib2/mode
>
> connected
>
> [root@gssio2 ~]# cat /sys/class/net/ib3/mode
>
> connected
>
> Our login nodes are also running connected mode as well.
>
> However, all of our compute nodes are running in datagram:
>
> [root@mgt ~]# psh compute cat /sys/class/net/ib0/mode
>
> qnode0758: datagram
>
> qnode0763: datagram
>
> qnode0760: datagram
>
> qnode0772: datagram
>
> qnode0773: datagram
> etc.
>
> Here is our mmlsconfig:
>
> [root@gssio1 ~]# mmlsconfig
>
> Configuration data for cluster ess-qstorage.it.northwestern.edu:
>
> 
>
> clusterName ess-qstorage.it.northwestern.edu
>
> clusterId 17746506346828356609
>
> dmapiFileHandleSize 32
>
> minReleaseLevel 4.2.0.1
>
> ccrEnabled yes
>
> cipherList AUTHONLY
>
> [gss_ppc64]
>
> nsdRAIDBufferPoolSizePct 80
>
> maxBufferDescs 2m
>
> prefetchPct 5
>
> nsdRAIDTracks 128k
>
> nsdRAIDSmallBufferSize 256k
>
> nsdMaxWorkerThreads 3k
>
> nsdMinWorkerThreads 3k
>
> nsdRAIDSmallThreadRatio 2
>
> nsdRAIDThreadsPerQueue 16
>
> nsdRAIDEventLogToConsole all
>
> nsdRAIDFastWriteFSDataLimit 256k
>
> nsdRAIDFastWriteFSMetadataLimit 1M
>
> nsdRAIDReconstructAggressiveness 1
>
> nsdRAIDFlusherBuffersLowWatermarkPct 20
>
> nsdRAIDFlusherBuffersLimitPct 80
>
> nsdRAIDFlusherTracksLowWatermarkPct 20
>
> nsdRAIDFlusherTracksLimitPct 80
>
> nsdRAIDFlusherFWLogHighWatermarkMB 1000
>
> nsdRAIDFlusherFWLogLimitMB 5000
>
> nsdRAIDFlusherThreadsLowWatermark 1
>
> nsdRAIDFlusherThreadsHighWatermark 512
>
> nsdRAIDBlockDeviceMaxSectorsKB 8192
>
> nsdRAIDBlockDeviceNrRequests 32
>
> nsdRAIDBlockDeviceQueueDepth 16
>
> nsdRAIDBlockDeviceScheduler deadline
>
> nsdRAIDMaxTransientStale2FT 1
>
> nsdRAIDMaxTransientStale3FT 1
>
> nsdMultiQueue 512
>
> syncWorkerThreads 256
>
> nsdInlineWriteMax 32k
>
> maxGeneralThreads 1280
>
> maxReceiverThreads 128
>
> nspdQueues 64
>
> [common]
>
> maxblocksize 16m
>
> [ems1-fdr,compute,gss_ppc64]
>
> numaMemoryInterleave yes
>
> [gss_ppc64]
>
> maxFilesToCache 12k
>
> [ems1-fdr,compute]
>
> maxFilesToCache 128k
>
> [ems1-fdr,compute,gss_ppc64]
>
> flushedDataTarget 1024
>
> flushedInodeTarget 1024
>
> maxFileCleaners 1024
>
> maxBufferCleaners 1024
>
> logBufferCount 20
>
> logWrapAmountPct 2
>
> logWrapThreads 128
>
> maxAllocRegionsPerNode 32
>
> maxBackgroundDeletionThreads 16
>
> maxInodeDeallocPrefetch 128
>
> [gss_ppc64]
>
> maxMBpS 16000
>
> [ems1-fdr,compute]
>
> maxMBpS 1
>
> [ems1-fdr,compute,gss_ppc64]
>
> worker1Threads 1024
>
> worker3Threads 32
>
> [gss_ppc64]
>
> ioHistorySize 64k
>
> [ems1-fdr,compute]
>
> ioHistorySize 4k
>
> [gss_ppc64]
>
> verbsRdmaMinBytes 16k
>
> [ems1-fdr,compute]
>
> verbsRdmaMinBytes 32k
>
> [ems1-fdr,compute,gss_ppc64]
>
> verbsRdmaSend yes
>
> [gss_ppc64]
>
> verbsRdmasPerConnection 16
>
> [ems1-fdr,compute]
>
> verbsRdmasPerConnection 256
>
> [gss_ppc64]
>
> verbsRdmasPerNode 3200
>
> [ems1-fdr,compute]
>
> verbsRdmasPerNode 1024
>
> [ems1-fdr,compute,gss_ppc64]
>
> verbsSendBufferMemoryMB 1024
>
> verbsRdmasPerNodeOptimize yes
>
> verbsRdmaUseMultiCqThreads yes
>
> [ems1-fdr,compute]
>
> ignorePrefetchLUNCount yes
>
> [gss_ppc64]
>
> scatterBufferSize 256K
>
> [ems1-fdr,compute]
>
> scatterBufferSize 256k
>
> syncIntervalStrict yes
>
> [ems1-fdr,compute,gss_ppc64]
>
> nsdClientCksumTypeLocal ck64
>
> nsdClientCksumTypeRemote ck64
>
> [gss_ppc64]
>
> pagepool 72856M
>
> [ems1-fdr]
>
> pagepool 17544M
>
> [compute]
>
> pagepool 4g
>
> [ems1-fdr,qsched03-ib0,quser10-fdr,compute,gss_ppc64]
>
> verbsRdma enable
>
> [gss_ppc64]
>
> verbsPorts mlx5_0/1 mlx5_0/2 mlx5_1/1 mlx5_1/2
>
> [ems1-fdr]
>
> verbsPorts mlx5_0/1 mlx5_0/2
>
> [qsched03-ib0,quser10-fdr,compute]
>
> verbsPorts mlx4_0/1
>
> [common]
>
> autoload no
>
> [ems1-fdr,compute,gss_ppc64]
>
> maxStatCache 0
>
> [common]
>
> 

Re: [gpfsug-discuss] nodes being ejected out of the cluster

2017-01-11 Thread Jan-Frode Myklebust
And there you have:

[ems1-fdr,compute,gss_ppc64]
verbsRdmaSend yes

Try turning this off.


-jf
ons. 11. jan. 2017 kl. 18.54 skrev Damir Krstic :

> Thanks for all the suggestions. Here is our mmlsconfig file. We just
> purchased another GL6. During the installation of the new GL6 IBM will
> upgrade our existing GL6 up to the latest code levels. This will happen
> during the week of 23rd of Jan.
>
> I am skeptical that the upgrade is going to fix the issue.
>
> On our IO servers we are running in connected mode (please note that IB
> interfaces are bonded)
>
> [root@gssio1 ~]# cat /sys/class/net/ib0/mode
>
> connected
>
> [root@gssio1 ~]# cat /sys/class/net/ib1/mode
>
> connected
>
> [root@gssio1 ~]# cat /sys/class/net/ib2/mode
>
> connected
>
> [root@gssio1 ~]# cat /sys/class/net/ib3/mode
>
> connected
>
> [root@gssio2 ~]# cat /sys/class/net/ib0/mode
>
> connected
>
> [root@gssio2 ~]# cat /sys/class/net/ib1/mode
>
> connected
>
> [root@gssio2 ~]# cat /sys/class/net/ib2/mode
>
> connected
>
> [root@gssio2 ~]# cat /sys/class/net/ib3/mode
>
> connected
>
> Our login nodes are also running connected mode as well.
>
> However, all of our compute nodes are running in datagram:
>
> [root@mgt ~]# psh compute cat /sys/class/net/ib0/mode
>
> qnode0758: datagram
>
> qnode0763: datagram
>
> qnode0760: datagram
>
> qnode0772: datagram
>
> qnode0773: datagram
> etc.
>
> Here is our mmlsconfig:
>
> [root@gssio1 ~]# mmlsconfig
>
> Configuration data for cluster ess-qstorage.it.northwestern.edu:
>
> 
>
> clusterName ess-qstorage.it.northwestern.edu
>
> clusterId 17746506346828356609
>
> dmapiFileHandleSize 32
>
> minReleaseLevel 4.2.0.1
>
> ccrEnabled yes
>
> cipherList AUTHONLY
>
> [gss_ppc64]
>
> nsdRAIDBufferPoolSizePct 80
>
> maxBufferDescs 2m
>
> prefetchPct 5
>
> nsdRAIDTracks 128k
>
> nsdRAIDSmallBufferSize 256k
>
> nsdMaxWorkerThreads 3k
>
> nsdMinWorkerThreads 3k
>
> nsdRAIDSmallThreadRatio 2
>
> nsdRAIDThreadsPerQueue 16
>
> nsdRAIDEventLogToConsole all
>
> nsdRAIDFastWriteFSDataLimit 256k
>
> nsdRAIDFastWriteFSMetadataLimit 1M
>
> nsdRAIDReconstructAggressiveness 1
>
> nsdRAIDFlusherBuffersLowWatermarkPct 20
>
> nsdRAIDFlusherBuffersLimitPct 80
>
> nsdRAIDFlusherTracksLowWatermarkPct 20
>
> nsdRAIDFlusherTracksLimitPct 80
>
> nsdRAIDFlusherFWLogHighWatermarkMB 1000
>
> nsdRAIDFlusherFWLogLimitMB 5000
>
> nsdRAIDFlusherThreadsLowWatermark 1
>
> nsdRAIDFlusherThreadsHighWatermark 512
>
> nsdRAIDBlockDeviceMaxSectorsKB 8192
>
> nsdRAIDBlockDeviceNrRequests 32
>
> nsdRAIDBlockDeviceQueueDepth 16
>
> nsdRAIDBlockDeviceScheduler deadline
>
> nsdRAIDMaxTransientStale2FT 1
>
> nsdRAIDMaxTransientStale3FT 1
>
> nsdMultiQueue 512
>
> syncWorkerThreads 256
>
> nsdInlineWriteMax 32k
>
> maxGeneralThreads 1280
>
> maxReceiverThreads 128
>
> nspdQueues 64
>
> [common]
>
> maxblocksize 16m
>
> [ems1-fdr,compute,gss_ppc64]
>
> numaMemoryInterleave yes
>
> [gss_ppc64]
>
> maxFilesToCache 12k
>
> [ems1-fdr,compute]
>
> maxFilesToCache 128k
>
> [ems1-fdr,compute,gss_ppc64]
>
> flushedDataTarget 1024
>
> flushedInodeTarget 1024
>
> maxFileCleaners 1024
>
> maxBufferCleaners 1024
>
> logBufferCount 20
>
> logWrapAmountPct 2
>
> logWrapThreads 128
>
> maxAllocRegionsPerNode 32
>
> maxBackgroundDeletionThreads 16
>
> maxInodeDeallocPrefetch 128
>
> [gss_ppc64]
>
> maxMBpS 16000
>
> [ems1-fdr,compute]
>
> maxMBpS 1
>
> [ems1-fdr,compute,gss_ppc64]
>
> worker1Threads 1024
>
> worker3Threads 32
>
> [gss_ppc64]
>
> ioHistorySize 64k
>
> [ems1-fdr,compute]
>
> ioHistorySize 4k
>
> [gss_ppc64]
>
> verbsRdmaMinBytes 16k
>
> [ems1-fdr,compute]
>
> verbsRdmaMinBytes 32k
>
> [ems1-fdr,compute,gss_ppc64]
>
> verbsRdmaSend yes
>
> [gss_ppc64]
>
> verbsRdmasPerConnection 16
>
> [ems1-fdr,compute]
>
> verbsRdmasPerConnection 256
>
> [gss_ppc64]
>
> verbsRdmasPerNode 3200
>
> [ems1-fdr,compute]
>
> verbsRdmasPerNode 1024
>
> [ems1-fdr,compute,gss_ppc64]
>
> verbsSendBufferMemoryMB 1024
>
> verbsRdmasPerNodeOptimize yes
>
> verbsRdmaUseMultiCqThreads yes
>
> [ems1-fdr,compute]
>
> ignorePrefetchLUNCount yes
>
> [gss_ppc64]
>
> scatterBufferSize 256K
>
> [ems1-fdr,compute]
>
> scatterBufferSize 256k
>
> syncIntervalStrict yes
>
> [ems1-fdr,compute,gss_ppc64]
>
> nsdClientCksumTypeLocal ck64
>
> nsdClientCksumTypeRemote ck64
>
> [gss_ppc64]
>
> pagepool 72856M
>
> [ems1-fdr]
>
> pagepool 17544M
>
> [compute]
>
> pagepool 4g
>
> [ems1-fdr,qsched03-ib0,quser10-fdr,compute,gss_ppc64]
>
> verbsRdma enable
>
> [gss_ppc64]
>
> verbsPorts mlx5_0/1 mlx5_0/2 mlx5_1/1 mlx5_1/2
>
> [ems1-fdr]
>
> verbsPorts mlx5_0/1 mlx5_0/2
>
> [qsched03-ib0,quser10-fdr,compute]
>
> verbsPorts mlx4_0/1
>
> [common]
>
> autoload no
>
> [ems1-fdr,compute,gss_ppc64]
>
> maxStatCache 0
>
> [common]
>
> envVar MLX4_USE_MUTEX=1 MLX5_SHUT_UP_BF=1 MLX5_USE_MUTEX=1
>
> deadlockOverloadThreshold 0
>
> deadlockDetectionThreshold 0
>
> adminMode central
>
>
> File systems in cluster 

Re: [gpfsug-discuss] nodes being ejected out of the cluster

2017-01-11 Thread Damir Krstic
Thanks for all the suggestions. Here is our mmlsconfig file. We just
purchased another GL6. During the installation of the new GL6 IBM will
upgrade our existing GL6 up to the latest code levels. This will happen
during the week of 23rd of Jan.

I am skeptical that the upgrade is going to fix the issue.

On our IO servers we are running in connected mode (please note that IB
interfaces are bonded)

[root@gssio1 ~]# cat /sys/class/net/ib0/mode

connected

[root@gssio1 ~]# cat /sys/class/net/ib1/mode

connected

[root@gssio1 ~]# cat /sys/class/net/ib2/mode

connected

[root@gssio1 ~]# cat /sys/class/net/ib3/mode

connected

[root@gssio2 ~]# cat /sys/class/net/ib0/mode

connected

[root@gssio2 ~]# cat /sys/class/net/ib1/mode

connected

[root@gssio2 ~]# cat /sys/class/net/ib2/mode

connected

[root@gssio2 ~]# cat /sys/class/net/ib3/mode

connected

Our login nodes are also running connected mode as well.

However, all of our compute nodes are running in datagram:

[root@mgt ~]# psh compute cat /sys/class/net/ib0/mode

qnode0758: datagram

qnode0763: datagram

qnode0760: datagram

qnode0772: datagram

qnode0773: datagram
etc.

Here is our mmlsconfig:

[root@gssio1 ~]# mmlsconfig

Configuration data for cluster ess-qstorage.it.northwestern.edu:



clusterName ess-qstorage.it.northwestern.edu

clusterId 17746506346828356609

dmapiFileHandleSize 32

minReleaseLevel 4.2.0.1

ccrEnabled yes

cipherList AUTHONLY

[gss_ppc64]

nsdRAIDBufferPoolSizePct 80

maxBufferDescs 2m

prefetchPct 5

nsdRAIDTracks 128k

nsdRAIDSmallBufferSize 256k

nsdMaxWorkerThreads 3k

nsdMinWorkerThreads 3k

nsdRAIDSmallThreadRatio 2

nsdRAIDThreadsPerQueue 16

nsdRAIDEventLogToConsole all

nsdRAIDFastWriteFSDataLimit 256k

nsdRAIDFastWriteFSMetadataLimit 1M

nsdRAIDReconstructAggressiveness 1

nsdRAIDFlusherBuffersLowWatermarkPct 20

nsdRAIDFlusherBuffersLimitPct 80

nsdRAIDFlusherTracksLowWatermarkPct 20

nsdRAIDFlusherTracksLimitPct 80

nsdRAIDFlusherFWLogHighWatermarkMB 1000

nsdRAIDFlusherFWLogLimitMB 5000

nsdRAIDFlusherThreadsLowWatermark 1

nsdRAIDFlusherThreadsHighWatermark 512

nsdRAIDBlockDeviceMaxSectorsKB 8192

nsdRAIDBlockDeviceNrRequests 32

nsdRAIDBlockDeviceQueueDepth 16

nsdRAIDBlockDeviceScheduler deadline

nsdRAIDMaxTransientStale2FT 1

nsdRAIDMaxTransientStale3FT 1

nsdMultiQueue 512

syncWorkerThreads 256

nsdInlineWriteMax 32k

maxGeneralThreads 1280

maxReceiverThreads 128

nspdQueues 64

[common]

maxblocksize 16m

[ems1-fdr,compute,gss_ppc64]

numaMemoryInterleave yes

[gss_ppc64]

maxFilesToCache 12k

[ems1-fdr,compute]

maxFilesToCache 128k

[ems1-fdr,compute,gss_ppc64]

flushedDataTarget 1024

flushedInodeTarget 1024

maxFileCleaners 1024

maxBufferCleaners 1024

logBufferCount 20

logWrapAmountPct 2

logWrapThreads 128

maxAllocRegionsPerNode 32

maxBackgroundDeletionThreads 16

maxInodeDeallocPrefetch 128

[gss_ppc64]

maxMBpS 16000

[ems1-fdr,compute]

maxMBpS 1

[ems1-fdr,compute,gss_ppc64]

worker1Threads 1024

worker3Threads 32

[gss_ppc64]

ioHistorySize 64k

[ems1-fdr,compute]

ioHistorySize 4k

[gss_ppc64]

verbsRdmaMinBytes 16k

[ems1-fdr,compute]

verbsRdmaMinBytes 32k

[ems1-fdr,compute,gss_ppc64]

verbsRdmaSend yes

[gss_ppc64]

verbsRdmasPerConnection 16

[ems1-fdr,compute]

verbsRdmasPerConnection 256

[gss_ppc64]

verbsRdmasPerNode 3200

[ems1-fdr,compute]

verbsRdmasPerNode 1024

[ems1-fdr,compute,gss_ppc64]

verbsSendBufferMemoryMB 1024

verbsRdmasPerNodeOptimize yes

verbsRdmaUseMultiCqThreads yes

[ems1-fdr,compute]

ignorePrefetchLUNCount yes

[gss_ppc64]

scatterBufferSize 256K

[ems1-fdr,compute]

scatterBufferSize 256k

syncIntervalStrict yes

[ems1-fdr,compute,gss_ppc64]

nsdClientCksumTypeLocal ck64

nsdClientCksumTypeRemote ck64

[gss_ppc64]

pagepool 72856M

[ems1-fdr]

pagepool 17544M

[compute]

pagepool 4g

[ems1-fdr,qsched03-ib0,quser10-fdr,compute,gss_ppc64]

verbsRdma enable

[gss_ppc64]

verbsPorts mlx5_0/1 mlx5_0/2 mlx5_1/1 mlx5_1/2

[ems1-fdr]

verbsPorts mlx5_0/1 mlx5_0/2

[qsched03-ib0,quser10-fdr,compute]

verbsPorts mlx4_0/1

[common]

autoload no

[ems1-fdr,compute,gss_ppc64]

maxStatCache 0

[common]

envVar MLX4_USE_MUTEX=1 MLX5_SHUT_UP_BF=1 MLX5_USE_MUTEX=1

deadlockOverloadThreshold 0

deadlockDetectionThreshold 0

adminMode central


File systems in cluster ess-qstorage.it.northwestern.edu:

-

/dev/home

/dev/hpc

/dev/projects

/dev/tthome

On Wed, Jan 11, 2017 at 9:16 AM Luis Bolinches 
wrote:

> In addition to what Olaf has said
>
> ESS upgrades include mellanox modules upgrades in the ESS nodes. In fact,
> on those noes you should do not update those solo (unless support says so
> in your PMR), so if that's been the recommendation, I suggest you look at
> it.
>
> Changelog on ESS 4.0.4 (no idea what ESS level you are running)
>
>
>   c) Support of MLNX_OFED_LINUX-3.2-2.0.0.1
>  - Updated from 

Re: [gpfsug-discuss] CES log files

2017-01-11 Thread Yi Sun




 Sometime  increasing CES debug level to get more info, e.g. "mmces log
level 3".


Here are two public wiki links (probably you already know).
https://www.ibm.com/developerworks/community/wikis/home?lang=en#!/wiki/General%20Parallel%20File%20System%20%28GPFS%29/page/Protocol%20Node%20-%20Tuning%20and%20Analysis
https://www.ibm.com/developerworks/community/wikis/home?lang=en#!/wiki/General%20Parallel%20File%20System%20%28GPFS%29/page/Protocols%20Problem%20Determination


Yi.

gpfsug-discuss-boun...@spectrumscale.org wrote on 01/11/2017 07:00:06 AM:

> From: gpfsug-discuss-requ...@spectrumscale.org
> To: gpfsug-discuss@spectrumscale.org
> Date: 01/11/2017 07:00 AM
> Subject: gpfsug-discuss Digest, Vol 60, Issue 26
> Sent by: gpfsug-discuss-boun...@spectrumscale.org
>
> Send gpfsug-discuss mailing list submissions to
>gpfsug-discuss@spectrumscale.org
>
> To subscribe or unsubscribe via the World Wide Web, visit
>http://gpfsug.org/mailman/listinfo/gpfsug-discuss
> or, via email, send a message with subject or body 'help' to
>gpfsug-discuss-requ...@spectrumscale.org
>
> You can reach the person managing the list at
>gpfsug-discuss-ow...@spectrumscale.org
>
> When replying, please edit your Subject line so it is more specific
> than "Re: Contents of gpfsug-discuss digest..."
>
>
> Today's Topics:
>
>1. Re: CES log files (Jan-Frode Myklebust)
>
>
> --
>
> Message: 1
> Date: Wed, 11 Jan 2017 12:21:00 +0100
> From: Jan-Frode Myklebust 
> To: gpfsug main discussion list 
> Subject: Re: [gpfsug-discuss] CES log files
> Message-ID:
>

Re: [gpfsug-discuss] nodes being ejected out of the cluster

2017-01-11 Thread Knister, Aaron S. (GSFC-606.2)[COMPUTER SCIENCE CORP]
The RDMA errors I think are secondary to what's going on with either your IPoIB 
or Ethernet fabrics that's causing I assume IPoIB communication breakdowns and 
expulsions. We've had entire IB fabrics go offline and if the nodes werent 
depending on it for daemon communication nobody got expelled. Do you have a 
subnet defined for your IPoIB network or are your nodes daemon interfaces 
already set to their IPoIB interface? Have you checked your SM logs?



From: Damir Krstic
Sent: 1/11/17, 9:39 AM
To: gpfsug main discussion list
Subject: [gpfsug-discuss] nodes being ejected out of the cluster
We are running GPFS 4.2 on our cluster (around 700 compute nodes). Our storage 
(ESS GL6) is also running GPFS 4.2. Compute nodes and storage are connected via 
Infiniband (FDR14). At the time of implementation of ESS, we were instructed to 
enable RDMA in addition to IPoIB. Previously we only ran IPoIB on our GPFS3.5 
cluster.

Every since the implementation (sometime back in July of 2016) we see a lot of 
compute nodes being ejected. What usually precedes the ejection are following 
messages:

Jan 11 02:03:15 quser13 mmfs: [E] VERBS RDMA rdma send error 
IBV_WC_RNR_RETRY_EXC_ERR to 172.41.2.5 (gssio2-fdr) on mlx4_0 port 1 fabnum 0 
vendor_err 135
Jan 11 02:03:15 quser13 mmfs: [E] VERBS RDMA closed connection to 172.41.2.5 
(gssio2-fdr) on mlx4_0 port 1 fabnum 0 due to send error 
IBV_WC_RNR_RETRY_EXC_ERR index 2
Jan 11 02:03:26 quser13 mmfs: [E] VERBS RDMA rdma send error 
IBV_WC_RNR_RETRY_EXC_ERR to 172.41.2.5 (gssio2-fdr) on mlx4_0 port 1 fabnum 0 
vendor_err 135
Jan 11 02:03:26 quser13 mmfs: [E] VERBS RDMA closed connection to 172.41.2.5 
(gssio2-fdr) on mlx4_0 port 1 fabnum 0 due to send error IBV_WC_WR_FLUSH_ERR 
index 1
Jan 11 02:03:26 quser13 mmfs: [E] VERBS RDMA rdma send error 
IBV_WC_RNR_RETRY_EXC_ERR to 172.41.2.5 (gssio2-fdr) on mlx4_0 port 1 fabnum 0 
vendor_err 135
Jan 11 02:03:26 quser13 mmfs: [E] VERBS RDMA closed connection to 172.41.2.5 
(gssio2-fdr) on mlx4_0 port 1 fabnum 0 due to send error 
IBV_WC_RNR_RETRY_EXC_ERR index 2
Jan 11 02:06:38 quser11 mmfs: [E] VERBS RDMA rdma send error 
IBV_WC_RNR_RETRY_EXC_ERR to 172.41.2.5 (gssio2-fdr) on mlx4_0 port 1 fabnum 0 
vendor_err 135
Jan 11 02:06:38 quser11 mmfs: [E] VERBS RDMA closed connection to 172.41.2.5 
(gssio2-fdr) on mlx4_0 port 1 fabnum 0 due to send error IBV_WC_WR_FLUSH_ERR 
index 400

Even our ESS IO server sometimes ends up being ejected (case in point - 
yesterday morning):

Jan 10 11:23:42 gssio2 mmfs: [E] VERBS RDMA rdma send error 
IBV_WC_RNR_RETRY_EXC_ERR to 172.41.2.1 (gssio1-fdr) on mlx5_1 port 1 fabnum 0 
vendor_err 135
Jan 10 11:23:42 gssio2 mmfs: [E] VERBS RDMA closed connection to 172.41.2.1 
(gssio1-fdr) on mlx5_1 port 1 fabnum 0 due to send error 
IBV_WC_RNR_RETRY_EXC_ERR index 3001
Jan 10 11:23:43 gssio2 mmfs: [E] VERBS RDMA rdma send error 
IBV_WC_RNR_RETRY_EXC_ERR to 172.41.2.1 (gssio1-fdr) on mlx5_1 port 2 fabnum 0 
vendor_err 135
Jan 10 11:23:43 gssio2 mmfs: [E] VERBS RDMA closed connection to 172.41.2.1 
(gssio1-fdr) on mlx5_1 port 2 fabnum 0 due to send error 
IBV_WC_RNR_RETRY_EXC_ERR index 2671
Jan 10 11:23:43 gssio2 mmfs: [E] VERBS RDMA rdma send error 
IBV_WC_RNR_RETRY_EXC_ERR to 172.41.2.1 (gssio1-fdr) on mlx5_0 port 2 fabnum 0 
vendor_err 135
Jan 10 11:23:43 gssio2 mmfs: [E] VERBS RDMA closed connection to 172.41.2.1 
(gssio1-fdr) on mlx5_0 port 2 fabnum 0 due to send error 
IBV_WC_RNR_RETRY_EXC_ERR index 2495
Jan 10 11:23:44 gssio2 mmfs: [E] VERBS RDMA rdma send error 
IBV_WC_RNR_RETRY_EXC_ERR to 172.41.2.1 (gssio1-fdr) on mlx5_0 port 1 fabnum 0 
vendor_err 135
Jan 10 11:23:44 gssio2 mmfs: [E] VERBS RDMA closed connection to 172.41.2.1 
(gssio1-fdr) on mlx5_0 port 1 fabnum 0 due to send error 
IBV_WC_RNR_RETRY_EXC_ERR index 3077
Jan 10 11:24:11 gssio2 mmfs: [N] Node 172.41.2.1 (gssio1-fdr) lease renewal is 
overdue. Pinging to check if it is alive

I've had multiple PMRs open for this issue, and I am told that our ESS needs 
code level upgrades in order to fix this issue. Looking at the errors, I think 
the issue is Infiniband related, and I am wondering if anyone on this list has 
seen similar issues?

Thanks for your help in advance.

Damir
___
gpfsug-discuss mailing list
gpfsug-discuss at spectrumscale.org
http://gpfsug.org/mailman/listinfo/gpfsug-discuss


Re: [gpfsug-discuss] nodes being ejected out of the cluster

2017-01-11 Thread Olaf Weiser
most likely, there's smth wrong with your
IB fabric ... you say, you run ~ 700 nodes ? ...Are you running with verbsRdmaSendenabled ? ,if so, please consider to disable  - and discuss this within
the PMR another issue, you may check is  -
Are you running the IPoIB in connected mode or datagram ... but as I said,
please discuss this within the PMR .. there are to much dependencies to
discuss this here .. cheersMit freundlichen Grüßen / Kind regards Olaf Weiser EMEA Storage Competence Center Mainz, German / IBM Systems, Storage Platform,---IBM DeutschlandIBM Allee 171139 EhningenPhone: +49-170-579-44-66E-Mail: olaf.wei...@de.ibm.com---IBM Deutschland GmbH / Vorsitzender des Aufsichtsrats: Martin JetterGeschäftsführung: Martina Koederitz (Vorsitzende), Susanne Peter, Norbert
Janzen, Dr. Christian Keller, Ivo Koerner, Markus KoernerSitz der Gesellschaft: Ehningen / Registergericht: Amtsgericht Stuttgart,
HRB 14562 / WEEE-Reg.-Nr. DE 99369940 From:      
 Damir Krstic To:      
 gpfsug main discussion
list Date:      
 01/11/2017 03:39 PMSubject:    
   [gpfsug-discuss]
nodes being ejected out of the clusterSent by:    
   gpfsug-discuss-boun...@spectrumscale.orgWe are running GPFS 4.2 on our cluster (around 700 compute
nodes). Our storage (ESS GL6) is also running GPFS 4.2. Compute nodes and
storage are connected via Infiniband (FDR14). At the time of implementation
of ESS, we were instructed to enable RDMA in addition to IPoIB. Previously
we only ran IPoIB on our GPFS3.5 cluster.Every since the implementation (sometime back in July
of 2016) we see a lot of compute nodes being ejected. What usually precedes
the ejection are following messages:Jan 11 02:03:15 quser13 mmfs: [E] VERBS RDMA rdma send
error IBV_WC_RNR_RETRY_EXC_ERR to 172.41.2.5 (gssio2-fdr) on mlx4_0 port
1 fabnum 0 vendor_err 135 Jan 11 02:03:15 quser13 mmfs: [E] VERBS RDMA closed connection
to 172.41.2.5 (gssio2-fdr) on mlx4_0 port 1 fabnum 0 due to send error
IBV_WC_RNR_RETRY_EXC_ERR index 2Jan 11 02:03:26 quser13 mmfs: [E] VERBS RDMA rdma send
error IBV_WC_RNR_RETRY_EXC_ERR to 172.41.2.5 (gssio2-fdr) on mlx4_0 port
1 fabnum 0 vendor_err 135 Jan 11 02:03:26 quser13 mmfs: [E] VERBS RDMA closed connection
to 172.41.2.5 (gssio2-fdr) on mlx4_0 port 1 fabnum 0 due to send error
IBV_WC_WR_FLUSH_ERR index 1Jan 11 02:03:26 quser13 mmfs: [E] VERBS RDMA rdma send
error IBV_WC_RNR_RETRY_EXC_ERR to 172.41.2.5 (gssio2-fdr) on mlx4_0 port
1 fabnum 0 vendor_err 135 Jan 11 02:03:26 quser13 mmfs: [E] VERBS RDMA closed connection
to 172.41.2.5 (gssio2-fdr) on mlx4_0 port 1 fabnum 0 due to send error
IBV_WC_RNR_RETRY_EXC_ERR index 2Jan 11 02:06:38 quser11 mmfs: [E] VERBS RDMA rdma send
error IBV_WC_RNR_RETRY_EXC_ERR to 172.41.2.5 (gssio2-fdr) on mlx4_0 port
1 fabnum 0 vendor_err 135 Jan 11 02:06:38 quser11 mmfs: [E] VERBS RDMA closed connection
to 172.41.2.5 (gssio2-fdr) on mlx4_0 port 1 fabnum 0 due to send error
IBV_WC_WR_FLUSH_ERR index 400Even our ESS IO server sometimes ends up being ejected
(case in point - yesterday morning):Jan 10 11:23:42 gssio2 mmfs: [E] VERBS RDMA rdma send
error IBV_WC_RNR_RETRY_EXC_ERR to 172.41.2.1 (gssio1-fdr) on mlx5_1 port
1 fabnum 0 vendor_err 135Jan 10 11:23:42 gssio2 mmfs: [E] VERBS RDMA closed connection
to 172.41.2.1 (gssio1-fdr) on mlx5_1 port 1 fabnum 0 due to send error
IBV_WC_RNR_RETRY_EXC_ERR index 3001Jan 10 11:23:43 gssio2 mmfs: [E] VERBS RDMA rdma send
error IBV_WC_RNR_RETRY_EXC_ERR to 172.41.2.1 (gssio1-fdr) on mlx5_1 port
2 fabnum 0 vendor_err 135Jan 10 11:23:43 gssio2 mmfs: [E] VERBS RDMA closed connection
to 172.41.2.1 (gssio1-fdr) on mlx5_1 port 2 fabnum 0 due to send error
IBV_WC_RNR_RETRY_EXC_ERR index 2671Jan 10 11:23:43 gssio2 mmfs: [E] VERBS RDMA rdma send
error IBV_WC_RNR_RETRY_EXC_ERR to 172.41.2.1 (gssio1-fdr) on mlx5_0 port
2 fabnum 0 vendor_err 135Jan 10 11:23:43 gssio2 mmfs: [E] VERBS RDMA closed connection
to 172.41.2.1 (gssio1-fdr) on mlx5_0 port 2 fabnum 0 due to send error
IBV_WC_RNR_RETRY_EXC_ERR index 2495Jan 10 11:23:44 gssio2 mmfs: [E] VERBS RDMA rdma send
error IBV_WC_RNR_RETRY_EXC_ERR to 172.41.2.1 (gssio1-fdr) on mlx5_0 port
1 fabnum 0 vendor_err 135Jan 10 11:23:44 gssio2 mmfs: [E] VERBS RDMA closed connection
to 172.41.2.1 (gssio1-fdr) on mlx5_0 port 1 fabnum 0 due to send error
IBV_WC_RNR_RETRY_EXC_ERR index 3077Jan 10 11:24:11 gssio2 mmfs: [N] Node 172.41.2.1 (gssio1-fdr)
lease renewal is overdue. Pinging to check if it is aliveI've had multiple PMRs open for this issue, and I am told
that our ESS needs code level upgrades in order to fix this issue. Looking
at the errors, I think the issue is Infiniband related, and I am wondering
if anyone on this list has seen 

Re: [gpfsug-discuss] CES log files

2017-01-11 Thread Simon Thompson (Research Computing - IT Services)
What did the smb log claim on the nodes? Should be in /var/adm/ras, for example 
if SMB failed, then I could see that CES would mark the node as degraded.

Simon

From: 
>
 on behalf of "Sobey, Richard A" 
>
Reply-To: 
"gpfsug-discuss@spectrumscale.org" 
>
Date: Wednesday, 11 January 2017 at 13:59
To: "gpfsug-discuss@spectrumscale.org" 
>
Subject: Re: [gpfsug-discuss] CES log files

Thanks. Some of the node would just say “failed” or “degraded” with the DCs 
offline. Of those that thought they were happy to host a CES IP address, they 
did not respond and winbindd process would take up 100% CPU as seen through top 
with no users on it.

Interesting that even though all CES nodes had the same configuration, three of 
them never had a problem at all.

JF – I’ll look at the protocol tracing next time this happens. It’s a rare 
thing that three DCs go offline at once but even so there should have been 
enough resiliency to cope.

Thanks
Richard

From: 
gpfsug-discuss-boun...@spectrumscale.org
 [mailto:gpfsug-discuss-boun...@spectrumscale.org] On Behalf Of Andrew Beattie
Sent: 11 January 2017 09:55
To: gpfsug-discuss@spectrumscale.org
Cc: gpfsug-discuss@spectrumscale.org
Subject: Re: [gpfsug-discuss] CES log files

mmhealth might be a good place to start

CES should probably throw a message along the lines of the following:

mmhealth shows something is wrong with AD server:
...
CES  DEGRADED ads_down
...
Andrew Beattie
Software Defined Storage  - IT Specialist
Phone: 614-2133-7927
E-mail: abeat...@au1.ibm.com


- Original message -
From: "Sobey, Richard A" >
Sent by: 
gpfsug-discuss-boun...@spectrumscale.org
To: 
"'gpfsug-discuss@spectrumscale.org'" 
>
Cc:
Subject: [gpfsug-discuss] CES log files
Date: Wed, Jan 11, 2017 7:27 PM


Which files do I need to look in to determine what’s happening with CES… 
supposing for example a load of domain controllers were shut down and CES had 
no clue how to handle this and stopped working until the DCs were switched back 
on again.



Mmfs.log.latest said everything was fine btw.



Thanks

Richard
___
gpfsug-discuss mailing list
gpfsug-discuss at spectrumscale.org
http://gpfsug.org/mailman/listinfo/gpfsug-discuss


___
gpfsug-discuss mailing list
gpfsug-discuss at spectrumscale.org
http://gpfsug.org/mailman/listinfo/gpfsug-discuss


Re: [gpfsug-discuss] CES log files

2017-01-11 Thread Sobey, Richard A
Thanks. Some of the node would just say “failed” or “degraded” with the DCs 
offline. Of those that thought they were happy to host a CES IP address, they 
did not respond and winbindd process would take up 100% CPU as seen through top 
with no users on it.

Interesting that even though all CES nodes had the same configuration, three of 
them never had a problem at all.

JF – I’ll look at the protocol tracing next time this happens. It’s a rare 
thing that three DCs go offline at once but even so there should have been 
enough resiliency to cope.

Thanks
Richard

From: gpfsug-discuss-boun...@spectrumscale.org 
[mailto:gpfsug-discuss-boun...@spectrumscale.org] On Behalf Of Andrew Beattie
Sent: 11 January 2017 09:55
To: gpfsug-discuss@spectrumscale.org
Cc: gpfsug-discuss@spectrumscale.org
Subject: Re: [gpfsug-discuss] CES log files

mmhealth might be a good place to start

CES should probably throw a message along the lines of the following:

mmhealth shows something is wrong with AD server:
...
CES  DEGRADED ads_down
...
Andrew Beattie
Software Defined Storage  - IT Specialist
Phone: 614-2133-7927
E-mail: abeat...@au1.ibm.com


- Original message -
From: "Sobey, Richard A" >
Sent by: 
gpfsug-discuss-boun...@spectrumscale.org
To: "'gpfsug-discuss@spectrumscale.org'" 
>
Cc:
Subject: [gpfsug-discuss] CES log files
Date: Wed, Jan 11, 2017 7:27 PM


Which files do I need to look in to determine what’s happening with CES… 
supposing for example a load of domain controllers were shut down and CES had 
no clue how to handle this and stopped working until the DCs were switched back 
on again.



Mmfs.log.latest said everything was fine btw.



Thanks

Richard
___
gpfsug-discuss mailing list
gpfsug-discuss at spectrumscale.org
http://gpfsug.org/mailman/listinfo/gpfsug-discuss


___
gpfsug-discuss mailing list
gpfsug-discuss at spectrumscale.org
http://gpfsug.org/mailman/listinfo/gpfsug-discuss


Re: [gpfsug-discuss] CES log files

2017-01-11 Thread Jan-Frode Myklebust
I also struggle with where to look for CES log files.. but maybe the new
"mmprotocoltrace" command can be useful?

# mmprotocoltrace start smb
### reproduce problem
# mmprotocoltrace stop smb

Check log files it has collected.


  -jf


On Wed, Jan 11, 2017 at 10:27 AM, Sobey, Richard A 
wrote:

> Which files do I need to look in to determine what’s happening with CES…
> supposing for example a load of domain controllers were shut down and CES
> had no clue how to handle this and stopped working until the DCs were
> switched back on again.
>
>
>
> Mmfs.log.latest said everything was fine btw.
>
>
>
> Thanks
>
> Richard
>
> ___
> gpfsug-discuss mailing list
> gpfsug-discuss at spectrumscale.org
> http://gpfsug.org/mailman/listinfo/gpfsug-discuss
>
>
___
gpfsug-discuss mailing list
gpfsug-discuss at spectrumscale.org
http://gpfsug.org/mailman/listinfo/gpfsug-discuss


Re: [gpfsug-discuss] CES log files

2017-01-11 Thread Andrew Beattie
mmhealth might be a good place to start
 
CES should probably throw a message along the lines of the following:
 
mmhealth shows something is wrong with AD server:...CES  DEGRADED ads_down  ...
Andrew Beattie
Software Defined Storage  - IT Specialist
Phone: 614-2133-7927
E-mail: abeat...@au1.ibm.com
 
 
- Original message -From: "Sobey, Richard A" Sent by: gpfsug-discuss-boun...@spectrumscale.orgTo: "'gpfsug-discuss@spectrumscale.org'" Cc:Subject: [gpfsug-discuss] CES log filesDate: Wed, Jan 11, 2017 7:27 PM  
Which files do I need to look in to determine what’s happening with CES… supposing for example a load of domain controllers were shut down and CES had no clue how to handle this and stopped working until the DCs were switched back on again.
 
Mmfs.log.latest said everything was fine btw.
 
Thanks
Richard
___gpfsug-discuss mailing listgpfsug-discuss at spectrumscale.orghttp://gpfsug.org/mailman/listinfo/gpfsug-discuss
 

___
gpfsug-discuss mailing list
gpfsug-discuss at spectrumscale.org
http://gpfsug.org/mailman/listinfo/gpfsug-discuss


[gpfsug-discuss] CES log files

2017-01-11 Thread Sobey, Richard A
Which files do I need to look in to determine what's happening with CES... 
supposing for example a load of domain controllers were shut down and CES had 
no clue how to handle this and stopped working until the DCs were switched back 
on again.

Mmfs.log.latest said everything was fine btw.

Thanks
Richard
___
gpfsug-discuss mailing list
gpfsug-discuss at spectrumscale.org
http://gpfsug.org/mailman/listinfo/gpfsug-discuss