On 2/4/2018 7:29 AM, Jose M Calhariz wrote: > I am chasing the root problem in my infra-structure of afsdb and > afs-fileservers. Sometimes my afsdb loses quorum in the middle of a > vos operation or the Linux clients time out talking to the > file servers. To help diagnose the problem I would like to know how > long is the timeout and if I can change the time out connections in > the Debian clients and for the vos operations. >[...] > The core of my infra-structure are 4 afsdb running Debian 9, and using > OpenAFS from Debian 1.6.20, on a shared virtualization platform. The > file-servers running Debian 9 and using OpenAFS from Debian, 1.6.20, > are VMs in dedicated hosts for OpenAFS on top of libvirt/KVM.
Jose, There is unlikely to be a single problem but since I'm procrastinating and curious I decided to perform some research on your cell. This research is the type of analysis that AuriStor performs on behalf of our support customers. Many of the problems you are experiencing with OpenAFS are likely due to or exacerbated by architectural limitations that are simply not present in AuriStorFS. Your cell has four db servers afs01 through afs04 with associated IP addresses that rank the servers from afs01 through afs04. therefore afs01 is the preferred coordinator (sync site) and if its not running afs02 will be elected. Given there are four servers it is not possible for afs03 or afs04 to be elected. There are of course multiple independent ubik database services (vl, pt, and bu) and it is possible for quorum to exist for one and not for others. The vl service is used to store volume location information as well as fileserver/volserver location information. vl entries are modified when a fileserver restarts, when a vos command locks and unlocks an entry, or creates, updates or deletes an entry. Its primary consumer is the afs client which queries volume and file server location information. The pt service stores user and group entries. pt entries are modified by pts when new user entries are created, modified or deleted; and when groups are created, modified or deleted; or when group membership information is modified. The primary consumer is the fileserver which queries the pt service for user and host current protection sets each time a client establishes an rxkad connection to the fileserver. The vl and pt services are of course ubik services. Therefore each vlserver and ptserver process also offers the ubik disk and vote services which are critical. The vote service is used to hold elections, distribute current database version info, and maintain quorum. The disk service is used to distribute the database, update the database, and maintain database consistency. It should be noted that the vote service is time sensitive in that packets that are used to request votes from peers and the responses only have a limited valid lifetime. Some statistics regarding your vl service. Each server is configured with 16 LWP threads. afs03 and afs04 have both failed to service calls in a timely fashion since the last restart. If those failures were vote or disk calls then the coordinator would mark afs03 and afs04 as unreachable, force a recovery operation, and if both were marked down across an election could result in lose of quorum. Since the last restart afs01 has processed 1894352 vl transactions, afs02 1075698 transactions, afs03 2059186 transactions, and afs04 1403592 transactions. That will provide you some idea of the load balancing across your cache managers. The coordinator of course is the only one to handle write transactions; the rest are read transactions. For the pt service the transaction counts are afs01 1818212, afs02 1619962, afs03 1554918, and afs04 1075620. Roughly on par with the vl service load. Like the vl service each server has 16 LWP threads. However, unlike the vl service the pt service is not keeping up with the requests. Since the last restart all four servers have failed to service incoming calls in a timely manner thousands of times each. The pt service failing to be responsive is a problem because it has ripple effects on the file servers. The longer it takes a fileserver to query the CPS data the longer it takes to accept a new connection from a cache manager. The ubik services in all versions of OpenAFS prior to the 1.8 branch have been built as LWP (cooperatively threaded) processes. There is only a single thread in the process that swaps context state. The rx threads (listener, event, ...), the vote, disk, and application (vl, pt, bu, ...) contexts are swapped in either upon a blocking event or a yield. Failure of a context to yield blocks other activities including reading packets, processing requests, etc. Like AuriStorFS the OpenAFS 1.8 series converts the ubik services (vl, pt, bu) to native threading. This will permit the vote and disk services and the rx threads (listener, event,...) to operate with greater parallelism. Unlike AuriStorFS, the OpenAFS implementation still relies to a large extent on global locks for thread safety so all of the resource contention remains. Still, 1.8 will be much less vulnerable to vote packets being delayed beyond their validity. The vote and disk timeouts cannot be adjusted because they are a core part of the protocol. Each of the fileserver bnodes is configured as such dafileserver -L -b 8192 -vc 32768 -s 65536 -l 16384 -p 256 \ -udpsize 16777216 -cb 1048576 -rxpck 4000 davolserver -p 16 -udpsize 16777216 The fileserver and volserver use pthreads and not LWP. There above states that there should be a total of 256 fileserver threads (including rx and other background threads) and 16 volserver threads. The fileservers are also configured with one million callback entries each. It might be the case that some fileservers in the cell are firewalled from the internet. Of those that are visible here are some important details. Too few Max GUCB Max FStat Wait-for CBs seconds seconds threads afs11 yes 112 101 no afs12 yes 110 150 no afs13 yes 111 97 no afs14 yes 112 155 no afs15 yes 80 166 no afs16 yes 112 106 no afs17 yes 118 118 no afs18 yes 105 123 no afs19 no 56 53 yes afs20 no 32 85 no afs21 no 72 63 yes afs22 no 56 29 no afs23 yes 64 29 no afs24 yes 106 84 no afs25 no 32 60 no afs26 no 64 120 yes The first column "too few CBs" indicates whether the number of callbacks requested by clients exceeds the size of the callback table. When the callback table is too small thrashing occurs because the fileserver is forced to invalidate existing callbacks in order to allocate the new one. Those invalidations require the fileserver to notify the impacted cache manager whose callback promise is going be broken prior to expiration even though there was no change to the file or directory. The second column is the maximum number of seconds it took the fileserver to process a GiveUpCallBacks call. GUCBs is issued when the cache manager's vcache is too small and it attempts to notify the fileserver that it no longer requires the callback promise because its flushing the object from the cache in order to make room for something else that is likely to require a callback promise. The third column is the maximum number of seconds it took to acquire the current metadata for the requested object and register a callback promise. The fourth column is an indication of whether or not the fileserver has experienced the situation in which it received a call from a cache manager that could not be processed because all of the available worker threads were blocked waiting for other operations to complete. Before I answer your question about timeouts let me explain the workflow necessary to process a call issued by a cache manager on a new rx connection. 1. CM issues call to file server on new connection. 2. if the call is protected by rxkad, the FS issues an rxkad challenge to the client 3. if a challenge was received, the CM issues an rxkad response 4. FS finds an available worker thread to process the call 5. FS starts statistics clock 6. FS issues TellMeAboutYourself call to CM with the FS's capabilities 7. CM replies to TMAY call with its uuid and capabilities 8. if the UUID was previously associated with another endpoint (ipv4 address and port) or if the endpoint was associated with another UUID, the FS issues ProbeUuid calls to confirm whether or not there is a conflict. 9. If received, CM responds with a yes or no answer depending upon whether or not the Uuid matches. 10. FS obtains the necessary volume and vnode locks (if any) 11. For a StoreData call, the FS receives the rest of the data and writes it to disk 12. FS issues callback notifications to other cache managers impacted by the call 13. FS allocates or updates a new callback promise if necessary for the caller. 14. FS, if necessary, issues callback invalidation to affected CM which might be the same CM as issued the call that is being processed. 15. CM either replies or not to the callback notification. If not, the notification is added to the delayed callback queue for the CM. 16. FS releases any volume and vnode locks. 17. FS updates call statistics. 18. FS completes the call to the CM with success or failure. For any rx connection there are three timeout values that can be set on both sides of the connection. 1. Connection timeout. How long to wait if no packets have been received from the peer. 2. Idle timeout. How long to wait if ping packets are received but no data packets have been received. This is usually set only on the server side of a connection. 3. Hard timeout. How long is the call permitted to live before it is killed for taking too long even if data if flowing slowly. The defaults are a connection timeout of 12 seconds, an idle timeout of 60 seconds on the server side, and no hard dead timeout. A CM typically sets a 50 second connection timeout and no idle or hard timeout on calls to the FS. The FS sets a 50 second connection timeout and 120 second hard timeout on calls to the CM callback service; except for the ProbeUuid calls which are assigned a connection time of 12 seconds. The FS connections to the PT service use the defaults. I selected the GiveUpCallBacks call statistics because that call doesn't require any volume or vnode locks, nor can it involve any notifications to other CMs. Long timeouts for GUCBs means one or more of the following: a. this is the first call on a new connection and the CM's one and only callback service thread is not responding to the FS promptly b. this is the first call on a new connection and the connection endpoint and the CM's UUID do not match and there is a conflict to resolve c. this is the first call on a new connection and the FS's two CPS queries to the protection service take a long time or timeout if the selected ptserver stops responding to ping ack packets. c. the FS's host table / callback table lock is in use by other threads and this thread cannot make progress The FetchStatus call is similar except that it can also block waiting for Volume and Vnode locks which might not be released until callback notifications are issued. So what are the potential bottlenecks that can result in extended delays totally tens or hundreds of seconds? 1. The single callback service thread in the cache manager which is known to experience soft-deadlocks. 2. The responsiveness of the ptservers to the file servers. 3. Blocking on callback invalidations due to the callback table being too small. 4. Network connectivity between the FS and both PT servers and CMs. Its time for the Super Bowl so I will send off this message as is. Perhaps it will be useful. Jeffrey Altman AuriStor, Inc.
Description: S/MIME Cryptographic Signature