Re: [OpenAFS] AFS lag
>>I'm no ubik engineer, but as far as I understand it, the protocol >>was not designed for even numbers of participating servers. For best >>results, three or five servers seem to be optimum. > >I hear this frequently, and don't see why it should be true. The tie >breaking mechanism during an election is simple. The tie breaking mechanism isn't really the issue here. My point is that you gain almost no benefit from an even number of servers. Specifically, if you have four ubik servers, you have the same amount of redundancy as if you have three servers(*); you can lose one and still maintain quorum. (*) Okay, purists will point out that this is not exactly true. If you have four servers and you happen to lose two, AND one of the two remaining ones is the "best" server, quorum will still be able to be established. I could see other reasons for having an even number of servers, but people should understand exactly what sort of redundancy they can expect out of a given Ubik configuration. --Ken ___ OpenAFS-info mailing list OpenAFS-info@openafs.org https://lists.openafs.org/mailman/listinfo/openafs-info
Re: [OpenAFS] AFS lag
Derrick Brashear wrote: On Wed, Mar 18, 2009 at 9:56 PM, Ken Hornstein wrote: I'm no ubik engineer, but as far as I understand it, the protocol was not designed for even numbers of participating servers. For best results, three or five servers seem to be optimum. I hear this frequently, and don't see why it should be true. The tie breaking mechanism during an election is simple. Kim There is a lot of misinformation about Ubik out there; the voting protocol is actually not complicated, it's just not documented well. it's actually well-documented, if you find Kazar's paper on Quorum Completion. If your database servers are accessable via the Internet, we could take a look at them via udebug. Really, there are only a few things that can go wrong; of all of the pieces of AFS, I think Ubik is one of the most bulletproof. There are a couple (unlikely) open issues; See RT. ___ OpenAFS-info mailing list OpenAFS-info@openafs.org https://lists.openafs.org/mailman/listinfo/openafs-info
Re: [OpenAFS] AFS lag
>> There is a lot of misinformation about Ubik out there; the voting >> protocol is actually not complicated, it's just not documented well. > >it's actually well-documented, if you find Kazar's paper on Quorum Completion. You know, we should try to find a copy of that and put it somewhere useful. >From what I remember (I think I saw a copy once), the paper gets you about 80% of the way there; the source code gets you the rest of the way. Actually, I now realize that I _do_ have a copy of it. Can we put it on the OpenAFS web site? I just have the PostScript; it's easy enough to convert that to PDF. >> If your database servers are accessable via the Internet, we could take >> a look at them via udebug. Really, there are only a few things that can >> go wrong; of all of the pieces of AFS, I think Ubik is one of the most >> bulletproof. > >There are a couple (unlikely) open issues; See RT. Didn't know about those. Still, I think we need more information to diagnose the original problem. --Ken ___ OpenAFS-info mailing list OpenAFS-info@openafs.org https://lists.openafs.org/mailman/listinfo/openafs-info
Re: [OpenAFS] AFS lag
On Wed, Mar 18, 2009 at 9:56 PM, Ken Hornstein wrote: >>I'm no ubik engineer, but as far as I understand it, the protocol was not >>designed for even numbers of participating servers. For best results, three >>or five servers seem to be optimum. > > There is a lot of misinformation about Ubik out there; the voting > protocol is actually not complicated, it's just not documented well. it's actually well-documented, if you find Kazar's paper on Quorum Completion. > If your database servers are accessable via the Internet, we could take > a look at them via udebug. Really, there are only a few things that can > go wrong; of all of the pieces of AFS, I think Ubik is one of the most > bulletproof. There are a couple (unlikely) open issues; See RT. -- Derrick ___ OpenAFS-info mailing list OpenAFS-info@openafs.org https://lists.openafs.org/mailman/listinfo/openafs-info
Re: [OpenAFS] AFS lag
>I'm no ubik engineer, but as far as I understand it, the protocol was not >designed for even numbers of participating servers. For best results, three >or five servers seem to be optimum. There is a lot of misinformation about Ubik out there; the voting protocol is actually not complicated, it's just not documented well. Looking at the source code is even more confusing. So, let me clear up some misconceptions: - It's not really that an odd number is optimum; it's just that you're wasting a server with an even number. Why? Well, the Ubik voting requires a majority number of servers to win an election; if there are 4 servers and only two are available, then that's not a majority. So with 4 servers, you can lose only one server and still maintain quorum (same as with three). You need 5 servers to be able to lose two of them. Now, there is an extra wrinkle here ... the "best" server (lowest numbered) gets an extra vote. So in a 4 server configuration, you can actually lose two and maintain quorum ... as long as one of the two isn't the "best" server. But with five servers, you can lose ANY two. But the protocol works fine with two, or three, or four, or five. There is NO magic here. >What I definitely whitnessed is that servers in a cell configured with two >servers take more than a minute to elect a sync site after server restarts. >Three servers are supposed to make it in an instant. This is one of those mostly-not-true statements that has a bit of truth in it. The exact details: - When brought up, a database server will not vote YES for anyone for 75 seconds. This is inviolate. It doesn't matter if there are two, three, or 100 database servers. If you bring up all your servers cold, at the same time, it will take at least 75 seconds for a quorum election. - If you have two database servers and you only restart the "best" server (note: in a two database server cell, only the "best" server can ever be elected as master), a new election will take 75 seconds. Why? Because you have to wait for the best server to be able to vote for itself; without that vote, there is not a majority. - If you have three (or more) database servers and you only restart the current master, a successful election will happen almost instantly. Why? Because all of the servers that are still up will still vote YES for the master; the master's own YES vote is not necessary. But note this only applies if all of the other servers are still running. If, for example, you rebooted the master and if it took longer than 75 seconds for the master to restart, then what will likely happen is a new master will be elected. Getting back to the original poster's question ... by far the most common problem I have seen with Ubik is bad time synchronization. All of your database servers must be synched up time-wise (the protocol depends on timestamps). It doesn't need to be femtosecond accuracy; the protocol defines MAXSKEW as 10 seconds. If your database servers are accessable via the Internet, we could take a look at them via udebug. Really, there are only a few things that can go wrong; of all of the pieces of AFS, I think Ubik is one of the most bulletproof. --Ken ___ OpenAFS-info mailing list OpenAFS-info@openafs.org https://lists.openafs.org/mailman/listinfo/openafs-info
Re: [OpenAFS] AFS lag
I agree with Abdelkader and would recommend having at least 3 database servers. You could be walking on very thin ice with just 2. Whats the reason for this ? I'm no ubik engineer, but as far as I understand it, the protocol was not designed for even numbers of participating servers. For best results, three or five servers seem to be optimum. What I definitely whitnessed is that servers in a cell configured with two servers take more than a minute to elect a sync site after server restarts. Three servers are supposed to make it in an instant. Apart from that, my test cell runs two servers and it works just fine, so long as no DB server restarts are necessary. It's plain annoying when I do development on a DB service. There may be more pitfalls in 2-server setups that I'm unaware of. Regards Felix ___ OpenAFS-info mailing list OpenAFS-info@openafs.org https://lists.openafs.org/mailman/listinfo/openafs-info
Re: [OpenAFS] AFS lag
On Wed, Mar 18, 2009 at 5:30 PM, Pesce, Nicholas wrote: > We just experienced significant lag issues at our AFS site for vos exam and > vos release issues. This seemed to be caused by a bug with Ubik callbacks > (version 1.4.7) . One of our database servers was restarted then all of the > database servers did not sync properly with the sync-site (only the sync > site was working). I got all but one of the vlserver's to run. But until I > got all 6 servers functioning properly (after patching) we still saw this > issue. > > Have you checked udebug to ensure that all of your database server > processes are current, up and giving a beacon? > > > I agree with Abdelkader and would recommend having at least 3 database > servers. You could be walking on very thin ice with just 2. > > Sincerely, > > -- > Nicholas Pesce > npe...@qualcomm.com > > > -Original Message- > From: openafs-info-ad...@openafs.org [mailto: > openafs-info-ad...@openafs.org] On Behalf Of Felix Frank > Sent: Wednesday, March 18, 2009 4:15 AM > To: Abdelkader El mastour > Cc: openafs-info@openafs.org > Subject: Re: [OpenAFS] AFS lag > > On Wed, 18 Mar 2009, Abdelkader El mastour wrote: > > > Configuration > > Netbsd4 > > heimdal1.1 > > arla > > You have Arla clients? > > > Openafs 1.4.5 via pkgsrc > > replicated root.afs & root.cell RO > > 1000 user per server > > > > 10 servers for fileserver. > > > > 2 servers for vlserver and ptserver > > This is not good. I've recently run some tests with 2 DB-servers, and > operation is not optimal. It can take them longer than necessary to > determine the sync site. 3 servers is pretty much ideal, but even a single > server works smoother than 2 IMHO. > > > Our users have been experiencing some major lag accessing afs . > > It all began when we had an hardware problem with one of our afs servers > > (afs-1),accessing afs was laggy for every user on the server > > so we decided to move every one of them from this server to one of the > nine > > others, > > we shutdown the broken server take it off the listaddrs list and restart > the > > vlserver instance. > > The slowdown continues.. > > > > We turned on the afs-1 server again but without lunch any afs services > and > > then no more lags accesing afs. > > Since then we've had to shutdown afs-1 ,took it off the listaddrs ,and > lags > > are back. > > Note#1 : afs servers are up since a year and we've never exeperienced any > > issue before. > > Note#2 : bos status and sysstat doesnt reveal any issue . > > Any guess about the reasons for lags ? > > I presume afs-1 was NOT one of your DB servers. If it is, > CellServDB would be the place to start. > > There may be problems with replicated volumes. root.cell should be cached > at > all times (are there frequent vos release's?) but who knows... > > On afflicted clients, try vos checkv. > > HTH > Felix > ___ > OpenAFS-info mailing list > OpenAFS-info@openafs.org > https://lists.openafs.org/mailman/listinfo/openafs-info > >I agree with Abdelkader and would recommend having at least 3 database servers. You could be walking on very thin ice with just 2. Whats the reason for this ? -- Abdelkader El mastour 0620477723
Re: [OpenAFS] AFS lag
On Wed, Mar 18, 2009 at 2:54 PM, Derrick Brashear wrote: > On Wed, Mar 18, 2009 at 5:35 AM, Abdelkader El mastour > wrote: > > Configuration > > Netbsd4 > > heimdal1.1 > > arla > > Openafs 1.4.5 via pkgsrc > > replicated root.afs & root.cell RO > > 1000 user per server > > > > 10 servers for fileserver. > > what's the configuration of the fileservers? > > bos status (any fileserver host) fs -long > > and share the information? > Filelog: http://perso.epitech.eu/~el-mas_a/Filelog/FileLog >are there any multihomed machines involved, or NATs What do you mean with multihomed machines ? -- Abdelkader El mastour 0620477723
RE: [OpenAFS] AFS lag
We just experienced significant lag issues at our AFS site for vos exam and vos release issues. This seemed to be caused by a bug with Ubik callbacks (version 1.4.7) . One of our database servers was restarted then all of the database servers did not sync properly with the sync-site (only the sync site was working). I got all but one of the vlserver's to run. But until I got all 6 servers functioning properly (after patching) we still saw this issue. Have you checked udebug to ensure that all of your database server processes are current, up and giving a beacon? I agree with Abdelkader and would recommend having at least 3 database servers. You could be walking on very thin ice with just 2. Sincerely, -- Nicholas Pesce npe...@qualcomm.com -Original Message- From: openafs-info-ad...@openafs.org [mailto:openafs-info-ad...@openafs.org] On Behalf Of Felix Frank Sent: Wednesday, March 18, 2009 4:15 AM To: Abdelkader El mastour Cc: openafs-info@openafs.org Subject: Re: [OpenAFS] AFS lag On Wed, 18 Mar 2009, Abdelkader El mastour wrote: > Configuration > Netbsd4 > heimdal1.1 > arla You have Arla clients? > Openafs 1.4.5 via pkgsrc > replicated root.afs & root.cell RO > 1000 user per server > > 10 servers for fileserver. > > 2 servers for vlserver and ptserver This is not good. I've recently run some tests with 2 DB-servers, and operation is not optimal. It can take them longer than necessary to determine the sync site. 3 servers is pretty much ideal, but even a single server works smoother than 2 IMHO. > Our users have been experiencing some major lag accessing afs . > It all began when we had an hardware problem with one of our afs servers > (afs-1),accessing afs was laggy for every user on the server > so we decided to move every one of them from this server to one of the nine > others, > we shutdown the broken server take it off the listaddrs list and restart the > vlserver instance. > The slowdown continues.. > > We turned on the afs-1 server again but without lunch any afs services and > then no more lags accesing afs. > Since then we've had to shutdown afs-1 ,took it off the listaddrs ,and lags > are back. > Note#1 : afs servers are up since a year and we've never exeperienced any > issue before. > Note#2 : bos status and sysstat doesnt reveal any issue . > Any guess about the reasons for lags ? I presume afs-1 was NOT one of your DB servers. If it is, CellServDB would be the place to start. There may be problems with replicated volumes. root.cell should be cached at all times (are there frequent vos release's?) but who knows... On afflicted clients, try vos checkv. HTH Felix ___ OpenAFS-info mailing list OpenAFS-info@openafs.org https://lists.openafs.org/mailman/listinfo/openafs-info ___ OpenAFS-info mailing list OpenAFS-info@openafs.org https://lists.openafs.org/mailman/listinfo/openafs-info
Re: [OpenAFS] AFS lag
On Wed, Mar 18, 2009 at 10:01 AM, Abdelkader El mastour wrote: > > > On Wed, Mar 18, 2009 at 2:54 PM, Derrick Brashear wrote: >> >> On Wed, Mar 18, 2009 at 5:35 AM, Abdelkader El mastour >> wrote: >> > Configuration >> > Netbsd4 >> > heimdal1.1 >> > arla >> > Openafs 1.4.5 via pkgsrc >> > replicated root.afs & root.cell RO >> > 1000 user per server >> > >> > 10 servers for fileserver. >> >> what's the configuration of the fileservers? >> >> bos status (any fileserver host) fs -long >> >> and share the information? > > Instance fs, (type is fs) currently running normally. > Auxiliary status is: file server running. > Process last started at Tue Mar 17 00:46:17 2009 (4 proc starts) > Last exit at Tue Mar 17 00:16:46 2009 > Command 1 is '/usr/pkg/libexec/openafs/fileserver -L -p 128' -L -p 128 is what i hoped to see (it means you've configured the fileserver reasonably) are there any multihomed machines involved, or NATs? what's in the FileLog on (any of the servers)? -- Derrick ___ OpenAFS-info mailing list OpenAFS-info@openafs.org https://lists.openafs.org/mailman/listinfo/openafs-info
Re: [OpenAFS] AFS lag
On Wed, Mar 18, 2009 at 5:35 AM, Abdelkader El mastour wrote: > Configuration > Netbsd4 > heimdal1.1 > arla > Openafs 1.4.5 via pkgsrc > replicated root.afs & root.cell RO > 1000 user per server > > 10 servers for fileserver. what's the configuration of the fileservers? bos status (any fileserver host) fs -long and share the information? ___ OpenAFS-info mailing list OpenAFS-info@openafs.org https://lists.openafs.org/mailman/listinfo/openafs-info
Re: [OpenAFS] AFS lag
On Wed, 18 Mar 2009, Abdelkader El mastour wrote: Configuration Netbsd4 heimdal1.1 arla You have Arla clients? Openafs 1.4.5 via pkgsrc replicated root.afs & root.cell RO 1000 user per server 10 servers for fileserver. 2 servers for vlserver and ptserver This is not good. I've recently run some tests with 2 DB-servers, and operation is not optimal. It can take them longer than necessary to determine the sync site. 3 servers is pretty much ideal, but even a single server works smoother than 2 IMHO. Our users have been experiencing some major lag accessing afs . It all began when we had an hardware problem with one of our afs servers (afs-1),accessing afs was laggy for every user on the server so we decided to move every one of them from this server to one of the nine others, we shutdown the broken server take it off the listaddrs list and restart the vlserver instance. The slowdown continues.. We turned on the afs-1 server again but without lunch any afs services and then no more lags accesing afs. Since then we've had to shutdown afs-1 ,took it off the listaddrs ,and lags are back. Note#1 : afs servers are up since a year and we've never exeperienced any issue before. Note#2 : bos status and sysstat doesnt reveal any issue . Any guess about the reasons for lags ? I presume afs-1 was NOT one of your DB servers. If it is, CellServDB would be the place to start. There may be problems with replicated volumes. root.cell should be cached at all times (are there frequent vos release's?) but who knows... On afflicted clients, try vos checkv. HTH Felix ___ OpenAFS-info mailing list OpenAFS-info@openafs.org https://lists.openafs.org/mailman/listinfo/openafs-info