Re: [OpenAFS] AFS lag

2009-05-27 Thread Ken Hornstein
>>I'm no ubik engineer, but as far as I understand it, the protocol
>>was not designed for even numbers of participating servers. For best
>>results, three or five servers seem to be optimum.
>
>I hear this frequently, and don't see why it should be true.  The tie
>breaking mechanism during an election is simple.

The tie breaking mechanism isn't really the issue here.

My point is that you gain almost no benefit from an even number of
servers.  Specifically, if you have four ubik servers, you have the
same amount of redundancy as if you have three servers(*); you can lose
one and still maintain quorum.

(*) Okay, purists will point out that this is not exactly true.  If you
have four servers and you happen to lose two, AND one of the two remaining
ones is the "best" server, quorum will still be able to be established.
I could see other reasons for having an even number of servers, but people
should understand exactly what sort of redundancy they can expect out
of a given Ubik configuration.

--Ken
___
OpenAFS-info mailing list
OpenAFS-info@openafs.org
https://lists.openafs.org/mailman/listinfo/openafs-info


Re: [OpenAFS] AFS lag

2009-05-26 Thread Kim Kimball




Derrick Brashear wrote:

  On Wed, Mar 18, 2009 at 9:56 PM, Ken Hornstein  wrote:
  
  

  I'm no ubik engineer, but as far as I understand it, the protocol was not
designed for even numbers of participating servers. For best results, three
or five servers seem to be optimum.
  

  

I hear this frequently, and don't see why it should be true.  The tie
breaking mechanism during an election is simple.

Kim




  
There is a lot of misinformation about Ubik out there; the voting
protocol is actually not complicated, it's just not documented well.

  
  
it's actually well-documented, if you find Kazar's paper on Quorum Completion.

  
  
If your database servers are accessable via the Internet, we could take
a look at them via udebug.  Really, there are only a few things that can
go wrong; of all of the pieces of AFS, I think Ubik is one of the most
bulletproof.

  
  
There are a couple (unlikely) open issues; See RT.


  





___
OpenAFS-info mailing list
OpenAFS-info@openafs.org
https://lists.openafs.org/mailman/listinfo/openafs-info


Re: [OpenAFS] AFS lag

2009-03-19 Thread Ken Hornstein
>> There is a lot of misinformation about Ubik out there; the voting
>> protocol is actually not complicated, it's just not documented well.
>
>it's actually well-documented, if you find Kazar's paper on Quorum Completion.

You know, we should try to find a copy of that and put it somewhere useful.
>From what I remember (I think I saw a copy once), the paper gets you about
80% of the way there; the source code gets you the rest of the way.

Actually, I now realize that I _do_ have a copy of it.  Can we put it on
the OpenAFS web site?  I just have the PostScript; it's easy enough to
convert that to PDF.

>> If your database servers are accessable via the Internet, we could take
>> a look at them via udebug.  Really, there are only a few things that can
>> go wrong; of all of the pieces of AFS, I think Ubik is one of the most
>> bulletproof.
>
>There are a couple (unlikely) open issues; See RT.

Didn't know about those.  Still, I think we need more information to diagnose
the original problem.

--Ken
___
OpenAFS-info mailing list
OpenAFS-info@openafs.org
https://lists.openafs.org/mailman/listinfo/openafs-info


Re: [OpenAFS] AFS lag

2009-03-18 Thread Derrick Brashear
On Wed, Mar 18, 2009 at 9:56 PM, Ken Hornstein  wrote:
>>I'm no ubik engineer, but as far as I understand it, the protocol was not
>>designed for even numbers of participating servers. For best results, three
>>or five servers seem to be optimum.
>
> There is a lot of misinformation about Ubik out there; the voting
> protocol is actually not complicated, it's just not documented well.

it's actually well-documented, if you find Kazar's paper on Quorum Completion.

> If your database servers are accessable via the Internet, we could take
> a look at them via udebug.  Really, there are only a few things that can
> go wrong; of all of the pieces of AFS, I think Ubik is one of the most
> bulletproof.

There are a couple (unlikely) open issues; See RT.


-- 
Derrick
___
OpenAFS-info mailing list
OpenAFS-info@openafs.org
https://lists.openafs.org/mailman/listinfo/openafs-info


Re: [OpenAFS] AFS lag

2009-03-18 Thread Ken Hornstein
>I'm no ubik engineer, but as far as I understand it, the protocol was not
>designed for even numbers of participating servers. For best results, three
>or five servers seem to be optimum.

There is a lot of misinformation about Ubik out there; the voting
protocol is actually not complicated, it's just not documented well.
Looking at the source code is even more confusing.

So, let me clear up some misconceptions:

- It's not really that an odd number is optimum; it's just that you're wasting
  a server with an even number.

  Why?  Well, the Ubik voting requires a majority number of servers to win an
  election; if there are 4 servers and only two are available, then that's
  not a majority.  So with 4 servers, you can lose only one server and still
  maintain quorum (same as with three).  You need 5 servers to be able to
  lose two of them.

  Now, there is an extra wrinkle here ... the "best" server (lowest
  numbered) gets an extra vote.  So in a 4 server configuration,
  you can actually lose two and maintain quorum ... as long as one
  of the two isn't the "best" server.  But with five servers, you
  can lose ANY two.  But the protocol works fine with two, or three, or
  four, or five.  There is NO magic here.

>What I definitely whitnessed is that servers in a cell configured with two
>servers take more than a minute to elect a sync site after server restarts.
>Three servers are supposed to make it in an instant.

This is one of those mostly-not-true statements that has a bit of truth in
it.  The exact details:

- When brought up, a database server will not vote YES for anyone for 75
  seconds.  This is inviolate.  It doesn't matter if there are two,
  three, or 100 database servers.  If you bring up all your servers
  cold, at the same time, it will take at least 75 seconds for a
  quorum election.

- If you have two database servers and you only restart the "best" server
  (note: in a two database server cell, only the "best" server can
  ever be elected as master), a new election will take 75 seconds.
  Why?  Because you have to wait for the best server to be able to
  vote for itself; without that vote, there is not a majority.
 
- If you have three (or more) database servers and you only restart the
  current master, a successful election will happen almost instantly.
  Why?  Because all of the servers that are still up will still vote
  YES for the master; the master's own YES vote is not necessary.  But
  note this only applies if all of the other servers are still running.
  If, for example, you rebooted the master and if it took longer than
  75 seconds for the master to restart, then what will likely happen is
  a new master will be elected.

Getting back to the original poster's question ... by far the most common
problem I have seen with Ubik is bad time synchronization.  All of your
database servers must be synched up time-wise (the protocol depends on
timestamps).  It doesn't need to be femtosecond accuracy; the protocol
defines MAXSKEW as 10 seconds.

If your database servers are accessable via the Internet, we could take
a look at them via udebug.  Really, there are only a few things that can
go wrong; of all of the pieces of AFS, I think Ubik is one of the most
bulletproof.

--Ken
___
OpenAFS-info mailing list
OpenAFS-info@openafs.org
https://lists.openafs.org/mailman/listinfo/openafs-info


Re: [OpenAFS] AFS lag

2009-03-18 Thread Felix Frank

I agree with Abdelkader and would recommend having at least 3 database

servers.  You could be walking on very thin ice with just 2.
Whats the reason for this ?


I'm no ubik engineer, but as far as I understand it, the protocol was not
designed for even numbers of participating servers. For best results, three
or five servers seem to be optimum.

What I definitely whitnessed is that servers in a cell configured with two
servers take more than a minute to elect a sync site after server restarts.
Three servers are supposed to make it in an instant.

Apart from that, my test cell runs two servers and it works just fine, so long
as no DB server restarts are necessary. It's plain annoying when I do
development on a DB service. There may be more pitfalls in 2-server setups
that I'm unaware of.

Regards
Felix
___
OpenAFS-info mailing list
OpenAFS-info@openafs.org
https://lists.openafs.org/mailman/listinfo/openafs-info


Re: [OpenAFS] AFS lag

2009-03-18 Thread Abdelkader El mastour
On Wed, Mar 18, 2009 at 5:30 PM, Pesce, Nicholas wrote:

> We just experienced significant lag issues at our AFS site for vos exam and
> vos release issues.  This seemed to be caused by a bug with Ubik callbacks
> (version 1.4.7) .  One of our database servers was restarted then all of the
> database servers did not sync properly with the sync-site (only the sync
> site was working). I got all but one of the vlserver's to run.  But until I
> got all 6 servers functioning properly (after patching) we still saw this
> issue.
>
> Have you checked udebug to ensure that all of your database server
> processes are current, up and giving a beacon?
>
>
> I agree with Abdelkader and would recommend having at least 3 database
> servers.  You could be walking on very thin ice with just 2.
>
> Sincerely,
>
> --
> Nicholas Pesce
> npe...@qualcomm.com
>
>
> -Original Message-
> From: openafs-info-ad...@openafs.org [mailto:
> openafs-info-ad...@openafs.org] On Behalf Of Felix Frank
> Sent: Wednesday, March 18, 2009 4:15 AM
> To: Abdelkader El mastour
> Cc: openafs-info@openafs.org
> Subject: Re: [OpenAFS] AFS lag
>
> On Wed, 18 Mar 2009, Abdelkader El mastour wrote:
>
> > Configuration
> > Netbsd4
> > heimdal1.1
> > arla
>
> You have Arla clients?
>
> > Openafs 1.4.5 via pkgsrc
> > replicated root.afs & root.cell RO
> > 1000 user per server
> >
> > 10 servers for fileserver.
> >
> > 2 servers for vlserver and ptserver
>
> This is not good. I've recently run some tests with 2 DB-servers, and
> operation is not optimal. It can take them longer than necessary to
> determine the sync site. 3 servers is pretty much ideal, but even a single
> server works smoother than 2 IMHO.
>
> > Our users have been experiencing some major lag accessing afs .
> > It all began when we had an hardware problem with one of our afs servers
> > (afs-1),accessing afs was laggy for every user on the server
> > so we decided to move every one of them from this server to one of the
> nine
> > others,
> > we shutdown the broken server take it off the listaddrs list and restart
> the
> > vlserver instance.
> > The slowdown continues..
> >
> > We turned on the afs-1 server again  but without lunch any afs services
> and
> > then no more lags accesing afs.
> > Since then we've had to shutdown afs-1 ,took it off the listaddrs ,and
> lags
> > are back.
> > Note#1 : afs servers are up since a year and we've never exeperienced any
> > issue before.
> > Note#2 : bos status and sysstat doesnt reveal any issue .
> > Any guess about the reasons for lags ?
>
> I presume afs-1 was NOT one of your DB servers. If it is,
> CellServDB would be the place to start.
>
> There may be problems with replicated volumes. root.cell should be cached
> at
> all times (are there frequent vos release's?) but who knows...
>
> On afflicted clients, try vos checkv.
>
> HTH
> Felix
> ___
> OpenAFS-info mailing list
> OpenAFS-info@openafs.org
> https://lists.openafs.org/mailman/listinfo/openafs-info
>



>I agree with Abdelkader and would recommend having at least 3 database
servers.  You could be walking on very thin ice with just 2.
Whats the reason for this ?

-- 
Abdelkader El mastour
0620477723


Re: [OpenAFS] AFS lag

2009-03-18 Thread Abdelkader El mastour
On Wed, Mar 18, 2009 at 2:54 PM, Derrick Brashear  wrote:

> On Wed, Mar 18, 2009 at 5:35 AM, Abdelkader El mastour
>  wrote:
> > Configuration
> > Netbsd4
> > heimdal1.1
> > arla
> > Openafs 1.4.5 via pkgsrc
> > replicated root.afs & root.cell RO
> > 1000 user per server
> >
> > 10 servers for fileserver.
>
> what's the configuration of the fileservers?
>
> bos status (any fileserver host) fs -long
>
> and share the information?
>

Filelog:
http://perso.epitech.eu/~el-mas_a/Filelog/FileLog
>are there any multihomed machines involved, or NATs
What do you mean with multihomed machines ?

-- 
Abdelkader El mastour
0620477723


RE: [OpenAFS] AFS lag

2009-03-18 Thread Pesce, Nicholas
We just experienced significant lag issues at our AFS site for vos exam and vos 
release issues.  This seemed to be caused by a bug with Ubik callbacks (version 
1.4.7) .  One of our database servers was restarted then all of the database 
servers did not sync properly with the sync-site (only the sync site was 
working). I got all but one of the vlserver's to run.  But until I got all 6 
servers functioning properly (after patching) we still saw this issue.

Have you checked udebug to ensure that all of your database server processes 
are current, up and giving a beacon?


I agree with Abdelkader and would recommend having at least 3 database servers. 
 You could be walking on very thin ice with just 2.

Sincerely,

--
Nicholas Pesce
npe...@qualcomm.com


-Original Message-
From: openafs-info-ad...@openafs.org [mailto:openafs-info-ad...@openafs.org] On 
Behalf Of Felix Frank
Sent: Wednesday, March 18, 2009 4:15 AM
To: Abdelkader El mastour
Cc: openafs-info@openafs.org
Subject: Re: [OpenAFS] AFS lag

On Wed, 18 Mar 2009, Abdelkader El mastour wrote:

> Configuration
> Netbsd4
> heimdal1.1
> arla

You have Arla clients?

> Openafs 1.4.5 via pkgsrc
> replicated root.afs & root.cell RO
> 1000 user per server
>
> 10 servers for fileserver.
>
> 2 servers for vlserver and ptserver

This is not good. I've recently run some tests with 2 DB-servers, and
operation is not optimal. It can take them longer than necessary to 
determine the sync site. 3 servers is pretty much ideal, but even a single 
server works smoother than 2 IMHO.

> Our users have been experiencing some major lag accessing afs .
> It all began when we had an hardware problem with one of our afs servers
> (afs-1),accessing afs was laggy for every user on the server
> so we decided to move every one of them from this server to one of the nine
> others,
> we shutdown the broken server take it off the listaddrs list and restart the
> vlserver instance.
> The slowdown continues..
>
> We turned on the afs-1 server again  but without lunch any afs services and
> then no more lags accesing afs.
> Since then we've had to shutdown afs-1 ,took it off the listaddrs ,and lags
> are back.
> Note#1 : afs servers are up since a year and we've never exeperienced any
> issue before.
> Note#2 : bos status and sysstat doesnt reveal any issue .
> Any guess about the reasons for lags ?

I presume afs-1 was NOT one of your DB servers. If it is, 
CellServDB would be the place to start.

There may be problems with replicated volumes. root.cell should be cached at
all times (are there frequent vos release's?) but who knows...

On afflicted clients, try vos checkv.

HTH
Felix
___
OpenAFS-info mailing list
OpenAFS-info@openafs.org
https://lists.openafs.org/mailman/listinfo/openafs-info
___
OpenAFS-info mailing list
OpenAFS-info@openafs.org
https://lists.openafs.org/mailman/listinfo/openafs-info


Re: [OpenAFS] AFS lag

2009-03-18 Thread Derrick Brashear
On Wed, Mar 18, 2009 at 10:01 AM, Abdelkader El mastour
 wrote:
>
>
> On Wed, Mar 18, 2009 at 2:54 PM, Derrick Brashear  wrote:
>>
>> On Wed, Mar 18, 2009 at 5:35 AM, Abdelkader El mastour
>>  wrote:
>> > Configuration
>> > Netbsd4
>> > heimdal1.1
>> > arla
>> > Openafs 1.4.5 via pkgsrc
>> > replicated root.afs & root.cell RO
>> > 1000 user per server
>> >
>> > 10 servers for fileserver.
>>
>> what's the configuration of the fileservers?
>>
>> bos status (any fileserver host) fs -long
>>
>> and share the information?
>
> Instance fs, (type is fs) currently running normally.
>     Auxiliary status is: file server running.
>     Process last started at Tue Mar 17 00:46:17 2009 (4 proc starts)
>     Last exit at Tue Mar 17 00:16:46 2009
>     Command 1 is '/usr/pkg/libexec/openafs/fileserver -L -p 128'

-L -p 128
is what i hoped to see (it means you've configured the fileserver reasonably)

are there any multihomed machines involved, or NATs? what's in the
FileLog on (any of the servers)?

-- 
Derrick
___
OpenAFS-info mailing list
OpenAFS-info@openafs.org
https://lists.openafs.org/mailman/listinfo/openafs-info


Re: [OpenAFS] AFS lag

2009-03-18 Thread Derrick Brashear
On Wed, Mar 18, 2009 at 5:35 AM, Abdelkader El mastour
 wrote:
> Configuration
> Netbsd4
> heimdal1.1
> arla
> Openafs 1.4.5 via pkgsrc
> replicated root.afs & root.cell RO
> 1000 user per server
>
> 10 servers for fileserver.

what's the configuration of the fileservers?

bos status (any fileserver host) fs -long

and share the information?
___
OpenAFS-info mailing list
OpenAFS-info@openafs.org
https://lists.openafs.org/mailman/listinfo/openafs-info


Re: [OpenAFS] AFS lag

2009-03-18 Thread Felix Frank

On Wed, 18 Mar 2009, Abdelkader El mastour wrote:


Configuration
Netbsd4
heimdal1.1
arla


You have Arla clients?


Openafs 1.4.5 via pkgsrc
replicated root.afs & root.cell RO
1000 user per server

10 servers for fileserver.

2 servers for vlserver and ptserver


This is not good. I've recently run some tests with 2 DB-servers, and
operation is not optimal. It can take them longer than necessary to 
determine the sync site. 3 servers is pretty much ideal, but even a single 
server works smoother than 2 IMHO.



Our users have been experiencing some major lag accessing afs .
It all began when we had an hardware problem with one of our afs servers
(afs-1),accessing afs was laggy for every user on the server
so we decided to move every one of them from this server to one of the nine
others,
we shutdown the broken server take it off the listaddrs list and restart the
vlserver instance.
The slowdown continues..

We turned on the afs-1 server again  but without lunch any afs services and
then no more lags accesing afs.
Since then we've had to shutdown afs-1 ,took it off the listaddrs ,and lags
are back.
Note#1 : afs servers are up since a year and we've never exeperienced any
issue before.
Note#2 : bos status and sysstat doesnt reveal any issue .
Any guess about the reasons for lags ?


I presume afs-1 was NOT one of your DB servers. If it is, 
CellServDB would be the place to start.


There may be problems with replicated volumes. root.cell should be cached at
all times (are there frequent vos release's?) but who knows...

On afflicted clients, try vos checkv.

HTH
Felix
___
OpenAFS-info mailing list
OpenAFS-info@openafs.org
https://lists.openafs.org/mailman/listinfo/openafs-info