RE: venus crash

Florin Muresan Thu, 15 Mar 2007 07:14:24 -0800

And the logs ....

> -----Original Message-----
> From: Florin Muresan [mailto:[EMAIL PROTECTED]
> Sent: Thursday, March 15, 2007 5:03 PM
> To: Jan Harkes
> Cc: codalist@coda.cs.cmu.edu
> Subject: RE: venus crash
> 
> Hello!
> 
> Thank you Jan for your answer.
> 
> > -----Original Message-----
> > From: Jan Harkes [mailto:[EMAIL PROTECTED]
> > Sent: Tuesday, March 13, 2007 4:57 PM
> > To: Florin Muresan
> > Subject: Re: venus crash
> >
> > On Tue, Mar 13, 2007 at 12:45:32PM +0200, Florin Muresan wrote:
> > > Hello everybody!
> >
> > I guess this was intended to be sent to codalist,
> 
> Yes it was but I missed the reply all button. Sorry for that.
> 
> >
> > > I have a similar problem in my coda realm. Venus crashes when I
try
> to
> > > copy(overwrite)/delete many files.
> > > I use coda in a production environment for a web hosting solution.
> > > During the testing period this situation never happened. My guess
is
> > > that the problem occurs because of the high number of accessed
files
> per
> > > second that triggers false conflicts.
> >
> > High rate of accesses isn't really something that would trigger
false
> > conflicts. Also, false conflicts should not cause a client to crash.
> 
> I must say that you are right and I had made a mistake saying that
Coda
> client crashes. In fact it just hangs. I had read some of the emails
> from the codalist posted last year and I think I understand better how
> Coda works.
> 
> >
> > There are some problems that I have observed with our own web/ftp
> > servers, which get pretty much everything out of /coda.
> >
> > - The Coda client is a userspace process, so all file open and close
> > requests are forwarded to this single process where they are queued.
> > There is a set of about 20 worker threads that pick requests off the
> > queue and handle them. This is in a way similar to how a web-server
> > has one incoming socket where requests are queued, which are then
> > accepted by a web-server instance (thread or process).
> >
> > In-kernel filesystems have it a bit easier, when the webserver makes
a
> > system call the kernel simply uses the application context, so there
> > is no queueing and it can handle as many requests in parallel as
there
> > are server instances.
> >
> > Now we only really see open and close requests all individual read
and
> > write calls are handled by the kernel, so if the client has most or
> all
> > files cached the worker threads don't have to do much and are pretty
> > efficient. However if a some (or a lot) of files are updated on the
> > server, most the locally cached data is invalidated and almost every
> > request ends up having to fetch data from the Coda servers. So each
> > request takes longer, and may have some locks on a volume because it
> is
> > updating the local cache state. So in the worst case only process a
> > single request at a time and the queue becomes very long, blocking
all
> > web-server processes.
> 
> I think this is exactly what happened when I tried to delete about 900
> files at one time and the result was that apache webserver got
blocked,
> slowing down the whole system.
> 
> My Coda setup is formed from one Coda server and three Coda clients
(for
> the moment). It is very important to have the same web documents on
all
> three clients because I implemented an load-balancing solution and all
> clients must serve to visitors the same content. Coda solves this very
> elegantly.
> 
> Back to the point. Trying to solve the problem I terminated the apache
> processes and then restarted venus. After restart the whole /coda
volume
> was in disconnected state and I feared that I lost all the data. I had
> to move quickly because all of my websites were down and the only
quick
> solution that I could think of was to purge and reinstall the
> coda-client package on the system where I deleted the files. I thought
> this way I will avoid any conflicts that could appear.
> The curios thing is that the other two Coda clients were hanging after
> this problem occurred.
> 
> For the sake of readability I atached the relevant logs.
> 
> After I reinstalled the coda-client and restarted the Coda server the
> /coda become accessible on all the clients but I had to waith for a
> while to reintegrate.
> 
> I belive now that it wasn't necesary to reinstall the coda-client
> because all the conflicts would been resolved automaticaly and that an
> restart of the Coda server would have been enough.
> 
> >
> > - Another thing that can happen is that when one client is updating
a
> > file, another client sees the update before the final commit. At
this
> > point the versions are still skewed. A write first updates each
> replica
> > and then uses the success/failure status to synchronize the
versions.
> > So if we see a changed object before the versions are synchronized
the
> > reading client believes there is a conflict and triggers
server-server
> > resolution. As a result the servers lock down the volume, exchange
> their
> > copies of the file or directory, compare the differences and decide
> > which one would be considered the most up-to-date version.
> >
> > We detect the missed version synchronization because the contents
are
> > identical, this is a 'weak-equality' type resolution and so the
> servers
> > reset the versions to be correct again. Then when the writing client
> > finalizes the operation, the versions end up getting bumped for a
> second
> > time, skewing them again, requiring the reading client to refresh
it's
> > cache and triggering another resolution. There is not a correctness
> > issue here, but the additional 2 resolution phases definitely slow
> down
> > everything because they add an additional 10 roundtrips and take an
> > exclusive lock on the volume, preventing all readers and writers
even
> > for unrelated objects within the volume.
> >
> > Neither of these would introduce crashes or conflicts though, mostly
a
> > temporary performance degradation where all web servers are blocked
> > until the system catches up again with all the queued requests.
> 
> Its clear for me now why I get this performance degradation when
trying
> to copy many files but this is not at all desireable in an production
> environment.
> 
> >
> > > Do you think that if I install the version 6.9.0 the problem with
> false
> > > conflicts will be avoided?
> >
> > Not sure if false conflicts are your problem. A crash is a bug, even
> > when there are conflicts we shouldn't crash. With 6.9 we basically
end
> > up using an existing code-path that was normally only used during
> > disconnections or when connectivity was poor. That code has been
> around
> > for a long time, but really hadn't been tested all that well because
> it
> > was the fallback path. So I do actually expect it to be somewhat
less
> > reliable. However as it is the same code that is used by older
clients
> > when things went wrong it isn't really a step back. Any bugs that
> could
> > happen in unusual situations have become bugs that will happen.
> >
> > > What any other suggestions for this situation?
> >
> > What does your setup look like? Are there replicated volumes or is
> > everything backed by a single Coda server (i.e. a single server
would
> > never have resolution issues). How many Coda clients are there. Are
> > updates being written by one client or by several client. Which
client
> > crashes, the one that is writing or another that is only reading.
> 
> I had described the setup above but I must make some remarks.
Typicaly,
> all the Coda clients will write from time to time. One of them is
mainly
> used to write files, but only on specific times and was not doing any
> writing at the moment of the incident. In this case, the client that
> hanged was one of the clients that rarely writes. At that time I used
it
> to delete the files.
> 
> >
> > What web server are you using? How many threads/processes does it
use?
> > How many requests per second are we talking about?
> 
> I run on every client Apache 2.0.54 with mpm_prefork_module and is
> currently setup for 200 MaxClients. The prefork module uses only one
> thread/process. The average of requests vary from 20 to 50 requests
per
> client per second.
> 
> >
> > What is logged in /var/log/coda/venus.err when the client crashes?
> >
> > Jan
> 
> 
> Thank you for your time.
> Florin

This is what what was recorded in the SrvLog file, on the SCM at that time:

15:53:21 VGetVnode: vnode 1000002.6e06 is not allocated
15:53:21 VGetVnode: vnode 1000002.7078 is not allocated
15:53:21 VGetVnode: vnode 1000002.af3a is not allocated
15:53:21 VGetVnode: memory vnode was snatched away
15:53:21 VGetVnode: memory vnode was snatched away
15:53:21 VGetVnode: memory vnode was snatched away
15:53:21 VGetVnode: vnode 1000002.b1ec is not allocated
15:53:21 VGetVnode: vnode 1000002.b08a is not allocated
15:53:21 VGetVnode: memory vnode was snatched away
15:53:21 VGetVnode: memory vnode was snatched away
15:53:21 VGetVnode: vnode 1000002.37c4 is not allocated
15:53:21 VGetVnode: vnode 1000002.7096 is not allocated
15:53:21 VGetVnode: memory vnode was snatched away
15:53:21 VGetVnode: vnode 1000002.b0a6 is not allocated
15:53:21 VGetVnode: memory vnode was snatched away
15:53:21 VGetVnode: vnode 1000002.b0b4 is not allocated
15:53:21 VGetVnode: vnode 1000002.6e7a is not allocated
15:56:43 client_GetVenusId: got new host 192.168.110.4:32768
15:56:43 Building callback conn.
15:58:31 Callback failed RPC2_DEAD (F) for ws 192.168.110.4:32781
15:58:31 Deleting client entry for user System:AnyUser at 192.168.110.4.32781 
rpcid 1103096844
15:58:31 Deleting client entry for user System:AnyUser at 192.168.110.4.32781 
rpcid 926787228
15:58:31 Deleting client entry for user System:AnyUser at 192.168.110.4.32781 
rpcid 1567630031
15:58:31 Deleting client entry for user System:AnyUser at 192.168.110.4.32781 
rpcid 2074404818
16:00:08 Callback failed RPC2_DEAD (F) for ws 192.168.110.3:34405
16:00:08 Deleting client entry for user System:AnyUser at 192.168.110.3.34405 
rpcid 1984138120
16:00:08 Deleting client entry for user System:AnyUser at 192.168.110.3.34405 
rpcid 1371334541
16:00:08 Deleting client entry for user System:AnyUser at 192.168.110.3.34405 
rpcid 1319064205
16:00:08 Deleting client entry for user System:AnyUser at 192.168.110.3.34405 
rpcid 649689690
16:00:08 Deleting client entry for user System:AnyUser at 192.168.110.3.34405 
rpcid 717920852
16:00:08 Deleting client entry for user System:AnyUser at 192.168.110.3.34405 
rpcid 1033575259
16:09:37 Callback failed RPC2_DEAD (F) for ws 192.168.110.4:32768
16:09:37 Deleting client entry for user System:AnyUser at 192.168.110.4.32768 
rpcid 880750331
16:20:16 Shutdown received



On one of the clients I found this in the venus.err log file:

15:54:33 DispatchWorker: signal received (seq = 10634279)
15:54:43 DispatchWorker: signal received (seq = 10638352)
15:55:49 RecovTerminate: clean shutdown

The venus.err file contains this:

[ W(2566) : 0000 : 15:54:33 ] WAITING(2.7f000001.bb.819e): level = RD, readers 
= 0, writers = 1

[ X(00) : 0000 : 15:55:49 ] TERM: About to terminate venus
[ X(00) : 0000 : 15:55:49 ] BeginRvmFlush (1, 16720, F)
[ X(00) : 0000 : 15:55:49 ] EndRvmFlush

[ X(00) : 0000 : 15:56:40 ] Coda Venus, version 6.0.16
[ X(00) : 0000 : 15:56:40 ] Logfile initialized with LogLevel = 0 at Wed Mar  7 
15:56:40 2007

[ X(00) : 0000 : 15:56:40 ] E StatsInit()
[ X(00) : 0000 : 15:56:40 ] L StatsInit()
[ X(00) : 0000 : 15:56:41 ] BeginRvmFlush (1, 60, F)
[ X(00) : 0000 : 15:56:41 ] EndRvmFlush
[ X(00) : 0000 : 15:56:41 ] BeginRvmTruncate (1, 224, F)
[ X(00) : 0000 : 15:56:41 ] EndRvmTruncate
[ X(00) : 0000 : 15:56:42 ] fsobj::Recover: invalid fso (wrtcGfkZw, 
2.7f000001.7f6.d1e3), attempting to GC...
0x508f5f48 : fid = (2.7f000001.7f6.d1e3), comp = wrtcGfkZw, vol = 50867548
        state = Normal, stat = { 0, 1, 1173275016, -2, 0600, 1, File }, rc 
rights = 0
        VV = {[ 3 0 0 0 0 0 0 0 ] [ 0x2c0fc047 6097 ] [ 0 ]}
        voltype = [0 1 0], fake = 0, fetching = 0 local = 0
        rep = 1, data = 1, owrite = 0, dirty = 0, shadow = 0 ckmtpt
        mvstat = Normal
        parent = (2.7f000001.bb.819e, 0), children = 0
        priority = 25 (10406), hoard = [0, -2, 0], lastref = 17478357
        mle_bindings = (0, 0), cleanstat = [-1, -1]
        cachefile = [ 00/00/0B/60, 10465/10465 ]
        refs = [0 0 0], openers = [0 0 0]       lastresolved = 0
[ X(00) : 0000 : 15:56:42 ] fsobj::Recover: invalid fso (wrtIJzt5Z, 
2.7f000001.7f8.d1e4), attempting to GC...
0x50906b08 : fid = (2.7f000001.7f8.d1e4), comp = wrtIJzt5Z, vol = 50867548
        state = Normal, stat = { 0, 0, 1173275016, -2, 0600, 1, File }, rc 
rights = 0
        VV = {[ 1 0 0 0 0 0 0 0 ] [ 0x2c0fc047 6098 ] [ 0 ]}
        voltype = [0 1 0], fake = 0, fetching = 0 local = 0
        rep = 1, data = 1, owrite = 0, dirty = 0, shadow = 0 ckmtpt
        mvstat = Normal
        parent = (2.7f000001.bb.819e, 0), children = 0
        priority = 25 (40553), hoard = [0, -2, 0], lastref = 17478359
        mle_bindings = (0, 0), cleanstat = [-1, -1]
        cachefile = [ 00/00/0A/C7, 0/0 ]
        refs = [0 0 0], openers = [0 0 0]       lastresolved = 0
[ X(00) : 0000 : 15:56:42 ] fsobj::Recover: invalid fso (wrtL9Tbk3, 
2.7f000001.7fa.d1e5), attempting to GC...
0x509ac108 : fid = (2.7f000001.7fa.d1e5), comp = wrtL9Tbk3, vol = 50867548
        state = Runt, stat = { 0, 0, -1, -1, 0177777, 255, ??? }, rc rights = 0
        VV = {[ 0 0 0 0 0 0 0 0 ] [ 0xffffffff 0 ] [ 0 ]}
        voltype = [0 1 0], fake = 0, fetching = 0 local = 0
        rep = 0, data = 0, owrite = 0, dirty = 0, shadow = 0 ckmtpt
        mvstat = Normal
        parent = (0.0.0.0, 0), children = 0
        priority = -1 (17552), hoard = [0, -2, 0], lastref = 0
        mle_bindings = (0, 0), cleanstat = [-1, -1]
        cachefile = [ 00/00/04/DF, 0/0 ]
        refs = [0 0 0], openers = [0 0 0]       lastresolved = 0
[ X(00) : 0000 : 15:56:42 ] BeginRvmFlush (1, 3880, F)
[ X(00) : 0000 : 15:56:42 ] EndRvmFlush
[ X(00) : 0000 : 15:56:42 ] BeginRvmTruncate (43, 4044, F)
[ X(00) : 0000 : 15:56:42 ] EndRvmTruncate
[ X(00) : 0000 : 15:56:42 ] E adv_daemon::adv_daemon: AdviceServer

[ A(13) : 0000 : 15:56:42 ] adv_daemon::main()

[ H(07) : 0000 : 15:56:42 ] HDBDaemon about to sleep on hdbdaemon_sync

[ W(15) : 0000 : 15:56:42 ] FidToNodeid: called for volume root (1.ff000001)!!!
[ W(15) : 0000 : 15:56:43 ] userent::Connect: ViceGetAttrPlusSHA(192.168.110.2)
[ W(15) : 0000 : 15:56:43 ] userent::Connect: ViceGetAttrPlusSHA() -> 22
[ W(15) : 0000 : 15:56:43 ] userent::Connect: VGAPlusSHA_Supported -> 1
[ W(15) : 0000 : 15:56:43 ] userent::Connect: ViceGetAttrPlusSHA(192.168.110.2)
[ W(15) : 0000 : 15:56:43 ] userent::Connect: ViceGetAttrPlusSHA() -> 22
[ W(15) : 0000 : 15:56:43 ] userent::Connect: VGAPlusSHA_Supported -> 1

[ T(01) : 0000 : 15:56:48 ] BeginRvmFlush (1, 2904, T)
[ T(01) : 0000 : 15:56:48 ] EndRvmFlush

RE: venus crash

Reply via email to