Re: [OpenAFS] Re: Recover old vice partition

2011-10-18 Thread Tom Keiser
On Tue, Oct 18, 2011 at 11:43 AM, Andrew Deason adea...@sinenomine.net wrote:
 On Tue, 18 Oct 2011 17:13:32 +0200
 ProbaNet i...@probanet.it wrote:

 Hello!
       We have an old vice partition (/vicepc) on an old backup
 hard-disk with data actually not used. Is there a way to access
 that data? We would like to bring its volumes online, if
 possible. Thank you!

 How old is it? Does it contain a directory called AFSIDat? Do you know
 what platform the fileserver was that last used the partition?

 If there is an AFSIDat directory on it, you can probably just mount the
 partition on any namei fileserver, and you should be able to get at the

With one minor caveat: namei is not endian-agnostic.  Thus, you'll
have to read your vice partition on a machine of the same endianness
as the one that originally wrote the data.

-Tom
___
OpenAFS-info mailing list
OpenAFS-info@openafs.org
https://lists.openafs.org/mailman/listinfo/openafs-info


Re: [OpenAFS] Overview? Linux filesystem choices

2010-09-30 Thread Tom Keiser
On Thu, Sep 30, 2010 at 12:02 PM, chas williams - CONTRACTOR
c...@cmf.nrl.navy.mil wrote:
 On Thu, 30 Sep 2010 14:19:51 +0200
 Stephan Wiesand stephan.wies...@desy.de wrote:

 Hi Jeff,

 On Sep 29, 2010, at 22:18 , Jeffrey Altman wrote:

  RAID is not a replacement for ZFS.  ZRAID-3 protects against single
  bit disk corruption errors that RAID cannot.  Only ZFS stores a
  checksum of the data as part of each block and verifies it before
  delivering the data to the application.  If the checksum fails and
  there are replicas, ZFS will read the data from another copy and
  fixup the damaged version. That is what makes ZFS so special and so
  valuable.  If you have data that must be correct, you want ZFS.


 you're right, of course. This is a very desirable feature, and the
 main reason why I'd love to see ZFS become available on linux.

 I disagree on the RAID cannot provide this statement though. RAID-5
 has the data to detect single bit corruption, and RAID-6 even has the
 data to correct it. Alas, verifying/correcting data upon read is not
 a common feature. I know of just one vendor (DDN) actually providing
 it. It's a mystery to me why the others don't.

 Anyway, the next best option if ZFS is not available is to run parity
 checks on all your arrays regularly. Things do get awkward when
 errors show up, but at least you know. Both Linux MD RAID and the
 better hardware solutions offer this.

 From my experience, disks don't do this at random and do not develop
 such a fault during their life span, but some broken disks do it
 frequently from the beginning. NB I only ever observed this problem
 with SATA drives.

 raid5 really isnt quite the same as what jeff is describing about zfs.
 zfs apparently maintains multiple copies of the same block across
 different devices.  if you had a single bit error in one of the those
 blocks (as determine by some checksum apparently stored with this
 block), zfs will pick another block that is supposed to contain the
 same data.

 raid5 only corrects single bit errors.  it can detect multiple bit
 errors.  raid5 (to my knowledge) always verifies, even on reads and can
 correct single bit errors.  raid6 can correct two single bit

RAID-5 only provides a single parity bit.  Unfortunately, this means
that it can merely detect a single bit parity error; it cannot correct
the error since there is insufficient information to prove which of
the stripes is in error.  RAID-6 is complicated because different
implementations use different algorithms for the two orthogonal
checksums.  IIRC, all of them are able to detect two-bit errors, and
some of them can correct a single-bit error.

-Tom
___
OpenAFS-info mailing list
OpenAFS-info@openafs.org
https://lists.openafs.org/mailman/listinfo/openafs-info


Re: [OpenAFS-devel] Re: [OpenAFS] Re: 1.6 and post-1.6 OpenAFS branch management and schedule

2010-06-18 Thread Tom Keiser
On Fri, Jun 18, 2010 at 3:55 AM, Jeffrey Hutzelman jh...@cmu.edu wrote:
 --On Thursday, June 17, 2010 04:12:48 PM -0500 Andrew Deason
 adea...@sinenomine.net wrote:

 On Thu, 17 Jun 2010 15:54:25 -0500
 Andrew Deason adea...@sinenomine.net wrote:

 And as has been mentioned elsewhere in the thread, you need to wait for
 the VG hierarchy summary scan to complete, no matter how fast salvaging
 is or how many you do in parallel. That involves reading the headers of
 all volumes on the partition, so it's not fast (but it is very fast if
 you're comparing it to the recovery time of a 1.4 unclean shutdown)

 Also, while I keep talking about this, what I haven't mentioned is that
 it may be solvable. Although I've never seen any code or even a
 complete plan for it yet, recording the VG hierarchy information on disk
 would obviate the need for this scan. Doing this would allow you to
 salvage essentially instantly in most cases, so you might be able to
 recover from an unclean shutdown and salvage 100s of volumes in a few
 seconds.

 It's also worth noting that in a namei fileserver, each VG is actually
 wholly self-contained, so there is no reason in the world why you should
 have to scan every VG on the partition before you can start salvaging any of
 them.  The salvage server design really should take this property into
 account, as it seems likely that some future backends may also have this
 property.


We _do_ treat each VG as a separate, concurrently-processed entity.
The problem is the on-disk format's VG membership data leaves much to
be desired--all we have to work with is the parent's volume id in
VolumeHeader_t (in other words, a forest of up-trees).  Hence, given
any arbitrary volume id, you end up performing an exhaustive search to
determine the full membership set of a VG.  This is why we wrote the
VGC in the first place: so you only have to perform that exhaustive
search once.

-Tom
___
OpenAFS-info mailing list
OpenAFS-info@openafs.org
https://lists.openafs.org/mailman/listinfo/openafs-info


Re: [OpenAFS] Re: [OpenAFS-devel] 1.6 and post-1.6 OpenAFS branch management and schedule

2010-06-18 Thread Tom Keiser
On Thu, Jun 17, 2010 at 1:43 PM, Russ Allbery r...@stanford.edu wrote:
 Christopher D. Clausen cclau...@acm.org writes:
 Rainer Toebbicke r...@pclella.cern.ch wrote:

 No, of course not.

 It would be painful to have to put back the '--enable-fast-restart and
 --enable-bitmap-later' code if you removed them, probably dangerous. My
 plea is to keep them in as an alternative to the demand-attach
 file-server: with mandatory salvaging the non-demand-attach case is
 seriously impaired, hence disabling it is no real alternative.

 With the ambitious schedule for new releases I see this happening very
 quickly. I'd like to avoid having to stop at a particular release next
 year because of a functionality that we manage to live without, and
 miss others that we're interested in.

 I agree with Rainer on this.

 Chris, to check, are you currently using --enable-fast-restart or
 --enable-bitmap-later?

 Please understand that neither of those options are recommended now,
 whether you have DAFS enabled or not.  I consider --enable-fast-restart in
 particular to be dangerous and likely to cause or propagate file
 corruption and would not feel comfortable ever running it in production.
 I know that some people are using the existing implementation and taking
 their chances, and if they're expert AFS administrators and know what
 they're risking, that's fine, but, as I understand it, it's pretty much
 equivalent to disabling fsck and journaling on your file systems after
 crashes and just trusting that there won't be any damage or that, if there
 is, you'll fsck when you notice it.


I'll note that bitmap-later is also dangerous--it has several known
race conditions (e.g. VFreeBitmapEntry_r is just plain wrong;
GetBitmap() relies upon microarchitectural store ordering rules that
no modern processor guarantees, ...).  These can result in various
classes of corruption from vnodes that fail to be freed until salvage,
to multiple allocations of the same vnode.

-Tom
___
OpenAFS-info mailing list
OpenAFS-info@openafs.org
https://lists.openafs.org/mailman/listinfo/openafs-info


Re: [OpenAFS] Re: Max number of files in a volume

2010-04-26 Thread Tom Keiser
On Mon, Apr 26, 2010 at 10:58 AM, Rich Sudlow r...@nd.edu wrote:
 Andrew Deason wrote:

 On Mon, 26 Apr 2010 10:14:01 -0400
 Rich Sudlow r...@nd.edu wrote:

 I'm having problems with a volume going off-line and not
 coming back with Salvage - what is the maximum number
 of files per volume? I believe the volume in question
 has over 20 million.

 Looks like there were actually 30 million files.


Hi Rich,

On most platforms we build the salvager as a 32-bit binary (excluding
certain 64-bit linux platforms where the platform maintainers decided
to simplify things by making everything a 64-bit binary).  One
operation that the salvager performs is to build an in-memory index of
critical details for every vnode in the volume [see SalvageIndex() in
src/vol/vol-salvage.c].  Each entry in this array requires 56 bytes in
a 32-bit process, which comes out to 1602MB of virtual memory for 30
million files.  Likewise, we require 56 bytes per directory vnode,
which for 30 million files requires a minimum of ~462 directories, and
thus an additional 26MB of heap.  My suspicion is your salvager is
core dumping because the heap and the stack have grown into each
other.  Depending on the hardware, it may be possible to build a
custom 64-bit salvager to work around this issue.

The first step here is to figure out whether your salvager binary is
32-bit or 64-bit; the output of file /usr/afs/bin/salvager should be
sufficient.

Cheers,

-Tom
___
OpenAFS-info mailing list
OpenAFS-info@openafs.org
https://lists.openafs.org/mailman/listinfo/openafs-info


Re: [OpenAFS] Performance issue with many volumes in a single /vicep?

2010-03-25 Thread Tom Keiser
On Wed, Mar 24, 2010 at 4:44 PM, Steve Simmons s...@umich.edu wrote:
 On Mar 24, 2010, at 4:38 PM, Russ Allbery wrote:

 Steve Simmons s...@umich.edu writes:

 Our estimate too. But before drilling down, it seemed worth checking if
 anyone else has a similar server - ext3 with 14,000 or more volumes in a
 single vice partition - and has seen a difference. Note, tho, that it's
 not #inodes or total disk usage in the partition. The servers that
 exhibited the problem had a large number of mostly empty volumes.

 That's a *lot* of volumes from our perspective.  The biggest partition
 we've got has about 7000 volumes on it.  It must be really fun when you
 have to restart that file server and reattach volumes.

 Nightmare is a better word. Fortunately very recent 1.4 releases have gotten 
 a lot faster on that front. It's also another reason why we're desperately 
 trying to carve out time so we can test dynamic attach, but that's grist for 
 another thread.


If your group (or anyone else on this list, for that matter) can the
find time, please please test DAFS.  Any feedback whatsoever would be
helpful and deeply appreciated.  In the unlikely event that problems
ensue, then by all means open bugs, start a discussion on -devel,
contact myself or Deason, etc.  Getting a 1.6 release out the door is
a high priority for all of us, and to some extent that is going to be
predicated on DAFS success stories.

As it stands, we believe the DAFS architecture shipping in 1.5.x will
provide a significant speedup for all moderate-to-large namei
fileserver deployments.  However, the true proof will be in the
pudding, and this is where we need the help of the community.  If
there are unforeseen corner cases where DAFS causes a regression, we
need to know about them ASAP.

-Tom
___
OpenAFS-info mailing list
OpenAFS-info@openafs.org
https://lists.openafs.org/mailman/listinfo/openafs-info


Re: [OpenAFS] Performance issue with many volumes in a single /vicep?

2010-03-24 Thread Tom Keiser
On Wed, Mar 24, 2010 at 4:32 PM, Steve Simmons s...@umich.edu wrote:

 On Mar 18, 2010, at 2:37 AM, Tom Keiser wrote:

 On Wed, Mar 17, 2010 at 7:41 PM, Derrick Brashear sha...@gmail.com wrote:
 On Wed, Mar 17, 2010 at 12:50 PM, Steve Simmons s...@umich.edu wrote:
 We've been seeing issues for a while that seem to relate to the number of 
 volumes in a single vice partition. The numbers and data are inexact 
 because there are so many damned possible parameters that affect 
 performance, but it appears that somewhere between 10,000 and 14,000 
 volumes performance falls off significantly. That 40% difference in volume 
 count results in 2x to 3x falloffs for performance in issues that affect 
 the /vicep as a whole - backupsys, nightly dumps, vos listvol, etc.


 First off, could you describe how you're measuring the performance drop-off?

 Wall clock, mostly. Operations which touch all the volumes on a server take 
 disproportionately longer on servers w/10,000 volumes vs servers with 14,000. 
 The best operations to show this are vos backupsys and our nightly dumps, 
 which call vos dump with various parameters on every volume on the server.


Ok.  Well, this likely rules out the volume hash chain suggestion--we
don't directly use the hash table in the volserver (although we do
perform at least two lookups as a consequence of  performing fssync
ops as part of the volume transaction).  The reason I say it's
unlikely is fssync overhead is an insignificant component of the
execution time for the vos ops you're talking about.


 The fact that this relationship b/t volumes and performance is
 superlinear makes me think you're exceeding a magic boundary (e.g
 you're now causing eviction pressure on some cache where you weren't
 previously...).

 Our estimate too. But before drilling down, it seemed worth checking if 
 anyone else has a similar server - ext3 with 14,000 or more volumes in a 
 single vice partition - and has seen a difference. Note, tho, that it's not 
 #inodes or total disk usage in the partition. The servers that exhibited the 
 problem had a large number of mostly empty volumes.


Sure.  Makes sense.   The one thing that does come to mind is that
regardless of the number of inodes, ISTR some people were having
trouble with ext performance when htree indices were turned on because
spatial locality of reference against the inode tables goes way down
when you process files in the order returned by readdir(), since
readdir() in htree mode returns files in hash chain order rather than
more-or-less inode order.  This could definitely have a huge impact on
the salvager [especially GetVolumeSummary(), and to a lesser extent
ListViceInodes() and friends].  I'm less certain how it would affect
things in the volserver, but it would certainly have an effect on
operations which delete clones, since the nuke code also calls
ListViceInodes().

In addition, with regard to ext htree indices I'll pose the
(completely untested) hypothesis that htree indices aren't necessarily
a net win for the namei workload.  Given that namei goes great lengths
to avoid large directories (with the notable exception of the /vicepXX
root dir itself), it is conceivable that htree overhead is actually a
net loss.  I don't know for sure, but I'd say it's worth doing further
study.  In a volume with filesdirs you're going to see on the order
of ~256 files per namei directory.  Certainly a linear search of on
average 128 entries is expensive, but it may be worth verifying this
empirically because we don't know how much overhead htree and its
side-effects produce.  Regrettably, there don't seem to be any
published results on the threshold above which htree becomes a net
win...

Finally, you did tune2fs -O dir_index dev before populating the file
system, right?


 Another possibility accounting for the superlinearity, which would
 very much depend upon your workload, is that by virtue of increased
 volume count you're now experiencing higher volume operation
 concurrency, thus causing higher rates of partition lock contention.
 However, this would be very specific to the volume server and
 salvager--it should not have any substantial effect on the file
 server, aside from some increased VOL_LOCK contention...

 Salvager is not involved, or at least, hasn't yet been involved. It's vos 
 backupsys and vos dump where we see it mostly.


What I was trying to say is if the observed performance regression
involves either the volserver, or the salvager, then it could involve
partition lock contention.  However, this will only come into play if
you're running a lot of vos jobs in parallel against the same vice
partition...

Cheers,

-Tom
___
OpenAFS-info mailing list
OpenAFS-info@openafs.org
https://lists.openafs.org/mailman/listinfo/openafs-info


Re: [OpenAFS] Re: Performance issue with many volumes in a single /vicep?

2010-03-24 Thread Tom Keiser
On Thu, Mar 25, 2010 at 12:39 AM, Andrew Deason adea...@sinenomine.net wrote:
 On Wed, 24 Mar 2010 23:43:32 -0400
 Tom Keiser tkei...@sinenomine.net wrote:

 What I was trying to say is if the observed performance regression
 involves either the volserver, or the salvager, then it could involve
 partition lock contention.  However, this will only come into play if
 you're running a lot of vos jobs in parallel against the same vice
 partition...

 The volserver won't cause partition lock contention with itself. If
 another thread already holds the partition lock, other threads won't
 wait for it, iirc. (At least, in 1.4)


Heh.  I should have quite while I was ahead.  Yes, what Andrew said.
These lock contention problems really only manifested themselves on
older DAFS builds where we saw lots of demand salvage procs contending
against volserver transactions for the partition lock.

-Tom
___
OpenAFS-info mailing list
OpenAFS-info@openafs.org
https://lists.openafs.org/mailman/listinfo/openafs-info


Re: [OpenAFS] Performance issue with many volumes in a single /vicep?

2010-03-18 Thread Tom Keiser
On Wed, Mar 17, 2010 at 7:41 PM, Derrick Brashear sha...@gmail.com wrote:
 On Wed, Mar 17, 2010 at 12:50 PM, Steve Simmons s...@umich.edu wrote:
 We've been seeing issues for a while that seem to relate to the number of 
 volumes in a single vice partition. The numbers and data are inexact because 
 there are so many damned possible parameters that affect performance, but it 
 appears that somewhere between 10,000 and 14,000 volumes performance falls 
 off significantly. That 40% difference in volume count results in 2x to 3x 
 falloffs for performance in issues that affect the /vicep as a whole - 
 backupsys, nightly dumps, vos listvol, etc.


First off, could you describe how you're measuring the performance drop-off?

The fact that this relationship b/t volumes and performance is
superlinear makes me think you're exceeding a magic boundary (e.g
you're now causing eviction pressure on some cache where you weren't
previously...).

Another possibility accounting for the superlinearity, which would
very much depend upon your workload, is that by virtue of increased
volume count you're now experiencing higher volume operation
concurrency, thus causing higher rates of partition lock contention.
However, this would be very specific to the volume server and
salvager--it should not have any substantial effect on the file
server, aside from some increased VOL_LOCK contention...


 My initial inclination is to say it's a linux issue with directory searches, 
 but before pursuing this much further I'd be interested in hearing from 
 anyone who's running 14,000 or move volumes in a single vicep. No, I'm not 
 counting .backup volumes in there, so 14,000 volumes means 28,000 entries in 
 the directory.

 Another possibility: there's a hash table which is taking the bulk of
 that that you then search linearly.

Hmm.  That does sound plausible.  Although, it seems like that
generally shouldn't result in superlinear performance changes
(ignoring interaction effects between the data structure and the
memory hierarchy); it would almost have to imply that the additional
4,000 volumes have special properties with respect to the hash
function.

-Tom
___
OpenAFS-info mailing list
OpenAFS-info@openafs.org
https://lists.openafs.org/mailman/listinfo/openafs-info


Re: [OpenAFS] Limit of clones

2010-02-09 Thread Tom Keiser
On Mon, Nov 2, 2009 at 5:14 PM, Steve Simmons s...@umich.edu wrote:

 On Oct 31, 2009, at 3:42 PM, Derrick Brashear wrote:

 On Sat, Oct 31, 2009 at 11:08 AM, Anders Magnusson ra...@ltu.se wrote:

 The manpage for vos clone says there are a maximum of 7 clones using the
 namei fileserver.
 What is the reason for this limitation?

 The implementation uses only 3 bits (1 + 2 + 4 = 7)

 Given that in a classic fileserver, RW, RO, BK, temporary clone = 4,
 this wasn't really a problem.

 We've experimentally verified that you can manually create another three
 clones and all AFS operations continue to work fine.


There is an important caveat: the minute you have =6 clones, there
MUST be (uniq,DV) overlap for every vnode within the VG.  Otherwise,
namei_GetFreeTag() will fail during CopyOnWrite() ops due to running
out of tags (the on-disk representation of the VG tag map is one row
per vnode id, with 5x 3-bit columns per row, where each 3-bit column
is the reference count for a tag; each unique (uniq,DV) tuple maps 1:1
onto a tag).  Hence, the practical limit on clones is five; six or
seven are only ok in a probabilistic sense.  On the other extreme, the
tag map could _technically_ allow up to 35 clones within a VG, so long
as you have at most five extant (uniq,DV) tuples for every vnode, each
referenced by at most seven volumes within the VG.

Cheers,

-Tom
___
OpenAFS-info mailing list
OpenAFS-info@openafs.org
https://lists.openafs.org/mailman/listinfo/openafs-info


Re: [OpenAFS] New Cell setup - ideas?

2010-01-27 Thread Tom Keiser
On Wed, Jan 27, 2010 at 3:22 AM, Lars Schimmer l.schim...@cgv.tugraz.at wrote:
 - -no single user (person) should be identified accessing that data by
 sharing organization (to see which department is fine, but not the
 single persons of the accessing department)


The AFS-3 security model _cannot_ satisfy this anonymization
requirement.  With the current security model, each file server must
know the identity of the caller in order to perform RPC authorization.

I suppose you could give them file server binaries with auditing
support disabled, call back table dump support disabled, and then hope
that the satellite site admins don't know enough about AFS to dissect
rxkad clear packets, file server cores, or use cmdebug to make
educated inferences.  But then again, if they know enough to do any of
that, then I suppose they also know that the KeyFile effectively gives
them full control over the entire distributed infrastructure.

Cheers,

-Tom
___
OpenAFS-info mailing list
OpenAFS-info@openafs.org
https://lists.openafs.org/mailman/listinfo/openafs-info


Re: [OpenAFS] All limitations of OpenAFS

2010-01-26 Thread Tom Keiser
On Tue, Jan 26, 2010 at 7:49 AM, Lars Schimmer l.schim...@cgv.tugraz.at wrote:
 -BEGIN PGP SIGNED MESSAGE-
 Hash: SHA1

 Hi!

 For a new EU project we want to setup a OpenAFS cell.
 As we could hit some limitations of OpenAFS, I would like to collect all
 known limitations of current OpenAFS systems (1.5.70 on windows and
 1.4.11 on linux that is).
 I know of:

 - -one RW copy of a volume
 - -max 6 RO copies from one volume

Are you referring to the number of repsites that can be represented in
the vldb, or the number of distinct clones supported within a volume
group?

repsites:  IIRC, vldb version 4 supports 13 repsites (so a max of 12
RO sites w/o BK; 11 w/ a BK repsite)

clones:  This depends on whether your backing store is namei or inode.
 For namei, the safe limit is five volumes within a volume group.
Thus, in the typical case you get one RW and four clones.  Note that
you can go beyond this limit (IIRC the true limit is 35), but there
must be DV overlap for every vnode (or clone operations will fail) --
each vnode (irrespective of differences in uniquifier) on namei can
have at most five divergent versions at any given point in time.

Cheers,

-Tom
___
OpenAFS-info mailing list
OpenAFS-info@openafs.org
https://lists.openafs.org/mailman/listinfo/openafs-info


Re: [OpenAFS] Openafs failover

2009-09-02 Thread Tom Keiser
On Mon, Jun 8, 2009 at 7:23 PM, Harald Barthh...@kth.se wrote:

 In the case of server1 would went down server2:
 1. would mount vicepa ...
 2. would take over address 10.0.0.1
 3. finally would restart the vlserver volserver and fs processes.

 You have missed what to do with the outstanding callbacks that server1
 is holding (in memory). When server1 does shut down nicely, these are
 handled (clients notified) during the shutdown. If server1 crashes,

Hi Harald,

I know a lot of us have said this over the years (I'm pretty sure I'm
guilty as well), but it's not entirely accurate.  Yes, coherence is
maintained across crashes/restarts by sending one of the
InitCallBackState family of RPCs to the client.  However, the key
point is it happens _after_ the new fileserver process starts up, and
when the cm next makes contact.  When we walk the host hash table, we
fail to find a host entry, and thus perform initialization of a new
host object, host cps data, etc.  This process forces the client to
invalidate its status entries, and thus results in a new round of
FetchStatus RPCs.  The net result is 2-node active/passive failover
clusters can be equivalent to standalone fileservers in terms of cache
coherence (assuming proper Net{Info,Restrict} and rxbind
configuration).


 these are lost, so clients could in this case continue to use an
 outdated copy in cache. If I remember correctly, there has been work
 for the 1.5.x server series to write down callback information
 (continously) to the /vicepX. That could then be used by a starting

Storing continuously is an excellent end-goal.  Unfortunately, we're
not there yet.  What we have at present (with dafs) is a mechanism to
serialize an atomic snapshot to disk.  Unfortunately, the current
implementation does not lend itself to continuous dumping.  In order
to achieve atomicity we quiesce all Rx worker threads and hold H_LOCK
across the entire operation.  Furthermore, the fsstate.dat on-disk
format is optimized for serialization, not random access.

Continuous dumping is complicated from a number of perspectives.
First of all, we'd likely want tunable consistency modes.  Secondly,
there's the question of whether extended callback data should be
serialized or not (at present, dafs+osi+xcb does not dump xcb data; it
would not be particularly hard to add support in future).  Lastly,
there is the pertinent question of where to store the data.  If/when
partition uuid extensions become supported, the issue becomes
significantly more complicated because we will likely want the host
package data to be replicated across every partition in order to
support partition-level load balancing (which is further complicated
by the existence of unmounted, unsynchronized, out-of-date clones).

-Tom


--
Tom Keiser
tkei...@sinenomine.net
___
OpenAFS-info mailing list
OpenAFS-info@openafs.org
https://lists.openafs.org/mailman/listinfo/openafs-info


Re: [OpenAFS] Automatic move of volumes

2007-10-25 Thread Tom Keiser
On 10/24/07, Steven Jenkins [EMAIL PROTECTED] wrote:
 On 10/24/07, Derrick Brashear [EMAIL PROTECTED] wrote:
  
   It has _everything_ to do with namespace management.  In absence of
   better tools, people are using vos release to do just that.  Note that
   vos release isn't a bad tool; it's just being stretched beyond its
   design because people need a way to do versioning of their namespaces.
 
  you want to dump and restore volumes. that's ugly. it's not a namespace
  issue; you want versioned volume clones.
 

 dump/restore is just a mechanism in lieu of a volume copy operation.
 Versionized clones could be interesting in this context, but I would
 prefer to stay away from that approach as it makes it harder to
 recover  see changes in the base volume.  I think having one RW per
 'generation' of ROs is reasonable.


I agree that having one RW per generation is very useful.  Ideally,
I'd like to see a volume group become a more flexible container which
contains an arbitary inheritance tree structure.  For instance, it
would be useful to allow creation of addtional forked RW volumes
within a volume group.  This would, in effect, give us the equivalent
of a CVS branch tag for a volume.  Obviously, the existing model where
a volume name is a tightly coupled 1:1 mapping onto the volume group
is a limitation which would need to be lifted.  But, there are other
reasons why making the name map more flexible would be beneficial
(e.g. providing useful names for infinite backup clones, etc.)


 With versionized clones, you would need to create a mechanism to have
 potentially infinite numbers of clones, with arbitrary generation
 identifiers (eg, some would be ok with '1', '2', ..., but some would
 want 'alpha', 'beta', ..., or 'dev', 'prod', etc).  IMO, that's better
 done outside of the volume itself.


Not sure I agree.  Providing this type of metadata in the volume
management system itself has value (provided it's done in a typeful,
standardized manner).  For instance, if we ever integrate an automated
volume migration/balancing system, we will want to access this type of
information to prioritize where certain volumes are stored.

-Tom
___
OpenAFS-info mailing list
OpenAFS-info@openafs.org
https://lists.openafs.org/mailman/listinfo/openafs-info


Re: [OpenAFS] File server not parallelized on restart?

2007-10-18 Thread Tom Keiser
On 10/17/07, Derrick Brashear [EMAIL PROTECTED] wrote:


 On 10/17/07, Steve Simmons [EMAIL PROTECTED] wrote:
  We had an AFS hang today (more detail after we complete the post-
  mortem). It required doing a hard reboot on the server. On reboot, it
  began salvaging the two partitions in parallel as normal. Wwhen the
  salvages completed, it started attaching the partitions sequentially.
  Here are the relevant times and events from the log. This last 4 in
  the sequence look funny to me:
 
  13:30:40 /vicepa salvage started
  13:30:40 /vicepb salvage started
  14:23:07 /vicepb salvage completed
  14:35:59 /vicepa salvage completed
  14:36:01 fs starts attaching /vicepb volumes
  14:50:16 fs finishes attaching /vicepb volumes
  14:50:16 fs starts attaching /vicepa volumes
 
  Should it have started attaching /vicepa volumes as soon as that
  salvage completed, or am I laboring under a misconception here?
 

The mode of operation is basically whole-partition-salvager XOR
fileserver+volserver.  In order to guarantee mutually exclusive
access, the bosserver won't start the fileserver and volserver until
the salvager has exited.

  Advance thanks,

 nope, it's serial unless you have 1.5, with -vattachpar set, and will do
 them in reverse in some versions due to a minor bug since fixed.


Parallel volume attachment support ships with 1.3.83 and above.
Parallel shutdown requires DAFS.  As Derrick mentioned, -vattachpar
controls parallelization of startup and shutdown in the volume
package.  Unless set explicitly, -vattachpar has a value of 1, thus
providing the classic single threaded behavior by default.  The
single-threaded partition attachment ordering fix was committed in
time for 1.4.4.

Regards,

-Tom
___
OpenAFS-info mailing list
OpenAFS-info@openafs.org
https://lists.openafs.org/mailman/listinfo/openafs-info


Re: [OpenAFS] compile error on AIX 5.3 - softsig.c

2007-06-07 Thread Tom Keiser

On 6/7/07, [EMAIL PROTECTED] [EMAIL PROTECTED] wrote:

Get this error while compiling on AIX 5.3.  Maybe I gave
more information than needed, and any help will be
appreciated.  Thanks ahead of time.

 /openafs/openafs-1.4.4/src/pinstall/pinstall
libafsauthent.a /openafs/op
enafs-1.4.4/lib/libafsauthent.a
Target all is up to date.
 case rs_aix53 in
 alpha_dux*|sgi_*|sun*_5*|rs_aix*|*linux*|hp_ux11*|ia64
_hpux*|*[of]bsd*|*nbsd[234]*)  cd src  cd tviced  make
all ;;  *_darwin_[1-6
][0-9])  echo Not building MT viced for rs_aix53 ;;
 *_darwin_*)  cd src  cd t
viced   make all ;;  *)  echo Not building MT viced for
rs_aix53 ;;  esac
 xlc_r4  -O -I/openafs/openafs-1.4.4/src/config


Um, why did you set MT_CC to xlc_r4?  We don't support DCE (aka draft
4) threads.

--
Tom Keiser
[EMAIL PROTECTED]
___
OpenAFS-info mailing list
OpenAFS-info@openafs.org
https://lists.openafs.org/mailman/listinfo/openafs-info


Re: [OpenAFS] Openafs 1.4.2 on Debian Etch kernel 2.6.18 slow

2007-02-11 Thread Tom Keiser

I've got a bunch of questions.  Even if you only have time to answer a
few of them, it will help us to narrow down the root cause.

On 2/10/07, Derek Harkness [EMAIL PROTECTED] wrote:

I'm attempting to deploy/update a new AFS fileserver.  The new server is the
first to upgraded from Debian sarge, OpenAFS 1.3.xx, kernel 2.4 to Etch,
2.6.18, AFS 1.4.2, reiserfs and a new 7 terabyte XRaid.

The upgrade went fine except I file writes to the new system are so slow the
system is unusable.  On the server iostat shows a transfer rate of ~40KB/s
and an iowait of 20 during AFS operations.  If I stop the fileserver and


First and foremost, do local volume package operations (e.g. the
salvager, vos backup, fileserver startup/shutdown, etc) run slowly, or
is it only stuff that involves Rx?  What about vos dump foo localhost
on the ailing fileserver?  The fact that iowait is going through the
roof may be indicative of an io subsystem problem, so eliminating
network/Rx problems at the top of the decision tree will be useful.

I'm not familiar with the Linux iostat utility, but if it supports
per-disk stats similar similar to the -x option on Solaris, or the -D
option on AIX, then please post some data while the problem is
occurring.



perform io directly on the XRaid I can read and write between
100MB/s-500MB/s.



A single fibre channel port (excepting 10Gb E-ports) can't transmit
500MB/s.  From what I've heard, apple's fc raid products only provide
a single 2Gb sfp per controller, and don't support fc multipathing.
So, you're limited to a max theoretical of ~203MB/s (less in AL mode).
Thus, I'm guessing your tests are, at least in some cases, only
stressing the page cache, rather than anything across the fabric (for
that matter, is there a fabric?).  In order to declare the storage
subsystem OK, we need to be sure you've tested every layer of the
storage stack.

Please tell us specifically what you did to verify direct io.  For example:

* Were you running some well-known benchmark suite?  If so, what
options did you pass?
* Did it involve one file or many?
* Were any fsync()s issued?
* Did it modify any filesystem metadata, or only file data?
* Was it single threaded or multi-threaded?
* How much data was read/written?
* How big were the files involved?
* Did you do anything to mitigate/bypass caching?

Other questions that might be useful:

* How deep are the tagged command queues for the xserve lun(s)?
* Do all the disks pass surface scans?
* Are the disks and/or controllers reporting SMART events?
* If this stuff is fabric attached, have you looked at port error
counts, port performance data, etc?



Does anyone have any suggestions on how I might trouble shoot this problem?
So far I've checked network performance, io performance directly to the
XRaid, and the reisfer filesystem.  It all seems to be pointing me back to


How have you verified that network performance is ok?  What are the
ethernet port error counts like?  What are the packet retransmit rates
like?

I don't know much of anything about apple's storage line, but if they
have any sort of performance analysis and/or problem determination
tools, what do they say?


the some problem is the AFS fileserver.

Hardware:
HP DL380
2x2.8ghz Hyperthreaded Xeon CPU
4 Gigs of RAM
Gigabit ethernet
MPTFusion fiber channel card
Apple XRaid

I've got 2 other identical box currently run AFS and working fine.  The only
difference is the other boxes are running an old OS.



Are the machines running the older kernel still running 1.3.x?

Until we can better understand your testing methodology, I'd have to
say this could be a hardware problem, a kernel driver problem, an AFS
problem, or even a network problem.  We need more information to
narrow it down.

Regards,

--
Tom Keiser
[EMAIL PROTECTED]
___
OpenAFS-info mailing list
OpenAFS-info@openafs.org
https://lists.openafs.org/mailman/listinfo/openafs-info


Re: [OpenAFS] openafs in solaris10 containers

2006-12-04 Thread Tom Keiser

On 12/3/06, Matthew Cocker [EMAIL PROTECTED] wrote:

Anyone running afs client in a solaris 10 container environment? I have seen
some references that you can not run afs in the child containers but you
have to run it from the main container (I may have the solaris terms
mixed). Is this correct?



Many people are running afs and containers in production.  You need to
run afsd in the global zone.  Use lofs mounts to import all or part of
the afs namespace into the child zones.  Importing all of /afs into a
zone just requires the following zonecfg stanza:

add fs
set type=lofs
set dir=/afs
set special=/afs
end

Use set options as you like.

--
Tom Keiser
[EMAIL PROTECTED]
___
OpenAFS-info mailing list
OpenAFS-info@openafs.org
https://lists.openafs.org/mailman/listinfo/openafs-info


Re: [OpenAFS] Two afs cells in one server

2006-11-16 Thread Tom Keiser

On 11/16/06, Walter Lamagna [EMAIL PROTECTED] wrote:

Hi, i would like to know if it is possible to have two different afs
cells in one server.  I guess that it is not possible since the command:



Without source modification, it's very easy to do this using solaris 10 zones.

On other platforms, it's possible, provided you don't mind modifying
the fssync source code to use a different port.  Then, you'd need to
build a bunch of installations with different install prefixes.  In
this latter configuration, don't forget to make the necessary
NetInfo/NetResrtict files (each fileserver will require its own IP
address as the vldb only advertises a list of ipv4 adresses and not a
list of (address,port) tuples for each fileserver uuid).

--
Tom Keiser
[EMAIL PROTECTED]
___
OpenAFS-info mailing list
OpenAFS-info@openafs.org
https://lists.openafs.org/mailman/listinfo/openafs-info


Re: [OpenAFS] Two afs cells in one server

2006-11-16 Thread Tom Keiser

On 11/16/06, Christof Hanke [EMAIL PROTECTED] wrote:

Derek Atkins wrote:
 Nope, a single server can only server a single cell.
 You could run servers in vmware on different IP Addresses to
 get different cells.  The AFS Client assumes services are on
 specific ports.

 If your server is multi-homed you could bind different servers to
 different IP Addresses on the same machine, thereby splitting the
 cells by IP Address, but I dont know if any released OpenAFS code
 supports that.

Don't forget that in case of a fileserver, you need to determine which
data (/vicep*) is served by which fileserver. I doubt that this is
implemented, but I could be wrong.


There's an implementation of a config-file based mechanism for vice
attaching (vptab), which is currently winnt-specific.  It wouldn't be
too much work to generalize this.

The alternative is to run each server within its own chroot environment.

--
Tom Keiser
[EMAIL PROTECTED]
___
OpenAFS-info mailing list
OpenAFS-info@openafs.org
https://lists.openafs.org/mailman/listinfo/openafs-info


Re: [OpenAFS-devel] Re: [OpenAFS] namei interface lockf buggy on Solaris (and probably HP-UX and AIX)

2006-09-11 Thread Tom Keiser

On 9/11/06, Jeffrey Hutzelman [EMAIL PROTECTED] wrote:



On Monday, September 11, 2006 12:45:40 PM -0400 Tom Keiser
[EMAIL PROTECTED] wrote:

 As it turns out, the way we use file locks in the volume package is
 quite broken.  The spec says that once a process closes *any* file
 descriptor, all fcntl locks held for that file are immediately
 destroyed.  This means that the pthread fileserver/volserver can have
 some interesting races given how the ih package fd cache allows
 multiple concurrent descriptors per inode handle.  I have sample code
 sitting around somewhere which demonstrates this fault.

Do we ever acquire file locks on files other than the link table?
If so, why?





Namei link counts are the only usage I'm aware of (aside from the
salvager and partition lockfiles, obviously).

-Tom
___
OpenAFS-info mailing list
OpenAFS-info@openafs.org
https://lists.openafs.org/mailman/listinfo/openafs-info


Re: [OpenAFS] namei interface lockf buggy on Solaris (and probably HP-UX and AIX)

2006-09-11 Thread Tom Keiser

I propose we move this discussion to -devel.

On 9/11/06, Rainer Toebbicke [EMAIL PROTECTED] wrote:

The namei interface uses file locking extensively, implemented using
lockf() on Solaris, AIX  HP-UX.

Unfortunately lockf() locks and unlocks from the *current position* to
whatever the argument says (end of file), moving the file pointer in
between becomes a problem for the subsequent unlock!  The result is
that frequently locks aren't released, but replaced by partial locks
on the file data just moved over.


At least on AIX and Solaris, lockf() is nothing more than an
inflexible wrapper around fcntl() byte-range locks.  My vote is to
transition to fcntl (where we can explicitly pass in a base offset and
length).  This eliminates the call semantics change introduced by your
patch, and eliminates the unnecessary syscall overhead.  I further
object because I'm working on a patch which will allow us to use
pread/pwrite on platforms which support it.  This will completely
eliminate fcntl(F_DUPFD,...) and lseek() overhead in the fd package,
so any new requirements on lseek could mitigate the performance
improvement I'm seeing.  However, the real motivation for switching to
pread/pwrite is due to a fairly serious locking bug:

As it turns out, the way we use file locks in the volume package is
quite broken.  The spec says that once a process closes *any* file
descriptor, all fcntl locks held for that file are immediately
destroyed.  This means that the pthread fileserver/volserver can have
some interesting races given how the ih package fd cache allows
multiple concurrent descriptors per inode handle.  I have sample code
sitting around somewhere which demonstrates this fault.

Regards,

--
Tom Keiser
[EMAIL PROTECTED]
___
OpenAFS-info mailing list
OpenAFS-info@openafs.org
https://lists.openafs.org/mailman/listinfo/openafs-info


Re: [OpenAFS] Crash on Solaris 10 update 2

2006-07-02 Thread Tom Keiser

On 7/2/06, Andrew Cobaugh [EMAIL PROTECTED] wrote:

I am running Solaris 10 update 2, fresh install. OpenAFS 1.4.1.  I
have /afs lofs mounted into several zones. If I try to rename a file
within AFS from inside any of the zones, the machine immediately dumps
core and reboots.

I get the following on console after the crash (this particular
instance was caused by Gallery v1 running under apache):
http://www.phys.psu.edu/~phalenor/console_output

I can also reproduce this by simply mv'ing a file in afs from within a zone.

I can provide stacktraces from the core file if necessary to help in
debugging this.

Has anyone else seen this issue?



I've repro'd it with lofs in global and child zones.  gafs_rename()
incorrectly assumes v_path is always non-null.  Sometime around snv_21
the vnode path cache code in the kernel was substantially modified.
These changes were subsequently pulled up into s10u2.  See RT 34774.
Patch is also available at:

/afs/dementia.org/user/tkeiser/openafs/patches/solaris-vnode-path-cache-20060702.diff

-Tom
___
OpenAFS-info mailing list
OpenAFS-info@openafs.org
https://lists.openafs.org/mailman/listinfo/openafs-info


Re: [OpenAFS] RPC failed, code=5377

2005-03-20 Thread Tom Keiser
Code 5377 is currently UNOTSYNC.  What does udebug say about your
vlservers?  You might have a quorum problem.

Regards,

-- 
Tom Keiser
[EMAIL PROTECTED]


On Sun, 20 Mar 2005 16:35:53 +0100, Lars Schimmer
[EMAIL PROTECTED] wrote:
 -BEGIN PGP SIGNED MESSAGE-
 Hash: SHA1
 
 Hiho!
 
 After restarting all fileserver, all nodes are up again and it works.
 But at least on one of the servers (kernel 2.6.10, 1.3.79) the Filelog prints 
 out:
 Sun Mar 20 16:05:43 2005 VL_RegisterAddrs rpc failed; will retry periodically
 (code=5377, err=0)
 Sun Mar 20 16:10:43 2005 VL_RegisterAddrs rpc failed; will retry periodically
 (code=5377, err=0)
 Sun Mar 20 16:15:43 2005 VL_RegisterAddrs rpc failed; will retry periodically
 (code=5377, err=0)
 Sun Mar 20 16:20:44 2005 VL_RegisterAddrs rpc failed; will retry periodically
 (code=5377, err=0)
 Sun Mar 20 16:25:44 2005 VL_RegisterAddrs rpc failed; will retry periodically
 (code=5377, err=0)
 Sun Mar 20 16:30:44 2005 VL_RegisterAddrs rpc failed; will retry periodically
 (code=5377, err=0)
 And so on and on and on...
 Google gave nothing.
 Anyone has an explanation?
 
 Cya  Thx
 Lars
 - --
 - -
 Technische Universität Braunschweig, Institut für Computergraphik
 Tel.: +49 531 391-2109E-Mail: [EMAIL PROTECTED]
 PGP-Key-ID: 0xB87A0E03
 
 -BEGIN PGP SIGNATURE-
 Version: GnuPG v1.2.5 (GNU/Linux)
 Comment: Using GnuPG with Thunderbird - http://enigmail.mozdev.org
 
 iD8DBQFCPZhYVguzrLh6DgMRArRwAJ4iy13lRkJxGBcS9xmaV3VH5MP0qQCfQYY0
 a9TdSnraUiAzgN1vDwqvQt8=
 =2cdB
 -END PGP SIGNATURE-
 ___
 OpenAFS-info mailing list
 OpenAFS-info@openafs.org
 https://lists.openafs.org/mailman/listinfo/openafs-info

___
OpenAFS-info mailing list
OpenAFS-info@openafs.org
https://lists.openafs.org/mailman/listinfo/openafs-info