Re: [OpenAFS] Re: Recover old vice partition
On Tue, Oct 18, 2011 at 11:43 AM, Andrew Deason adea...@sinenomine.net wrote: On Tue, 18 Oct 2011 17:13:32 +0200 ProbaNet i...@probanet.it wrote: Hello! We have an old vice partition (/vicepc) on an old backup hard-disk with data actually not used. Is there a way to access that data? We would like to bring its volumes online, if possible. Thank you! How old is it? Does it contain a directory called AFSIDat? Do you know what platform the fileserver was that last used the partition? If there is an AFSIDat directory on it, you can probably just mount the partition on any namei fileserver, and you should be able to get at the With one minor caveat: namei is not endian-agnostic. Thus, you'll have to read your vice partition on a machine of the same endianness as the one that originally wrote the data. -Tom ___ OpenAFS-info mailing list OpenAFS-info@openafs.org https://lists.openafs.org/mailman/listinfo/openafs-info
Re: [OpenAFS] Overview? Linux filesystem choices
On Thu, Sep 30, 2010 at 12:02 PM, chas williams - CONTRACTOR c...@cmf.nrl.navy.mil wrote: On Thu, 30 Sep 2010 14:19:51 +0200 Stephan Wiesand stephan.wies...@desy.de wrote: Hi Jeff, On Sep 29, 2010, at 22:18 , Jeffrey Altman wrote: RAID is not a replacement for ZFS. ZRAID-3 protects against single bit disk corruption errors that RAID cannot. Only ZFS stores a checksum of the data as part of each block and verifies it before delivering the data to the application. If the checksum fails and there are replicas, ZFS will read the data from another copy and fixup the damaged version. That is what makes ZFS so special and so valuable. If you have data that must be correct, you want ZFS. you're right, of course. This is a very desirable feature, and the main reason why I'd love to see ZFS become available on linux. I disagree on the RAID cannot provide this statement though. RAID-5 has the data to detect single bit corruption, and RAID-6 even has the data to correct it. Alas, verifying/correcting data upon read is not a common feature. I know of just one vendor (DDN) actually providing it. It's a mystery to me why the others don't. Anyway, the next best option if ZFS is not available is to run parity checks on all your arrays regularly. Things do get awkward when errors show up, but at least you know. Both Linux MD RAID and the better hardware solutions offer this. From my experience, disks don't do this at random and do not develop such a fault during their life span, but some broken disks do it frequently from the beginning. NB I only ever observed this problem with SATA drives. raid5 really isnt quite the same as what jeff is describing about zfs. zfs apparently maintains multiple copies of the same block across different devices. if you had a single bit error in one of the those blocks (as determine by some checksum apparently stored with this block), zfs will pick another block that is supposed to contain the same data. raid5 only corrects single bit errors. it can detect multiple bit errors. raid5 (to my knowledge) always verifies, even on reads and can correct single bit errors. raid6 can correct two single bit RAID-5 only provides a single parity bit. Unfortunately, this means that it can merely detect a single bit parity error; it cannot correct the error since there is insufficient information to prove which of the stripes is in error. RAID-6 is complicated because different implementations use different algorithms for the two orthogonal checksums. IIRC, all of them are able to detect two-bit errors, and some of them can correct a single-bit error. -Tom ___ OpenAFS-info mailing list OpenAFS-info@openafs.org https://lists.openafs.org/mailman/listinfo/openafs-info
Re: [OpenAFS-devel] Re: [OpenAFS] Re: 1.6 and post-1.6 OpenAFS branch management and schedule
On Fri, Jun 18, 2010 at 3:55 AM, Jeffrey Hutzelman jh...@cmu.edu wrote: --On Thursday, June 17, 2010 04:12:48 PM -0500 Andrew Deason adea...@sinenomine.net wrote: On Thu, 17 Jun 2010 15:54:25 -0500 Andrew Deason adea...@sinenomine.net wrote: And as has been mentioned elsewhere in the thread, you need to wait for the VG hierarchy summary scan to complete, no matter how fast salvaging is or how many you do in parallel. That involves reading the headers of all volumes on the partition, so it's not fast (but it is very fast if you're comparing it to the recovery time of a 1.4 unclean shutdown) Also, while I keep talking about this, what I haven't mentioned is that it may be solvable. Although I've never seen any code or even a complete plan for it yet, recording the VG hierarchy information on disk would obviate the need for this scan. Doing this would allow you to salvage essentially instantly in most cases, so you might be able to recover from an unclean shutdown and salvage 100s of volumes in a few seconds. It's also worth noting that in a namei fileserver, each VG is actually wholly self-contained, so there is no reason in the world why you should have to scan every VG on the partition before you can start salvaging any of them. The salvage server design really should take this property into account, as it seems likely that some future backends may also have this property. We _do_ treat each VG as a separate, concurrently-processed entity. The problem is the on-disk format's VG membership data leaves much to be desired--all we have to work with is the parent's volume id in VolumeHeader_t (in other words, a forest of up-trees). Hence, given any arbitrary volume id, you end up performing an exhaustive search to determine the full membership set of a VG. This is why we wrote the VGC in the first place: so you only have to perform that exhaustive search once. -Tom ___ OpenAFS-info mailing list OpenAFS-info@openafs.org https://lists.openafs.org/mailman/listinfo/openafs-info
Re: [OpenAFS] Re: [OpenAFS-devel] 1.6 and post-1.6 OpenAFS branch management and schedule
On Thu, Jun 17, 2010 at 1:43 PM, Russ Allbery r...@stanford.edu wrote: Christopher D. Clausen cclau...@acm.org writes: Rainer Toebbicke r...@pclella.cern.ch wrote: No, of course not. It would be painful to have to put back the '--enable-fast-restart and --enable-bitmap-later' code if you removed them, probably dangerous. My plea is to keep them in as an alternative to the demand-attach file-server: with mandatory salvaging the non-demand-attach case is seriously impaired, hence disabling it is no real alternative. With the ambitious schedule for new releases I see this happening very quickly. I'd like to avoid having to stop at a particular release next year because of a functionality that we manage to live without, and miss others that we're interested in. I agree with Rainer on this. Chris, to check, are you currently using --enable-fast-restart or --enable-bitmap-later? Please understand that neither of those options are recommended now, whether you have DAFS enabled or not. I consider --enable-fast-restart in particular to be dangerous and likely to cause or propagate file corruption and would not feel comfortable ever running it in production. I know that some people are using the existing implementation and taking their chances, and if they're expert AFS administrators and know what they're risking, that's fine, but, as I understand it, it's pretty much equivalent to disabling fsck and journaling on your file systems after crashes and just trusting that there won't be any damage or that, if there is, you'll fsck when you notice it. I'll note that bitmap-later is also dangerous--it has several known race conditions (e.g. VFreeBitmapEntry_r is just plain wrong; GetBitmap() relies upon microarchitectural store ordering rules that no modern processor guarantees, ...). These can result in various classes of corruption from vnodes that fail to be freed until salvage, to multiple allocations of the same vnode. -Tom ___ OpenAFS-info mailing list OpenAFS-info@openafs.org https://lists.openafs.org/mailman/listinfo/openafs-info
Re: [OpenAFS] Re: Max number of files in a volume
On Mon, Apr 26, 2010 at 10:58 AM, Rich Sudlow r...@nd.edu wrote: Andrew Deason wrote: On Mon, 26 Apr 2010 10:14:01 -0400 Rich Sudlow r...@nd.edu wrote: I'm having problems with a volume going off-line and not coming back with Salvage - what is the maximum number of files per volume? I believe the volume in question has over 20 million. Looks like there were actually 30 million files. Hi Rich, On most platforms we build the salvager as a 32-bit binary (excluding certain 64-bit linux platforms where the platform maintainers decided to simplify things by making everything a 64-bit binary). One operation that the salvager performs is to build an in-memory index of critical details for every vnode in the volume [see SalvageIndex() in src/vol/vol-salvage.c]. Each entry in this array requires 56 bytes in a 32-bit process, which comes out to 1602MB of virtual memory for 30 million files. Likewise, we require 56 bytes per directory vnode, which for 30 million files requires a minimum of ~462 directories, and thus an additional 26MB of heap. My suspicion is your salvager is core dumping because the heap and the stack have grown into each other. Depending on the hardware, it may be possible to build a custom 64-bit salvager to work around this issue. The first step here is to figure out whether your salvager binary is 32-bit or 64-bit; the output of file /usr/afs/bin/salvager should be sufficient. Cheers, -Tom ___ OpenAFS-info mailing list OpenAFS-info@openafs.org https://lists.openafs.org/mailman/listinfo/openafs-info
Re: [OpenAFS] Performance issue with many volumes in a single /vicep?
On Wed, Mar 24, 2010 at 4:44 PM, Steve Simmons s...@umich.edu wrote: On Mar 24, 2010, at 4:38 PM, Russ Allbery wrote: Steve Simmons s...@umich.edu writes: Our estimate too. But before drilling down, it seemed worth checking if anyone else has a similar server - ext3 with 14,000 or more volumes in a single vice partition - and has seen a difference. Note, tho, that it's not #inodes or total disk usage in the partition. The servers that exhibited the problem had a large number of mostly empty volumes. That's a *lot* of volumes from our perspective. The biggest partition we've got has about 7000 volumes on it. It must be really fun when you have to restart that file server and reattach volumes. Nightmare is a better word. Fortunately very recent 1.4 releases have gotten a lot faster on that front. It's also another reason why we're desperately trying to carve out time so we can test dynamic attach, but that's grist for another thread. If your group (or anyone else on this list, for that matter) can the find time, please please test DAFS. Any feedback whatsoever would be helpful and deeply appreciated. In the unlikely event that problems ensue, then by all means open bugs, start a discussion on -devel, contact myself or Deason, etc. Getting a 1.6 release out the door is a high priority for all of us, and to some extent that is going to be predicated on DAFS success stories. As it stands, we believe the DAFS architecture shipping in 1.5.x will provide a significant speedup for all moderate-to-large namei fileserver deployments. However, the true proof will be in the pudding, and this is where we need the help of the community. If there are unforeseen corner cases where DAFS causes a regression, we need to know about them ASAP. -Tom ___ OpenAFS-info mailing list OpenAFS-info@openafs.org https://lists.openafs.org/mailman/listinfo/openafs-info
Re: [OpenAFS] Performance issue with many volumes in a single /vicep?
On Wed, Mar 24, 2010 at 4:32 PM, Steve Simmons s...@umich.edu wrote: On Mar 18, 2010, at 2:37 AM, Tom Keiser wrote: On Wed, Mar 17, 2010 at 7:41 PM, Derrick Brashear sha...@gmail.com wrote: On Wed, Mar 17, 2010 at 12:50 PM, Steve Simmons s...@umich.edu wrote: We've been seeing issues for a while that seem to relate to the number of volumes in a single vice partition. The numbers and data are inexact because there are so many damned possible parameters that affect performance, but it appears that somewhere between 10,000 and 14,000 volumes performance falls off significantly. That 40% difference in volume count results in 2x to 3x falloffs for performance in issues that affect the /vicep as a whole - backupsys, nightly dumps, vos listvol, etc. First off, could you describe how you're measuring the performance drop-off? Wall clock, mostly. Operations which touch all the volumes on a server take disproportionately longer on servers w/10,000 volumes vs servers with 14,000. The best operations to show this are vos backupsys and our nightly dumps, which call vos dump with various parameters on every volume on the server. Ok. Well, this likely rules out the volume hash chain suggestion--we don't directly use the hash table in the volserver (although we do perform at least two lookups as a consequence of performing fssync ops as part of the volume transaction). The reason I say it's unlikely is fssync overhead is an insignificant component of the execution time for the vos ops you're talking about. The fact that this relationship b/t volumes and performance is superlinear makes me think you're exceeding a magic boundary (e.g you're now causing eviction pressure on some cache where you weren't previously...). Our estimate too. But before drilling down, it seemed worth checking if anyone else has a similar server - ext3 with 14,000 or more volumes in a single vice partition - and has seen a difference. Note, tho, that it's not #inodes or total disk usage in the partition. The servers that exhibited the problem had a large number of mostly empty volumes. Sure. Makes sense. The one thing that does come to mind is that regardless of the number of inodes, ISTR some people were having trouble with ext performance when htree indices were turned on because spatial locality of reference against the inode tables goes way down when you process files in the order returned by readdir(), since readdir() in htree mode returns files in hash chain order rather than more-or-less inode order. This could definitely have a huge impact on the salvager [especially GetVolumeSummary(), and to a lesser extent ListViceInodes() and friends]. I'm less certain how it would affect things in the volserver, but it would certainly have an effect on operations which delete clones, since the nuke code also calls ListViceInodes(). In addition, with regard to ext htree indices I'll pose the (completely untested) hypothesis that htree indices aren't necessarily a net win for the namei workload. Given that namei goes great lengths to avoid large directories (with the notable exception of the /vicepXX root dir itself), it is conceivable that htree overhead is actually a net loss. I don't know for sure, but I'd say it's worth doing further study. In a volume with filesdirs you're going to see on the order of ~256 files per namei directory. Certainly a linear search of on average 128 entries is expensive, but it may be worth verifying this empirically because we don't know how much overhead htree and its side-effects produce. Regrettably, there don't seem to be any published results on the threshold above which htree becomes a net win... Finally, you did tune2fs -O dir_index dev before populating the file system, right? Another possibility accounting for the superlinearity, which would very much depend upon your workload, is that by virtue of increased volume count you're now experiencing higher volume operation concurrency, thus causing higher rates of partition lock contention. However, this would be very specific to the volume server and salvager--it should not have any substantial effect on the file server, aside from some increased VOL_LOCK contention... Salvager is not involved, or at least, hasn't yet been involved. It's vos backupsys and vos dump where we see it mostly. What I was trying to say is if the observed performance regression involves either the volserver, or the salvager, then it could involve partition lock contention. However, this will only come into play if you're running a lot of vos jobs in parallel against the same vice partition... Cheers, -Tom ___ OpenAFS-info mailing list OpenAFS-info@openafs.org https://lists.openafs.org/mailman/listinfo/openafs-info
Re: [OpenAFS] Re: Performance issue with many volumes in a single /vicep?
On Thu, Mar 25, 2010 at 12:39 AM, Andrew Deason adea...@sinenomine.net wrote: On Wed, 24 Mar 2010 23:43:32 -0400 Tom Keiser tkei...@sinenomine.net wrote: What I was trying to say is if the observed performance regression involves either the volserver, or the salvager, then it could involve partition lock contention. However, this will only come into play if you're running a lot of vos jobs in parallel against the same vice partition... The volserver won't cause partition lock contention with itself. If another thread already holds the partition lock, other threads won't wait for it, iirc. (At least, in 1.4) Heh. I should have quite while I was ahead. Yes, what Andrew said. These lock contention problems really only manifested themselves on older DAFS builds where we saw lots of demand salvage procs contending against volserver transactions for the partition lock. -Tom ___ OpenAFS-info mailing list OpenAFS-info@openafs.org https://lists.openafs.org/mailman/listinfo/openafs-info
Re: [OpenAFS] Performance issue with many volumes in a single /vicep?
On Wed, Mar 17, 2010 at 7:41 PM, Derrick Brashear sha...@gmail.com wrote: On Wed, Mar 17, 2010 at 12:50 PM, Steve Simmons s...@umich.edu wrote: We've been seeing issues for a while that seem to relate to the number of volumes in a single vice partition. The numbers and data are inexact because there are so many damned possible parameters that affect performance, but it appears that somewhere between 10,000 and 14,000 volumes performance falls off significantly. That 40% difference in volume count results in 2x to 3x falloffs for performance in issues that affect the /vicep as a whole - backupsys, nightly dumps, vos listvol, etc. First off, could you describe how you're measuring the performance drop-off? The fact that this relationship b/t volumes and performance is superlinear makes me think you're exceeding a magic boundary (e.g you're now causing eviction pressure on some cache where you weren't previously...). Another possibility accounting for the superlinearity, which would very much depend upon your workload, is that by virtue of increased volume count you're now experiencing higher volume operation concurrency, thus causing higher rates of partition lock contention. However, this would be very specific to the volume server and salvager--it should not have any substantial effect on the file server, aside from some increased VOL_LOCK contention... My initial inclination is to say it's a linux issue with directory searches, but before pursuing this much further I'd be interested in hearing from anyone who's running 14,000 or move volumes in a single vicep. No, I'm not counting .backup volumes in there, so 14,000 volumes means 28,000 entries in the directory. Another possibility: there's a hash table which is taking the bulk of that that you then search linearly. Hmm. That does sound plausible. Although, it seems like that generally shouldn't result in superlinear performance changes (ignoring interaction effects between the data structure and the memory hierarchy); it would almost have to imply that the additional 4,000 volumes have special properties with respect to the hash function. -Tom ___ OpenAFS-info mailing list OpenAFS-info@openafs.org https://lists.openafs.org/mailman/listinfo/openafs-info
Re: [OpenAFS] Limit of clones
On Mon, Nov 2, 2009 at 5:14 PM, Steve Simmons s...@umich.edu wrote: On Oct 31, 2009, at 3:42 PM, Derrick Brashear wrote: On Sat, Oct 31, 2009 at 11:08 AM, Anders Magnusson ra...@ltu.se wrote: The manpage for vos clone says there are a maximum of 7 clones using the namei fileserver. What is the reason for this limitation? The implementation uses only 3 bits (1 + 2 + 4 = 7) Given that in a classic fileserver, RW, RO, BK, temporary clone = 4, this wasn't really a problem. We've experimentally verified that you can manually create another three clones and all AFS operations continue to work fine. There is an important caveat: the minute you have =6 clones, there MUST be (uniq,DV) overlap for every vnode within the VG. Otherwise, namei_GetFreeTag() will fail during CopyOnWrite() ops due to running out of tags (the on-disk representation of the VG tag map is one row per vnode id, with 5x 3-bit columns per row, where each 3-bit column is the reference count for a tag; each unique (uniq,DV) tuple maps 1:1 onto a tag). Hence, the practical limit on clones is five; six or seven are only ok in a probabilistic sense. On the other extreme, the tag map could _technically_ allow up to 35 clones within a VG, so long as you have at most five extant (uniq,DV) tuples for every vnode, each referenced by at most seven volumes within the VG. Cheers, -Tom ___ OpenAFS-info mailing list OpenAFS-info@openafs.org https://lists.openafs.org/mailman/listinfo/openafs-info
Re: [OpenAFS] New Cell setup - ideas?
On Wed, Jan 27, 2010 at 3:22 AM, Lars Schimmer l.schim...@cgv.tugraz.at wrote: - -no single user (person) should be identified accessing that data by sharing organization (to see which department is fine, but not the single persons of the accessing department) The AFS-3 security model _cannot_ satisfy this anonymization requirement. With the current security model, each file server must know the identity of the caller in order to perform RPC authorization. I suppose you could give them file server binaries with auditing support disabled, call back table dump support disabled, and then hope that the satellite site admins don't know enough about AFS to dissect rxkad clear packets, file server cores, or use cmdebug to make educated inferences. But then again, if they know enough to do any of that, then I suppose they also know that the KeyFile effectively gives them full control over the entire distributed infrastructure. Cheers, -Tom ___ OpenAFS-info mailing list OpenAFS-info@openafs.org https://lists.openafs.org/mailman/listinfo/openafs-info
Re: [OpenAFS] All limitations of OpenAFS
On Tue, Jan 26, 2010 at 7:49 AM, Lars Schimmer l.schim...@cgv.tugraz.at wrote: -BEGIN PGP SIGNED MESSAGE- Hash: SHA1 Hi! For a new EU project we want to setup a OpenAFS cell. As we could hit some limitations of OpenAFS, I would like to collect all known limitations of current OpenAFS systems (1.5.70 on windows and 1.4.11 on linux that is). I know of: - -one RW copy of a volume - -max 6 RO copies from one volume Are you referring to the number of repsites that can be represented in the vldb, or the number of distinct clones supported within a volume group? repsites: IIRC, vldb version 4 supports 13 repsites (so a max of 12 RO sites w/o BK; 11 w/ a BK repsite) clones: This depends on whether your backing store is namei or inode. For namei, the safe limit is five volumes within a volume group. Thus, in the typical case you get one RW and four clones. Note that you can go beyond this limit (IIRC the true limit is 35), but there must be DV overlap for every vnode (or clone operations will fail) -- each vnode (irrespective of differences in uniquifier) on namei can have at most five divergent versions at any given point in time. Cheers, -Tom ___ OpenAFS-info mailing list OpenAFS-info@openafs.org https://lists.openafs.org/mailman/listinfo/openafs-info
Re: [OpenAFS] Openafs failover
On Mon, Jun 8, 2009 at 7:23 PM, Harald Barthh...@kth.se wrote: In the case of server1 would went down server2: 1. would mount vicepa ... 2. would take over address 10.0.0.1 3. finally would restart the vlserver volserver and fs processes. You have missed what to do with the outstanding callbacks that server1 is holding (in memory). When server1 does shut down nicely, these are handled (clients notified) during the shutdown. If server1 crashes, Hi Harald, I know a lot of us have said this over the years (I'm pretty sure I'm guilty as well), but it's not entirely accurate. Yes, coherence is maintained across crashes/restarts by sending one of the InitCallBackState family of RPCs to the client. However, the key point is it happens _after_ the new fileserver process starts up, and when the cm next makes contact. When we walk the host hash table, we fail to find a host entry, and thus perform initialization of a new host object, host cps data, etc. This process forces the client to invalidate its status entries, and thus results in a new round of FetchStatus RPCs. The net result is 2-node active/passive failover clusters can be equivalent to standalone fileservers in terms of cache coherence (assuming proper Net{Info,Restrict} and rxbind configuration). these are lost, so clients could in this case continue to use an outdated copy in cache. If I remember correctly, there has been work for the 1.5.x server series to write down callback information (continously) to the /vicepX. That could then be used by a starting Storing continuously is an excellent end-goal. Unfortunately, we're not there yet. What we have at present (with dafs) is a mechanism to serialize an atomic snapshot to disk. Unfortunately, the current implementation does not lend itself to continuous dumping. In order to achieve atomicity we quiesce all Rx worker threads and hold H_LOCK across the entire operation. Furthermore, the fsstate.dat on-disk format is optimized for serialization, not random access. Continuous dumping is complicated from a number of perspectives. First of all, we'd likely want tunable consistency modes. Secondly, there's the question of whether extended callback data should be serialized or not (at present, dafs+osi+xcb does not dump xcb data; it would not be particularly hard to add support in future). Lastly, there is the pertinent question of where to store the data. If/when partition uuid extensions become supported, the issue becomes significantly more complicated because we will likely want the host package data to be replicated across every partition in order to support partition-level load balancing (which is further complicated by the existence of unmounted, unsynchronized, out-of-date clones). -Tom -- Tom Keiser tkei...@sinenomine.net ___ OpenAFS-info mailing list OpenAFS-info@openafs.org https://lists.openafs.org/mailman/listinfo/openafs-info
Re: [OpenAFS] Automatic move of volumes
On 10/24/07, Steven Jenkins [EMAIL PROTECTED] wrote: On 10/24/07, Derrick Brashear [EMAIL PROTECTED] wrote: It has _everything_ to do with namespace management. In absence of better tools, people are using vos release to do just that. Note that vos release isn't a bad tool; it's just being stretched beyond its design because people need a way to do versioning of their namespaces. you want to dump and restore volumes. that's ugly. it's not a namespace issue; you want versioned volume clones. dump/restore is just a mechanism in lieu of a volume copy operation. Versionized clones could be interesting in this context, but I would prefer to stay away from that approach as it makes it harder to recover see changes in the base volume. I think having one RW per 'generation' of ROs is reasonable. I agree that having one RW per generation is very useful. Ideally, I'd like to see a volume group become a more flexible container which contains an arbitary inheritance tree structure. For instance, it would be useful to allow creation of addtional forked RW volumes within a volume group. This would, in effect, give us the equivalent of a CVS branch tag for a volume. Obviously, the existing model where a volume name is a tightly coupled 1:1 mapping onto the volume group is a limitation which would need to be lifted. But, there are other reasons why making the name map more flexible would be beneficial (e.g. providing useful names for infinite backup clones, etc.) With versionized clones, you would need to create a mechanism to have potentially infinite numbers of clones, with arbitrary generation identifiers (eg, some would be ok with '1', '2', ..., but some would want 'alpha', 'beta', ..., or 'dev', 'prod', etc). IMO, that's better done outside of the volume itself. Not sure I agree. Providing this type of metadata in the volume management system itself has value (provided it's done in a typeful, standardized manner). For instance, if we ever integrate an automated volume migration/balancing system, we will want to access this type of information to prioritize where certain volumes are stored. -Tom ___ OpenAFS-info mailing list OpenAFS-info@openafs.org https://lists.openafs.org/mailman/listinfo/openafs-info
Re: [OpenAFS] File server not parallelized on restart?
On 10/17/07, Derrick Brashear [EMAIL PROTECTED] wrote: On 10/17/07, Steve Simmons [EMAIL PROTECTED] wrote: We had an AFS hang today (more detail after we complete the post- mortem). It required doing a hard reboot on the server. On reboot, it began salvaging the two partitions in parallel as normal. Wwhen the salvages completed, it started attaching the partitions sequentially. Here are the relevant times and events from the log. This last 4 in the sequence look funny to me: 13:30:40 /vicepa salvage started 13:30:40 /vicepb salvage started 14:23:07 /vicepb salvage completed 14:35:59 /vicepa salvage completed 14:36:01 fs starts attaching /vicepb volumes 14:50:16 fs finishes attaching /vicepb volumes 14:50:16 fs starts attaching /vicepa volumes Should it have started attaching /vicepa volumes as soon as that salvage completed, or am I laboring under a misconception here? The mode of operation is basically whole-partition-salvager XOR fileserver+volserver. In order to guarantee mutually exclusive access, the bosserver won't start the fileserver and volserver until the salvager has exited. Advance thanks, nope, it's serial unless you have 1.5, with -vattachpar set, and will do them in reverse in some versions due to a minor bug since fixed. Parallel volume attachment support ships with 1.3.83 and above. Parallel shutdown requires DAFS. As Derrick mentioned, -vattachpar controls parallelization of startup and shutdown in the volume package. Unless set explicitly, -vattachpar has a value of 1, thus providing the classic single threaded behavior by default. The single-threaded partition attachment ordering fix was committed in time for 1.4.4. Regards, -Tom ___ OpenAFS-info mailing list OpenAFS-info@openafs.org https://lists.openafs.org/mailman/listinfo/openafs-info
Re: [OpenAFS] compile error on AIX 5.3 - softsig.c
On 6/7/07, [EMAIL PROTECTED] [EMAIL PROTECTED] wrote: Get this error while compiling on AIX 5.3. Maybe I gave more information than needed, and any help will be appreciated. Thanks ahead of time. /openafs/openafs-1.4.4/src/pinstall/pinstall libafsauthent.a /openafs/op enafs-1.4.4/lib/libafsauthent.a Target all is up to date. case rs_aix53 in alpha_dux*|sgi_*|sun*_5*|rs_aix*|*linux*|hp_ux11*|ia64 _hpux*|*[of]bsd*|*nbsd[234]*) cd src cd tviced make all ;; *_darwin_[1-6 ][0-9]) echo Not building MT viced for rs_aix53 ;; *_darwin_*) cd src cd t viced make all ;; *) echo Not building MT viced for rs_aix53 ;; esac xlc_r4 -O -I/openafs/openafs-1.4.4/src/config Um, why did you set MT_CC to xlc_r4? We don't support DCE (aka draft 4) threads. -- Tom Keiser [EMAIL PROTECTED] ___ OpenAFS-info mailing list OpenAFS-info@openafs.org https://lists.openafs.org/mailman/listinfo/openafs-info
Re: [OpenAFS] Openafs 1.4.2 on Debian Etch kernel 2.6.18 slow
I've got a bunch of questions. Even if you only have time to answer a few of them, it will help us to narrow down the root cause. On 2/10/07, Derek Harkness [EMAIL PROTECTED] wrote: I'm attempting to deploy/update a new AFS fileserver. The new server is the first to upgraded from Debian sarge, OpenAFS 1.3.xx, kernel 2.4 to Etch, 2.6.18, AFS 1.4.2, reiserfs and a new 7 terabyte XRaid. The upgrade went fine except I file writes to the new system are so slow the system is unusable. On the server iostat shows a transfer rate of ~40KB/s and an iowait of 20 during AFS operations. If I stop the fileserver and First and foremost, do local volume package operations (e.g. the salvager, vos backup, fileserver startup/shutdown, etc) run slowly, or is it only stuff that involves Rx? What about vos dump foo localhost on the ailing fileserver? The fact that iowait is going through the roof may be indicative of an io subsystem problem, so eliminating network/Rx problems at the top of the decision tree will be useful. I'm not familiar with the Linux iostat utility, but if it supports per-disk stats similar similar to the -x option on Solaris, or the -D option on AIX, then please post some data while the problem is occurring. perform io directly on the XRaid I can read and write between 100MB/s-500MB/s. A single fibre channel port (excepting 10Gb E-ports) can't transmit 500MB/s. From what I've heard, apple's fc raid products only provide a single 2Gb sfp per controller, and don't support fc multipathing. So, you're limited to a max theoretical of ~203MB/s (less in AL mode). Thus, I'm guessing your tests are, at least in some cases, only stressing the page cache, rather than anything across the fabric (for that matter, is there a fabric?). In order to declare the storage subsystem OK, we need to be sure you've tested every layer of the storage stack. Please tell us specifically what you did to verify direct io. For example: * Were you running some well-known benchmark suite? If so, what options did you pass? * Did it involve one file or many? * Were any fsync()s issued? * Did it modify any filesystem metadata, or only file data? * Was it single threaded or multi-threaded? * How much data was read/written? * How big were the files involved? * Did you do anything to mitigate/bypass caching? Other questions that might be useful: * How deep are the tagged command queues for the xserve lun(s)? * Do all the disks pass surface scans? * Are the disks and/or controllers reporting SMART events? * If this stuff is fabric attached, have you looked at port error counts, port performance data, etc? Does anyone have any suggestions on how I might trouble shoot this problem? So far I've checked network performance, io performance directly to the XRaid, and the reisfer filesystem. It all seems to be pointing me back to How have you verified that network performance is ok? What are the ethernet port error counts like? What are the packet retransmit rates like? I don't know much of anything about apple's storage line, but if they have any sort of performance analysis and/or problem determination tools, what do they say? the some problem is the AFS fileserver. Hardware: HP DL380 2x2.8ghz Hyperthreaded Xeon CPU 4 Gigs of RAM Gigabit ethernet MPTFusion fiber channel card Apple XRaid I've got 2 other identical box currently run AFS and working fine. The only difference is the other boxes are running an old OS. Are the machines running the older kernel still running 1.3.x? Until we can better understand your testing methodology, I'd have to say this could be a hardware problem, a kernel driver problem, an AFS problem, or even a network problem. We need more information to narrow it down. Regards, -- Tom Keiser [EMAIL PROTECTED] ___ OpenAFS-info mailing list OpenAFS-info@openafs.org https://lists.openafs.org/mailman/listinfo/openafs-info
Re: [OpenAFS] openafs in solaris10 containers
On 12/3/06, Matthew Cocker [EMAIL PROTECTED] wrote: Anyone running afs client in a solaris 10 container environment? I have seen some references that you can not run afs in the child containers but you have to run it from the main container (I may have the solaris terms mixed). Is this correct? Many people are running afs and containers in production. You need to run afsd in the global zone. Use lofs mounts to import all or part of the afs namespace into the child zones. Importing all of /afs into a zone just requires the following zonecfg stanza: add fs set type=lofs set dir=/afs set special=/afs end Use set options as you like. -- Tom Keiser [EMAIL PROTECTED] ___ OpenAFS-info mailing list OpenAFS-info@openafs.org https://lists.openafs.org/mailman/listinfo/openafs-info
Re: [OpenAFS] Two afs cells in one server
On 11/16/06, Walter Lamagna [EMAIL PROTECTED] wrote: Hi, i would like to know if it is possible to have two different afs cells in one server. I guess that it is not possible since the command: Without source modification, it's very easy to do this using solaris 10 zones. On other platforms, it's possible, provided you don't mind modifying the fssync source code to use a different port. Then, you'd need to build a bunch of installations with different install prefixes. In this latter configuration, don't forget to make the necessary NetInfo/NetResrtict files (each fileserver will require its own IP address as the vldb only advertises a list of ipv4 adresses and not a list of (address,port) tuples for each fileserver uuid). -- Tom Keiser [EMAIL PROTECTED] ___ OpenAFS-info mailing list OpenAFS-info@openafs.org https://lists.openafs.org/mailman/listinfo/openafs-info
Re: [OpenAFS] Two afs cells in one server
On 11/16/06, Christof Hanke [EMAIL PROTECTED] wrote: Derek Atkins wrote: Nope, a single server can only server a single cell. You could run servers in vmware on different IP Addresses to get different cells. The AFS Client assumes services are on specific ports. If your server is multi-homed you could bind different servers to different IP Addresses on the same machine, thereby splitting the cells by IP Address, but I dont know if any released OpenAFS code supports that. Don't forget that in case of a fileserver, you need to determine which data (/vicep*) is served by which fileserver. I doubt that this is implemented, but I could be wrong. There's an implementation of a config-file based mechanism for vice attaching (vptab), which is currently winnt-specific. It wouldn't be too much work to generalize this. The alternative is to run each server within its own chroot environment. -- Tom Keiser [EMAIL PROTECTED] ___ OpenAFS-info mailing list OpenAFS-info@openafs.org https://lists.openafs.org/mailman/listinfo/openafs-info
Re: [OpenAFS-devel] Re: [OpenAFS] namei interface lockf buggy on Solaris (and probably HP-UX and AIX)
On 9/11/06, Jeffrey Hutzelman [EMAIL PROTECTED] wrote: On Monday, September 11, 2006 12:45:40 PM -0400 Tom Keiser [EMAIL PROTECTED] wrote: As it turns out, the way we use file locks in the volume package is quite broken. The spec says that once a process closes *any* file descriptor, all fcntl locks held for that file are immediately destroyed. This means that the pthread fileserver/volserver can have some interesting races given how the ih package fd cache allows multiple concurrent descriptors per inode handle. I have sample code sitting around somewhere which demonstrates this fault. Do we ever acquire file locks on files other than the link table? If so, why? Namei link counts are the only usage I'm aware of (aside from the salvager and partition lockfiles, obviously). -Tom ___ OpenAFS-info mailing list OpenAFS-info@openafs.org https://lists.openafs.org/mailman/listinfo/openafs-info
Re: [OpenAFS] namei interface lockf buggy on Solaris (and probably HP-UX and AIX)
I propose we move this discussion to -devel. On 9/11/06, Rainer Toebbicke [EMAIL PROTECTED] wrote: The namei interface uses file locking extensively, implemented using lockf() on Solaris, AIX HP-UX. Unfortunately lockf() locks and unlocks from the *current position* to whatever the argument says (end of file), moving the file pointer in between becomes a problem for the subsequent unlock! The result is that frequently locks aren't released, but replaced by partial locks on the file data just moved over. At least on AIX and Solaris, lockf() is nothing more than an inflexible wrapper around fcntl() byte-range locks. My vote is to transition to fcntl (where we can explicitly pass in a base offset and length). This eliminates the call semantics change introduced by your patch, and eliminates the unnecessary syscall overhead. I further object because I'm working on a patch which will allow us to use pread/pwrite on platforms which support it. This will completely eliminate fcntl(F_DUPFD,...) and lseek() overhead in the fd package, so any new requirements on lseek could mitigate the performance improvement I'm seeing. However, the real motivation for switching to pread/pwrite is due to a fairly serious locking bug: As it turns out, the way we use file locks in the volume package is quite broken. The spec says that once a process closes *any* file descriptor, all fcntl locks held for that file are immediately destroyed. This means that the pthread fileserver/volserver can have some interesting races given how the ih package fd cache allows multiple concurrent descriptors per inode handle. I have sample code sitting around somewhere which demonstrates this fault. Regards, -- Tom Keiser [EMAIL PROTECTED] ___ OpenAFS-info mailing list OpenAFS-info@openafs.org https://lists.openafs.org/mailman/listinfo/openafs-info
Re: [OpenAFS] Crash on Solaris 10 update 2
On 7/2/06, Andrew Cobaugh [EMAIL PROTECTED] wrote: I am running Solaris 10 update 2, fresh install. OpenAFS 1.4.1. I have /afs lofs mounted into several zones. If I try to rename a file within AFS from inside any of the zones, the machine immediately dumps core and reboots. I get the following on console after the crash (this particular instance was caused by Gallery v1 running under apache): http://www.phys.psu.edu/~phalenor/console_output I can also reproduce this by simply mv'ing a file in afs from within a zone. I can provide stacktraces from the core file if necessary to help in debugging this. Has anyone else seen this issue? I've repro'd it with lofs in global and child zones. gafs_rename() incorrectly assumes v_path is always non-null. Sometime around snv_21 the vnode path cache code in the kernel was substantially modified. These changes were subsequently pulled up into s10u2. See RT 34774. Patch is also available at: /afs/dementia.org/user/tkeiser/openafs/patches/solaris-vnode-path-cache-20060702.diff -Tom ___ OpenAFS-info mailing list OpenAFS-info@openafs.org https://lists.openafs.org/mailman/listinfo/openafs-info
Re: [OpenAFS] RPC failed, code=5377
Code 5377 is currently UNOTSYNC. What does udebug say about your vlservers? You might have a quorum problem. Regards, -- Tom Keiser [EMAIL PROTECTED] On Sun, 20 Mar 2005 16:35:53 +0100, Lars Schimmer [EMAIL PROTECTED] wrote: -BEGIN PGP SIGNED MESSAGE- Hash: SHA1 Hiho! After restarting all fileserver, all nodes are up again and it works. But at least on one of the servers (kernel 2.6.10, 1.3.79) the Filelog prints out: Sun Mar 20 16:05:43 2005 VL_RegisterAddrs rpc failed; will retry periodically (code=5377, err=0) Sun Mar 20 16:10:43 2005 VL_RegisterAddrs rpc failed; will retry periodically (code=5377, err=0) Sun Mar 20 16:15:43 2005 VL_RegisterAddrs rpc failed; will retry periodically (code=5377, err=0) Sun Mar 20 16:20:44 2005 VL_RegisterAddrs rpc failed; will retry periodically (code=5377, err=0) Sun Mar 20 16:25:44 2005 VL_RegisterAddrs rpc failed; will retry periodically (code=5377, err=0) Sun Mar 20 16:30:44 2005 VL_RegisterAddrs rpc failed; will retry periodically (code=5377, err=0) And so on and on and on... Google gave nothing. Anyone has an explanation? Cya Thx Lars - -- - - Technische Universität Braunschweig, Institut für Computergraphik Tel.: +49 531 391-2109E-Mail: [EMAIL PROTECTED] PGP-Key-ID: 0xB87A0E03 -BEGIN PGP SIGNATURE- Version: GnuPG v1.2.5 (GNU/Linux) Comment: Using GnuPG with Thunderbird - http://enigmail.mozdev.org iD8DBQFCPZhYVguzrLh6DgMRArRwAJ4iy13lRkJxGBcS9xmaV3VH5MP0qQCfQYY0 a9TdSnraUiAzgN1vDwqvQt8= =2cdB -END PGP SIGNATURE- ___ OpenAFS-info mailing list OpenAFS-info@openafs.org https://lists.openafs.org/mailman/listinfo/openafs-info ___ OpenAFS-info mailing list OpenAFS-info@openafs.org https://lists.openafs.org/mailman/listinfo/openafs-info