Re: FS hang when creating snapshots on a UFS SU+J setup
Hello, I've done some tests to verify that the problem only occures when SU+J is used, but not SU without J. In fact, I did run the following two loops on different TTYs in parallel: while 1 cp -r /usr/src /root rm -Rf /root/src end while 1 mksnap_ffs / /.snap/snap rm -f /.snap/snap end With SU without J the system survives this for at least 1 hour. But as soon as SU+J is used it most likely deadlocks or even panics in the first 1 or 2 minutes. What extactly happens seems to vary... In most cases the system just deadlocks, sometimes like al...@bsdgate.org descripes and sometimes it's completely unresponsive to any input. I've seen kernel messages like fsync: giving up on dirty. Several times the system paniced. In most cases printing the generic panic: page fault while in kernel mode and one time printing panic: snapacct_ufs2: bad block. I've never seen the same backtrace twice. One time the system suddenly rebooted, like a tripple fault or something like that happend. Since it's much more likely that the problems described above arrise when the the filesystem is loaded (for example by the first loop) while taking the snapshot this looks like some kind of race condition or something like that. Some more information from an older debug session can be found at: http://deponie.yamagi.org/freebsd/debug/snapshots_panic/ On Tue, 10 Jan 2012 10:30:13 -0800 Kirk McKusick mckus...@mckusick.com wrote: Date: Mon, 9 Jan 2012 18:30:51 +0100 From: Yamagi Burmeister li...@yamagi.org To: j...@freebsd.org, mckus...@freebsd.org Cc: freebsd-current@freebsd.org, br...@bryce.net Subject: Re: FS hang when creating snapshots on a UFS SU+J setup Hello, I'm sorry to bother you, but you may not be aware of this thread and this problem. We are several people experiencing deadlocks, kernel panics and other problems when creating sanpshots on file systems with SU+J. It would be nice to get some feedback, e.g. how can we help debugging and / or fixing this problem. Thank you, Yamagi First step in debugging is to find out if the problem is SU+J specific. To find out, turn off SU+J but leave SU. This change is done by running: umount filesystem tunefs -j disable filesystem mount filesystem cd filesystem rm .sujournal You may want to run `fsck -f' on the filesystem while you have it unmounted just to be sure that it is clean. Then run your snapshot request to see if it still fails. If it works, then we have narrowed the problem down to something related to SU+J. If it fails then we have a broader issue to deal with. If you wish to go back to using SU+J after the test, you can reenable SU+J by running: umount filesystem tunefs -j enable filesystem mount filesystem When responding to me, it is best to use my mckus...@mckusick.com email as I tend to read it more regularly. Kirk McKusick -- Homepage: www.yamagi.org XMPP: yam...@yamagi.org GnuPG/GPG: 0xEFBCCBCB pgpCLdO5w7GlU.pgp Description: PGP signature
Re: bus dma: a flag/quirk for page zero
On Tuesday, January 10, 2012 3:18:28 pm Andriy Gapon wrote: Some hardware interfaces may reserve a special meaning for a (physical) memory address value of zero. One example is the OHCI specification where a zero value in CurrentBufferPointer doesn't mean a physical address, but has a reserved meaning. To be honest I don't have another example :) but don't preclude its existence. To deal with this peculiarity we could use a special flag/quirk that would instruct the bus dma code to never use the page zero for communication with the hardware. Here's a proof of concept patch that implements the idea: http://people.freebsd.org/~avg/usb-dma-pagezero.diff Some concerns: - not sure if BUS_DMA_NO_PAGEZERO is the best name for the flag - the patch implements the flag only for x86 at the moment - usb code uses the flag regardless of the actual controller type What do you think? I think this is fine, but you should just always exclude page zero when allocating bounce pages. Bounce pages are assigned to zones that can be shared by multiple tags, so other tags that map to the same zone can alloc bounce pages that ohci will use (add_bounce_page() should probably take the bounce zone as an arg instead of a tag). I think it's not worth creating a separate zone just for ohci, but to forbid page zero from all zones instead. Also, please change this: - if (newtag-lowaddr ptoa((vm_paddr_t)Maxmem) -|| newtag-alignment 1) + if (newtag-lowaddr ptoa((vm_paddr_t)Maxmem) || + newtag-alignment 1) + newtag-flags |= BUS_DMA_COULD_BOUNCE; + + if ((newtag-flags BUS_DMA_NO_PAGEZERO) != 0) newtag-flags |= BUS_DMA_COULD_BOUNCE; To just be one if. -- John Baldwin ___ freebsd-current@freebsd.org mailing list http://lists.freebsd.org/mailman/listinfo/freebsd-current To unsubscribe, send any mail to freebsd-current-unsubscr...@freebsd.org
Re: memory barriers in bus_dmamap_sync() ?
On Tuesday, January 10, 2012 5:41:00 pm Luigi Rizzo wrote: On Tue, Jan 10, 2012 at 01:52:49PM -0800, Adrian Chadd wrote: On 10 January 2012 13:37, Luigi Rizzo ri...@iet.unipi.it wrote: I was glancing through manpages and implementations of bus_dma(9) and i am a bit unclear on what this API (in particular, bus_dmamap_sync() ) does in terms of memory barriers. I see that the x86/amd64 and ia64 code only does the bounce buffers. That is because x86 in general does not need memory barriers. Other platforms have them (alpha had them in bus_dmamap_sync()). The mips seems to do some coherency-related calls. How do we guarantee, say, that a recently built packet is to memory before issuing the tx command to the NIC ? The drivers should be good examples of doing the right thing. You just do pre-map and post-map calls as appropriate. Some devices don't bother with this on register accesses and this is a bug. (eg, ath/ath_hal.) Others (eg iwn) do explicit flushes where needed. so you are saying that drivers are correct unless they are buggy :) For bus_dma, just use bus_dmamap_sync() and you will be fine. Anyways... i see that some drivers use wmb() and rmb() and redefine their own version, usually based on lfence/sfence even on i386 #define rmb() __asm volatile(lfence ::: memory) #define wmb() __asm volatile(sfence ::: memory) whereas the standard definitions are slightly different, e.g. sys/i386/include/atomic.h: #define rmb() __asm __volatile(lock; addl $0,(%%esp) : : : memory) #define wmb() __asm __volatile(lock; addl $0,(%%esp) : : : memory) and our bus_space API in sys/x86/include/bus.h is a bit unclear to me (other than the fact that having 4 unused arguments don't really encourage its use...) We could use lfence/sfence on amd64, but on i386 not all processors support those. The broken drivers doing it by hand don't work on early i386 CPUs. Also, I personally don't like using membars like rmb() and wmb() by hand. If you are operating on normal memory I think atomic_load_acq() and atomic_store_rel() are better. static __inline void bus_space_barrier(bus_space_tag_t tag __unused, bus_space_handle_t bsh __unused, bus_size_t offset __unused, bus_size_t len __unused, int flags) { #ifdef __GNUCLIKE_ASM if (flags BUS_SPACE_BARRIER_READ) #ifdef __amd64__ __asm __volatile(lock; addl $0,0(%%rsp) : : : memory); #else __asm __volatile(lock; addl $0,0(%%esp) : : : memory); #endif else __asm __volatile( : : : memory); #endif } This is only for use with something accessed via bus_space(9). Often these are not needed however. For example, on x86 all bus_space memory is mapped uncached, so no actual barrier is needed except for a compiler barrier. -- John Baldwin ___ freebsd-current@freebsd.org mailing list http://lists.freebsd.org/mailman/listinfo/freebsd-current To unsubscribe, send any mail to freebsd-current-unsubscr...@freebsd.org
Re: memory barriers in bus_dmamap_sync() ?
On Jan 10, 2012, at 2:37 PM, Luigi Rizzo wrote: I was glancing through manpages and implementations of bus_dma(9) and i am a bit unclear on what this API (in particular, bus_dmamap_sync() ) does in terms of memory barriers. I see that the x86/amd64 and ia64 code only does the bounce buffers. The mips seems to do some coherency-related calls. How do we guarantee, say, that a recently built packet is to memory before issuing the tx command to the NIC ? In short, i386 and amd64 architectures do bus snooping between the cpu cache and the memory and bus controllers, and coherency is implicit and guaranteed. No explicit barriers or flushes are needed for device mastered DMA. Other CPU architectures have appropriate cache flushes and memory barriers built into their busdma implementations. Note that this is a different scenario than device register accesses, which is essentially host mastered DMA. Scott ___ freebsd-current@freebsd.org mailing list http://lists.freebsd.org/mailman/listinfo/freebsd-current To unsubscribe, send any mail to freebsd-current-unsubscr...@freebsd.org
Re: bus dma: a flag/quirk for page zero
An old controller in the aac driver family had a variation of this problem back when the FreeBSD contigmalloc algorithm started from the bottom of memory instead of the top. I worked around it at driver init time by basically assuring that page 0 (and page 1) were allocated and thrown away; it seemed easier to leak 8k of memory than to jump through expensive hoops in busdma. The busdma filter is expensive, and is used so rarely that I'm not even sure it works. It was created for an old SCSI controller that had a buggy DMA controller which aliased a repeating pattern of address ranges; in other words it was a hack. It's expensive to use, since it forces every bus_dmamap_load() request through the slow path and possibly bouncing. With that said, your idea of a flag is probably a reasonable change for now. Alternatively, the ability to specify multiple DMA exclusion ranges has come up in the past, and would be a more complete answer to your problem; just treating page0 as special might not be enough (and I know for a fact that this is true with old i960RX pci processors). That'll involve an API change, so is something that I'd rather not happen on a whim. Scott On Jan 10, 2012, at 1:18 PM, Andriy Gapon wrote: Some hardware interfaces may reserve a special meaning for a (physical) memory address value of zero. One example is the OHCI specification where a zero value in CurrentBufferPointer doesn't mean a physical address, but has a reserved meaning. To be honest I don't have another example :) but don't preclude its existence. To deal with this peculiarity we could use a special flag/quirk that would instruct the bus dma code to never use the page zero for communication with the hardware. Here's a proof of concept patch that implements the idea: http://people.freebsd.org/~avg/usb-dma-pagezero.diff Some concerns: - not sure if BUS_DMA_NO_PAGEZERO is the best name for the flag - the patch implements the flag only for x86 at the moment - usb code uses the flag regardless of the actual controller type What do you think? -- Andriy Gapon ___ freebsd-current@freebsd.org mailing list http://lists.freebsd.org/mailman/listinfo/freebsd-current To unsubscribe, send any mail to freebsd-current-unsubscr...@freebsd.org ___ freebsd-current@freebsd.org mailing list http://lists.freebsd.org/mailman/listinfo/freebsd-current To unsubscribe, send any mail to freebsd-current-unsubscr...@freebsd.org
Re: memory barriers in bus_dmamap_sync() ?
On Wed, Jan 11, 2012 at 10:05:28AM -0500, John Baldwin wrote: On Tuesday, January 10, 2012 5:41:00 pm Luigi Rizzo wrote: On Tue, Jan 10, 2012 at 01:52:49PM -0800, Adrian Chadd wrote: On 10 January 2012 13:37, Luigi Rizzo ri...@iet.unipi.it wrote: I was glancing through manpages and implementations of bus_dma(9) and i am a bit unclear on what this API (in particular, bus_dmamap_sync() ) does in terms of memory barriers. I see that the x86/amd64 and ia64 code only does the bounce buffers. That is because x86 in general does not need memory barriers. ... maybe they are not called memory barriers but for instance how do i make sure, even on the x86, that a write to the NIC ring is properly flushed before the write to the 'start' register occurs ? Take for instance the following segment from head/sys/ixgbe/ixgbe.c::ixgbe_xmit() : txd-read.cmd_type_len |= htole32(IXGBE_TXD_CMD_EOP | IXGBE_TXD_CMD_RS); txr-tx_avail -= nsegs; txr-next_avail_desc = i; txbuf-m_head = m_head; /* Swap the dma map between the first and last descriptor */ txr-tx_buffers[first].map = txbuf-map; txbuf-map = map; bus_dmamap_sync(txr-txtag, map, BUS_DMASYNC_PREWRITE); /* Set the index of the descriptor that will be marked done */ txbuf = txr-tx_buffers[first]; txbuf-eop_index = last; bus_dmamap_sync(txr-txdma.dma_tag, txr-txdma.dma_map, BUS_DMASYNC_PREREAD | BUS_DMASYNC_PREWRITE); /* * Advance the Transmit Descriptor Tail (Tdt), this tells the * hardware that this frame is available to transmit. */ ++txr-total_packets; IXGBE_WRITE_REG(adapter-hw, IXGBE_TDT(txr-me), i); the descriptor is allocated without any caching constraint, the bus_dmamap_sync() are effectively NOPs on i386 and amd64, and IXGBE_WRITE_REG has no implicit guard. We could use lfence/sfence on amd64, but on i386 not all processors support ok then we can make it machine-specific versions... this is kernel code so we do have a list of supported CPUs. those. The broken drivers doing it by hand don't work on early i386 CPUs. Also, I personally don't like using membars like rmb() and wmb() by hand. If you are operating on normal memory I think atomic_load_acq() and atomic_store_rel() are better. is it just a matter of names ? My complaint was mostly on how many unused parameters you need to pass to bus_space_barrier(). They make life hard for both the programmer and the compiler, which might become unable to optimize them out. I understand that more parameter may help parallelism, but i wonder if it is worth the effort. cheers luigi ___ freebsd-current@freebsd.org mailing list http://lists.freebsd.org/mailman/listinfo/freebsd-current To unsubscribe, send any mail to freebsd-current-unsubscr...@freebsd.org
Re: memory barriers in bus_dmamap_sync() ?
On Wednesday, January 11, 2012 11:29:44 am Luigi Rizzo wrote: On Wed, Jan 11, 2012 at 10:05:28AM -0500, John Baldwin wrote: On Tuesday, January 10, 2012 5:41:00 pm Luigi Rizzo wrote: On Tue, Jan 10, 2012 at 01:52:49PM -0800, Adrian Chadd wrote: On 10 January 2012 13:37, Luigi Rizzo ri...@iet.unipi.it wrote: I was glancing through manpages and implementations of bus_dma(9) and i am a bit unclear on what this API (in particular, bus_dmamap_sync() ) does in terms of memory barriers. I see that the x86/amd64 and ia64 code only does the bounce buffers. That is because x86 in general does not need memory barriers. ... maybe they are not called memory barriers but for instance how do i make sure, even on the x86, that a write to the NIC ring is properly flushed before the write to the 'start' register occurs ? Take for instance the following segment from head/sys/ixgbe/ixgbe.c::ixgbe_xmit() : txd-read.cmd_type_len |= htole32(IXGBE_TXD_CMD_EOP | IXGBE_TXD_CMD_RS); txr-tx_avail -= nsegs; txr-next_avail_desc = i; txbuf-m_head = m_head; /* Swap the dma map between the first and last descriptor */ txr-tx_buffers[first].map = txbuf-map; txbuf-map = map; bus_dmamap_sync(txr-txtag, map, BUS_DMASYNC_PREWRITE); /* Set the index of the descriptor that will be marked done */ txbuf = txr-tx_buffers[first]; txbuf-eop_index = last; bus_dmamap_sync(txr-txdma.dma_tag, txr-txdma.dma_map, BUS_DMASYNC_PREREAD | BUS_DMASYNC_PREWRITE); /* * Advance the Transmit Descriptor Tail (Tdt), this tells the * hardware that this frame is available to transmit. */ ++txr-total_packets; IXGBE_WRITE_REG(adapter-hw, IXGBE_TDT(txr-me), i); the descriptor is allocated without any caching constraint, the bus_dmamap_sync() are effectively NOPs on i386 and amd64, and IXGBE_WRITE_REG has no implicit guard. x86 doesn't need a guard as its stores are ordered. The bus_dmamap_sync() would be sufficient for platforms where stores can be reordered in this case (as those platforms should place memory barriers in their implementation of bus_dmamap_sync()). We could use lfence/sfence on amd64, but on i386 not all processors support ok then we can make it machine-specific versions... this is kernel code so we do have a list of supported CPUs. It is not worth it to add the overhead for i386 to do that when all modern x86 CPUs are going to run amd64 anyway. those. The broken drivers doing it by hand don't work on early i386 CPUs. Also, I personally don't like using membars like rmb() and wmb() by hand. If you are operating on normal memory I think atomic_load_acq() and atomic_store_rel() are better. is it just a matter of names ? For regular memory when you are using memory barriers you often want to tie the barrier to a specific operation (e.g. it is the store in IXGBE_WRITE_REG() above that you want ordered after any other stores). Having the load/store and membar in the same call explicitly notes that relationship. My complaint was mostly on how many unused parameters you need to pass to bus_space_barrier(). They make life hard for both the programmer and the compiler, which might become unable to optimize them out. Yes, it seems overly abstracted. In NetBSD, bus_dmapmap_sync() actually takes extra parameters to say which portion of the map should be sync'd. We removed those in FreeBSD to make the API simpler. bus_space_barrier() could probably use similar simplification (I believe we also adopted that API from NetBSD). -- John Baldwin ___ freebsd-current@freebsd.org mailing list http://lists.freebsd.org/mailman/listinfo/freebsd-current To unsubscribe, send any mail to freebsd-current-unsubscr...@freebsd.org
Data corruption over NFS in -current
I'm sorry for the unspecific bug report but I thought a heads-up is better than none. $ uname -a FreeBSD wings.cons.org 10.0-CURRENT FreeBSD 10.0-CURRENT #2: Wed Dec 28 12:19:21 EST 2011 craca...@wings.cons.org:/usr/src/sys/amd64/compile/WINGS amd64 I see filesystem corruption on NFS filesystems here. I am running a heavy shellscript that is noodling around with ascii files assembling them with awk and whatnot. Some actions are concurrent with up to 21 forks doing full-CPU load scripting. This machine is a K8 with a total of 8 cores, diskless NFS and memory filesystem for /tmp. I observe two problems: - for no reason whatsoever, some files change from my (user/group) cracauer/wheel to root/cracauer - the same files will later be corrupted. The beginning of the file is normal but then it has what looks like parts of /usr/ports, including our CVS files and binary junk, mostly zeros I did do some ports building lately but not at the same time that this problem manifested itself. I speculate some ports blocks were still resident in the filesystem buffer cache. Server is Linux. Martin -- %%% Martin Cracauer craca...@cons.org http://www.cons.org/cracauer/ ___ freebsd-current@freebsd.org mailing list http://lists.freebsd.org/mailman/listinfo/freebsd-current To unsubscribe, send any mail to freebsd-current-unsubscr...@freebsd.org
Re: memory barriers in bus_dmamap_sync() ?
On Jan 11, 2012, at 9:29 AM, Luigi Rizzo wrote: On Wed, Jan 11, 2012 at 10:05:28AM -0500, John Baldwin wrote: On Tuesday, January 10, 2012 5:41:00 pm Luigi Rizzo wrote: On Tue, Jan 10, 2012 at 01:52:49PM -0800, Adrian Chadd wrote: On 10 January 2012 13:37, Luigi Rizzo ri...@iet.unipi.it wrote: I was glancing through manpages and implementations of bus_dma(9) and i am a bit unclear on what this API (in particular, bus_dmamap_sync() ) does in terms of memory barriers. I see that the x86/amd64 and ia64 code only does the bounce buffers. That is because x86 in general does not need memory barriers. ... maybe they are not called memory barriers but for instance how do i make sure, even on the x86, that a write to the NIC ring is properly flushed before the write to the 'start' register occurs ? Flushed from where? The CPU's cache or the device memory and pci bus? I already told you that x86/64 is fundamentally designed around bus snooping, and John already told you that we map device memory to be uncached. Also, PCI guarantees that reads and writes are retired in order, and that reads are therefore flushing barriers. So lets take two scenarios. In the first scenario, the NIC descriptors are in device memory, so the DMA has to do bus_space accesses to write them. Scenario 1 1. driver writes to the descriptors. These may or may not hang out in the cpu's cache, though they probably won't because we map PCI device memory as uncachable. But let's say for the sake of argument that they are cached. 2. driver writes to the 'go' register on the card. This may or may not be in the cpu's cache, as in step 1. 3. The writes get flushed out of the cpu and onto the host bus. Again, the x86/64 architecture guarantees that these writes won't be reordered. 4. The writes get onto the PCI bus and buffered at the first bridge. 5. PCI ordering rules keep the writes in order, and they eventually make it to the card in the same order that the driver executed them. Scenario 2 1. driver writes to the descriptors in host memory. This memory is mapped as cache-able, so these writes hang out in the CPU. 2. driver writes to the 'go' register on the card. This may or may not hang out in the cpu's cache, but likely won't as discussed previously. 3. The 'go' write eventually makes its way down to the card, and the card starts its processing. 4. the card masters a PCI read for the descriptor data, and the request goes up the pci bus to the host bridge 5. thanks to the fundamental design guarantees on x86/64, the pci host bridge, memory controller, and cpu all snoop each other. In this case, the cpu sees the read come from the pci host bridge, knows that its for data that's in its cache, and intercepts and fills the request. Coherency is preserved! Explicit barriers aren't needed in either scenario; everything will retire correctly and in order. The only caveat is the buffering that happens on the PCI bus. A write by the host might take a relatively long and indeterminate time to reach the card thanks to this buffering and the bus being busy. To guarantee that you know when the write has been delivered and retired, you can do a read immediately after the write. On some systems, this might also boost the transaction priority of the write and get it down faster, but that's really not a reliably guarantee. All you'll know is that when the read completes, the write prior to it has also completed. Where barriers _are_ needed is in interrupt handlers, and I can discuss that if you're interested. Scott ___ freebsd-current@freebsd.org mailing list http://lists.freebsd.org/mailman/listinfo/freebsd-current To unsubscribe, send any mail to freebsd-current-unsubscr...@freebsd.org
Re: memory barriers in bus_dmamap_sync() ?
On Wed, 2012-01-11 at 11:49 -0500, John Baldwin wrote: On Wednesday, January 11, 2012 11:29:44 am Luigi Rizzo wrote: On Wed, Jan 11, 2012 at 10:05:28AM -0500, John Baldwin wrote: On Tuesday, January 10, 2012 5:41:00 pm Luigi Rizzo wrote: On Tue, Jan 10, 2012 at 01:52:49PM -0800, Adrian Chadd wrote: On 10 January 2012 13:37, Luigi Rizzo ri...@iet.unipi.it wrote: I was glancing through manpages and implementations of bus_dma(9) and i am a bit unclear on what this API (in particular, bus_dmamap_sync() ) does in terms of memory barriers. I see that the x86/amd64 and ia64 code only does the bounce buffers. That is because x86 in general does not need memory barriers. ... maybe they are not called memory barriers but for instance how do i make sure, even on the x86, that a write to the NIC ring is properly flushed before the write to the 'start' register occurs ? Take for instance the following segment from head/sys/ixgbe/ixgbe.c::ixgbe_xmit() : txd-read.cmd_type_len |= htole32(IXGBE_TXD_CMD_EOP | IXGBE_TXD_CMD_RS); txr-tx_avail -= nsegs; txr-next_avail_desc = i; txbuf-m_head = m_head; /* Swap the dma map between the first and last descriptor */ txr-tx_buffers[first].map = txbuf-map; txbuf-map = map; bus_dmamap_sync(txr-txtag, map, BUS_DMASYNC_PREWRITE); /* Set the index of the descriptor that will be marked done */ txbuf = txr-tx_buffers[first]; txbuf-eop_index = last; bus_dmamap_sync(txr-txdma.dma_tag, txr-txdma.dma_map, BUS_DMASYNC_PREREAD | BUS_DMASYNC_PREWRITE); /* * Advance the Transmit Descriptor Tail (Tdt), this tells the * hardware that this frame is available to transmit. */ ++txr-total_packets; IXGBE_WRITE_REG(adapter-hw, IXGBE_TDT(txr-me), i); the descriptor is allocated without any caching constraint, the bus_dmamap_sync() are effectively NOPs on i386 and amd64, and IXGBE_WRITE_REG has no implicit guard. x86 doesn't need a guard as its stores are ordered. The bus_dmamap_sync() would be sufficient for platforms where stores can be reordered in this case (as those platforms should place memory barriers in their implementation of bus_dmamap_sync()). We could use lfence/sfence on amd64, but on i386 not all processors support ok then we can make it machine-specific versions... this is kernel code so we do have a list of supported CPUs. It is not worth it to add the overhead for i386 to do that when all modern x86 CPUs are going to run amd64 anyway. Harumph. I run i386 on all my x86 CPUs. For our embedded systems products it's because they're small wimpy old CPUs, and for my desktop system it's because I need to run builds for the embedded systems and avoid all the cross-build problems of trying to create i386 ports on a 64 bit host. those. The broken drivers doing it by hand don't work on early i386 CPUs. Also, I personally don't like using membars like rmb() and wmb() by hand. If you are operating on normal memory I think atomic_load_acq() and atomic_store_rel() are better. is it just a matter of names ? For regular memory when you are using memory barriers you often want to tie the barrier to a specific operation (e.g. it is the store in IXGBE_WRITE_REG() above that you want ordered after any other stores). Having the load/store and membar in the same call explicitly notes that relationship. My complaint was mostly on how many unused parameters you need to pass to bus_space_barrier(). They make life hard for both the programmer and the compiler, which might become unable to optimize them out. Yes, it seems overly abstracted. In NetBSD, bus_dmapmap_sync() actually takes extra parameters to say which portion of the map should be sync'd. We removed those in FreeBSD to make the API simpler. bus_space_barrier() could probably use similar simplification (I believe we also adopted that API from NetBSD). I've wished (in the ARM world) for the ability to sync a portion of a map. I've even kicked around the idea of proposing an API extension to do so, but I guess if FreeBSD went out of its way to remove that functionality that idea probably won't fly. :) -- Ian ___ freebsd-current@freebsd.org mailing list http://lists.freebsd.org/mailman/listinfo/freebsd-current To unsubscribe, send any mail to freebsd-current-unsubscr...@freebsd.org
Re: memory barriers in bus_dmamap_sync() ?
On Jan 11, 2012, at 10:00 AM, Ian Lepore wrote: I've wished (in the ARM world) for the ability to sync a portion of a map. I've even kicked around the idea of proposing an API extension to do so, but I guess if FreeBSD went out of its way to remove that functionality that idea probably won't fly. :) It's been discussed numerous times since mips and arm became relevant in FreeBSD, and I'm frankly surprised that it hasn't happened yet. Go forth and code, it won't be opposed. Scott ___ freebsd-current@freebsd.org mailing list http://lists.freebsd.org/mailman/listinfo/freebsd-current To unsubscribe, send any mail to freebsd-current-unsubscr...@freebsd.org
Re: memory barriers in bus_dmamap_sync() ?
On Wed, 2012-01-11 at 09:59 -0700, Scott Long wrote: Where barriers _are_ needed is in interrupt handlers, and I can discuss that if you're interested. Scott I'd be interested in hearing about that (and in general I'm loving the details coming out in your explanations -- thanks!). -- Ian ___ freebsd-current@freebsd.org mailing list http://lists.freebsd.org/mailman/listinfo/freebsd-current To unsubscribe, send any mail to freebsd-current-unsubscr...@freebsd.org
Re: bus dma: a flag/quirk for page zero
on 11/01/2012 17:01 John Baldwin said the following: I think this is fine, but you should just always exclude page zero when allocating bounce pages. Bounce pages are assigned to zones that can be shared by multiple tags, so other tags that map to the same zone can alloc bounce pages that ohci will use (add_bounce_page() should probably take the bounce zone as an arg instead of a tag). I think it's not worth creating a separate zone just for ohci, but to forbid page zero from all zones instead. Thank you for the explanation. Actually, I think that on x86 we don't have to do anything special for any memory allocations that we do, including the bounce pages, as the page zero is excluded from phys_avail and is not available for normal use. The only thing we have to do on x86 is to bounce the page zero if it gets passed to us. (And that can happen only in very special situations, obviously. I am not sure if anything besides the system dump would do that.) And I would prefer to defer any changes to !x86 bus dma to the respective platform maintainers, obviously ;-) Also, please change this: - if (newtag-lowaddr ptoa((vm_paddr_t)Maxmem) - || newtag-alignment 1) + if (newtag-lowaddr ptoa((vm_paddr_t)Maxmem) || + newtag-alignment 1) + newtag-flags |= BUS_DMA_COULD_BOUNCE; + + if ((newtag-flags BUS_DMA_NO_PAGEZERO) != 0) newtag-flags |= BUS_DMA_COULD_BOUNCE; To just be one if. Will do. -- Andriy Gapon ___ freebsd-current@freebsd.org mailing list http://lists.freebsd.org/mailman/listinfo/freebsd-current To unsubscribe, send any mail to freebsd-current-unsubscr...@freebsd.org
Re: bus dma: a flag/quirk for page zero
on 11/01/2012 18:02 Scott Long said the following: An old controller in the aac driver family had a variation of this problem back when the FreeBSD contigmalloc algorithm started from the bottom of memory instead of the top. I worked around it at driver init time by basically assuring that page 0 (and page 1) were allocated and thrown away; it seemed easier to leak 8k of memory than to jump through expensive hoops in busdma. The busdma filter is expensive, and is used so rarely that I'm not even sure it works. It was created for an old SCSI controller that had a buggy DMA controller which aliased a repeating pattern of address ranges; in other words it was a hack. It's expensive to use, since it forces every bus_dmamap_load() request through the slow path and possibly bouncing. With that said, your idea of a flag is probably a reasonable change for now. Alternatively, the ability to specify multiple DMA exclusion ranges has come up in the past, and would be a more complete answer to your problem; just treating page0 as special might not be enough (and I know for a fact that this is true with old i960RX pci processors). That'll involve an API change, so is something that I'd rather not happen on a whim. Scott, thank you very much for the explanation and the insight. As I've written in some other email, on x86 page 0 is already an unavailable page and the only way it can get into the dma layer is only during a system dump. I am not sure about all other platforms, probably there is at least one where page 0 is just another normal page. Maybe excluding page 0 from both normal use and the dump is the most simple hummer for this nail... The problem with trying to deal with page zero at the bus dma level is that it pessimizes the cases where previously no bouncing was expected as page zero may pop up anywhere. That's why I decided to go with the flag instead of handling page 0 in all dma tags unconditionally as Matthew has suggested. It feels like there could be a better solution that the flag, but I just can't come up with it. To be fair, I haven't come up with the flag either, it's a John's idea. -- Andriy Gapon ___ freebsd-current@freebsd.org mailing list http://lists.freebsd.org/mailman/listinfo/freebsd-current To unsubscribe, send any mail to freebsd-current-unsubscr...@freebsd.org
Re: FS hang when creating snapshots on a UFS SU+J setup
On Wed, Jan 11, 2012 at 10:30:39AM +0100, Yamagi Burmeister wrote: Hello, I've done some tests to verify that the problem only occures when SU+J is used, but not SU without J. In fact, I did run the following two loops on different TTYs in parallel: I also confirm this using a similar technique. The panic is only seen with SU+J and not with just SU. I did a similar cp -R /root /var/tmp ; rm -rf /var/tmp/root and the panic was trigger with dump -0L... I got the panic (again in less than a minute of issuing the dump command) -- I also got the giving up on dirty kind of message. I took a picture of the screen -- I am not sure if that helps! http://picpaste.com/11012012519-LF0sWlpw.jpg Since it's much more likely that the problems described above arrise when the the filesystem is loaded (for example by the first loop) while taking the snapshot this looks like some kind of race condition or something like that. Earlier I have seen this happen with dump without any high load -- or atleast very minimum -- again with the /var because some logs were written or cronjob was running writing to it. That didnt panic as I indicated in my previous email -- hogged the CPU and forced a power-cycle. Do let me know if I can try something further. Thanks Gautam ___ freebsd-current@freebsd.org mailing list http://lists.freebsd.org/mailman/listinfo/freebsd-current To unsubscribe, send any mail to freebsd-current-unsubscr...@freebsd.org
Re: Data corruption over NFS in -current
Am 11.01.2012 um 17:57 schrieb Martin Cracauer: I'm sorry for the unspecific bug report but I thought a heads-up is better than none. $ uname -a FreeBSD wings.cons.org 10.0-CURRENT FreeBSD 10.0-CURRENT #2: Wed Dec 28 12:19:21 EST 2011 craca...@wings.cons.org:/usr/src/sys/amd64/compile/WINGS amd64 I'm sure Rick will want to know which NFS version, which client code (default new code I'm assuming) and which mount options... I see filesystem corruption on NFS filesystems here. I am running a heavy shellscript that is noodling around with ascii files assembling them with awk and whatnot. Some actions are concurrent with up to 21 forks doing full-CPU load scripting. This machine is a K8 with a total of 8 cores, diskless NFS and memory filesystem for /tmp. I observe two problems: - for no reason whatsoever, some files change from my (user/group) cracauer/wheel to root/cracauer - the same files will later be corrupted. The beginning of the file is normal but then it has what looks like parts of /usr/ports, including our CVS files and binary junk, mostly zeros I did do some ports building lately but not at the same time that this problem manifested itself. I speculate some ports blocks were still resident in the filesystem buffer cache. Server is Linux. Martin -- %%% Martin Cracauer craca...@cons.org http://www.cons.org/cracauer/ ___ freebsd-current@freebsd.org mailing list http://lists.freebsd.org/mailman/listinfo/freebsd-current To unsubscribe, send any mail to freebsd-current-unsubscr...@freebsd.org -- Stefan Bethke s...@lassitu.de Fon +49 151 14070811 ___ freebsd-current@freebsd.org mailing list http://lists.freebsd.org/mailman/listinfo/freebsd-current To unsubscribe, send any mail to freebsd-current-unsubscr...@freebsd.org
Re: Data corruption over NFS in -current
Stefan Bethke wrote on Wed, Jan 11, 2012 at 07:14:44PM +0100: Am 11.01.2012 um 17:57 schrieb Martin Cracauer: I'm sorry for the unspecific bug report but I thought a heads-up is better than none. $ uname -a FreeBSD wings.cons.org 10.0-CURRENT FreeBSD 10.0-CURRENT #2: Wed Dec 28 12:19:21 EST 2011 craca...@wings.cons.org:/usr/src/sys/amd64/compile/WINGS amd64 I'm sure Rick will want to know which NFS version, which client code (default new code I'm assuming) and which mount options... It's all default both in fstab and as reported by mount(8). This is a diskless PXE boot but the mount affected (usr) is not the root filesystem, so this should come in via fstab. BTW, my /usr/ports is another mount so the corruption is cross-mount (garbage from /usr/ports entering /usr). Appending nfsstat output. I am re-running things contiguously to see how reproducible this is. This machine was recently updated from a -current almost a year old, so it's its first time with the new NFS client code. Martin I see filesystem corruption on NFS filesystems here. I am running a heavy shellscript that is noodling around with ascii files assembling them with awk and whatnot. Some actions are concurrent with up to 21 forks doing full-CPU load scripting. This machine is a K8 with a total of 8 cores, diskless NFS and memory filesystem for /tmp. I observe two problems: - for no reason whatsoever, some files change from my (user/group) cracauer/wheel to root/cracauer - the same files will later be corrupted. The beginning of the file is normal but then it has what looks like parts of /usr/ports, including our CVS files and binary junk, mostly zeros I did do some ports building lately but not at the same time that this problem manifested itself. I speculate some ports blocks were still resident in the filesystem buffer cache. Server is Linux. Martin -- %%% Martin Cracauer craca...@cons.org http://www.cons.org/cracauer/ ___ freebsd-current@freebsd.org mailing list http://lists.freebsd.org/mailman/listinfo/freebsd-current To unsubscribe, send any mail to freebsd-current-unsubscr...@freebsd.org -- Stefan Bethke s...@lassitu.de Fon +49 151 14070811 -- %%% Martin Cracauer craca...@cons.org http://www.cons.org/cracauer/ Client Info: Rpc Counts: Getattr SetattrLookup Readlink Read WriteCreateRemove 94392942513117 3637266 2577 40227237 2824593333832304567 Rename Link Symlink Mkdir Rmdir Readdir RdirPlusAccess 32522 5121 4856 20363 13954179035 0 3534382 MknodFsstatFsinfo PathConfCommit 5 21127240 3 2999521782 Rpc Info: TimedOut Invalid X Replies Retries Requests 0 0 0 0 167678419 Cache Info: Attr HitsMisses Lkup HitsMisses BioR HitsMisses BioW HitsMisses 1933340911 73265447 1123380719 3636242 90975094450509 4917135 2824593 BioRLHitsMisses BioD HitsMisses DirE HitsMisses Accs HitsMisses 54732346 2577599049142917352394 0 733726346 3534382 Server Info: Getattr SetattrLookup Readlink Read WriteCreateRemove 0 0 0 0 0 0 0 0 Rename Link Symlink Mkdir Rmdir Readdir RdirPlusAccess 0 0 0 0 0 0 0 0 MknodFsstatFsinfo PathConfCommit 0 0 0 0 0 Server Ret-Failed 0 Server Faults 0 Server Cache Stats: Inprog Idem Non-idemMisses 0 0 0 0 Server Write Gathering: WriteOps WriteRPC Opsaved 0 0 0 ___ freebsd-current@freebsd.org mailing list http://lists.freebsd.org/mailman/listinfo/freebsd-current To unsubscribe, send any mail to freebsd-current-unsubscr...@freebsd.org
Re: memory barriers in bus_dmamap_sync() ?
On Jan 11, 2012, at 10:10 AM, Ian Lepore wrote: On Wed, 2012-01-11 at 09:59 -0700, Scott Long wrote: Where barriers _are_ needed is in interrupt handlers, and I can discuss that if you're interested. Scott I'd be interested in hearing about that (and in general I'm loving the details coming out in your explanations -- thanks!). -- Ian Well, I unfortunately wasn't as clear as I should have been. Interrupt handlers need bus barriers, not cpu cache/instruction barriers. This is because the interrupt signal can arrive at the CPU before data and control words are finished being DMA's up from the controller. Also, many controllers require an acknowledgement write to be performed before leaving the interrupt handler, so the driver needs to do a bus barrier to ensure that the write flushes. But these are two different topics, so let me start with the interrupt handler. Legacy interrupts in PCI are carried on discrete pins and are level triggered. When the device wants to signal an interrupt, it asserts the pin. That assertion is seen at the IOAPIC on the host bridge and converted to an interrupt message, which is then sent immediately to the CPU's lAPIC. This all happened very, very quickly. Meanwhile, the interrupt condition could have been predicated on the device DMA'ing bytes up to host memory, and those DMA writes could have gotten stalled and buffered on the way up the PCI topology. The end result is often that the driver interrupt handler runs before those writes have hit host memory. To fix this, drivers do a read of a card register as the first step in the interrupt handler, even if the read is just a dummy and the result is thrown away. Thanks to PCI ordering, the read will ensure that any pending writes from the card have flushed all the way up, and everything will be coherent by the time the read completes. MSI and MSIX interrupts on modern PCI and PCIe fix this. These interrupts are sent as byte messages that are DMA'd to the host bridge. Since they are in-band data, they are subject to the same ordering rules as all other data on the bus, and thus ordering for them is implicit. When the MSI message reaches the host bridge, it's converted into an lAPIC message just like before. However, the driver doesn't need to do a flushing read because it knows that the MSI message was the last write on the bus, therefore everything prior to it has arrived and everything is coherent. Since reads are expensive in PCI, this saves a considerable amount of time in the driver. Unfortunately, it adds non-deterministic latency to the interrupt since the MSI message is in-band and has no way to force priority flushing on a busy bus. So while MSI/MSIX save some time in the interrupt handler, they actually make the overall latency situation potentially worse (thanks Intel!). The acknowledgement write issue is a little more straight forward. If the card requires an acknowledgment write from the driver to know that the interrupt has been serviced (so that it'll then know to de-assert the interrupt line), that write has to be flushed to the hardware before the interrupt handler completes. Otherwise, the write could get stalled, the interrupt remain asserted, and in the interrupt erroneously re-trigger on the host CPU. I've seen cases where this devolves into the card getting out of sync with the driver to the point that interrupts get missed. Also, this gets a little weird sometimes with buggy MSI hacks in both device and PCI bridge hardware. Scott ___ freebsd-current@freebsd.org mailing list http://lists.freebsd.org/mailman/listinfo/freebsd-current To unsubscribe, send any mail to freebsd-current-unsubscr...@freebsd.org
[RFT] Major snd_hda rewrite
Hi. I would like request for testing of my work on further HDA sound driver improvement. List of changes done this time: - Huge old hdac driver was split into three independent pieces: HDA controller driver (hdac), HDA CODEC driver (hdacc) and HDA sudio function driver (hdaa). All drivers are completely independent and talk to each other only via NewBus interfaces. Using more NewBus bells and whistles allows to properly see HDA structure with standard system instruments, such as `devinfo -v`. Biggest driver file size now is 150K, instead of 240K before, and the code is much more clean. - Support for multichannel recording was added. While I've never seen it configured by default, UAA specification tells that it is possible. Now, as specification defines, driver checks input associations for pins with sequence numbers 14 and 15, and if found (usually) -- works as before, mixing signals together. If it doesn't, it configures input association as multichannel. I've found some CODECs doing strange things when configured for multichannel recording, but I've also found successfully working examples. - Signal tracer was improved to look for cases where several DACs/ADCs in CODEC can work with the same audio signal. If such case found, driver registers additional playback/record stream (channel) for the pcm device. Having more then one stream allows to avoid vchans use and so avoid extra conversion to pre-configured vchan rate and sample format. Not many CODECs allow this, especially on playback, but some do. - New controller streams reservation mechanism was implemented. That allows to have more pcm devices then streams supported by the controller (usually 4 in each direction). Now it limits only number of _simultaneously_ transferred audio streams, that is rarely reachable and properly reported if happens. - Codec pins and GPIO signals configuration was exported via set of writable sysctls. Another sysctl dev.hdaa.X.reconfig allows to trigger driver reconfiguration in run-time. The only requirement is that all pcm devices should be closed at the moment, as they will be destroyed and recreated. This should significantly simplify process of fixing CODEC configuration. It should be possible now even to write GUI to do it with few mouse clicks. - Driver now decodes pins location and connector type names. In some cases it allows to hint user where on the system case connectors, related to the pcm device, are located. Number of channels supported by pcm device, reported now (if it is not 2), should also make search easier. - Added fix for digital mic recording on some Asus laptops/netbooks. That is how it may look now in dmesg: hdac0: Intel 5 Series/3400 Series HDA Controller mem 0xf7ef4000-0xf7ef7fff irq 22 at device 27.0 on pci0 hdacc0: VIA VT1708S_0 HDA CODEC at cad 0 on hdac0 hdaa0: VIA VT1708S_0 HDA CODEC Audio Function Group at nid 1 on hdacc0 hdacc1: Intel Ibex Peak HDA CODEC at cad 3 on hdac0 hdaa1: Intel Ibex Peak HDA CODEC Audio Function Group at nid 1 on hdacc1 pcm0: VIA VT1708S_0 HDA CODEC PCM (Analog) at nid 28,29 and 26,30,27 on hdaa0 pcm1: VIA VT1708S_0 HDA CODEC PCM (Digital) at nid 32 on hdaa0 pcm2: Intel Ibex Peak HDA CODEC PCM (DisplayPort 8ch) at nid 6 on hdaa1 Patch can be found here: http://people.freebsd.org/~mav/hda.rewrite.patch Patch was generated for 10-CURRENT, but should apply to fresh 9-STABLE and 8-STABLE branches also. Special thanks to iXsystems, Inc. for supporting this work. Comments and tests results are welcome! -- Alexander Motin ___ freebsd-current@freebsd.org mailing list http://lists.freebsd.org/mailman/listinfo/freebsd-current To unsubscribe, send any mail to freebsd-current-unsubscr...@freebsd.org
Re: bus dma: a flag/quirk for page zero
on 11/01/2012 19:18 Andriy Gapon said the following: Actually, I think that on x86 we don't have to do anything special for any memory allocations that we do, including the bounce pages, as the page zero is excluded from phys_avail and is not available for normal use. After some additional thinking there is probably no reason to take advantage of this fact. First, it would increase differences with other platforms. Second, it would add a hidden dependency. So it's better to be explicit here. -- Andriy Gapon ___ freebsd-current@freebsd.org mailing list http://lists.freebsd.org/mailman/listinfo/freebsd-current To unsubscribe, send any mail to freebsd-current-unsubscr...@freebsd.org
Re: Very fresh (two days ago) 10-current becomes completely unresponsive under load
Hello, Chuck. You wrote 11 января 2012 г., 3:47:08: If it were me, I would also try with the older 44BSD scheduler, just to see what happens. It helps both with mpd5.5 and mpd5.6. Now under network load top lines in `top' are PID USERNAME PRI NICE SIZERES STATETIME WCPU COMMAND 10 root 155 ki31 0K 8K RUN 2:19 60.74% idle 11 root -72- 0K 112K WAIT 1:47 32.03% intr{swi1: netisr 0} And system is very responsive. ng_queue is not in top 17 (one screen) lines of `top' any more, it looks usual to me. I'll try to find revision, which breaks ULE + NetGraph by binary search, but it takes some time as here is 590 revisions in head/sys between previous version I used (which works Ok with ULE) and current version (which doesn't). So, it should be ~9 iterations, and every iteration takes ~1 hour and I could not spend 9 hours in row on this task. -- // Black Lion AKA Lev Serebryakov l...@freebsd.org ___ freebsd-current@freebsd.org mailing list http://lists.freebsd.org/mailman/listinfo/freebsd-current To unsubscribe, send any mail to freebsd-current-unsubscr...@freebsd.org
Re: ImageMagick: tests fail on freebsd 10
on 12/01/2012 00:22 Andriy Gapon said the following: [snip] /usr/include/xlocale.h:160:3: error: unknown type name 'va_list' /usr/include/xlocale.h:162:3: error: unknown type name 'va_list' [snip] Back to the main problem. I am not sure where the difference between the base GCC and GCC 4.6 with respect to 'va_list' in xlocale.h comes from. Changing those two instances of 'va_list' to '__va_list' (which is used a lot throughout the header) seems to fix the problem with GCC 4.6. David, what do you think? -- Andriy Gapon ___ freebsd-current@freebsd.org mailing list http://lists.freebsd.org/mailman/listinfo/freebsd-current To unsubscribe, send any mail to freebsd-current-unsubscr...@freebsd.org
Re: couldn't log on to my -CURRENT machine after upgrade to latest PAM
On 11 Jan, Dag-Erling Smørgrav wrote: Could you please try this: # cd /usr/src/contrib # mv openpam openpam.orig # svn export svn://svn.des.no/openpam/trunk@526 openpam # cd ../lib/libpam # make depend make all make install [snip] building shared library libpam.so.5 make: don't know how to make openpam.3. Stop *** Error code 2 Other than that, it works great and doesn't get tripped up by my obsolete /etc/pam.conf. Thanks! ___ freebsd-current@freebsd.org mailing list http://lists.freebsd.org/mailman/listinfo/freebsd-current To unsubscribe, send any mail to freebsd-current-unsubscr...@freebsd.org
Re: Data corruption over NFS in -current
Martin Cracauer wrote: Stefan Bethke wrote on Wed, Jan 11, 2012 at 07:14:44PM +0100: Am 11.01.2012 um 17:57 schrieb Martin Cracauer: I'm sorry for the unspecific bug report but I thought a heads-up is better than none. $ uname -a FreeBSD wings.cons.org 10.0-CURRENT FreeBSD 10.0-CURRENT #2: Wed Dec 28 12:19:21 EST 2011 craca...@wings.cons.org:/usr/src/sys/amd64/compile/WINGS amd64 I'm sure Rick will want to know which NFS version, which client code (default new code I'm assuming) and which mount options... It's all default both in fstab and as reported by mount(8). I assume that by the above statement, you mean that you don't specify any mount options in your /etc/fstab entry except rw? (If this isn't correct, please post your /etc/fstab entries for the NFS mounts.) - If I am correct, in that you just specify rw, the main difference between the old and new NFS client will be the rsize/wsize used. The new NFS client will use MAX_BSIZE (64Kb) decreased to whatever the server says is the largest it can handle. This should be fine, unless the server says it can handle = 64Kb, but actually only works correctly for 32Kb (which is what the old NFS client will default to, I think?). A few things to try/check: - Look locally on the server to see if the file is corrupted there. - Try the old NFS client. (Set the fs type to oldnfs instead of nfs on the lines in your /etc/fstab.) - If switching to the old client helps, it might be a bug in the way the new client generates the create verifier. I just looked at the code and I'm not certain the code in the new client would work correctly for a amd64. (I only have i386 to test with.) - I can easily generate a patch that changes the new client to do this the same way as the old client, but there is no point, unless the old client doesn't have the problem. -- Exclusive create problems might explain the incorrect ownership, since it first does a create that will fill in user/group in whatever default way the Linux server chooses to and then does a Setattr RPC to change them to the correct values. If the Setattr RPC fails, then the file exists owned by whatever the server chooses. (I don't know if Linux servers use the gid of the directory or the gid of the requestor or ???) - If you have a non-Linux NFS server, try running against that to see if it is a Linux server specific problem. (Since I haven't seen any other reports like this, I suspect it might be an interoperability problem related to the Linux server.) Also, if you can reproduce the problem fairly easily, capture a packet trace via # tcpdump -s 0 -w xxx host server running on the client (or similar). Then email me xxx as an attachment and I can look at it in wireshark. (If you choose to look at it in wireshark, I would suggest you look for Create RPCs to see if they are Exclusive Creates, plus try and see where the data for the corrupt file is written.) Even if the capture is pretty large, it should be easy to find the interesting part, so long as you know the name of the corrupt file and search for that. This is a diskless PXE boot but the mount affected (usr) is not the root filesystem, so this should come in via fstab. BTW, my /usr/ports is another mount so the corruption is cross-mount (garbage from /usr/ports entering /usr). Appending nfsstat output. nfsstat output is pretty useless for this kind of situation. I did find it interesting that you do so many Fsstat RPCs, but that shouldn't be a problem, it's just weird to see that. rick I am re-running things contiguously to see how reproducible this is. This machine was recently updated from a -current almost a year old, so it's its first time with the new NFS client code. Martin I see filesystem corruption on NFS filesystems here. I am running a heavy shellscript that is noodling around with ascii files assembling them with awk and whatnot. Some actions are concurrent with up to 21 forks doing full-CPU load scripting. This machine is a K8 with a total of 8 cores, diskless NFS and memory filesystem for /tmp. I observe two problems: - for no reason whatsoever, some files change from my (user/group) cracauer/wheel to root/cracauer - the same files will later be corrupted. The beginning of the file is normal but then it has what looks like parts of /usr/ports, including our CVS files and binary junk, mostly zeros I did do some ports building lately but not at the same time that this problem manifested itself. I speculate some ports blocks were still resident in the filesystem buffer cache. Server is Linux. Martin -- %%% Martin Cracauer craca...@cons.org http://www.cons.org/cracauer/ ___
Re: Data corruption over NFS in -current
Rick Macklem wrote on Wed, Jan 11, 2012 at 08:42:25PM -0500: Martin Cracauer wrote: Stefan Bethke wrote on Wed, Jan 11, 2012 at 07:14:44PM +0100: Am 11.01.2012 um 17:57 schrieb Martin Cracauer: I'm sorry for the unspecific bug report but I thought a heads-up is better than none. $ uname -a FreeBSD wings.cons.org 10.0-CURRENT FreeBSD 10.0-CURRENT #2: Wed Dec 28 12:19:21 EST 2011 craca...@wings.cons.org:/usr/src/sys/amd64/compile/WINGS amd64 I'm sure Rick will want to know which NFS version, which client code (default new code I'm assuming) and which mount options... It's all default both in fstab and as reported by mount(8). I assume that by the above statement, you mean that you don't specify any mount options in your /etc/fstab entry except rw? (If this isn't correct, please post your /etc/fstab entries for the NFS mounts.) 172.18.30.2:/home/diskless/freebsd-current-usr /usrnfs rw 0 0 172.18.30.2:/home/diskless/usr-ports/usr/ports nfs rw 0 0 - If I am correct, in that you just specify rw, the main difference between the old and new NFS client will be the rsize/wsize used. The new NFS client will use MAX_BSIZE (64Kb) decreased to whatever the server says is the largest it can handle. This should be fine, unless the server says it can handle = 64Kb, but actually only works correctly for 32Kb (which is what the old NFS client will default to, I think?). I'll try 32 KB. A few things to try/check: - Look locally on the server to see if the file is corrupted there. Yes it has the corrupted version of the file, and in a new run I had another file changed to root ownership and that is the same from server and client standpoint. The good news is that this seems fairly reproducible, the root ownership is back. This time I stopped the script when ownership changed so I don't know whether it would have gone forward with corrupting the file afterwards. - Try the old NFS client. (Set the fs type to oldnfs instead of nfs on the lines in your /etc/fstab.) - If switching to the old client helps, it might be a bug in the way the new client generates the create verifier. I just looked at the code and I'm not certain the code in the new client would work correctly for a amd64. (I only have i386 to test with.) - I can easily generate a patch that changes the new client to do this the same way as the old client, but there is no point, unless the old client doesn't have the problem. -- Exclusive create problems might explain the incorrect ownership, since it first does a create that will fill in user/group in whatever default way the Linux server chooses to and then does a Setattr RPC to change them to the correct values. If the Setattr RPC fails, then the file exists owned by whatever the server chooses. (I don't know if Linux servers use the gid of the directory or the gid of the requestor or ???) - If you have a non-Linux NFS server, try running against that to see if it is a Linux server specific problem. (Since I haven't seen any other reports like this, I suspect it might be an interoperability problem related to the Linux server.) I should mention that I also updated the server to Linux-3.1.5 two weeks ago. I'm not sure I put I put heavy load on it since then. I will have a Linux NFS client do the same thing and try the FreeBSD things you mention. Also, if you can reproduce the problem fairly easily, capture a packet trace via # tcpdump -s 0 -w xxx host server running on the client (or similar). Then email me xxx as an attachment and I can look at it in wireshark. (If you choose to look at it in wireshark, I would suggest you look for Create RPCs to see if they are Exclusive Creates, plus try and see where the data for the corrupt file is written.) Even if the capture is pretty large, it should be easy to find the interesting part, so long as you know the name of the corrupt file and search for that. That's probably not practical, we are talking about hammering the NFS server with several CPU hours worth of parallel activity in a shellscript but I'll do my best :-) Martin This is a diskless PXE boot but the mount affected (usr) is not the root filesystem, so this should come in via fstab. BTW, my /usr/ports is another mount so the corruption is cross-mount (garbage from /usr/ports entering /usr). Appending nfsstat output. nfsstat output is pretty useless for this kind of situation. I did find it interesting that you do so many Fsstat RPCs, but that shouldn't be a problem, it's just weird to see that. rick I am re-running things contiguously to see how reproducible this is. This machine was recently updated from a -current almost a year old, so it's its first time with the new NFS client code. Martin I see filesystem corruption on NFS
Re: CAM Target Layer available
On Wed, Jan 04, 2012 at 21:53:11 -0700, Kenneth D. Merry wrote: The CAM Target Layer (CTL) is now available for testing. I am planning to commit it to to head next week, barring any major objections. CTL is a disk and processor device emulation subsystem originally written for Copan Systems under Linux starting in 2003. It has been shipping in Copan (now SGI) products since 2005. It was ported to FreeBSD in 2008, and thanks to an agreement between SGI (who acquired Copan's assets in 2010) and Spectra Logic in 2010, CTL is available under a BSD-style license. The intent behind the agreement was that Spectra would work to get CTL into the FreeBSD tree. The patches are against FreeBSD/head as of SVN change 229516 and are located here: http://people.freebsd.org/~ken/ctl/ctl_diffs.20120104.4.txt.gz The code is not perfect (few pieces of software are), but is in good shape from a functional standpoint. My intent is to get it out there for other folks to use, and perhaps help with improvements. There are a few other CAM changes included with these diffs, some of which will be committed separately from CTL, some concurrently. This is a quick summary: - Fix a panic in the da(4) driver when a drive disappears on boot. - Fix locking in the CAM EDT traversal code. - Add an optional sysctl/tunable (disabled by default) to suppress duplicate devices. This most frequently shows up with dual ported SAS drives. - Add some very basic error injection into the da(4) driver. - Bump the length field in the SCSI INQUIRY CDB to 2 bytes to line up with more recent SCSI specs. CTL Features: - Disk and processor device emulation. - Tagged queueing - SCSI task attribute support (ordered, head of queue, simple tags) - SCSI implicit command ordering support. (e.g. if a read follows a mode select, the read will be blocked until the mode select completes.) - Full task management support (abort, LUN reset, target reset, etc.) - Support for multiple ports - Support for multiple simultaneous initiators - Support for multiple simultaneous backing stores - Persistent reservation support - Mode sense/select support - Error injection support - High Availability support (1) - All I/O handled in-kernel, no userland context switch overhead. (1) HA Support is just an API stub, and needs much more to be fully functional. See the to-do list below. Configuring and Running CTL: === - After applying the CTL patchset to your tree, build world and install it on your target system. - Add 'device ctl' to your kernel configuration file. - If you're running with a 8Gb or 4Gb Qlogic FC board, add 'options ISP_TARGET_MODE' to your kernel config file. 'device ispfw' or loading the ispfw module is also recommended. - Rebuild and install a new kernel. - Reboot with the new kernel. - To add a LUN with the RAM disk backend: ctladm create -b ramdisk -s 10485760 ctladm port -o on - You should now see the CTL disk LUN through camcontrol devlist: scbus6 on ctl2cam0 bus 0: FREEBSD CTLDISK 0001 at scbus6 target 1 lun 0 (da24,pass32) at scbus6 target -1 lun -1 () This is visible through the CTL CAM SIM. This allows using CTL without any physical hardware. You should be able to issue any normal SCSI commands to the device via the pass(4)/da(4) devices. If any target-capable HBAs are in the system (e.g. isp(4)), and have target mode enabled, you should now also be able to see the CTL LUNs via that target interface. Note that all CTL LUNs are presented to all frontends. There is no LUN masking, or separate, per-port configuration. - Note that the ramdisk backend is a fake ramdisk. That is, it is backed by a small amount of RAM that is used for all I/O requests. This is useful for performance testing, but not for any data integrity tests. - To add a LUN with the block/file backend: truncate -s +1T myfile ctladm create -b block -o file=myfile ctladm port -o on - You can also see a list of LUNs and their backends like this: # ctladm devlist LUN Backend Size (Blocks) BS Serial NumberDevice ID 0 block2147483648 512 MYSERIAL 0 MYDEVID 0 1 block2147483648 512 MYSERIAL 1 MYDEVID 1 2 block2147483648 512 MYSERIAL 2 MYDEVID 2 3 block2147483648 512 MYSERIAL 3 MYDEVID 3 4 block2147483648 512 MYSERIAL 4 MYDEVID 4 5 block2147483648 512 MYSERIAL 5 MYDEVID 5 6 block2147483648 512 MYSERIAL 6 MYDEVID 6 7 block2147483648 512 MYSERIAL 7 MYDEVID 7 8 block2147483648 512 MYSERIAL 8
Re: Data corruption over NFS in -current
In the last episode (Jan 11), Martin Cracauer said: Rick Macklem wrote on Wed, Jan 11, 2012 at 08:42:25PM -0500: Also, if you can reproduce the problem fairly easily, capture a packet trace via # tcpdump -s 0 -w xxx host server running on the client (or similar). Then email me xxx as an attachment and I can look at it in wireshark. (If you choose to look at it in wireshark, I would suggest you look for Create RPCs to see if they are Exclusive Creates, plus try and see where the data for the corrupt file is written.) Even if the capture is pretty large, it should be easy to find the interesting part, so long as you know the name of the corrupt file and search for that. That's probably not practical, we are talking about hammering the NFS server with several CPU hours worth of parallel activity in a shellscript but I'll do my best :-) The tcpdump options -C and -W can help here. For example, -C 1000 -W 10 will keep the most recent 10-GB of traffic by circularly writing to 10 1-GB capture files. All you need to do is kill the tcpdump when you discover the corruption, and work backwards through the logs until you find your file. -- Dan Nelson dnel...@allantgroup.com ___ freebsd-current@freebsd.org mailing list http://lists.freebsd.org/mailman/listinfo/freebsd-current To unsubscribe, send any mail to freebsd-current-unsubscr...@freebsd.org
Can you use a USB3.0 hub?
Hi, Can you use a USB3.0 hub? I tried a USB3.0 hub (BUFFALO BSH4A04U3BK). And I used 8-stable and PCI-E card (BUFFALO IFC-PCIE2U3) The hub is for only japanese market. The card is NEC’s 720200 chip http://www.buffalotech.com/products/accessories/interface-card-adapters/usb-30-pci-express-interface-card/ The kernel could not recognize USB3.0 HDD that connected to this hub as the following log. But, the kernel could reconize USB2.0 HDD that connected to this hub. Regards, Kohji Okuno -- log xhci0: XHCI (generic) USB 3.0 controller mem 0xf7ffe000-0xf7ff irq 28 at d evice 0.0 on pci1 xhci0: [ITHREAD] xhci0: 32 byte context size. usbus0 on xhci0 ... ugen0.2: VIA Labs, Inc. at usbus0 uhub11: VIA Labs, Inc. 4-Port USB 3.0 Hub, class 9/0, rev 3.00/3.74, addr 1 on usbus0 uhub11: 4 ports with 4 removable, self powered usb_alloc_device: set address 3 failed (USB_ERR_IOERROR, ignored) usbd_req_re_enumerate: addr=3, set address failed! (USB_ERR_IOERROR, ignored) usbd_req_re_enumerate: addr=3, set address failed! (USB_ERR_IOERROR, ignored) ugen0.3: Unknown at usbus0 (disconnected) uhub_reattach_port: could not allocate new device uhub_reattach_port: device problem (USB_ERR_STALLED), disabling port 4 ugen0.3: vendor 0x2109 at usbus0 uhub12: vendor 0x2109 USB2.0 Hub, class 9/0, rev 2.00/2.74, addr 2 on usbus0 uhub12: 4 ports with 4 removable, self powered usb_alloc_device: set address 4 failed (USB_ERR_IOERROR, ignored) usbd_req_re_enumerate: addr=4, set address failed! (USB_ERR_IOERROR, ignored) usbd_req_re_enumerate: addr=4, set address failed! (USB_ERR_IOERROR, ignored) ugen0.4: Unknown at usbus0 (disconnected) uhub_reattach_port: could not allocate new device uhub_reattach_port: device problem (USB_ERR_STALLED), disabling port 4 usb_alloc_device: set address 4 failed (USB_ERR_IOERROR, ignored) usbd_req_re_enumerate: addr=4, set address failed! (USB_ERR_IOERROR, ignored) usbd_req_re_enumerate: addr=4, set address failed! (USB_ERR_IOERROR, ignored) ugen0.4: Unknown at usbus0 (disconnected) uhub_reattach_port: could not allocate new device ___ freebsd-current@freebsd.org mailing list http://lists.freebsd.org/mailman/listinfo/freebsd-current To unsubscribe, send any mail to freebsd-current-unsubscr...@freebsd.org
Build Option Survey results
Hey, after two years I had the opportunity to run the build option survey, initially done by phk, again. The number of options seems to have grown quite a bit it felt. I have not even looked at the results yet but here they are fresh off the machine: http://people.freebsd.org/~bz/build_option_survey_20120106/ Special thanks go to np, sbruno and bhaga for bringing worm back to life. /bz PS: the last run from 2010 can still be found here: http://people.freebsd.org/~bz/build_option_survey_20100104/ -- Bjoern A. Zeeb You have to have visions! It does not matter how good you are. It matters what good you do! ___ freebsd-current@freebsd.org mailing list http://lists.freebsd.org/mailman/listinfo/freebsd-current To unsubscribe, send any mail to freebsd-current-unsubscr...@freebsd.org
kernel config files outside of sys/${ARCH}/conf ?
usr/sbin/config assumes that the kernel config file lives in ${src_base}/sys/${arch}/conf , which means that if you need to build a custom kernel one needs RW access to that directory. Any idea on how we can enable config to work in a generic directory ? I scanned the source code usr.sbin/config and found that it uses hardwired paths -- specifically, it looks for the kernel source tree in ../.. and has multiple hardwired paths such as ../../conf/. There is also a somewhat undocumented access to a file called DEFAULTS that extends the configuration you pass. Any objections to the addition of a -s option to config(8) to specify the location of the source tree ? cheers luigi ___ freebsd-current@freebsd.org mailing list http://lists.freebsd.org/mailman/listinfo/freebsd-current To unsubscribe, send any mail to freebsd-current-unsubscr...@freebsd.org