Re: CONFIG_PPC_VAS depends on 64k pages...?
I don't know anything about VAS page size requirements in the kernel. I checked the user compression library and saw that we do a sysconf to get the page size; so the library should be immune to page size by design. But it wouldn't surprise me if a 64KB constant is inadvertently hardcoded somewhere else in the library. Giving heads up to Tulio and Raphael who are owners of the github repo. https://github.com/libnxz/power-gzip/blob/master/lib/nx_zlib.c#L922 If we got this wrong in the library it might manifest itself as an error message of the sort "excessive page faults". The library must touch pages ahead to make them present in the memory; occasional page faults is acceptable. It will retry. Bulent From: "Sukadev Bhattiprolu" To: "Christophe Leroy" Cc: "Will Springer" , linuxppc-dev@lists.ozlabs.org, dan...@octaforge.org, Bulent Abali/Watson/IBM@IBM, ha...@linux.ibm.com Date: 12/01/2020 12:53 AM Subject:Re: CONFIG_PPC_VAS depends on 64k pages...? Christophe Leroy [christophe.le...@csgroup.eu] wrote: > Hi, > > Le 19/11/2020 à 11:58, Will Springer a écrit : > > I learned about the POWER9 gzip accelerator a few months ago when the > > support hit upstream Linux 5.8. However, for some reason the Kconfig > > dictates that VAS depends on a 64k page size, which is problematic as I > > run Void Linux, which uses a 4k-page kernel. > > > > Some early poking by others indicated there wasn't an obvious page size > > dependency in the code, and suggested I try modifying the config to switch > > it on. I did so, but was stopped by a minor complaint of an "unexpected DT > > configuration" by the VAS code. I wasn't equipped to figure out exactly what > > this meant, even after finding the offending condition, so after writing a > > very drawn-out forum post asking for help, I dropped the subject. > > > > Fast forward to today, when I was reminded of the whole thing again, and > > decided to debug a bit further. Apparently the VAS platform device > > (derived from the DT node) has 5 resources on my 4k kernel, instead of 4 > > (which evidently works for others who have had success on 64k kernels). I > > have no idea what this means in practice (I don't know how to introspect > > it), but after making a tiny patch[1], everything came up smoothly and I > > was doing blazing-fast gzip (de)compression in no time. > > > > Everything seems to work fine on 4k pages. So, what's up? Are there > > pitfalls lurking around that I've yet to stumble over? More reasonably, > > I'm curious as to why the feature supposedly depends on 64k pages, or if > > there's anything else I should be concerned about. Will, The reason I put in that config check is because we were only able to test 64K pages at that point. It is interesting that it is working for you. Following code in skiboot https://github.com/open-power/skiboot/blob/master/hw/vas.c should restrict it to 64K pages. IIRC there is also a corresponding change in some NX registers that should also be configured to allow 4K pages. static int init_north_ctl(struct proc_chip *chip) { uint64_t val = 0ULL; val = SETFIELD(VAS_64K_MODE_MASK, val, true); val = SETFIELD(VAS_ACCEPT_PASTE_MASK, val, true); val = SETFIELD(VAS_ENABLE_WC_MMIO_BAR, val, true); val = SETFIELD(VAS_ENABLE_UWC_MMIO_BAR, val, true); val = SETFIELD(VAS_ENABLE_RMA_MMIO_BAR, val, true); return vas_scom_write(chip, VAS_MISC_N_CTL, val); } I am copying Bulent Albali and Haren Myneni who have been working with VAS/NX for their thoughts/experience. > > > > Maybe ask Sukadev who did the implementation and is maintaining it ? > > > I do have to say I'm quite satisfied with the results of the NX > > accelerator, though. Being able to shuffle data to a RaptorCS box over gigE > > and get compressed data back faster than most software gzip could ever > > hope to achieve is no small feat, let alone the instantaneous results locally. > > :) > > > > Cheers, > > Will Springer [she/her] > > > > [1]: https://github.com/Skirmisher/void-packages/blob/vas-4k-pages/srcpkgs/linux5.9/patches/ppc-vas-on-4k.patch > > > > > Christophe
RE: [PATCH 1/2] powerpc/vas: Report proper error for address translation failure
copied verbatim from P9 DD2 Nest Accelerators Workbook Version 3.2 Table 4-36. CSB Non-zero CC Reported Error Types CC=5, Error Type: Translation, Comment: Unused, defined by RFC02130 (footnote: DMA controller uses this CC internally in translation fault handling. Do not reuse for other purposes.) CC=240 through 251, reserved for future firmware use, Comment: Error codes 240 - 255 (0xF0 - 0xF0) are reserved for firmware use and are not signalled by the hardware. These CCs are written in the CSB by hypervisor to alert the partition to error conditions detected by the hypervisor. These codes have been used in past processors for this purpose and ought not be relocated. From: Haren Myneni/Beaverton/IBM To: Michael Ellerman Cc: ab...@us.ibm.com, Haren Myneni , linuxppc-dev@lists.ozlabs.org, "Linuxppc-dev", rzin...@linux.ibm.com, tuli...@br.ibm.com, Haren Myneni/Beaverton/IBM@IBMUS Date: 07/09/2020 04:01 PM Subject:Re: [EXTERNAL] Re: [PATCH 1/2] powerpc/vas: Report proper error for address translation failure "Linuxppc-dev" wrote on 07/09/2020 04:22:10 AM: > From: Michael Ellerman > To: Haren Myneni > Cc: tuli...@br.ibm.com, ab...@us.ibm.com, linuxppc- > d...@lists.ozlabs.org, rzin...@linux.ibm.com > Date: 07/09/2020 04:21 AM > Subject: [EXTERNAL] Re: [PATCH 1/2] powerpc/vas: Report proper error > for address translation failure > Sent by: "Linuxppc-dev" +hbabu=us.ibm@lists.ozlabs.org> > > Haren Myneni writes: > > DMA controller uses CC=5 internally for translation fault handling. So > > OS should be using CC=250 and should report this error to the user space > > when NX encounters address translation failure on the request buffer. > > That doesn't really explain *why* the OS must use CC=250. > > Is it documented somewhere that 5 is for hardware use, and 250 is for > software? Yes, mentioned in Table 4-36. CSB Non-zero CC Reported Error Types (P9 NX DD2 work book). Also footnote for CC=5 says "DMA controller uses this CC internally in translation fault handling. Do not reuse for other purposes" I will add documentation reference for CC=250 comment. > > > This patch defines CSB_CC_ADDRESS_TRANSLATION(250) and updates > > CSB.CC with this proper error code for user space. > > We still have: > > #define CSB_CC_TRANSLATION (5) > > And it's very unclear where one or the other should be used. > > Can one or the other get a name that makes the distinction clear. CSB_CC_TRANSLATION is added in 842 driver (nx-common-powernv.c) when NX is introduced (P7+). NX will not see faults on kernel requests (cc=250) and even CC=5. Table 4-36: For CC=5: says Translation CC=250:says "Address Translation Fault" So I can say CRB_CC_ADDRESS_TRANSLATION_FAULT or CRN_CC_TRANSLATION_FAULT. This code path (also CRBs) should be generic, so should not use like CRB_CC_NX_FAULT. Thanks Haren > > cheers > > > > diff --git a/Documentation/powerpc/vas-api.rst b/Documentation/ > powerpc/vas-api.rst > > index 1217c2f..78627cc 100644 > > --- a/Documentation/powerpc/vas-api.rst > > +++ b/Documentation/powerpc/vas-api.rst > > @@ -213,7 +213,7 @@ request buffers are not in memory. The > operating system handles the fault by > > updating CSB with the following data: > > > > csb.flags = CSB_V; > > - csb.cc = CSB_CC_TRANSLATION; > > + csb.cc = CSB_CC_ADDRESS_TRANSLATION; > > csb.ce = CSB_CE_TERMINATION; > > csb.address = fault_address; > > > > diff --git a/arch/powerpc/include/asm/icswx.h b/arch/powerpc/ > include/asm/icswx.h > > index 965b1f3..b1c9a57 100644 > > --- a/arch/powerpc/include/asm/icswx.h > > +++ b/arch/powerpc/include/asm/icswx.h > > @@ -77,6 +77,8 @@ struct coprocessor_completion_block { > > #define CSB_CC_CHAIN (37) > > #define CSB_CC_SEQUENCE (38) > > #define CSB_CC_HW (39) > > +/* User space address traslation failure */ > > +#define CSB_CC_ADDRESS_TRANSLATION (250) > > > > #define CSB_SIZE (0x10) > > #define CSB_ALIGN CSB_SIZE > > diff --git a/arch/powerpc/platforms/powernv/vas-fault.c b/arch/ > powerpc/platforms/powernv/vas-fault.c > > index 266a6ca..33e89d4 100644 > > --- a/arch/powerpc/platforms/powernv/vas-fault.c > > +++ b/arch/powerpc/platforms/powernv/vas-fault.c > > @@ -79,7 +79,7 @@ static void update_csb(struct vas_window *window, > > csb_addr = (void __user *)be64_to_cpu(crb->csb_addr); > > > > memset(, 0, sizeof(csb)); > > - csb.cc = CSB_CC_TRANSLATION; > > + csb.cc = CSB_CC_ADDRESS_TRANSLATION; > > csb.ce = CSB_CE_TERMINATION; > > csb.cs = 0; > > csb.count = 0; > > -- > > 1.8.3.1 >
Re:
help -- To unsubscribe from this list: send the line unsubscribe kvm in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
bug in __tcp_inherit_port ?
I get an occasional panic in __tcp_inherit_port(sk,child). I believe the reason is tb=sk->prev is NULL. sk->prev is set to NULL in only few places including __tcp_put_port(sk). Perhaps there is a serialization problem between __tcp_inherit_port and __tcp_put_port ??? One possibility is that sk->num != child->num. Therefore spin_locks in the two routines do not serialize. This code is out of my league so I couldn't debug any further. Ingo, this is the same problem that I posted to linux-kernel couple weeks ago for tcp_v4_syn_recv_sock. Problem occurs when running TUX-B6, 2.4.5-ac4 with SPECweb99, dual PIII, and one acenic adapter. It is difficult to trigger but did occur few times so far. In the following are the oops and objdump /bulent. = /* Caller must disable local BH processing. */ static __inline__ void __tcp_inherit_port(struct sock *sk, struct sock *child) { struct tcp_bind_hashbucket *head = _bhash[tcp_bhashfn(child->num)]; struct tcp_bind_bucket *tb; spin_lock(>lock); tb = (struct tcp_bind_bucket *)sk->prev; <** line 149 if ((child->bind_next = tb->owners) != NULL) <** panic here tb->owners->bind_pprev = >bind_next; tb->owners = child; child->bind_pprev = >owners; child->prev = (struct sock *) tb; spin_unlock(>lock); } __inline__ void __tcp_put_port(struct sock *sk) { struct tcp_bind_hashbucket *head = _bhash[tcp_bhashfn(sk->num)]; struct tcp_bind_bucket *tb; spin_lock(>lock); tb = (struct tcp_bind_bucket *) sk->prev; if (sk->bind_next) sk->bind_next->bind_pprev = sk->bind_pprev; *(sk->bind_pprev) = sk->bind_next; sk->prev = NULL; sk->num = 0; if (tb->owners == NULL) { if (tb->next) tb->next->pprev = tb->pprev; *(tb->pprev) = tb->next; kmem_cache_free(tcp_bucket_cachep, tb); } spin_unlock(>lock); } oops output NULL pointer dereference at virtual address 0008 printing eip: c0247a34 *pde = Oops: CPU:0 EIP:0010:[] EFLAGS: 00010246 eax: ebx: f74224c0 ecx: edx: f74224c0 esi: f750 edi: f71e6cf0 ebp: f74225b4 esp: c0313c00 ds: 0018 es: 0018 ss: 0018 Process swapper (pid: 0, stackpage=c0313000) Stack: f2a55ec4 f2d6bf64 459d1162 f74224c0 c024aff9 f74224c0 f2a55ec4 f2d6bf64 459d1163 459d1162 459d1163 1000 f74225b4 f740f58c f7760c00 c022a3c5 f740f58c c0231e76 e11d2a9c f7760cd8 f740083c Call Trace: [] [] [] [] [] [] [] [] [] [] [] [] [] [] [] [] [] [] [] [] [] [] [] [] [] [] [] [] [] [] [] [] [] [] [] [] [] [] [] [] [] [] Code: 8b 41 08 89 43 18 85 c0 74 09 8b 51 08 8d 43 18 89 42 1c 89 <0>Kernel panic: Aiee, killing interrupt handler! In interrupt handler - not syncing ksymoops output Code; c0247a34 <_EIP>: Code; c0247a34 0: 8b 41 08 mov0x8(%ecx),%eax //panics in child->bind_next=tb->owners Code; c0247a37 3: 89 43 18 mov%eax,0x18(%ebx) Code; c0247a3a 6: 85 c0 test %eax,%eax Code; c0247a3c 8: 74 09 je 13 <_EIP+0x13> c0247a47 Code; c0247a3e a: 8b 51 08 mov0x8(%ecx),%edx Code; c0247a41 d: 8d 43 18 lea0x18(%ebx),%eax Code; c0247a44 10: 89 42 1c mov%eax,0x1c(%edx) Code; c0247a47 13: 89 00 mov%eax,(%eax) objdump -S /usr/src/linux-2.4.5-ac4/include/asm/spinlock.h:104 c0247a21: f0 fe 0e lock decb (%esi) c0247a24: 0f 88 85 79 03 00js c027f3af /usr/src/linux-2.4.5-ac4/net/ipv4/tcp_ipv4.c:149 c0247a2a: 8b 54 24 14 mov0x14(%esp,1),%edx c0247a2e: 8b 8a a4 00 00 00mov0xa4(%edx),%ecx //tb = sk->prev /usr/src/linux-2.4.5-ac4/net/ipv4/tcp_ipv4.c:150 c0247a34: 8b 41 08 mov0x8(%ecx),%eax // child->bind_next=tb->owners c0247a37: 89 43 18 mov%eax,0x18(%ebx) c0247a3a: 85 c0test %eax,%eax c0247a3c: 74 09je c0247a47 /usr/src/linux-2.4.5-ac4/net/ipv4/tcp_ipv4.c:151 c0247a3e: 8b 51 08 mov0x8(%ecx),%edx c0247a41: 8d 43 18 lea0x18(%ebx),%eax c0247a44: 89 42 1c mov%eax,0x1c(%edx) /usr/src/linux-2.4.5-ac4/net/ipv4/tcp_ipv4.c:152 c0247a47: 89 59 08 mov%ebx,0x8(%ecx) /usr/src/linux-2.4.5-ac4/net/ipv4/tcp_ipv4.c:153 c0247a4a: 8d 41 08 lea0x8(%ecx),%eax c0247a4d: 89 43 1c mov%eax,0x1c(%ebx) - To unsubscribe from this list: send the line "unsubscribe linux-kernel" in the body of a message to [EMAIL PROTECTED] More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at
bug in __tcp_inherit_port ?
I get an occasional panic in __tcp_inherit_port(sk,child). I believe the reason is tb=sk-prev is NULL. sk-prev is set to NULL in only few places including __tcp_put_port(sk). Perhaps there is a serialization problem between __tcp_inherit_port and __tcp_put_port ??? One possibility is that sk-num != child-num. Therefore spin_locks in the two routines do not serialize. This code is out of my league so I couldn't debug any further. Ingo, this is the same problem that I posted to linux-kernel couple weeks ago for tcp_v4_syn_recv_sock. Problem occurs when running TUX-B6, 2.4.5-ac4 with SPECweb99, dual PIII, and one acenic adapter. It is difficult to trigger but did occur few times so far. In the following are the oops and objdump /bulent. = /* Caller must disable local BH processing. */ static __inline__ void __tcp_inherit_port(struct sock *sk, struct sock *child) { struct tcp_bind_hashbucket *head = tcp_bhash[tcp_bhashfn(child-num)]; struct tcp_bind_bucket *tb; spin_lock(head-lock); tb = (struct tcp_bind_bucket *)sk-prev; ** line 149 if ((child-bind_next = tb-owners) != NULL) ** panic here tb-owners-bind_pprev = child-bind_next; tb-owners = child; child-bind_pprev = tb-owners; child-prev = (struct sock *) tb; spin_unlock(head-lock); } __inline__ void __tcp_put_port(struct sock *sk) { struct tcp_bind_hashbucket *head = tcp_bhash[tcp_bhashfn(sk-num)]; struct tcp_bind_bucket *tb; spin_lock(head-lock); tb = (struct tcp_bind_bucket *) sk-prev; if (sk-bind_next) sk-bind_next-bind_pprev = sk-bind_pprev; *(sk-bind_pprev) = sk-bind_next; sk-prev = NULL; sk-num = 0; if (tb-owners == NULL) { if (tb-next) tb-next-pprev = tb-pprev; *(tb-pprev) = tb-next; kmem_cache_free(tcp_bucket_cachep, tb); } spin_unlock(head-lock); } oops output NULL pointer dereference at virtual address 0008 printing eip: c0247a34 *pde = Oops: CPU:0 EIP:0010:[c0247a34] EFLAGS: 00010246 eax: ebx: f74224c0 ecx: edx: f74224c0 esi: f750 edi: f71e6cf0 ebp: f74225b4 esp: c0313c00 ds: 0018 es: 0018 ss: 0018 Process swapper (pid: 0, stackpage=c0313000) Stack: f2a55ec4 f2d6bf64 459d1162 f74224c0 c024aff9 f74224c0 f2a55ec4 f2d6bf64 459d1163 459d1162 459d1163 1000 f74225b4 f740f58c f7760c00 c022a3c5 f740f58c c0231e76 e11d2a9c f7760cd8 f740083c Call Trace: [c024aff9] [c022a3c5] [c0231e76] [c01bff2c] [f8805514] [c02ac6b0] [f8805000] [c0231e76] [c022a3c5] [c0231e76] [c0224962] [c0231e76] [c0220f5c] [c02210a8] [c02321e4] [c02444ff] [c0220f5c] [c02210a8] [c023d85c] [c023db31] [c0220f5c] [c02210a8] [c0247b1c] [c0247e4b] [c024833f] [c022f5b8] [c022f955] [c01bf55a] [c02251bb] [c0224eb2] [c0108d7e] [c0120e5b] [c0109189] [c0105220] [c0105220] [c0107544] [c0105220] [c0105220] [c010524d] [c01052d2] [c0105000] [c01001cf] Code: 8b 41 08 89 43 18 85 c0 74 09 8b 51 08 8d 43 18 89 42 1c 89 0Kernel panic: Aiee, killing interrupt handler! In interrupt handler - not syncing ksymoops output Code; c0247a34 tcp_v4_syn_recv_sock+284/330 _EIP: Code; c0247a34 tcp_v4_syn_recv_sock+284/330 0: 8b 41 08 mov0x8(%ecx),%eax //panics in child-bind_next=tb-owners Code; c0247a37 tcp_v4_syn_recv_sock+287/330 3: 89 43 18 mov%eax,0x18(%ebx) Code; c0247a3a tcp_v4_syn_recv_sock+28a/330 6: 85 c0 test %eax,%eax Code; c0247a3c tcp_v4_syn_recv_sock+28c/330 8: 74 09 je 13 _EIP+0x13 c0247a47 tcp_v4_syn_recv_sock+297/330 Code; c0247a3e tcp_v4_syn_recv_sock+28e/330 a: 8b 51 08 mov0x8(%ecx),%edx Code; c0247a41 tcp_v4_syn_recv_sock+291/330 d: 8d 43 18 lea0x18(%ebx),%eax Code; c0247a44 tcp_v4_syn_recv_sock+294/330 10: 89 42 1c mov%eax,0x1c(%edx) Code; c0247a47 tcp_v4_syn_recv_sock+297/330 13: 89 00 mov%eax,(%eax) objdump -S /usr/src/linux-2.4.5-ac4/include/asm/spinlock.h:104 c0247a21: f0 fe 0e lock decb (%esi) c0247a24: 0f 88 85 79 03 00js c027f3af stext_lock+0x5c6f /usr/src/linux-2.4.5-ac4/net/ipv4/tcp_ipv4.c:149 c0247a2a: 8b 54 24 14 mov0x14(%esp,1),%edx c0247a2e: 8b 8a a4 00 00 00mov0xa4(%edx),%ecx //tb = sk-prev /usr/src/linux-2.4.5-ac4/net/ipv4/tcp_ipv4.c:150 c0247a34: 8b 41 08 mov0x8(%ecx),%eax // child-bind_next=tb-owners c0247a37: 89 43 18 mov%eax,0x18(%ebx) c0247a3a: 85 c0test %eax,%eax c0247a3c: 74 09je c0247a47 tcp_v4_syn_recv_sock+0x297 /usr/src/linux-2.4.5-ac4/net/ipv4/tcp_ipv4.c:151 c0247a3e: 8b 51 08 mov
Re: all processes waiting in TASK_UNINTERRUPTIBLE state
>> I am running in to a problem, seemingly a deadlock situation, where almost >> all the processes end up in the TASK_UNINTERRUPTIBLE state. All the > >could you try to reproduce with this patch applied on top of >2.4.6pre5aa1 or 2.4.6pre5 vanilla? Andrea, I would like try your patch but so far I can trigger the bug only when running TUX 2.0-B6 which runs on 2.4.5-ac4. /bulent - To unsubscribe from this list: send the line "unsubscribe linux-kernel" in the body of a message to [EMAIL PROTECTED] More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/
Re: all processes waiting in TASK_UNINTERRUPTIBLE state
I am running in to a problem, seemingly a deadlock situation, where almost all the processes end up in the TASK_UNINTERRUPTIBLE state. All the could you try to reproduce with this patch applied on top of 2.4.6pre5aa1 or 2.4.6pre5 vanilla? Andrea, I would like try your patch but so far I can trigger the bug only when running TUX 2.0-B6 which runs on 2.4.5-ac4. /bulent - To unsubscribe from this list: send the line unsubscribe linux-kernel in the body of a message to [EMAIL PROTECTED] More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/
Re: all processes waiting in TASK_UNINTERRUPTIBLE state
>[EMAIL PROTECTED] said: >> I am running in to a problem, seemingly a deadlock situation, where >> almost all the processes end up in the TASK_UNINTERRUPTIBLE state. >> All the process eventually stop responding, including login shell, no >> screen updates, keyboard etc. Can ping and sysrq key works. I >> traced the tasks through sysrq-t key. The processors are in the idle >> state. Tasks all seem to get stuck in the __wait_on_page or >> __lock_page. > >I've seen this under UML, Rik van Riel has seen it on a physical box, and we >suspect that they're the same problem (i.e. mine isn't a UML-specific bug). Can you give more details? Was there an aic7xxx scsi driver on the box? run_task_queue(_disk) should eventually unlock those pages but they remain locked. I am trying to narrow it down to fs/buffer code or the SCSI driver aic7xxx in my case. Thanks. /bulent - To unsubscribe from this list: send the line "unsubscribe linux-kernel" in the body of a message to [EMAIL PROTECTED] More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/
all processes waiting in TASK_UNINTERRUPTIBLE state
keywords: tux, aic7xxx, 2.4.5-ac4, specweb99, __wait_on_page, __lock_page Greetings, I am running in to a problem, seemingly a deadlock situation, where almost all the processes end up in the TASK_UNINTERRUPTIBLE state. All the process eventually stop responding, including login shell, no screen updates, keyboard etc. Can ping and sysrq key works. I traced the tasks through sysrq-t key. The processors are in the idle state. Tasks all seem to get stuck in the __wait_on_page or __lock_page. It appears from the source that they are waiting for pages to be unlocked. run_task_queue (_disk) should eventually cause pages to unlock but it doesn't happen. Anybody familiar with this problem or have seen it before? Thanks for any comments. Bulent Here are the conditions: Dual PIII, 1GHz, 1GB of memory, aic7xxx scsi driver, acenic eth. This occurs while TUX (2.4.5-B6) webserver is being driven by SPECWeb99 benchmark at a rate of 800 c/s. The system is very busy doing disk and network I/O. Problem occurs sometimes in an hour and sometimes 10-20 hours in to the running. Bulent Process: 0, { swapper} EIP: 0010:[] CPU: 1 EFLAGS: 0246 EAX: EBX: c0105220 ECX: c2afe000 EDX: 0025 ESI: c2afe000 EDI: c2afe000 EBP: c0105220 DS: 0018 ES: 0018 CR0: 8005003b CR2: 08049df0 CR3: 268e CR4: 06d0 Call Trace: [] [] [] SysRq : Show Regs Process: 0, { swapper} EIP: 0010:[] CPU: 0 EFLAGS: 0246 EAX: EBX: c0105220 ECX: c030a000 EDX: ESI: c030a000 EDI: c030a000 EBP: c0105220 DS: 0018 ES: 0018 CR0: 8005003b CR2: 08049f7c CR3: 37a63000 CR4: 06d0 Call Trace: [] [] [] SysRq : Show Regs EIP: 0010:[] CPU: 1 EFLAGS: 0246 Using defaults from ksymoops -t elf32-i386 -a i386 EAX: EBX: c0105220 ECX: c2afe000 EDX: 0025 ESI: c2afe000 EDI: c2afe000 EBP: c0105220 DS: 0018 ES: 0018 CR0: 8005003b CR2: 08049df0 CR3: 268e CR4: 06d0 Call Trace: [] [] [] EIP: 0010:[] CPU: 0 EFLAGS: 0246 EAX: EBX: c0105220 ECX: c030a000 EDX: ESI: c030a000 EDI: c030a000 EBP: c0105220 DS: 0018 ES: 0018 CR0: 8005003b CR2: 08049f7c CR3: 37a63000 CR4: 06d0 Call Trace: [] [] [] >>EIP; c010524d<= Trace; c01052d2 Trace; c0119186 <__call_console_drivers+46/60> Trace; c01192fb >>EIP; c010524d<= Trace; c01052d2 Trace; c0105000 Trace; c01001cf = SysRq : Show Memory Mem-info: Free pages:4300kB ( 792kB HighMem) ( Active: 200434, inactive_dirty: 26808, inactive_clean: 1472, free: 1075 (574 1148 1722) ) 24*4kB 15*8kB 2*16kB 1*32kB 1*64kB 1*128kB 1*256kB 0*512kB 0*1024kB 0*2048kB 0*4096kB = 728kB) 493*4kB 3*8kB 1*16kB 0*32kB 0*64kB 0*128kB 1*256kB 1*512kB 0*1024kB 0*2048kB 0*4096kB = 2780kB) 0*4kB 1*8kB 1*16kB 0*32kB 0*64kB 0*128kB 1*256kB 1*512kB 0*1024kB 0*2048kB 0*4096kB = 792kB) Swap cache: add 2711, delete 643, find 5301/6721 Free swap: 2087996kB 253932 pages of RAM 24556 pages of HIGHMEM 7212 reserved pages 221419 pages shared 2068 pages swap cached 0 pages in page table cache Buffer memory:12164kB CLEAN: 2322 buffers, 9276 kbyte, 3 used (last=2322), 2 locked, 0 protected, 0 dirty LOCKED: 405 buffers, 1608 kbyte, 39 used (last=404), 348 locked, 0 protected, 0 dirty DIRTY: 322 buffers, 1288 kbyte, 0 used (last=0), 322 locked, 0 protected, 322 dirty = async IO 0/2 D 0013 0 1061 1059 1062 (NOTLB) Call Trace: [] [] [] [] [] [] [] [] [] [] Trace; c012e121 <___wait_on_page+91/c0> Trace; c012f059 Trace; c02614d7 Trace; c0258c44 Trace; c02588c0 Trace; c025c65a Trace; c0256848 Trace; c0258478 Trace; c0105636 Trace; c02582a0 == bash D C2AE541C 0 920912 (NOTLB) Call Trace: [] [] [] [] [] [] [] [] [] [] [] [] [] [] [ Trace; c012e1e1 <__lock_page+91/c0> Trace; c012e04d Trace; c016b880 Trace; c012fdac Trace; c012a49a Trace; c012a76a Trace; c012a8cb Trace; c021814c Trace; c0113ed0 Trace; c0114106 Trace; c0118aa5 Trace; c01417d2 Trace; c011e25b Trace; c0113ed0 Trace; c01075b8 void ___wait_on_page(struct page *page) { struct task_struct *tsk = current; DECLARE_WAITQUEUE(wait, tsk); add_wait_queue(>wait, ); do { sync_page(page); set_task_state(tsk, TASK_UNINTERRUPTIBLE); if (!PageLocked(page)) break; run_task_queue(_disk); schedule(); } while (PageLocked(page)); tsk->state = TASK_RUNNING; remove_wait_queue(>wait, ); } static void __lock_page(struct page *page) { struct task_struct *tsk = current; DECLARE_WAITQUEUE(wait, tsk); add_wait_queue_exclusive(>wait, ); for (;;) { sync_page(page); set_task_state(tsk, TASK_UNINTERRUPTIBLE); if (PageLocked(page)) { run_task_queue(_disk);
all processes waiting in TASK_UNINTERRUPTIBLE state
keywords: tux, aic7xxx, 2.4.5-ac4, specweb99, __wait_on_page, __lock_page Greetings, I am running in to a problem, seemingly a deadlock situation, where almost all the processes end up in the TASK_UNINTERRUPTIBLE state. All the process eventually stop responding, including login shell, no screen updates, keyboard etc. Can ping and sysrq key works. I traced the tasks through sysrq-t key. The processors are in the idle state. Tasks all seem to get stuck in the __wait_on_page or __lock_page. It appears from the source that they are waiting for pages to be unlocked. run_task_queue (tq_disk) should eventually cause pages to unlock but it doesn't happen. Anybody familiar with this problem or have seen it before? Thanks for any comments. Bulent Here are the conditions: Dual PIII, 1GHz, 1GB of memory, aic7xxx scsi driver, acenic eth. This occurs while TUX (2.4.5-B6) webserver is being driven by SPECWeb99 benchmark at a rate of 800 c/s. The system is very busy doing disk and network I/O. Problem occurs sometimes in an hour and sometimes 10-20 hours in to the running. Bulent Process: 0, { swapper} EIP: 0010:[c010524d] CPU: 1 EFLAGS: 0246 EAX: EBX: c0105220 ECX: c2afe000 EDX: 0025 ESI: c2afe000 EDI: c2afe000 EBP: c0105220 DS: 0018 ES: 0018 CR0: 8005003b CR2: 08049df0 CR3: 268e CR4: 06d0 Call Trace: [c01052d2] [c0119186] [c01192fb] SysRq : Show Regs Process: 0, { swapper} EIP: 0010:[c010524d] CPU: 0 EFLAGS: 0246 EAX: EBX: c0105220 ECX: c030a000 EDX: ESI: c030a000 EDI: c030a000 EBP: c0105220 DS: 0018 ES: 0018 CR0: 8005003b CR2: 08049f7c CR3: 37a63000 CR4: 06d0 Call Trace: [c01052d2] [c0105000] [c01001cf] SysRq : Show Regs EIP: 0010:[c010524d] CPU: 1 EFLAGS: 0246 Using defaults from ksymoops -t elf32-i386 -a i386 EAX: EBX: c0105220 ECX: c2afe000 EDX: 0025 ESI: c2afe000 EDI: c2afe000 EBP: c0105220 DS: 0018 ES: 0018 CR0: 8005003b CR2: 08049df0 CR3: 268e CR4: 06d0 Call Trace: [c01052d2] [c0119186] [c01192fb] EIP: 0010:[c010524d] CPU: 0 EFLAGS: 0246 EAX: EBX: c0105220 ECX: c030a000 EDX: ESI: c030a000 EDI: c030a000 EBP: c0105220 DS: 0018 ES: 0018 CR0: 8005003b CR2: 08049f7c CR3: 37a63000 CR4: 06d0 Call Trace: [c01052d2] [c0105000] [c01001cf] EIP; c010524d default_idle+2d/40 = Trace; c01052d2 cpu_idle+52/70 Trace; c0119186 __call_console_drivers+46/60 Trace; c01192fb call_console_drivers+eb/100 EIP; c010524d default_idle+2d/40 = Trace; c01052d2 cpu_idle+52/70 Trace; c0105000 prepare_namespace+0/10 Trace; c01001cf L6+0/2 = SysRq : Show Memory Mem-info: Free pages:4300kB ( 792kB HighMem) ( Active: 200434, inactive_dirty: 26808, inactive_clean: 1472, free: 1075 (574 1148 1722) ) 24*4kB 15*8kB 2*16kB 1*32kB 1*64kB 1*128kB 1*256kB 0*512kB 0*1024kB 0*2048kB 0*4096kB = 728kB) 493*4kB 3*8kB 1*16kB 0*32kB 0*64kB 0*128kB 1*256kB 1*512kB 0*1024kB 0*2048kB 0*4096kB = 2780kB) 0*4kB 1*8kB 1*16kB 0*32kB 0*64kB 0*128kB 1*256kB 1*512kB 0*1024kB 0*2048kB 0*4096kB = 792kB) Swap cache: add 2711, delete 643, find 5301/6721 Free swap: 2087996kB 253932 pages of RAM 24556 pages of HIGHMEM 7212 reserved pages 221419 pages shared 2068 pages swap cached 0 pages in page table cache Buffer memory:12164kB CLEAN: 2322 buffers, 9276 kbyte, 3 used (last=2322), 2 locked, 0 protected, 0 dirty LOCKED: 405 buffers, 1608 kbyte, 39 used (last=404), 348 locked, 0 protected, 0 dirty DIRTY: 322 buffers, 1288 kbyte, 0 used (last=0), 322 locked, 0 protected, 322 dirty = async IO 0/2 D 0013 0 1061 1059 1062 (NOTLB) Call Trace: [c012e121] [c012f059] [c02614d7] [c0258c44] [c02588c0] [c025c65a] [c0256848] [c0258478] [c0105636] [c02582a0] Trace; c012e121 ___wait_on_page+91/c0 Trace; c012f059 do_generic_file_read+449/7d0 Trace; c02614d7 send_abuf+27/30 Trace; c0258c44 generic_send_file+84/100 Trace; c02588c0 sock_send_actor+0/1a0 Trace; c025c65a http_send_body+6a/100 Trace; c0256848 tux_schedule_atom+18/20 Trace; c0258478 cachemiss_thread+1d8/350 Trace; c0105636 kernel_thread+26/30 Trace; c02582a0 cachemiss_thread+0/350 == bash D C2AE541C 0 920912 (NOTLB) Call Trace: [c012e1e1] [c012e04d] [c016b880] [c012fdac] [c012a76a] [c012a8cb] [c0110018] [c02709c7] [c0113ed0] [c0114106] [c0195494] [c01417d2] [c011e25b] [c0113ed0] [c01075b8 Trace; c012e1e1 __lock_page+91/c0 Trace; c012e04d read_cluster_nonblocking+17d/1c0 Trace; c016b880 ext2_get_block+0/5b0 Trace; c012fdac filemap_nopage+3fc/5b0 Trace; c012a49a do_swap_page+23a/2f0 Trace; c012a76a do_no_page+8a/150 Trace; c012a8cb handle_mm_fault+9b/150 Trace; c021814c sock_sendmsg+6c/90 Trace; c0113ed0 do_page_fault+0/550 Trace; c0114106 do_page_fault+236/550 Trace; c0118aa5 do_syslog+1e5/820 Trace; c01417d2 sys_read+c2/d0 Trace; c011e25b do_softirq+6b/a0 Trace; c0113ed0 do_page_fault+0/550 Trace; c01075b8
Re: all processes waiting in TASK_UNINTERRUPTIBLE state
[EMAIL PROTECTED] said: I am running in to a problem, seemingly a deadlock situation, where almost all the processes end up in the TASK_UNINTERRUPTIBLE state. All the process eventually stop responding, including login shell, no screen updates, keyboard etc. Can ping and sysrq key works. I traced the tasks through sysrq-t key. The processors are in the idle state. Tasks all seem to get stuck in the __wait_on_page or __lock_page. I've seen this under UML, Rik van Riel has seen it on a physical box, and we suspect that they're the same problem (i.e. mine isn't a UML-specific bug). Can you give more details? Was there an aic7xxx scsi driver on the box? run_task_queue(tq_disk) should eventually unlock those pages but they remain locked. I am trying to narrow it down to fs/buffer code or the SCSI driver aic7xxx in my case. Thanks. /bulent - To unsubscribe from this list: send the line unsubscribe linux-kernel in the body of a message to [EMAIL PROTECTED] More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/
Re: [RFQ] aic7xxx driver panics under heavy swap.
Justin, Your patch works for me. printk "Temporary Resource Shortage" has to go, or may be you can make it a debug option. Here is the cleaned up patch for 2.4.5-ac15 with TAILQ macros replaced with LIST macros. Thanks for the help. Bulent --- aic7xxx_linux.c.save Mon Jun 18 20:25:35 2001 +++ aic7xxx_linux.c Tue Jun 19 17:35:55 2001 @@ -1516,7 +1516,11 @@ } cmd->result = CAM_REQ_INPROG << 16; TAILQ_INSERT_TAIL(>busyq, (struct ahc_cmd *)cmd, acmd_links.tqe); -ahc_linux_run_device_queue(ahc, dev); +if ((dev->flags & AHC_DEV_ON_RUN_LIST) == 0) { + LIST_INSERT_HEAD(>platform_data->device_runq, dev, links); + dev->flags |= AHC_DEV_ON_RUN_LIST; + ahc_linux_run_device_queues(ahc); +} ahc_unlock(ahc, ); return (0); } @@ -1532,6 +1536,9 @@ struct ahc_tmode_tstate *tstate; uint16_t mask; +if ((dev->flags & AHC_DEV_ON_RUN_LIST) != 0) + panic("running device on run list"); + while ((acmd = TAILQ_FIRST(>busyq)) != NULL && dev->openings > 0 && dev->qfrozen == 0) { @@ -1540,8 +1547,6 @@ * running is because the whole controller Q is frozen. */ if (ahc->platform_data->qfrozen != 0) { - if ((dev->flags & AHC_DEV_ON_RUN_LIST) != 0) - return; LIST_INSERT_HEAD(>platform_data->device_runq, dev, links); @@ -1552,8 +1557,6 @@ * Get an scb to use. */ if ((scb = ahc_get_scb(ahc)) == NULL) { - if ((dev->flags & AHC_DEV_ON_RUN_LIST) != 0) - panic("running device on run list"); LIST_INSERT_HEAD(>platform_data->device_runq, dev, links); dev->flags |= AHC_DEV_ON_RUN_LIST; - To unsubscribe from this list: send the line "unsubscribe linux-kernel" in the body of a message to [EMAIL PROTECTED] More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/
Re: [RFQ] aic7xxx driver panics under heavy swap.
Justin, Your patch works for me. printk Temporary Resource Shortage has to go, or may be you can make it a debug option. Here is the cleaned up patch for 2.4.5-ac15 with TAILQ macros replaced with LIST macros. Thanks for the help. Bulent --- aic7xxx_linux.c.save Mon Jun 18 20:25:35 2001 +++ aic7xxx_linux.c Tue Jun 19 17:35:55 2001 @@ -1516,7 +1516,11 @@ } cmd-result = CAM_REQ_INPROG 16; TAILQ_INSERT_TAIL(dev-busyq, (struct ahc_cmd *)cmd, acmd_links.tqe); -ahc_linux_run_device_queue(ahc, dev); +if ((dev-flags AHC_DEV_ON_RUN_LIST) == 0) { + LIST_INSERT_HEAD(ahc-platform_data-device_runq, dev, links); + dev-flags |= AHC_DEV_ON_RUN_LIST; + ahc_linux_run_device_queues(ahc); +} ahc_unlock(ahc, flags); return (0); } @@ -1532,6 +1536,9 @@ struct ahc_tmode_tstate *tstate; uint16_t mask; +if ((dev-flags AHC_DEV_ON_RUN_LIST) != 0) + panic(running device on run list); + while ((acmd = TAILQ_FIRST(dev-busyq)) != NULL dev-openings 0 dev-qfrozen == 0) { @@ -1540,8 +1547,6 @@ * running is because the whole controller Q is frozen. */ if (ahc-platform_data-qfrozen != 0) { - if ((dev-flags AHC_DEV_ON_RUN_LIST) != 0) - return; LIST_INSERT_HEAD(ahc-platform_data-device_runq, dev, links); @@ -1552,8 +1557,6 @@ * Get an scb to use. */ if ((scb = ahc_get_scb(ahc)) == NULL) { - if ((dev-flags AHC_DEV_ON_RUN_LIST) != 0) - panic(running device on run list); LIST_INSERT_HEAD(ahc-platform_data-device_runq, dev, links); dev-flags |= AHC_DEV_ON_RUN_LIST; - To unsubscribe from this list: send the line unsubscribe linux-kernel in the body of a message to [EMAIL PROTECTED] More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/
[RFQ] aic7xxx driver panics under heavy swap.
Justin, When free memory is low, I get a series of aic7xxx messages followed by panic. It appears to be a race condition in the code. Should you panic? I tried the following patch to not panic. But I am not sure if it is functionally correct. Bulent scsi0: Temporary Resource Shortage scsi0: Temporary Resource Shortage scsi0: Temporary Resource Shortage scsi0: Temporary Resource Shortage scsi0: Temporary Resource Shortage Kernel panic: running device on run list --- aic7xxx_linux.c.save Mon Jun 18 20:25:35 2001 +++ aic7xxx_linux.c Mon Jun 18 20:26:29 2001 @@ -1552,12 +1552,14 @@ * Get an scb to use. */ if ((scb = ahc_get_scb(ahc)) == NULL) { + ahc->flags |= AHC_RESOURCE_SHORTAGE; if ((dev->flags & AHC_DEV_ON_RUN_LIST) != 0) - panic("running device on run list"); + return; + // panic("running device on run list"); LIST_INSERT_HEAD(>platform_data->device_runq, dev, links); dev->flags |= AHC_DEV_ON_RUN_LIST; - ahc->flags |= AHC_RESOURCE_SHORTAGE; + // ahc->flags |= AHC_RESOURCE_SHORTAGE; printf("%s: Temporary Resource Shortage\n", ahc_name(ahc)); return; - To unsubscribe from this list: send the line "unsubscribe linux-kernel" in the body of a message to [EMAIL PROTECTED] More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/
[RFQ] aic7xxx driver panics under heavy swap.
Justin, When free memory is low, I get a series of aic7xxx messages followed by panic. It appears to be a race condition in the code. Should you panic? I tried the following patch to not panic. But I am not sure if it is functionally correct. Bulent scsi0: Temporary Resource Shortage scsi0: Temporary Resource Shortage scsi0: Temporary Resource Shortage scsi0: Temporary Resource Shortage scsi0: Temporary Resource Shortage Kernel panic: running device on run list --- aic7xxx_linux.c.save Mon Jun 18 20:25:35 2001 +++ aic7xxx_linux.c Mon Jun 18 20:26:29 2001 @@ -1552,12 +1552,14 @@ * Get an scb to use. */ if ((scb = ahc_get_scb(ahc)) == NULL) { + ahc-flags |= AHC_RESOURCE_SHORTAGE; if ((dev-flags AHC_DEV_ON_RUN_LIST) != 0) - panic(running device on run list); + return; + // panic(running device on run list); LIST_INSERT_HEAD(ahc-platform_data-device_runq, dev, links); dev-flags |= AHC_DEV_ON_RUN_LIST; - ahc-flags |= AHC_RESOURCE_SHORTAGE; + // ahc-flags |= AHC_RESOURCE_SHORTAGE; printf(%s: Temporary Resource Shortage\n, ahc_name(ahc)); return; - To unsubscribe from this list: send the line unsubscribe linux-kernel in the body of a message to [EMAIL PROTECTED] More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/
Re: Please test: workaround to help swapoff behaviour
>The fix is to kill the dead/orphaned swap pages before we get to >swapoff. At shutdown time there is practically nothing active in > ... >Once the dead swap pages problem is fixed it is time to optimize >swapoff. I think fixing the orphaned swap pages problem will eliminate the problem all together. Probably there is no need to optimize swapoff. Because as the system is shutting down all the processes will be killed and their pages in swap will be orphaned. If those pages were to be reaped in a timely manner there wouldn't be any work left for swapoff. Bulent - To unsubscribe from this list: send the line "unsubscribe linux-kernel" in the body of a message to [EMAIL PROTECTED] More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/
Re: Please test: workaround to help swapoff behaviour
The fix is to kill the dead/orphaned swap pages before we get to swapoff. At shutdown time there is practically nothing active in ... Once the dead swap pages problem is fixed it is time to optimize swapoff. I think fixing the orphaned swap pages problem will eliminate the problem all together. Probably there is no need to optimize swapoff. Because as the system is shutting down all the processes will be killed and their pages in swap will be orphaned. If those pages were to be reaped in a timely manner there wouldn't be any work left for swapoff. Bulent - To unsubscribe from this list: send the line unsubscribe linux-kernel in the body of a message to [EMAIL PROTECTED] More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/
Re: Please test: workaround to help swapoff behaviour
>Bulent, > >Could you please check if 2.4.6-pre2+the schedule patch has better >swapoff behaviour for you? Marcelo, It works as expected. Doesn't lockup the box however swapoff keeps burning the CPU cycles. It took 4 1/2 minutes to swapoff about 256MB of swap content. Shutdown took just as long. I was hoping that shutdown would kill the swapoff process but it doesn't. It just hangs there. Shutdown is the common case. Therefore, swapoff needs to be optimized for shutdowns. You could imagine users frustration waiting for a shutdown when there are gigabytes in the swap. So to summarize, schedule patch is better than nothing but falls far short. I would put it in 2.4.6. Read on. -- The problem is with the try_to_unuse() algorithm which is very inefficient. I searched the linux-mm archives and Tweedie was on to this. This is what he wrote: "it is much cheaper to find a swap entry for a given page than to find the swap cache page for a given swap entry." And he posted a patch http://mail.nl.linux.org/linux-mm/2001-03/msg00224.html His patch is in the Redhat 7.1 kernel 2.4.2-2 and not in 2.4.5. But in any case I believe the patch will not work as expected. It seems to me that he is calling the function check_orphaned_swap(page) in the wrong place. He is calling the function while scanning the active_list in refill_inactive_scan(). The problem with that is if you wait 60 seconds or longer the orphaned swap pages will move from active to inactive lists. Therefore the function will miss the orphans in inactive lists. Any comments? - To unsubscribe from this list: send the line "unsubscribe linux-kernel" in the body of a message to [EMAIL PROTECTED] More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/
Re: Please test: workaround to help swapoff behaviour
Bulent, Could you please check if 2.4.6-pre2+the schedule patch has better swapoff behaviour for you? Marcelo, It works as expected. Doesn't lockup the box however swapoff keeps burning the CPU cycles. It took 4 1/2 minutes to swapoff about 256MB of swap content. Shutdown took just as long. I was hoping that shutdown would kill the swapoff process but it doesn't. It just hangs there. Shutdown is the common case. Therefore, swapoff needs to be optimized for shutdowns. You could imagine users frustration waiting for a shutdown when there are gigabytes in the swap. So to summarize, schedule patch is better than nothing but falls far short. I would put it in 2.4.6. Read on. -- The problem is with the try_to_unuse() algorithm which is very inefficient. I searched the linux-mm archives and Tweedie was on to this. This is what he wrote: it is much cheaper to find a swap entry for a given page than to find the swap cache page for a given swap entry. And he posted a patch http://mail.nl.linux.org/linux-mm/2001-03/msg00224.html His patch is in the Redhat 7.1 kernel 2.4.2-2 and not in 2.4.5. But in any case I believe the patch will not work as expected. It seems to me that he is calling the function check_orphaned_swap(page) in the wrong place. He is calling the function while scanning the active_list in refill_inactive_scan(). The problem with that is if you wait 60 seconds or longer the orphaned swap pages will move from active to inactive lists. Therefore the function will miss the orphans in inactive lists. Any comments? - To unsubscribe from this list: send the line unsubscribe linux-kernel in the body of a message to [EMAIL PROTECTED] More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/
Re: Please test: workaround to help swapoff behaviour
>> I looked at try_to_unuse in swapfile.c. I believe that the algorithm is >> broken. >> For each and every swap entry it is walking the entire process list >> (for_each_task(p)). It is also grabbing a whole bunch of locks >> for each swap entry. It might be worthwhile processing swap entries in >> batches instead of one entry at a time. >> >> In any case, I think having this patch is worthwhile as a quick and dirty >> remedy. > >Bulent, > >Could you please check if 2.4.6-pre2+the schedule patch has better >swapoff behaviour for you? No problem. I will check it tomorrow. I don't think it can be any worse than it is now. The patch looks correct in principle. I believe it should go in to 2.4.6. But I will test it. On small machines people don't notice it, but otherwise if you have few GB of memory it really hurts. Shutdowns take forever since swapoff takes forever. - To unsubscribe from this list: send the line "unsubscribe linux-kernel" in the body of a message to [EMAIL PROTECTED] More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/
Re: Please test: workaround to help swapoff behaviour
I looked at try_to_unuse in swapfile.c. I believe that the algorithm is broken. For each and every swap entry it is walking the entire process list (for_each_task(p)). It is also grabbing a whole bunch of locks for each swap entry. It might be worthwhile processing swap entries in batches instead of one entry at a time. In any case, I think having this patch is worthwhile as a quick and dirty remedy. Bulent, Could you please check if 2.4.6-pre2+the schedule patch has better swapoff behaviour for you? No problem. I will check it tomorrow. I don't think it can be any worse than it is now. The patch looks correct in principle. I believe it should go in to 2.4.6. But I will test it. On small machines people don't notice it, but otherwise if you have few GB of memory it really hurts. Shutdowns take forever since swapoff takes forever. - To unsubscribe from this list: send the line unsubscribe linux-kernel in the body of a message to [EMAIL PROTECTED] More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/
Re: Please test: workaround to help swapoff behaviour
>This is for the people who has been experiencing the lockups while running >swapoff. > >Please test. (against 2.4.6-pre1) > > >--- linux.orig/mm/swapfile.c Wed Jun 6 18:16:45 2001 >+++ linux/mm/swapfile.c Thu Jun 7 16:06:11 2001 >@@ -345,6 +345,8 @@ > /* > * Find a swap page in use and read it in. > */ >+if (current->need_resched) >+ schedule(); > swap_device_lock(si); > for (i = 1; i < si->max ; i++) { > if (si->swap_map[i] > 0 && si->swap_map[i] != SWAP_MAP_BAD) { I tested your patch against 2.4.5. It works. No more lockups. Without the patch it took 14 minutes 51 seconds to complete swapoff (this is to recover 1.5GB of swap space). During this time the system was frozen. No keyboard, no screen, etc. Practically locked-up. With the patch there are no more lockups. Swapoff kept running in the background. This is a winner. But here is the caveat: swapoff keeps burning 100% of the cycles until it completes. This is not going to be a big deal during shutdowns. Only when you enter swapoff from the command line it is going to be a problem. I looked at try_to_unuse in swapfile.c. I believe that the algorithm is broken. For each and every swap entry it is walking the entire process list (for_each_task(p)). It is also grabbing a whole bunch of locks for each swap entry. It might be worthwhile processing swap entries in batches instead of one entry at a time. In any case, I think having this patch is worthwhile as a quick and dirty remedy. Bulent Abali - To unsubscribe from this list: send the line "unsubscribe linux-kernel" in the body of a message to [EMAIL PROTECTED] More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/
Re: Break 2.4 VM in five easy steps
>> O.k. I think I'm ready to nominate the dead swap pages for the big >> 2.4.x VM bug award. So we are burning cpu cycles in sys_swapoff >> instead of being IO bound? Just wanting to understand this the cheap way :) > >There's no IO being done whatsoever (that I can see with only a blinky). >I can fire up ktrace and find out exactly what's going on if that would >be helpful. Eating the dead swap pages from the active page list prior >to swapoff cures all but a short freeze. Eating the rest (few of those) >might cure the rest, but I doubt it. > >-Mike 1) I second Mike's observation. swapoff either from command line or during shutdown, just hangs there. No disk I/O is being done as I could see from the blinkers. This is not a I/O boundness issue. It is more like a deadlock. I happened to saw this one with debugger attached serial port. The system was alive. I think I was watching the free page count and it was decreasing very slowly may be couple pages per second. Bigger the swap usage longer it takes to do swapoff. For example, if I had 1GB in the swap space then it would take may be an half hour to shutdown... 2) Now why I would have 1 GB in the swap space, that is another problem. Here is what I observe and it doesn't make much sense to me. Let's say I have 1GB of memory and plenty of swap. And let's say there is process with little less than 1GB size. Suppose the system starts swapping because it is short few megabytes of memory. Within *seconds* of swapping, I see that the swap disk usage balloons to nearly 1GB. Nearly entire memory moves in to the page cache. If you run xosview you will know what I mean. Memory usage suddenly turns from green to red :-). And I know for a fact that my disk cannot do 1GB per second :-). The SHARE column of the big process in "top" goes up by hundreds of megabytes. So it appears to me that MM is marking the whole process memory to be swapped out and probably reserving nearly 1 GB in the swap space and furthermore moves entire process pages to apparently to the page cache. You would think that if you are short by few MB of memory MM would put few MB worth of pages in the swap. But it wants to move entire processes in to swap. When the 1GB process exits, the swap usage doesn't change (dead swap pages?). And shutdown or swapoff will take forever due to #1 above. Bulent - To unsubscribe from this list: send the line "unsubscribe linux-kernel" in the body of a message to [EMAIL PROTECTED] More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/
Re: Break 2.4 VM in five easy steps
O.k. I think I'm ready to nominate the dead swap pages for the big 2.4.x VM bug award. So we are burning cpu cycles in sys_swapoff instead of being IO bound? Just wanting to understand this the cheap way :) There's no IO being done whatsoever (that I can see with only a blinky). I can fire up ktrace and find out exactly what's going on if that would be helpful. Eating the dead swap pages from the active page list prior to swapoff cures all but a short freeze. Eating the rest (few of those) might cure the rest, but I doubt it. -Mike 1) I second Mike's observation. swapoff either from command line or during shutdown, just hangs there. No disk I/O is being done as I could see from the blinkers. This is not a I/O boundness issue. It is more like a deadlock. I happened to saw this one with debugger attached serial port. The system was alive. I think I was watching the free page count and it was decreasing very slowly may be couple pages per second. Bigger the swap usage longer it takes to do swapoff. For example, if I had 1GB in the swap space then it would take may be an half hour to shutdown... 2) Now why I would have 1 GB in the swap space, that is another problem. Here is what I observe and it doesn't make much sense to me. Let's say I have 1GB of memory and plenty of swap. And let's say there is process with little less than 1GB size. Suppose the system starts swapping because it is short few megabytes of memory. Within *seconds* of swapping, I see that the swap disk usage balloons to nearly 1GB. Nearly entire memory moves in to the page cache. If you run xosview you will know what I mean. Memory usage suddenly turns from green to red :-). And I know for a fact that my disk cannot do 1GB per second :-). The SHARE column of the big process in top goes up by hundreds of megabytes. So it appears to me that MM is marking the whole process memory to be swapped out and probably reserving nearly 1 GB in the swap space and furthermore moves entire process pages to apparently to the page cache. You would think that if you are short by few MB of memory MM would put few MB worth of pages in the swap. But it wants to move entire processes in to swap. When the 1GB process exits, the swap usage doesn't change (dead swap pages?). And shutdown or swapoff will take forever due to #1 above. Bulent - To unsubscribe from this list: send the line unsubscribe linux-kernel in the body of a message to [EMAIL PROTECTED] More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/
Re: Please test: workaround to help swapoff behaviour
This is for the people who has been experiencing the lockups while running swapoff. Please test. (against 2.4.6-pre1) --- linux.orig/mm/swapfile.c Wed Jun 6 18:16:45 2001 +++ linux/mm/swapfile.c Thu Jun 7 16:06:11 2001 @@ -345,6 +345,8 @@ /* * Find a swap page in use and read it in. */ +if (current-need_resched) + schedule(); swap_device_lock(si); for (i = 1; i si-max ; i++) { if (si-swap_map[i] 0 si-swap_map[i] != SWAP_MAP_BAD) { I tested your patch against 2.4.5. It works. No more lockups. Without the patch it took 14 minutes 51 seconds to complete swapoff (this is to recover 1.5GB of swap space). During this time the system was frozen. No keyboard, no screen, etc. Practically locked-up. With the patch there are no more lockups. Swapoff kept running in the background. This is a winner. But here is the caveat: swapoff keeps burning 100% of the cycles until it completes. This is not going to be a big deal during shutdowns. Only when you enter swapoff from the command line it is going to be a problem. I looked at try_to_unuse in swapfile.c. I believe that the algorithm is broken. For each and every swap entry it is walking the entire process list (for_each_task(p)). It is also grabbing a whole bunch of locks for each swap entry. It might be worthwhile processing swap entries in batches instead of one entry at a time. In any case, I think having this patch is worthwhile as a quick and dirty remedy. Bulent Abali - To unsubscribe from this list: send the line unsubscribe linux-kernel in the body of a message to [EMAIL PROTECTED] More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/
can I call wake_up_interruptible_all from an interrupt service routine?
Interrupt service routine of a driver makes a wake_up_interruptible_all() call to wake up a kernel thread. Is that legitimate? Thanks for any advice you might have. please cc: your response to me if you decide to post to the mailing list. Bulent - To unsubscribe from this list: send the line "unsubscribe linux-kernel" in the body of a message to [EMAIL PROTECTED] More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/
can I call wake_up_interruptible_all from an interrupt service routine?
Interrupt service routine of a driver makes a wake_up_interruptible_all() call to wake up a kernel thread. Is that legitimate? Thanks for any advice you might have. please cc: your response to me if you decide to post to the mailing list. Bulent - To unsubscribe from this list: send the line unsubscribe linux-kernel in the body of a message to [EMAIL PROTECTED] More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/