Bug#596419: [xen] BUG at drivers/scsi/aacraid/aachba.c:2825
Thanks for reporting this and sorry for the long quiet. Can you reproduce this using a sid kernel for the dom0? I think the only packages that should be needed for this test from outside squeeze are the kernel image itself, linux-base, and initramfs-tools. Jonathan Hello, Jonathan, unfortunatelly, I have changed the SW RAID to HW RAID and the server is already used in production with Debian Squeeze, so I cannot try to reproduce the problems at this time... But I will try to do it by the next server... Thanx for information, Artur -- To UNSUBSCRIBE, email to debian-bugs-dist-requ...@lists.debian.org with a subject of unsubscribe. Trouble? Contact listmas...@lists.debian.org
Bug#603802: Acknowledgement (linux-image-2.6.32-5-xen-amd64: DomU not really resumed (hangs) after restore from disk after Dom0 restart)
Additional info - I figured out, the domain is paused only, so after calling of xm unpause DomId the domain continues to respond and to run. So, maybe it is not a bug, but a feature? If it is so, then there should be a possibility to start the domain immediatelly. One more remark: It does not affect the fully virtualized domains, only the paravirtualized domains. -- To UNSUBSCRIBE, email to debian-bugs-dist-requ...@lists.debian.org with a subject of unsubscribe. Trouble? Contact listmas...@lists.debian.org
Bug#596419: Acknowledgement (xen-linux-system-2.6.32-5-xen-amd64: causes a system hangup by the shutdown of the system, aacraid (sw raid) involved in hangup)
-Original Message- From: Konrad Rzeszutek Wilk [mailto:konrad.w...@oracle.com] Sent: Monday, September 20, 2010 6:08 PM To: Artur Linhart - Linux communication Cc: 'Ian Campbell'; 596...@bugs.debian.org Subject: Re: Bug#596419: Acknowledgement (xen-linux-system-2.6.32-5-xen-amd64: causes a system hangup by the shutdown of the system, aacraid (sw raid) involved in hangup) So, it worked if I have specified in Dom0 in the baloon mode by omitting the specification of dom0_mem or, if dom0_mem is specified then also the swiotlb=65536 must be specified. Wow. That implies that AACRAID uses quite a lot of buffers, and looking at the driver there are a bunch of quirks where it can only do DMA up to 2GB, so that would explain why it relies on SWIOTLB that much. Unfortunatelly I did not tried to raise dom0_mem higher than 2 GB :-(. Based on what Ian analyzed it really looks that we just ran out of DMA buffers and the driver didn't try to retry but just bails out. We can narrow down who is using so many buffers by using the attached debug module that when loaded will print out who is using what buffers if CONFIG_DMA_API_DEBUG=y is set. But the proper workaround is the one you discovered - either raise the SWIOTLB buffer or raise the memory allocated for Dom0. I have noticed one interesting behavior - during the successfull suspension of the domains during the shutdown the first one which is beeing suspended writes very fast three dots, then it stops to write the dots for some time and then agfter some time very fast a lot of (possibly also all remaining) dots are written on the screen. By the next suspensions the suspension works continuously dot-by-dot smoothly without any delays. It looks like it waits for something during the first suspension (memory allocation?). That usually means that is stuck waiting for the disks to write out all the data. OK, I thought it too, but in the case if I omitted dom0_mem or specified the higher swiotlb this behaved differently and I think, it should behave in the same way, isn't it? At least I would guess it so... Generally, it is for me very surpsrising, how the aacraid module works, I am no C or kernel developer but I would expect something like this cannot happen - the module should allocate its necessary memory in the start or, I would understand there can fail some specific read or write operation if the sw raid has not enough memory to execute them, but I would never expect this will lead to the hangup and freeze of the whole system. The probability of Well, to be honest, we engineers aren't known for testing all of the failure paths as well as we should. That is why folks like you are quite helpful in finding bugs :-) I am always very pleased to have the possibility to help You all who are doing such a great job at least with some small piece of work - even if it did cost me unexpectedly much time :-) I actually began with the usage of the HW RAID on that server instead of SW raid - from other reasons. But at this time I still have the HDD with the SW raid configuration and I would be able to test something, if You have some ideas or want to let me test something concrete on my configuration. If not, I want to remove the software raid sometimes in the next week completely because I need this HDD, so let me know till that time, if there is something You would need to test - I do not know, how difficult would it be for You to reproduce the error on other machine(s). I think it should not be so difficult but who knows -- To UNSUBSCRIBE, email to debian-bugs-dist-requ...@lists.debian.org with a subject of unsubscribe. Trouble? Contact listmas...@lists.debian.org
Bug#596419: Acknowledgement (xen-linux-system-2.6.32-5-xen-amd64: causes a system hangup by the shutdown of the system, aacraid (sw raid) involved in hangup)
Hello, Ian, your theory with the out-of-memory seems to be the step into the right direction. It looks like the problems did not really start with the instalaltion of the new packages, but with the set of the xen kernel parameter dom0_mem=1024M which I made approximatelly at the same time like the upgrades. If I have removed this option now, so Dom0 has complete 12GB for its run and the problem does not occur anymore. Also the domains are suspended correctly after the call of /etc/init.d/xendomains stop Possibly this is also the reason, why I could not reproduce this problem with the non-xen kernel - because in that case the memory also was not reduced to this 1GB, but the complete 12GB memory pool was used withtout any specifications, so possibly the error could not occur as well. Also usage of dom0_mem=2048 is not enough to fix the problem for me. I have tried dom0_mem=2048 but it leads also to the hangup by the shutdown during the domain suspension. Only if I omit the dom0_mem parameter completely at all it works correctly. Free memory after increase of the dom0_mem to 2048M: total used free sharedbuffers cached Mem: 2090832 4480921642740 0 111600 90908 -/+ buffers/cache: 2455841845248 Swap: 999416 0 999416 - so there is basically no problem with the base memory amount, there is enough memory for everything. According to the swiotlb parameter - I have found following lines in kern.log from the previous reboots: Sep 13 17:15:13 alg-puv-xen-1 kernel: [3.105461] xen_swiotlb_fixup: buf=880005711000 size=67108864 Sep 13 17:15:13 alg-puv-xen-1 kernel: [3.126345] xen_swiotlb_fixup: buf=880009771000 size=32768 - (so the 64MB should be there) but the given lines are repeatet there always with the same values, independently on the fact if dom0_mem has been set to 1024M, 2048M or unset completely. After I have specified swiotlb=65536 on the line with the xen kernel then I got in the log the same thing like If I would done nothing (and also the hangups during domain suspension). If I put this parameter to the linux kernel module parameters, then it also did not changed the value in the log: Sep 13 18:15:32 alg-puv-xen-1 kernel: [3.856096] Kernel command line: root=/dev/md0 ro console=tty0 vga=773 swiotlb=65536 Sep 13 18:15:32 alg-puv-xen-1 kernel: [3.856129] PID hash table entries: 4096 (order: 3, 32768 bytes) Sep 13 18:15:32 alg-puv-xen-1 kernel: [3.856512] Initializing CPU#0 Sep 13 18:15:32 alg-puv-xen-1 kernel: [3.873864] DMA: Placing 128MB software IO TLB between 880005711000 - 88000d711000 Sep 13 18:15:32 alg-puv-xen-1 kernel: [3.873868] DMA: software IO TLB at phys 0x5711000 - 0xd711000 Sep 13 18:15:32 alg-puv-xen-1 kernel: [3.873871] xen_swiotlb_fixup: buf=880005711000 size=134217728 Sep 13 18:15:32 alg-puv-xen-1 kernel: [3.915338] xen_swiotlb_fixup: buf=88000d7d1000 size=32768 Sep 13 18:15:32 alg-puv-xen-1 kernel: [3.924636] Memory: 1891528k/2097152k available (3141k kernel code, 432k absent, 205192k reserved, 1905k data, 592k init) But the reboot came through without the crash! :-) Where has to be applied the swiotlb parameter to see some effect of the swiotlb memory change in the logs? So, it worked if I have specified in Dom0 in the baloon mode by omitting the specification of dom0_mem or, if dom0_mem is specified then also the swiotlb=65536 must be specified. I have noticed one interesting behavior - during the successfull suspension of the domains during the shutdown the first one which is beeing suspended writes very fast three dots, then it stops to write the dots for some time and then agfter some time very fast a lot of (possibly also all remaining) dots are written on the screen. By the next suspensions the suspension works continuously dot-by-dot smoothly without any delays. It looks like it waits for something during the first suspension (memory allocation?). Generally, it is for me very surpsrising, how the aacraid module works, I am no C or kernel developer but I would expect something like this cannot happen - the module should allocate its necessary memory in the start or, I would understand there can fail some specific read or write operation if the sw raid has not enough memory to execute them, but I would never expect this will lead to the hangup and freeze of the whole system. The probability of data corruption is so increased drastically. And especially by raid1, which is arranged in the most of cases to archieve more data safety :-). With regards, Artur -- To UNSUBSCRIBE, email to debian-bugs-dist-requ...@lists.debian.org with a subject of unsubscribe. Trouble? Contact listmas...@lists.debian.org
Bug#596419: Acknowledgement (xen-linux-system-2.6.32-5-xen-amd64: causes a system hangup by the shutdown of the system, aacraid (sw raid) involved in hangup)
Even after the downgrade of kernel and of the corresponding files to the version 2.6.32-18 and downgrade of mdadm the problem still persists, so it is not bound specificallz to this package and to this version. I have identified now (after the downgrades to 2.6.30-18) the following initial stack trace (some lines are missing from the top, I think, they were no longer on the screen): [] ? bio_alloc_bioset+0x45/0xb7 [] ? submit_bio+0xd6/0xf2 [] ? md_super_write+0x84/0xb2 [md_mod] [] ? xen_restore_fl_direct_end+0x0/0x1 [] ? md_update_sb+0x268/0x31e [] ? md_check_recovery+0x1e2/0x4b9 [md_mod] [] ? raid1d+0x42/0xe0b [raid1] [] ? finish_task_switch+0x44/0xaf [] ? schedule_timeout+0x2e/0xdd [] ? xen_restore_fl_direct_end+0x0/0x1 [] ? xen_force_evtchn_callback+0x9/0xa [] ? check_events+0x12/0x20 [] ? xen_restore_fl_direct_end+0x0/0x1 [] ? md_thread+0xf1/0x10f [md_mod] [] ? autoremove_wake_function+0x0/0x2e [] ? md_thread+0x0/0x10f [md_mod] [] ? kthread+0x79/0x01 [] ? child_rip+0xa/0x20 [] ? int_ret_from_szs_call+0x7/0x1b [] ? retinit_restore_args+0x5/0x6 [] ? xen-restore-fl-direct-end+0x0/0x1 [] ? xen-restore-fl-direct-end+0x0/0x1 [] ? child_rip+0x0/0x20 Code: 00 00 c7 46 0c 00 00 00 00 c7 46 10 00 00 00 00 c7 46 14 00 00 00 00 c7 46 18 00 00 00 00 e8 10 63 fa ff 83 f8 00 41 89 c6 7d 04 0f 0b eb fe 75 08 45 31 e4 e9 9c 00 00 00 49 8b 7f 58 48 89 eb RIP [] aac_build_sgraw+0x51/0x10a [aacraid] RSP 88003cd998e0 --- [ end trace ] --- Now also this stack trace stays on the screen and nothing happens also after very long time (1 hour) -- To UNSUBSCRIBE, email to debian-bugs-dist-requ...@lists.debian.org with a subject of unsubscribe. Trouble? Contact listmas...@lists.debian.org
Bug#596419: Acknowledgement (xen-linux-system-2.6.32-5-xen-amd64: causes a system hangup by the shutdown of the system, aacraid (sw raid) involved in hangup)
One more remark - the last tests from the previous post were done on the synced array, so there was not other heavy load on it at the time of this last crash. The crash happened also during the xendomains stop before the system shutdown. It happened not immediatelly, but first after sime time (there have been displayed more dots from the suspend process of the given virtual domain before the crash happened) -- To UNSUBSCRIBE, email to debian-bugs-dist-requ...@lists.debian.org with a subject of unsubscribe. Trouble? Contact listmas...@lists.debian.org
Bug#596419: Acknowledgement (xen-linux-system-2.6.32-5-xen-amd64: causes a system hangup by the shutdown of the system, aacraid (sw raid) involved in hangup)
Hello, If booting on non-xen kernel, then no problems can be seen. But it is true, exactly the same asction cannot be tested, the xendomains script can be started only if running under xen and also there had to be some virtual instances to be suspended... If I boot with xen and then shut down the virtual instances, and then reboot the computer, the hangup does not occur. Only in the case the suspend during /etc/init.d/xendomains stop Happens, the crash comes after some time. Under the non-xen I have also tested the cration of the larger amount of the data by the usage of dd or cp (of cca 5 GiB of data, which is 2 times more than all virtual instances together have memory which has to be written to the raid), but nothing strange happens, everything works. I even tried to write the files to the same location like xendomains writes the memory snapshots (it is on md1 raid, the system itself is installed on md0) but everything seems to be working fine without xen kernel. Finally I booted the xen kernel again and just tried to perform a heavy operations on the raid1 - I have generated the hangup induring some seconds and ater reboot again. In the fiorst case I have started just the dd of the 5 GB data bs=1M count=5000 on the first screen and then switched to the second screen and here simply started aptitude. In the second case (in this case the resync of the array from the previous crash was running) I have tried to start paralely two dd's on two screns. It was no problem for some time, then I tried to start aptitude in the third screen, it caused also nothing. I returnd back to screen 2 and pressed twice ctrl-C what lead to hangup of the system again. So, it seems to be very probable this problem has nothing special to do with xendomains script or any xen utility, but is just the question of running under xen kernel and performing more complex or heavy operations on the raid array... My configuration of the array is: 2 TB SATA disks, both split in the similar way to 1x50GB and 3x650GB raid-partitions, on the first one (md0 - the smallest) is the system, on md1 (size 650 GB) is the raid md1 (here I perform the write operation in the tests and here writes xendomains too) and third array md2 not involved in tests or problems. The fourth 650GB partitions are unused. Regards, Archie -- To UNSUBSCRIBE, email to debian-bugs-dist-requ...@lists.debian.org with a subject of unsubscribe. Trouble? Contact listmas...@lists.debian.org
Bug#596422: Acknowledgement (xen-linux-system-2.6.32-5-xen-amd64: System hangup during shutdown high probable during stopping of mdadm)
This bug is a duplicate of the bug 596419, so it can be closed. -Original Message- From: Debian BTS [mailto:debb...@busoni.debian.org] On Behalf Of Debian Bug Tracking System Sent: Saturday, September 11, 2010 11:06 AM To: Artur Linhart Subject: Bug#596422: Acknowledgement (xen-linux-system-2.6.32-5-xen-amd64: System hangup during shutdown high probable during stopping of mdadm) Thank you for filing a new Bug report with Debian. This is an automatically generated reply to let you know your message has been received. Your message is being forwarded to the package maintainers and other interested parties for their attention; they will reply in due course. As you requested using X-Debbugs-CC, your message was also forwarded to al.li...@bcpraha.com (after having been given a Bug report number, if it did not have one). Your message has been sent to the package maintainer(s): Debian Kernel Team debian-ker...@lists.debian.org If you wish to submit further information on this problem, please send it to 596...@bugs.debian.org. Please do not send mail to ow...@bugs.debian.org unless you wish to report a problem with the Bug-tracking system. -- 596422: http://bugs.debian.org/cgi-bin/bugreport.cgi?bug=596422 Debian Bug Tracking System Contact ow...@bugs.debian.org with problems __ Informace od NOD32 5441 (20100910) __ Tato zprava byla proverena antivirovym systemem NOD32. http://www.nod32.cz -- To UNSUBSCRIBE, email to debian-bugs-dist-requ...@lists.debian.org with a subject of unsubscribe. Trouble? Contact listmas...@lists.debian.org
Bug#596419: Acknowledgement (xen-linux-system-2.6.32-5-xen-amd64: causes a system hangup by the shutdown of the system, aacraid (sw raid) involved in hangup)
After further anaysis it seems to be the fact it occurs not by mdadm stop but by the call of xendomains stop because the hangup occurs not only by the shutdown, but also by the simple call of /etc/init.d/xendomains stop - there are 4 domains running, 3 fully virtualized based on qemu and one paravirtual, also debian squeeze based. It also does not hang up the computer completely, it just freezes the keyboard (also the numlock does not react etc.) and services (for example concurrent ssh connection to the amchine is no longer usefull), but the system itself does still something. There come after some minutes of waiting again and again messages, ending by the following call trace (I hope I made no mistakes in writing it down from monitor): Call Trace: []? smp_call_function_many+0x191/0x1af []? drain_local_pages+0x0/0xd []? smp_call_function+0x20/0x24 []? on_each_cpu+0x10/0x2e []? __alloc_pages_nodemask+0x3f4/0x5ce []? check_events+0x12/0x20 []? new_slab+0x42/0x1ca []? __slab_alloc+0x1f0/0x39b []? sock_alloc_send_pskb+0xbd/0x2d8 []? cap_socket_getpeersec_dgram+0x0/0x6 []? __kmalloc_node_track_caller+0xbb/0x11b []? sock_alloc_send_pskb+0xbd/0x2d8 []? __alloc_skb+0x69/0x15a []? sock_alloc_send_pskb+0xbd/0x2d8 []? pollwake+0x0/0x5b []? unix_stream_sendmsg+0x133/0x2a1 []? sock_aio_write+0xb1/0xbc []? sock_aio_write+0x0/0xbc []? do_sync_readv_writev+0xc0/0x107 []? autoremove_wake_function+0x0/0x2e []? rw_copy_check_nvector+0x6d/0xe4 []? do_readv_writev+0xb2/0x115 []? pvclock_clocksource_read+0x3a/0x70 []? sys_writev+0x45/0x93 []? system_call_fastpath+0x16/0x1b This stack trace comes multiple times (for hour or more before I pushed the power button) and unfortunatelly I cannot see anything more usefull in the error. In the start, there was also some message with aacraid, but It vanished too quickly to see what was written there. -Original Message- From: Debian BTS [mailto:debb...@busoni.debian.org] On Behalf Of Debian Bug Tracking System Sent: Saturday, September 11, 2010 10:15 AM To: Artur Linhart Subject: Bug#596419: Acknowledgement (xen-linux-system-2.6.32-5-xen-amd64: causes a system hangup by the shutdown of the system, aacraid (sw raid) involved in hangup) Thank you for filing a new Bug report with Debian. This is an automatically generated reply to let you know your message has been received. Your message is being forwarded to the package maintainers and other interested parties for their attention; they will reply in due course. As you requested using X-Debbugs-CC, your message was also forwarded to al.li...@bcpraha.com (after having been given a Bug report number, if it did not have one). Your message has been sent to the package maintainer(s): Debian Kernel Team debian-ker...@lists.debian.org If you wish to submit further information on this problem, please send it to 596...@bugs.debian.org. Please do not send mail to ow...@bugs.debian.org unless you wish to report a problem with the Bug-tracking system. -- 596419: http://bugs.debian.org/cgi-bin/bugreport.cgi?bug=596419 Debian Bug Tracking System Contact ow...@bugs.debian.org with problems __ Informace od NOD32 5441 (20100910) __ Tato zprava byla proverena antivirovym systemem NOD32. http://www.nod32.cz -- To UNSUBSCRIBE, email to debian-bugs-dist-requ...@lists.debian.org with a subject of unsubscribe. Trouble? Contact listmas...@lists.debian.org
Bug#596419: Acknowledgement (xen-linux-system-2.6.32-5-xen-amd64: causes a system hangup by the shutdown of the system, aacraid (sw raid) involved in hangup)
Additional info: Downgrading of the package and kernel image to 2.6.35-18 did not helped, Running of /etc/init.d/xendomains stop Still brought an error, containing also following message: kernel bug /source_amd_xen/drivers/scsi/aacraid/aachba.c:2825! (the dots were not there, there was some other text). Also other downgrades of the hypervisor or qemu-dm did not help. The problem can be also caused by the heavy load of the sw raid - because of the previous crashes, the raid arrays have been under sync if the next crashes happened. The script xendomains suspends the instances to the disk (it is cca 2,5 GB of data), so possible cause could be the heavy load caused by the virtual domain suspension combined with the running resync of the mdadm array. After the array is fully resynced, I try it again with the downgraded mdadm -- To UNSUBSCRIBE, email to debian-bugs-dist-requ...@lists.debian.org with a subject of unsubscribe. Trouble? Contact listmas...@lists.debian.org