Bug#596419: [xen] BUG at drivers/scsi/aacraid/aachba.c:2825

2012-02-13 Thread Artur Linhart - Linux communication

 Thanks for reporting this and sorry for the long quiet.  Can you
 reproduce this using a sid kernel for the dom0?  I think the only
 packages that should be needed for this test from outside squeeze are
 the kernel image itself, linux-base, and initramfs-tools.
 
 Jonathan

Hello, Jonathan,
unfortunatelly, I have changed the SW RAID to HW RAID and the server is already 
used in production with Debian Squeeze, so I cannot
try to reproduce the problems at this time... But I will try to do it by the 
next server...

Thanx for information, Artur





-- 
To UNSUBSCRIBE, email to debian-bugs-dist-requ...@lists.debian.org
with a subject of unsubscribe. Trouble? Contact listmas...@lists.debian.org



Bug#603802: Acknowledgement (linux-image-2.6.32-5-xen-amd64: DomU not really resumed (hangs) after restore from disk after Dom0 restart)

2010-11-17 Thread Artur Linhart - Linux communication
Additional info - I figured out, the domain is paused only, so after calling of
xm unpause DomId
the domain continues to respond and to run.

So, maybe it is not a bug, but a feature?

If it is so, then there should be a possibility to start the domain 
immediatelly.

One more remark: It does not affect the fully virtualized domains, only the 
paravirtualized domains.





-- 
To UNSUBSCRIBE, email to debian-bugs-dist-requ...@lists.debian.org
with a subject of unsubscribe. Trouble? Contact listmas...@lists.debian.org



Bug#596419: Acknowledgement (xen-linux-system-2.6.32-5-xen-amd64: causes a system hangup by the shutdown of the system, aacraid (sw raid) involved in hangup)

2010-09-23 Thread Artur Linhart - Linux communication
-Original Message-
From: Konrad Rzeszutek Wilk [mailto:konrad.w...@oracle.com] 
Sent: Monday, September 20, 2010 6:08 PM
To: Artur Linhart - Linux communication
Cc: 'Ian Campbell'; 596...@bugs.debian.org
Subject: Re: Bug#596419: Acknowledgement (xen-linux-system-2.6.32-5-xen-amd64: 
causes a system hangup by the shutdown of the
system, aacraid (sw raid) involved in hangup)

 So, it worked if I have specified in Dom0 in the baloon mode by omitting
 the specification of dom0_mem or, if dom0_mem is specified then also the
 swiotlb=65536 must be specified.

Wow. That implies that AACRAID uses quite a lot of buffers, and looking at the 
driver
there are a bunch of quirks where it can only do DMA up to 2GB, so that would 
explain
why it relies on SWIOTLB that much.

Unfortunatelly I did not tried to raise dom0_mem higher than 2 GB :-(. 


Based on what Ian analyzed it really looks that we just ran out of DMA buffers 
and
the driver didn't try to retry but just bails out.

We can narrow down who is using so many buffers by using the attached debug 
module
that when loaded will print out who is using what buffers if
CONFIG_DMA_API_DEBUG=y is set.

But the proper workaround is the one you discovered - either raise the SWIOTLB 
buffer
or raise the memory allocated for Dom0.

 
 I have noticed one interesting behavior - during the successfull suspension
 of the domains during the shutdown the first one which is beeing suspended
 writes very fast three dots, then it stops to write the dots for some time
 and then agfter some time very fast a lot of (possibly also all remaining)
 dots are written on the screen. By the next suspensions the suspension
 works continuously dot-by-dot smoothly without any delays. It looks like it
 waits for something during the first suspension (memory allocation?).

That usually means that is stuck waiting for the disks to write out all the 
data.

OK, I thought it too, but in the case if I omitted dom0_mem or specified the 
higher swiotlb this behaved differently and I think, it
should behave in the same way, isn't it? At least I would guess it so... 

 
 Generally, it is for me very surpsrising, how the aacraid module works, I am
 no C or kernel developer but I would expect something like this cannot
 happen - the module should allocate its necessary memory in the start or, I
 would understand there can fail some specific read or write operation if the
 sw raid has not enough memory to execute them, but I would never expect this
 will lead to the hangup and freeze of the whole system. The probability of

 Well, to be honest, we engineers aren't known for testing all of the failure 
 paths
 as well as we should. That is why folks like you are quite helpful in finding
 bugs :-)

I am always very pleased to have the possibility to help You all who are doing 
such a great job at least with some small piece of
work - even if it did cost me unexpectedly much time :-) I actually began with 
the usage of the HW RAID on that server instead of SW
raid - from other reasons. But at this time I still have the HDD with the SW 
raid configuration and I would be able to test
something, if You have some ideas or want to let me test something concrete on 
my configuration.
If not, I want to remove the software raid sometimes in the next week 
completely because I need this HDD, so let me know till that
time, if there is something You would need to test - I do not know, how 
difficult would it be for You to reproduce the error on
other machine(s). I think it should not be so difficult but who knows






-- 
To UNSUBSCRIBE, email to debian-bugs-dist-requ...@lists.debian.org
with a subject of unsubscribe. Trouble? Contact listmas...@lists.debian.org



Bug#596419: Acknowledgement (xen-linux-system-2.6.32-5-xen-amd64: causes a system hangup by the shutdown of the system, aacraid (sw raid) involved in hangup)

2010-09-13 Thread Artur Linhart - Linux communication
Hello, Ian,

your theory with the out-of-memory seems to be the step into the
right direction.

It looks like the problems did not really start with the
instalaltion of the new packages, but with the set of the xen kernel
parameter 
dom0_mem=1024M
which I made approximatelly at the same time like the upgrades. If I have
removed this option now, so Dom0 has complete 12GB for its run and the
problem does not occur anymore. Also the domains are suspended correctly
after the call of 
/etc/init.d/xendomains stop

Possibly this is also the reason, why I could not reproduce this problem
with the non-xen kernel - because in that case the memory also was not
reduced to this 1GB, but the complete 12GB memory pool was used withtout any
specifications, so possibly the error could not occur as well.
Also usage of dom0_mem=2048 is not enough to fix the problem for me. I have
tried dom0_mem=2048 but it leads also to the hangup by the shutdown during
the domain suspension. Only if I omit the dom0_mem parameter completely at
all it works correctly.
Free memory after increase of the dom0_mem to 2048M:
 total   used   free sharedbuffers cached
Mem:   2090832 4480921642740  0 111600  90908
-/+ buffers/cache: 2455841845248
Swap:   999416  0 999416
- so there is basically no problem with the base memory amount, there is
enough memory for everything.

According to the swiotlb parameter - I have found following lines in
kern.log from the previous reboots:

Sep 13 17:15:13 alg-puv-xen-1 kernel: [3.105461] xen_swiotlb_fixup:
buf=880005711000 size=67108864
Sep 13 17:15:13 alg-puv-xen-1 kernel: [3.126345] xen_swiotlb_fixup:
buf=880009771000 size=32768

- (so the 64MB should be there) but the given lines are repeatet there
always with the same values, independently on the fact if dom0_mem has been
set to 1024M, 2048M or unset completely. After I have specified
swiotlb=65536 on the line with the xen kernel then I got in the log the same
thing like If I would done nothing (and also the hangups during domain
suspension). If I put this parameter to the linux kernel module parameters,
then it also did not changed the value in the log:
Sep 13 18:15:32 alg-puv-xen-1 kernel: [3.856096] Kernel command line:
root=/dev/md0 ro console=tty0 vga=773 swiotlb=65536
Sep 13 18:15:32 alg-puv-xen-1 kernel: [3.856129] PID hash table entries:
4096 (order: 3, 32768 bytes)
Sep 13 18:15:32 alg-puv-xen-1 kernel: [3.856512] Initializing CPU#0
Sep 13 18:15:32 alg-puv-xen-1 kernel: [3.873864] DMA: Placing 128MB
software IO TLB between 880005711000 - 88000d711000
Sep 13 18:15:32 alg-puv-xen-1 kernel: [3.873868] DMA: software IO TLB at
phys 0x5711000 - 0xd711000
Sep 13 18:15:32 alg-puv-xen-1 kernel: [3.873871] xen_swiotlb_fixup:
buf=880005711000 size=134217728
Sep 13 18:15:32 alg-puv-xen-1 kernel: [3.915338] xen_swiotlb_fixup:
buf=88000d7d1000 size=32768
Sep 13 18:15:32 alg-puv-xen-1 kernel: [3.924636] Memory:
1891528k/2097152k available (3141k kernel code, 432k absent, 205192k
reserved, 1905k data, 592k init)

But the reboot came through without the crash! :-)
Where has to be applied the swiotlb parameter to see some effect of the
swiotlb memory change in the logs?

So, it worked if I have specified in Dom0 in the baloon mode by omitting
the specification of dom0_mem or, if dom0_mem is specified then also the
swiotlb=65536 must be specified.

I have noticed one interesting behavior - during the successfull suspension
of the domains during the shutdown the first one which is beeing suspended
writes very fast three dots, then it stops to write the dots for some time
and then agfter some time very fast a lot of (possibly also all remaining)
dots are written on the screen. By the next suspensions the suspension
works continuously dot-by-dot smoothly without any delays. It looks like it
waits for something during the first suspension (memory allocation?).

Generally, it is for me very surpsrising, how the aacraid module works, I am
no C or kernel developer but I would expect something like this cannot
happen - the module should allocate its necessary memory in the start or, I
would understand there can fail some specific read or write operation if the
sw raid has not enough memory to execute them, but I would never expect this
will lead to the hangup and freeze of the whole system. The probability of
data corruption is so increased drastically. And especially by raid1, which
is arranged in the most of cases to archieve more data safety :-).

With regards, Artur






-- 
To UNSUBSCRIBE, email to debian-bugs-dist-requ...@lists.debian.org
with a subject of unsubscribe. Trouble? Contact listmas...@lists.debian.org



Bug#596419: Acknowledgement (xen-linux-system-2.6.32-5-xen-amd64: causes a system hangup by the shutdown of the system, aacraid (sw raid) involved in hangup)

2010-09-12 Thread Artur Linhart - Linux communication

Even after the downgrade of kernel and of the corresponding files to the
version 2.6.32-18 and downgrade of mdadm the problem still persists, so it
is not bound specificallz to this package and to this version. 

I have identified now (after the downgrades to 2.6.30-18) the following
initial stack trace (some lines are missing from the top, I think, they were
no longer on the screen):

[] ? bio_alloc_bioset+0x45/0xb7
[] ? submit_bio+0xd6/0xf2
[] ? md_super_write+0x84/0xb2 [md_mod]
[] ? xen_restore_fl_direct_end+0x0/0x1
[] ? md_update_sb+0x268/0x31e
[] ? md_check_recovery+0x1e2/0x4b9 [md_mod]
[] ? raid1d+0x42/0xe0b [raid1]
[] ? finish_task_switch+0x44/0xaf
[] ? schedule_timeout+0x2e/0xdd
[] ? xen_restore_fl_direct_end+0x0/0x1
[] ? xen_force_evtchn_callback+0x9/0xa
[] ? check_events+0x12/0x20
[] ? xen_restore_fl_direct_end+0x0/0x1
[] ? md_thread+0xf1/0x10f [md_mod]
[] ? autoremove_wake_function+0x0/0x2e
[] ? md_thread+0x0/0x10f [md_mod]
[] ? kthread+0x79/0x01
[] ? child_rip+0xa/0x20
[] ? int_ret_from_szs_call+0x7/0x1b
[] ? retinit_restore_args+0x5/0x6
[] ? xen-restore-fl-direct-end+0x0/0x1
[] ? xen-restore-fl-direct-end+0x0/0x1
[] ? child_rip+0x0/0x20
Code: 00 00 c7 46 0c 00 00 00 00 c7 46 10 00 00 00 00 c7 46 14 00
00 00 00 c7 46 18 00 00 00 00 e8 10 63 fa ff 83 f8 00 41 89 c6 7d 04 0f 0b
eb
fe 75 08 45 31 e4 e9 9c 00 00 00 49 8b 7f 58 48 89 eb
RIP [] aac_build_sgraw+0x51/0x10a [aacraid]
 RSP 88003cd998e0
--- [ end trace  ] ---  

Now also this stack trace stays on the screen and nothing happens also after
very long time (1 hour)




-- 
To UNSUBSCRIBE, email to debian-bugs-dist-requ...@lists.debian.org
with a subject of unsubscribe. Trouble? Contact listmas...@lists.debian.org



Bug#596419: Acknowledgement (xen-linux-system-2.6.32-5-xen-amd64: causes a system hangup by the shutdown of the system, aacraid (sw raid) involved in hangup)

2010-09-12 Thread Artur Linhart - Linux communication
One more remark - the last tests from the previous post were done on the
synced array, so there was not other heavy load on it at the time of this
last crash. The crash happened also during the 
xendomains stop
before the system shutdown. It happened not immediatelly, but first after
sime time (there have been displayed more dots from the suspend process of
the given virtual domain before the crash happened)




-- 
To UNSUBSCRIBE, email to debian-bugs-dist-requ...@lists.debian.org
with a subject of unsubscribe. Trouble? Contact listmas...@lists.debian.org



Bug#596419: Acknowledgement (xen-linux-system-2.6.32-5-xen-amd64: causes a system hangup by the shutdown of the system, aacraid (sw raid) involved in hangup)

2010-09-12 Thread Artur Linhart - Linux communication
Hello,

If booting on non-xen kernel, then no problems can be seen. But it
is true, exactly the same asction cannot be tested, the xendomains script
can be started only if running under xen and also there had to be some
virtual instances to be suspended...
If I boot with xen and then shut down the virtual instances, and
then reboot the computer, the hangup does not occur. Only in the case the
suspend during 
/etc/init.d/xendomains stop 
Happens, the crash comes after some time.

Under the non-xen I have also tested the cration of the larger amount of the
data by the usage of dd or cp (of cca 5 GiB of data, which is 2 times more
than all virtual instances together have memory which has to be written to
the raid), but nothing strange happens, everything works. I even tried to
write the files to the same location like xendomains writes the memory
snapshots (it is on md1 raid, the system itself is installed on md0) but
everything seems to be working fine without xen kernel.

Finally I booted the xen kernel again and just tried to perform a heavy
operations on the raid1 - I have generated the hangup induring some seconds
and ater reboot again.
In the fiorst case I have started just the dd of the 5 GB data bs=1M
count=5000 on the first screen and then switched to the second screen and
here simply started aptitude.
In the second case (in this case the resync of the array from the previous
crash was running) I have tried to start paralely two dd's on two screns. It
was no problem for some time, then I tried to start aptitude in the third
screen, it caused also nothing. I returnd back to screen 2 and pressed twice
ctrl-C what lead to hangup of the system again.
So, it seems to be very probable this problem has nothing special to do with
xendomains script or any xen utility, but is just the question of running
under xen kernel and performing more complex or heavy operations on the raid
array...
My configuration of the array is:
2 TB SATA disks, both split in the similar way to 1x50GB and 3x650GB
raid-partitions, on the first one (md0 - the smallest) is the system, on md1
(size 650 GB) is the raid md1 (here I perform the write operation in the
tests and here writes xendomains too) and third array md2 not involved in
tests or problems. The fourth 650GB partitions are unused.

Regards, Archie




-- 
To UNSUBSCRIBE, email to debian-bugs-dist-requ...@lists.debian.org
with a subject of unsubscribe. Trouble? Contact listmas...@lists.debian.org



Bug#596422: Acknowledgement (xen-linux-system-2.6.32-5-xen-amd64: System hangup during shutdown high probable during stopping of mdadm)

2010-09-11 Thread Artur Linhart - Linux communication
This bug is a duplicate of the bug 596419, so it can be closed.


-Original Message-
From: Debian BTS [mailto:debb...@busoni.debian.org] On Behalf Of Debian Bug
Tracking System
Sent: Saturday, September 11, 2010 11:06 AM
To: Artur Linhart
Subject: Bug#596422: Acknowledgement (xen-linux-system-2.6.32-5-xen-amd64:
System hangup during shutdown high probable during stopping of mdadm)

Thank you for filing a new Bug report with Debian.

This is an automatically generated reply to let you know your message
has been received.

Your message is being forwarded to the package maintainers and other
interested parties for their attention; they will reply in due course.

As you requested using X-Debbugs-CC, your message was also forwarded to
  al.li...@bcpraha.com
(after having been given a Bug report number, if it did not have one).

Your message has been sent to the package maintainer(s):
 Debian Kernel Team debian-ker...@lists.debian.org

If you wish to submit further information on this problem, please
send it to 596...@bugs.debian.org.

Please do not send mail to ow...@bugs.debian.org unless you wish
to report a problem with the Bug-tracking system.

-- 
596422: http://bugs.debian.org/cgi-bin/bugreport.cgi?bug=596422
Debian Bug Tracking System
Contact ow...@bugs.debian.org with problems

__ Informace od NOD32 5441 (20100910) __

Tato zprava byla proverena antivirovym systemem NOD32.
http://www.nod32.cz





-- 
To UNSUBSCRIBE, email to debian-bugs-dist-requ...@lists.debian.org
with a subject of unsubscribe. Trouble? Contact listmas...@lists.debian.org



Bug#596419: Acknowledgement (xen-linux-system-2.6.32-5-xen-amd64: causes a system hangup by the shutdown of the system, aacraid (sw raid) involved in hangup)

2010-09-11 Thread Artur Linhart - Linux communication
After further anaysis it seems to be the fact it occurs not by mdadm stop
but by the call of xendomains stop because the hangup occurs not only by
the shutdown, but also by the simple call of

/etc/init.d/xendomains stop

- there are 4 domains running, 3 fully virtualized based on qemu and one
paravirtual, also debian squeeze based.

It also does not hang up the computer completely, it just freezes the
keyboard (also the numlock does not react etc.) and services (for example
concurrent ssh connection to the amchine is no longer usefull), but the
system itself does still something. There come after some minutes of waiting
again and again messages, ending by the following call trace (I hope I made
no mistakes in writing it down from monitor):
Call Trace:
[]? smp_call_function_many+0x191/0x1af
[]? drain_local_pages+0x0/0xd
[]? smp_call_function+0x20/0x24
[]? on_each_cpu+0x10/0x2e
[]? __alloc_pages_nodemask+0x3f4/0x5ce
[]? check_events+0x12/0x20
[]? new_slab+0x42/0x1ca
[]? __slab_alloc+0x1f0/0x39b
[]? sock_alloc_send_pskb+0xbd/0x2d8
[]? cap_socket_getpeersec_dgram+0x0/0x6
[]? __kmalloc_node_track_caller+0xbb/0x11b
[]? sock_alloc_send_pskb+0xbd/0x2d8
[]? __alloc_skb+0x69/0x15a
[]? sock_alloc_send_pskb+0xbd/0x2d8
[]? pollwake+0x0/0x5b
[]? unix_stream_sendmsg+0x133/0x2a1
[]? sock_aio_write+0xb1/0xbc
[]? sock_aio_write+0x0/0xbc
[]? do_sync_readv_writev+0xc0/0x107
[]? autoremove_wake_function+0x0/0x2e
[]? rw_copy_check_nvector+0x6d/0xe4
[]? do_readv_writev+0xb2/0x115
[]? pvclock_clocksource_read+0x3a/0x70
[]? sys_writev+0x45/0x93
[]? system_call_fastpath+0x16/0x1b

This stack trace comes multiple times (for hour or more before I pushed the
power button) and unfortunatelly I cannot see anything more usefull in the
error. In the start, there was also some message with aacraid, but It
vanished too quickly to see what was written there.

-Original Message-
From: Debian BTS [mailto:debb...@busoni.debian.org] On Behalf Of Debian Bug
Tracking System
Sent: Saturday, September 11, 2010 10:15 AM
To: Artur Linhart
Subject: Bug#596419: Acknowledgement (xen-linux-system-2.6.32-5-xen-amd64:
causes a system hangup by the shutdown of the system, aacraid (sw raid)
involved in hangup)

Thank you for filing a new Bug report with Debian.

This is an automatically generated reply to let you know your message
has been received.

Your message is being forwarded to the package maintainers and other
interested parties for their attention; they will reply in due course.

As you requested using X-Debbugs-CC, your message was also forwarded to
  al.li...@bcpraha.com
(after having been given a Bug report number, if it did not have one).

Your message has been sent to the package maintainer(s):
 Debian Kernel Team debian-ker...@lists.debian.org

If you wish to submit further information on this problem, please
send it to 596...@bugs.debian.org.

Please do not send mail to ow...@bugs.debian.org unless you wish
to report a problem with the Bug-tracking system.

-- 
596419: http://bugs.debian.org/cgi-bin/bugreport.cgi?bug=596419
Debian Bug Tracking System
Contact ow...@bugs.debian.org with problems

__ Informace od NOD32 5441 (20100910) __

Tato zprava byla proverena antivirovym systemem NOD32.
http://www.nod32.cz





-- 
To UNSUBSCRIBE, email to debian-bugs-dist-requ...@lists.debian.org
with a subject of unsubscribe. Trouble? Contact listmas...@lists.debian.org



Bug#596419: Acknowledgement (xen-linux-system-2.6.32-5-xen-amd64: causes a system hangup by the shutdown of the system, aacraid (sw raid) involved in hangup)

2010-09-11 Thread Artur Linhart - Linux communication
Additional info:

Downgrading of the package and kernel image to 2.6.35-18 did not helped, 
Running of
/etc/init.d/xendomains stop 
Still brought an error, containing also following message:

kernel bug  /source_amd_xen/drivers/scsi/aacraid/aachba.c:2825!

(the dots were not there, there was some other text).

Also other downgrades of the hypervisor or qemu-dm did not help.

The problem can be also caused by the heavy load of the sw raid - because of
the previous crashes, the raid arrays have been under sync if the next
crashes happened. The script xendomains suspends the instances to the disk
(it is cca 2,5 GB of data), so possible cause could be the heavy load caused
by the virtual domain suspension combined with the running resync of the
mdadm array.

After the array is fully resynced, I try it again with the downgraded mdadm 





-- 
To UNSUBSCRIBE, email to debian-bugs-dist-requ...@lists.debian.org
with a subject of unsubscribe. Trouble? Contact listmas...@lists.debian.org