Hi Emmanuel.

Wow this are good tips. we can check for. thank you!

What we´ve started with is my thread in december.
[So Dez 27 05:17:44 2015] ata1.00: exception Emask 0x0 SAct 0x0 SErr 0x0 action 0x6 frozen
[So Dez 27 05:17:44 2015] ata1.00: failed command: WRITE DMA
[So Dez 27 05:17:44 2015] ata1.00: cmd ca/00:80:b8:4e:ce/00:00:00:00:00/eb tag 0 dma 65536 out res 40/00:01:00:00:00/00:00:00:00:00/a0 Emask 0x4(timeout)
[So Dez 27 05:17:44 2015] ata1.00: status: { DRDY }
[So Dez 27 05:17:44 2015] ata1: soft resetting link
[So Dez 27 05:17:45 2015] ata1.01: NODEV after polling detection
[So Dez 27 05:17:45 2015] ata1.00: configured for MWDMA2
[So Dez 27 05:17:45 2015] ata1.00: device reported invalid CHS sector 0
[So Dez 27 05:17:45 2015] ata1: EH complete

OR

kernel: [309438.824333] ata1.00: exception Emask 0x0 SAct 0x0 SErr 0x0 action 0x6 frozen
kernel: [309438.825198] ata1.00: failed command: FLUSH CACHE
kernel: [309438.825921] ata1.00: cmd e7/00:00:00:00:00/00:00:00:00:00/a0 tag 0 kernel: [309438.825921] res 40/00:01:00:00:00/00:00:00:00:00/a0 Emask 0x4 (timeout)
kernel: [309438.827996] ata1.00: status: { DRDY }
kernel: [309443.868140] ata1: link is slow to respond, please be patient (ready=0) kernel: [309448.852147] ata1: device not ready (errno=-16), forcing hardreset
kernel: [309448.852175] ata1: soft resetting link
kernel: [309449.009123] ata1.00: configured for MWDMA2
kernel: [309449.009129] ata1.00: retrying FLUSH 0xe7 Emask 0x4
kernel: [309449.009532] ata1.00: device reported invalid CHS sector 0
kernel: [309449.009545] ata1: EH complete

The problem started with VMs just stopped working with those messages in kernel.log inside the VM.
half a year all works fine and than it started ;)
There were several VMs on this hosts. Some with older kernels, some with newer ones. (debian 8.2 - 3.16.x i.e.) BUT only the new VMs with newer kernels stopped working (some days after last update of pve-kernel 2.x) Those crashing VMs are smaller ones. The big ones with old kernels just run and run and run.

So after Dmitry´s response we switched from IDE to default virtIO.
Short time after we got hung_task_timeout_secs and blocked for more than 120 seconds problems. BUT again ONLY in the debian 8.2 VMs sporadically and also not during backup times or cronjob times (daily, weekly, etc)


So to check your points:
1. heavy IO
io activity is nearby 0 on those VMs, only during backup-times are big peaks. but during the backup the VMs just run fine. The problems occour on an empty host with only one vm and a raid5 with 7200er SAS drives and also on another node with raid 1 with 7200er SAS drives. IO-Wait times on the busy node are about 1-2 (%?) logged as peak in proxmox gui.
So i´m thinking heavy io seems not the problem.

2. RAM
VMs are between 4 and 8GB of ram only (host only 32GB ram). so this should be easely handled by the raid controller

3. berserker
This was also my first idea, but the problem occours over different VMs. Yes all VMs are running with debian 8.2 and plesk 12.5 but with different sites.
So there have to be some identically probs in debian or plesk.

all Plesk servers are fresh installed and just run with max. 10 to 20 domains and only small sites.
Plesk 12.5 brings in a reverse caching proxy (nginx) per default.


----

So after reading a lot of threads in proxmox forum and many many other sites the discussion widely started about kernel problems (see i.e. https://forum.proxmox.com/threads/linux-guest-problems-on-new-haswell-ep-processors.20372/page-4#post-124663)

Yesterday we migrated the first machine to the new proxmox 4 with 4.x kernel and will have a look now how long it will be up without errors.

And again big big thank you for your ideas!

kind regards
Michael



_______________________________________________
pve-user mailing list
pve-user@pve.proxmox.com
http://pve.proxmox.com/cgi-bin/mailman/listinfo/pve-user

Reply via email to