Re: [PVE-User] HDD errors in VMs

Michael Pöllinger Mon, 04 Jan 2016 10:54:07 -0800

Hi Emmanuel.

Wow this are good tips. we can check for. thank you!


What we´ve started with is my thread in december.

[So Dez 27 05:17:44 2015] ata1.00: exception Emask 0x0 SAct 0x0 SErr 0x0action 0x6 frozen

[So Dez 27 05:17:44 2015] ata1.00: failed command: WRITE DMA

[So Dez 27 05:17:44 2015] ata1.00: cmdca/00:80:b8:4e:ce/00:00:00:00:00/eb tag 0 dma 65536 out res40/00:01:00:00:00/00:00:00:00:00/a0 Emask 0x4(timeout)

[So Dez 27 05:17:44 2015] ata1.00: status: { DRDY }
[So Dez 27 05:17:44 2015] ata1: soft resetting link
[So Dez 27 05:17:45 2015] ata1.01: NODEV after polling detection
[So Dez 27 05:17:45 2015] ata1.00: configured for MWDMA2
[So Dez 27 05:17:45 2015] ata1.00: device reported invalid CHS sector 0
[So Dez 27 05:17:45 2015] ata1: EH complete

OR

kernel: [309438.824333] ata1.00: exception Emask 0x0 SAct 0x0 SErr 0x0action 0x6 frozen

kernel: [309438.825198] ata1.00: failed command: FLUSH CACHE

kernel: [309438.825921] ata1.00: cmd e7/00:00:00:00:00/00:00:00:00:00/a0tag 0kernel: [309438.825921] res 40/00:01:00:00:00/00:00:00:00:00/a0Emask 0x4 (timeout)

kernel: [309438.827996] ata1.00: status: { DRDY }

kernel: [309443.868140] ata1: link is slow to respond, please be patient(ready=0)kernel: [309448.852147] ata1: device not ready (errno=-16), forcinghardreset

kernel: [309448.852175] ata1: soft resetting link
kernel: [309449.009123] ata1.00: configured for MWDMA2
kernel: [309449.009129] ata1.00: retrying FLUSH 0xe7 Emask 0x4
kernel: [309449.009532] ata1.00: device reported invalid CHS sector 0
kernel: [309449.009545] ata1: EH complete

The problem started with VMs just stopped working with those messages inkernel.log inside the VM.

half a year all works fine and than it started ;)

There were several VMs on this hosts. Some with older kernels, some withnewer ones. (debian 8.2 - 3.16.x i.e.)BUT only the new VMs with newer kernels stopped working (some days afterlast update of pve-kernel 2.x)Those crashing VMs are smaller ones. The big ones with old kernels justrun and run and run.


So after Dmitry´s response we switched from IDE to default virtIO.

Short time after we got hung_task_timeout_secs and blocked for more than120 seconds problems.BUT again ONLY in the debian 8.2 VMs sporadically and also not duringbackup times or cronjob times (daily, weekly, etc)



So to check your points:
1. heavy IO

io activity is nearby 0 on those VMs, only during backup-times are bigpeaks. but during the backup the VMs just run fine.The problems occour on an empty host with only one vm and a raid5 with7200er SAS drives and also on another node with raid 1 with 7200er SASdrives.IO-Wait times on the busy node are about 1-2 (%?) logged as peak inproxmox gui.

So i´m thinking heavy io seems not the problem.

2. RAM

VMs are between 4 and 8GB of ram only (host only 32GB ram). so thisshould be easely handled by the raid controller


3. berserker

This was also my first idea, but the problem occours over different VMs.Yes all VMs are running with debian 8.2 and plesk 12.5 but withdifferent sites.

So there have to be some identically probs in debian or plesk.

all Plesk servers are fresh installed and just run with max. 10 to 20domains and only small sites.

Plesk 12.5 brings in a reverse caching proxy (nginx) per default.


----

So after reading a lot of threads in proxmox forum and many many othersites the discussion widely started about kernel problems (see i.e.https://forum.proxmox.com/threads/linux-guest-problems-on-new-haswell-ep-processors.20372/page-4#post-124663)

Yesterday we migrated the first machine to the new proxmox 4 with 4.xkernel and will have a look now how long it will be up without errors.


And again big big thank you for your ideas!

kind regards
Michael



_______________________________________________
pve-user mailing list
pve-user@pve.proxmox.com
http://pve.proxmox.com/cgi-bin/mailman/listinfo/pve-user

Re: [PVE-User] HDD errors in VMs

Reply via email to