Hi -

I've been using the livecd set of tools to build a pxeboot image for a set of 
compute nodes in our local HPC environment.  The livecd project has allowed me 
to make all of the compute nodes diskless, and any software errors are trivial 
to fix (just reboot).

I've run into one problem - there appears to be a problem with my image where 
if any process on a node produces a large amount of disk I/O to /tmp - 
somewhere around 0.5GiB or more in one operation, causes the root filesystem to 
panic and the node must be rebooted.

Creating the image is as simple as:
# LANG=C livecd-creator --config=/local/nodes/hyades-nodes.cfg --fslabel=hyades 
-t /local/nodes/
# livecd-iso-to-pxeboot /local/nodes/hyades.iso

The exact error caused during the I/O operation on a compute node is logged as:
May  2 16:11:32 eth-c31.cluster kernel: device-mapper: snapshots: Invalidating 
snapshot: Unable to allocate exception. 
May  2 16:11:32 eth-c31.cluster syslogd: /var/log/messages: Read-only file 
system 
May  2 16:11:32 eth-c31.cluster kernel: Buffer I/O error on device dm-0, 
logical block 997925 
May  2 16:11:32 eth-c31.cluster kernel: lost page write due to I/O error on 
dm-0 
May  2 16:11:32 eth-c31.cluster kernel: Aborting journal on device dm-0. 
May  2 16:11:32 eth-c31.cluster kernel: __journal_remove_journal_head: freeing 
b_committed_data 
May  2 16:11:32 eth-c31.cluster last message repeated 5 times 
May  2 16:11:32 eth-c31.cluster kernel: journal commit I/O error 
May  2 16:11:32 eth-c31.cluster kernel: ext3_abort called. 
May  2 16:11:32 eth-c31.cluster kernel: EXT3-fs error (device dm-0): 
ext3_journal_start_sb: Detected aborted journal 
May  2 16:11:32 eth-c31.cluster kernel: Remounting filesystem read-only 
May  2 16:11:32 eth-c31.cluster kernel: __journal_remove_journal_head: freeing 
b_committed_data 
May  2 16:11:32 eth-c31.cluster kernel: __journal_remove_journal_head: freeing 
b_committed_data 
May  2 16:11:32 eth-c31.cluster kernel: __journal_remove_journal_head: freeing 
b_frozen_data 
May  2 16:11:32 eth-c31.cluster kernel: __journal_remove_journal_head: freeing 
b_frozen_data 
May  2 16:11:32 eth-c31.cluster kernel: __journal_remove_journal_head: freeing 
b_committed_data 
May  2 16:11:32 eth-c31.cluster kernel: __journal_remove_journal_head: freeing 
b_frozen_data 
May  2 16:11:43 eth-c31.cluster kernel: printk: 259144 messages suppressed. 
May  2 16:11:43 eth-c31.cluster kernel: Buffer I/O error on device dm-0, 
logical block 737 
May  2 16:11:43 eth-c31.cluster kernel: lost page write due to I/O error on 
dm-0 
May  2 16:11:43 eth-c31.cluster kernel: Buffer I/O error on device dm-0, 
logical block 115035 
May  2 16:11:43 eth-c31.cluster kernel: lost page write due to I/O error on 
dm-0 


Googling for information suggests that the device underlying the filesystem is 
running out of space, which explains why the filesystem crashes.   df reports 
that the / filesystem should have space:

[root@c31 ~]# df -h
/dev/mapper/live-rw   6.0G  1.2G  4.8G  19% /

I've adjusted the "part / -size 6144" parameter in my kickstart file, but I see 
no effective results other than the size that df reports changes to match what 
I specify. Writing a file to /tmp larger than about 512MB causes the filesystem 
to continue to crash even if the space is reported as available.

Each compute node has 32GB of system memory, and is running an x86_64 kernel.

I'm open to any suggestions on how to fix this issue.

Thanks!
Howard


Howard Powell
[email protected]




Attachment: smime.p7s
Description: S/MIME cryptographic signature

--
livecd mailing list
[email protected]
https://admin.fedoraproject.org/mailman/listinfo/livecd

Reply via email to