On Mon, May 2, 2011 at 4:46 PM, Howard Powell <[email protected]> wrote:
> Hi - > > I've been using the livecd set of tools to build a pxeboot image for a set > of compute nodes in our local HPC environment. The livecd project has > allowed me to make all of the compute nodes diskless, and any software > errors are trivial to fix (just reboot). > > I've run into one problem - there appears to be a problem with my image > where if any process on a node produces a large amount of disk I/O to /tmp - > somewhere around 0.5GiB or more in one operation, causes the root filesystem > to panic and the node must be rebooted. > > Creating the image is as simple as: > # LANG=C livecd-creator --config=/local/nodes/hyades-nodes.cfg > --fslabel=hyades -t /local/nodes/ > # livecd-iso-to-pxeboot /local/nodes/hyades.iso > > The exact error caused during the I/O operation on a compute node is logged > as: > May 2 16:11:32 eth-c31.cluster kernel: device-mapper: snapshots: > Invalidating snapshot: Unable to allocate exception. > May 2 16:11:32 eth-c31.cluster syslogd: /var/log/messages: Read-only file > system > May 2 16:11:32 eth-c31.cluster kernel: Buffer I/O error on device dm-0, > logical block 997925 > May 2 16:11:32 eth-c31.cluster kernel: lost page write due to I/O error on > dm-0 > May 2 16:11:32 eth-c31.cluster kernel: Aborting journal on device dm-0. > May 2 16:11:32 eth-c31.cluster kernel: __journal_remove_journal_head: > freeing b_committed_data > May 2 16:11:32 eth-c31.cluster last message repeated 5 times > May 2 16:11:32 eth-c31.cluster kernel: journal commit I/O error > May 2 16:11:32 eth-c31.cluster kernel: ext3_abort called. > May 2 16:11:32 eth-c31.cluster kernel: EXT3-fs error (device dm-0): > ext3_journal_start_sb: Detected aborted journal > May 2 16:11:32 eth-c31.cluster kernel: Remounting filesystem read-only > May 2 16:11:32 eth-c31.cluster kernel: __journal_remove_journal_head: > freeing b_committed_data > May 2 16:11:32 eth-c31.cluster kernel: __journal_remove_journal_head: > freeing b_committed_data > May 2 16:11:32 eth-c31.cluster kernel: __journal_remove_journal_head: > freeing b_frozen_data > May 2 16:11:32 eth-c31.cluster kernel: __journal_remove_journal_head: > freeing b_frozen_data > May 2 16:11:32 eth-c31.cluster kernel: __journal_remove_journal_head: > freeing b_committed_data > May 2 16:11:32 eth-c31.cluster kernel: __journal_remove_journal_head: > freeing b_frozen_data > May 2 16:11:43 eth-c31.cluster kernel: printk: 259144 messages suppressed. > May 2 16:11:43 eth-c31.cluster kernel: Buffer I/O error on device dm-0, > logical block 737 > May 2 16:11:43 eth-c31.cluster kernel: lost page write due to I/O error on > dm-0 > May 2 16:11:43 eth-c31.cluster kernel: Buffer I/O error on device dm-0, > logical block 115035 > May 2 16:11:43 eth-c31.cluster kernel: lost page write due to I/O error on > dm-0 > > > Googling for information suggests that the device underlying the filesystem > is running out of space, which explains why the filesystem crashes. df > reports that the / filesystem should have space: > > [root@c31 ~]# df -h > /dev/mapper/live-rw 6.0G 1.2G 4.8G 19% / > > I've adjusted the "part / -size 6144" parameter in my kickstart file, but I > see no effective results other than the size that df reports changes to > match what I specify. Writing a file to /tmp larger than about 512MB causes > the filesystem to continue to crash even if the space is reported as > available. > > Each compute node has 32GB of system memory, and is running an x86_64 > kernel. > > I'm open to any suggestions on how to fix this issue. > > Thanks! > Howard > I'm not familiar with livecd-iso-to-pxeboot, but a standard LiveOS image places /tmp in a tmpfs. See http://git.fedorahosted.org/git/?p=spin-kickstarts.git;a=blob;f=fedora-live-base.ks;h=88bbf7057d099eb872f844b09fbf596bbee5eb32;hb=master#l171 You may try adjusting that line in /etc/rc.d/init.d/livesys, if that fits your situation. --Fred
-- livecd mailing list [email protected] https://admin.fedoraproject.org/mailman/listinfo/livecd
