Dear List, I'm trying to set up a testbed for batch systems using qemu-kvm. So far, I've created two machines, a master ("torque") and an execution host ("mom") for use with torque. I'm using the following command lines to start up the virtual machines:
qemu-kvm -smp 2 -m 768 -hda ./torque.qcow2 -net nic,vlan=1,macaddr=52:54:00:12:34:56 -net nic,vlan=2,macaddr=52:54:00:12:34:57 -net user,vlan=2 -net socket,vlan=1,listen=localhost:1234 -redir tcp:26022::22 -nographic -daemonize qemu-kvm -smp 2 -m 768 -hda ./mom.qcow2 -net nic,vlan=1,macaddr=52:54:00:12:34:58 -net socket,vlan=1,connect=localhost:1234 -nographic -daemonize which I took from http://www.h7.dion.ne.jp/~qemu-win/HowToNetwork-en.html. Everything works fine, I can see the internet from "mom" via "torque" and NFS mount the users home directory from "torque" on "mom" and resolve users via NIS. Here's the ifconfig of the nodes: torque:~ # ifconfig eth0 Link encap:Ethernet HWaddr 52:54:00:12:34:56 inet addr:192.168.42.250 Bcast:192.168.42.255 Mask:255.255.255.0 inet6 addr: fe80::5054:ff:fe12:3456/64 Scope:Link UP BROADCAST RUNNING MULTICAST MTU:1500 Metric:1 RX packets:707 errors:0 dropped:0 overruns:0 frame:0 TX packets:1873 errors:0 dropped:0 overruns:0 carrier:0 collisions:0 txqueuelen:1000 RX bytes:44388 (43.3 Kb) TX bytes:2539091 (2.4 Mb) Interrupt:11 Base address:0x2000 eth1 Link encap:Ethernet HWaddr 52:54:00:12:34:57 inet addr:10.0.2.15 Bcast:10.0.2.255 Mask:255.255.255.0 inet6 addr: fe80::5054:ff:fe12:3457/64 Scope:Link UP BROADCAST RUNNING MULTICAST MTU:1500 Metric:1 RX packets:69 errors:0 dropped:0 overruns:0 frame:0 TX packets:88 errors:0 dropped:0 overruns:0 carrier:0 collisions:0 txqueuelen:1000 RX bytes:7837 (7.6 Kb) TX bytes:13548 (13.2 Kb) Interrupt:10 Base address:0xc000 And "mom": mom:~ # ifconfig eth0 Link encap:Ethernet HWaddr 52:54:00:12:34:58 inet addr:192.168.42.1 Bcast:192.168.42.255 Mask:255.255.255.0 inet6 addr: fe80::5054:ff:fe12:3458/64 Scope:Link UP BROADCAST RUNNING MULTICAST MTU:1500 Metric:1 RX packets:1888 errors:0 dropped:0 overruns:0 frame:0 TX packets:752 errors:0 dropped:0 overruns:0 carrier:0 collisions:0 txqueuelen:1000 RX bytes:2514373 (2.3 Mb) TX bytes:60325 (58.9 Kb) Interrupt:11 Base address:0x2000 The ping times between the servers are the following: torque:~ # ping mom PING mom.qemu (192.168.42.1) 56(84) bytes of data. 64 bytes from mom.qemu (192.168.42.1): icmp_seq=1 ttl=64 time=39.6 ms 64 bytes from mom.qemu (192.168.42.1): icmp_seq=2 ttl=64 time=39.4 ms 64 bytes from mom.qemu (192.168.42.1): icmp_seq=3 ttl=64 time=39.7 ms 64 bytes from mom.qemu (192.168.42.1): icmp_seq=4 ttl=64 time=39.8 ms 64 bytes from mom.qemu (192.168.42.1): icmp_seq=5 ttl=64 time=39.8 ms 64 bytes from mom.qemu (192.168.42.1): icmp_seq=6 ttl=64 time=39.8 ms 64 bytes from mom.qemu (192.168.42.1): icmp_seq=7 ttl=64 time=39.8 ms Do these times make sense? However, batch operations are not working properly. Jobs start fine and produce the right output, but when it comes to tidying up, the "mom" machine can't contact the "torque": Aug 3 10:10:26 mom pbs_mom: LOG_ERROR::Operation now in progress (115) in scan_for_exiting, cannot connect to port 1023 in client_to_svr - connection refused Aug 3 10:10:27 mom pbs_mom: LOG_ERROR::Operation now in progress (115) in scan_for_exiting, cannot connect to port 1023 in client_to_svr - connection refused Aug 3 10:10:28 mom pbs_mom: LOG_ERROR::Operation now in progress (115) in scan_for_exiting, cannot connect to port 1023 in client_to_svr - connection refused Aug 3 10:10:29 mom pbs_mom: LOG_ERROR::Operation now in progress (115) in scan_for_exiting, cannot connect to port 1023 in client_to_svr - connection refused Aug 3 10:10:29 mom pbs_mom: LOG_ERROR::Operation now in progress (115) in scan_for_exiting, cannot connect to port 1023 in client_to_svr - connection refused Aug 3 10:10:29 mom pbs_mom: LOG_ERROR::Operation now in progress (115) in scan_for_exiting, cannot connect to port 1023 in client_to_svr - connection refused Aug 3 10:10:30 mom pbs_mom: LOG_ERROR::Operation now in progress (115) in scan_for_exiting, cannot connect to port 1023 in client_to_svr - connection refused Aug 3 10:10:31 mom pbs_mom: LOG_ERROR::Operation now in progress (115) in scan_for_exiting, cannot connect to port 1023 in client_to_svr - connection refused Aug 3 10:10:32 mom pbs_mom: LOG_ERROR::Operation now in progress (115) in scan_for_exiting, cannot connect to port 1023 in client_to_svr - connection refused Aug 3 10:10:33 mom pbs_mom: LOG_ERROR::Operation now in progress (115) in scan_for_exiting, cannot connect to port 1023 in client_to_svr - connection refused Aug 3 10:10:34 mom pbs_mom: LOG_ERROR::Operation now in progress (115) in scan_for_exiting, cannot connect to port 1023 in client_to_svr - connection refused At this time, tcpdump on the "torque" machine says: 10:10:17.072582 IP mom.qemu.1023 > torque.qemu.pbs: Flags [S], seq 25915729, win 5840, options [mss 1460,sackOK,TS val 719328 ecr 0,nop,wscale 6], length 0 10:10:17.072647 IP torque.qemu.pbs > mom.qemu.1023: Flags [S.], seq 18959859, ack 25915730, win 5792, options [mss 1460,sackOK,TS val 756722 ecr 719328,nop,wscale 6], length 0 10:10:17.152568 IP mom.qemu.1023 > torque.qemu.pbs: Flags [R], seq 25915730, win 0, length 0 10:10:18.084234 IP mom.qemu.1023 > torque.qemu.pbs: Flags [S], seq 41724490, win 5840, options [mss 1460,sackOK,TS val 720340 ecr 0,nop,wscale 6], length 0 10:10:18.084297 IP torque.qemu.pbs > mom.qemu.1023: Flags [S.], seq 34766899, ack 41724491, win 5792, options [mss 1460,sackOK,TS val 757734 ecr 720340,nop,wscale 6], length 0 10:10:18.163568 IP mom.qemu.1023 > torque.qemu.pbs: Flags [R], seq 41724491, win 0, length 0 10:10:19.095909 IP mom.qemu.1023 > torque.qemu.pbs: Flags [S], seq 57533379, win 5840, options [mss 1460,sackOK,TS val 721352 ecr 0,nop,wscale 6], length 0 10:10:19.095947 IP torque.qemu.pbs > mom.qemu.1023: Flags [S.], seq 50574033, ack 57533380, win 5792, options [mss 1460,sackOK,TS val 758745 ecr 721352,nop,wscale 6], length 0 10:10:19.175628 IP mom.qemu.1023 > torque.qemu.pbs: Flags [R], seq 57533380, win 0, length 0 netstat says: torque:~ # netstat | grep 1023 tcp 0 0 torque.qemu:1023 mom.qemu:pbs_mom TIME_WAIT tcp 0 0 torque.qemu:1023 mom.qemu:pbs_mom TIME_WAIT Might the performance of my internal network conection (192.168.42.0/24) not be sufficient? Thanks for your help, Cheers, Peter. ------------------------------------------------------------------------------------------------ ------------------------------------------------------------------------------------------------ Forschungszentrum Juelich GmbH 52425 Juelich Sitz der Gesellschaft: Juelich Eingetragen im Handelsregister des Amtsgerichts Dueren Nr. HR B 3498 Vorsitzender des Aufsichtsrats: MinDirig Dr. Karl Eugen Huthmacher Geschaeftsfuehrung: Prof. Dr. Achim Bachem (Vorsitzender), Dr. Ulrich Krafft (stellv. Vorsitzender), Prof. Dr.-Ing. Harald Bolt, Prof. Dr. Sebastian M. Schmidt ------------------------------------------------------------------------------------------------ ------------------------------------------------------------------------------------------------