> On Fri, 2003-09-05 at 17:06, Jason Lim wrote: >> Hi all, <snip>
>> Just wondering... I've got a 2.4Ghz Hyperthreading (100% it is >> the hyperthreading model), and the BIOS sees it. >> Hope you can advise... as hyperthreading is there but not >> being used, which is a waste and could add performance. Appended below is a mail from a freind of mine, his company does work with Open Source sytems for the larger enterprises in India. He came accross HT, and not finding any "real life" data, decided that the only task worth doing was a kernel compile. The report is good reading. I am cc:ing him, he is not subscribed to this list. Hope it helps. -- Sanjeev ==BEGIN== Intel has released this new feature in its high-end processors, where one processor can internally act as two processors in hardware. In essence, there are two sets of CPU registers, two caches, two TLBs, etc. But there's only one external address and data bus, so the interface between the CPU and the rest of the hardware is (largely) like that of a single processor. Look for Hyperthreading on www.intel.com. The interesting thing about hyperthreading is that it is done purely in hardware. This means that the OS kernel does not know the difference between two physical processors and an Intel Xeon with hyperthreading switched on. We recently got a client's machine for setting up a database server. We had asked for a two-processor machine, we got a dual-processor with Hyperthreading. You go to the ROM BIOS and switch on or switch off hyperthreading. If it is switched on, /proc/cpuinfo (we work in Linux) shows four processors. I wanted to see whether we get the power of four processors when we switch on hyperthreading, in a typical SMP Linux environment. I wanted to run a set of parallel Unix processes with and without hyperthreading, and see whether I got faster system throughput with four virtual processors than with two real ones. TYPE OF JOBS: I wanted jobs which would do some I/O but would do primarily a lot of in-memory data manipulation. And I didn't have the time to write custom code. So I chose C compilation. A "make" on a large source tree would give me a lot of this sort of workload. PARALLELISM CONTROL: I used "make" with the "-j" option of GNU Make. This controls how many parallel branches are fired by the top-level "make" for the compilations. It is clear that this is a less than perfect way to generate parallel workloads, because a full compilation of a complete source tree would not have a consistent degree of parallelism. I am certain that the last part of a compilation job would be sequential, but hopefully, with a large enough source tree with sufficiently large number of independent modules, 95%+ of the compilation would have opportunities for dozens of parallel threads. ACTUAL WORKLOAD: The final script that I ran would do the following actions, one after the other in /usr/src/linux: make -j $COUNT clean make -j $COUNT dep make -j $COUNT bzImage make -j $COUNT modules As you can see, this already shows you sequential points, when one "make" completes and the next "make" starts. The size of the job was quite huge. I ran "make config" first, and hit "Enter" and kept the key pressed. The resultant configuration has lots an lots of optional modules selected. For instance, the test compilations generate 2800+ .o files. The kernel which runs on my laptop generates just 650+ .o files when compiled. I could see that the workload was CPU-intensive; with high parallelism, I was getting CPU idle time less than 1%. And user-state CPU usage was 93%+, the rest being in system calls. This profile is expected based on the fact that there is large RAM availability for disk cacheing. MEASUREMENT METHOD: I ran the set of "make" commands and used the Bash $SECONDS variable to get the system clock, +/- 1 second. This error was okay; my kernel compilation took 2500+ seconds with zero parallelism. Moreover, with each parallelism setting, I ran the full set of "make" commands five times, taking the clocktime measurements each time. I averaged them using integer division. The error of +/- 5 seconds due to integer division again should not matter; typical job runtimes were always more than 1000 seconds. SYSTEM CONFIG: Two physical Intel Xeon processors at 1.8GHz (as per /proc/cpuinfo), 1 GB RAM, IDE drives, ext3 file system. (I also tried using ext2 filesystems, but got timings identical to ext3; the ext3 journalling does not seem to be adding any significant load.) OS kernel was Linux 2.4.19-64GB-SMP, a stock SuSE 8.1 SMP kernel. At peak loads, even with max parallelism, I never saw any swap space being used. This means that the only disk I/O must have been for writing out intermediate and output files to disk. I guess there was practically no page fault occurrence on the system during my test runs, though I didn't bother to verify this. Max RAM usage was about 970MB with -j 6 at certain points. The disks, though IDE, are fast. A "cat /proc/ide/piix" (that "piix" is the system's IDE chipset) said that the hard disks were working on UDMA 5 (133MHz IDE speed). RESULTS: Here are the figures. The first set is with two processors. All timings are averages of five full runs, as described above. make with -j 1: 2501 seconds. make with -j 2: 1218 seconds. make with -j 3: 1198 seconds. make with -j 4: 1196 seconds. make with -j 5: 1234 seconds. make with -j 6: 1215 seconds. The second set is with Hyperthreading on, i.e. with four processors as far as Linux was concerned. I didn't do the "-j 1" run here, I didn't see any point in repeating the zero-parallelism run. make with -j 2: 1405 seconds. make with -j 3: 1153 seconds. make with -j 4: 1063 seconds. make with -j 5: 1062 seconds. make with -j 6: 1079 seconds. So, as you can see, with just two parallel threads, Hyperthreading actually degrades overall system throughput. With higher numbers of parallel threads, Hyperthreading gives me throughput better than the best figures without. Therefore, if you have a large number of parallel CPU-intensive processes, many more than the number of virtual CPUs, I guess you're better off with Hyperthreading enabled. But with a limited amount of parallelism, I guess you _MAY_ in some cases get better throughput without Hyperthreading. And with four virtual processors, I had already hit peak performance with -j 5. At -j 6, the times had begun to climb again. This means that the ideal system throughput comes when there is not too much more parallelism than the number of processors. This is consistent with the folklore. But if you feel that that four virtual processors can give you the same speedups that four physical processors can, then you can forget it. Don't even _think_ about it. AN ASIDE: It's amazing to see such fast processors taking 40 minutes+ for a kernel compile (single threaded). These timings were so unbelievable that I began to doubt the entire exercise. I checked whether the disk partitions were being mounted with mount option "sync" (synchronous writethrough), which can _really_ slow down writes. But that was not the case. I switched from ext3 to ext2, and got no change; journalling of ext3 does not seem to be slowing down anything. Finally, with -j 6, I checked timings of each leg separately. "make bzImage" took 171 seconds and created 638 .o files, "make modules" took 991 seconds and created 2238 .o files. So, I have now concluded that the extraordinarily long time taken by my kernel compiles is because of the large set of modules in the default kernel config. The benchmark figures are real. Shuvam Misra [EMAIL PROTECTED] -- To UNSUBSCRIBE, email to [EMAIL PROTECTED] with a subject of "unsubscribe". Trouble? Contact [EMAIL PROTECTED]