Re: [Beowulf] Your thoughts on the latest RHEL drama?
I sat and carefully studied all the comments about RH's plans. On Mon, Jun 26, 2023 at 02:27:23PM -0400, Prentice Bisbal via Beowulf wrote: Somewhere around event #3 is when I started viewing RHEL from as the MS of the Linux world for obvious reasons. It seems that RH is determined to make RHEL a monopoly of the "Enterprise Linux" market. Yes, I know there's Ubuntu and SLES, but Ubuntu is viewed as a desktop more than a server OS (IMO), and SLES hasn't really caught on, at least not in the US. Previously, the small x86-64 clusters I supported used OpenSuSE for a number of years, then on the next cluster I decided to switch to CentOS 7 simply because of its large popularity in HPC and the greater breadth of packages available. BTW, I'm using CentOS 7 and on my home desktop (working with GNOME). And in general, I want to have the same distribution both at home and in the cluster. So Ubuntu is a good starting point for me in the future :-) (Nvidia loves Ubuntu on their GPU servers, but it's all for AI). But I would also like to hear your point of view about SLES / OpenSuSE - after all, HPC Cray OS is based on SuSE. Mikhail Kuzminsky ___ Beowulf mailing list, Beowulf@beowulf.org sponsored by Penguin Computing To change your subscription (digest mode or unsubscribe) visit https://beowulf.org/cgi-bin/mailman/listinfo/beowulf
Re: [Beowulf] milan and rhel7
In message from Michael DiDomenico (Tue, 28 Jun 2022 17:40:09 -0400): milan cpu's aren't officially supported on less then rhel8.3. but there's anecdotal evidence that rhel7 will run on milan cpu's. if the evidence is true, is anyone on the list doing so and can confirm? Yes, RHEL requires upgrading to 8.3 or later to work with EPYC 7003 https://access.redhat.com/articles/5899941. Officially CentOS 7 doesn't support this hardware either. You can switch to OpenSuSE - Milan support is available in 15.3 Mikhail Kuzminsky ___ Beowulf mailing list, Beowulf@beowulf.org sponsored by Penguin Computing To change your subscription (digest mode or unsubscribe) visit https://beowulf.org/cgi-bin/mailman/listinfo/beowulf
Re: [Beowulf] likwid vs stream (after HPCG discussion)
In message from Scott Atchley (Sun, 20 Mar 2022 14:52:10 -0400): On Sat, Mar 19, 2022 at 6:29 AM Mikhail Kuzminsky wrote: If so, it turns out that for the HPC user, stream gives a more important estimate - the application is translated by the compiler (they do not write in assembler - except for modules from mathematical libraries), and stream will give a real estimate of what will be received in the application. When vendors advertise STREAM results, they compile the application with non-temporal loads and stores. This means that all memory accesses bypass the processor's caches. If your application of interest does a random walk through memory and there is neither temporal or spatial locality, then using non-temporal loads and stores makes sense and STREAM irrelevant. STREAM is not initially oriented to random access to memory. In this case, memory latencies are important, and it makes more sense to get a bandwidth estimate in the mega-sweep (https://github.com/UK-MAC/mega-stream). ___ Beowulf mailing list, Beowulf@beowulf.org sponsored by Penguin Computing To change your subscription (digest mode or unsubscribe) visit https://beowulf.org/cgi-bin/mailman/listinfo/beowulf
[Beowulf] likwid vs stream (after HPCG discussion)
Just in the HPCG discussion, it was proposed to use the now widely used likwid benchmark to estimate memory bandwidth. It gives excellent estimates of hardware capabilities. Am I right that likwid uses its own optimized assembler code for each specific hardware? If so, it turns out that for the HPC user, stream gives a more important estimate - the application is translated by the compiler (they do not write in assembler - except for modules from mathematical libraries), and stream will give a real estimate of what will be received in the application. Mikhail Kuzminsky ___ Beowulf mailing list, Beowulf@beowulf.org sponsored by Penguin Computing To change your subscription (digest mode or unsubscribe) visit https://beowulf.org/cgi-bin/mailman/listinfo/beowulf
[Beowulf] About TofuD in A64FX and Infiniband HDR
My initial questions about bandwidth in a cluster: I want to understand the bandwidth when using TofuD in A64FX or Infiniband HDR. With HDR everything is clear - 25 GB / s for 4 links or 75 GB / s for 12 links. But PCIe-v3 x16 in A64FX will keep bandwidth. Everything is more cunning with TofuD. Each link has 2 lanes, 28.05 Gbps x 2 gives about 7 GB / s (more really 6.8 GB / s). 6 TNI or 6 links give a total of 40.8 GB / s for injection - that's more than 25 GB / s in 4x HDR. This is for 2.2 GHz, I understand that at 1.8 GHz all numbers will decrease accordingly. If I calculate above correctly, then: Can I get close to 40.8 GB / s in a simple MPI PUT to another node - or will the limit be 6.8 GB / s? Then HDR will give more bandwidth (in Ookami w/Infiniband: 19.4 GB / s as maximum for OSU MPI). Ookami works w/Infiniband, not TofuD because of such bandwidth for a not very large cluster - or because of financial reasons (cost of TofuD routers?) ? Mikhail Kuzminsky ___ Beowulf mailing list, Beowulf@beowulf.org sponsored by Penguin Computing To change your subscription (digest mode or unsubscribe) visit https://beowulf.org/cgi-bin/mailman/listinfo/beowulf
Re: [Beowulf] AMD and AVX512
I apologize - I should have written earlier, but I don't always work with my broken right hand. It seems to me that a reasonable basis for discussing AMD EPYC performance could be the specified performance data in the Daresburg University benchmark from M.Guest. Yes, newer versions of AMD EPYC and Xeon Scalable processors have appeared since then, and new compiler versions. However, Intel already had AVX-512 support, and AMD - AVX-256. Of course, peak performanceis is not so important as application performance. There are applications where performance is not limited to working with vectors - there AVX-512 may not be needed. And in AI tasks, working with vectors is actual - and GPUs are often used there. For AI, the Daresburg benchmark, on the other hand, is less relevant. And in Zen 4, AMD seemed to be going to support 512 bit vectors. But performance of linear algebra does not always require work with GPU. In quantum chemistry, you can get acceleration due to vectors on the V100, let's say a 2 times - how much more expensive is the GPU? Of course, support for 512 bit vectors is a plus, but you really need to look to application performance and cost (including power consumption). I prefer to see to the A64FX now, although there may need to be rebuild applications. Servers w/A64FX sold now, but the price is very important. In message from John Hearns (Sun, 20 Jun 2021 06:38:06 +0100): Regarding benchmarking real world codes on AMD , every year Martyn Guest presents a comprehensive set of benchmark studies to the UK Computing Insights Conference. I suggest a Sunday afternoon with the beverage of your choice is a good time to settle down and take time to read these or watch the presentation. 2019 https://www.scd.stfc.ac.uk/SiteAssets/Pages/CIUK-2019-Presentations/Martyn_Guest.pdf 2020 Video session https://ukri.zoom.us/rec/share/ajvsxdJ8RM1wzpJtnlcypw4OyrZ9J27nqsfAG7eW49Ehq_Z5igat_7gj21Ge8gWu.78Cd9I1DNIjVViPV?startTime=1607008552000 Skylake / Cascade Lake / AMD Rome The slides for 2020 do exist - as I remember all the slides from all talks are grouped together, but I cannot find them. Watch the video - it is an excellent presentation. On Sat, 19 Jun 2021 at 16:49, Gerald Henriksen wrote: On Wed, 16 Jun 2021 13:15:40 -0400, you wrote: >The answer given, and I'm >not making this up, is that AMD listens to their users and gives the >users what they want, and right now they're not hearing any demand for >AVX512. > >Personally, I call BS on that one. I can't imagine anyone in the HPC >community saying "we'd like processors that offer only 1/2 the floating >point performance of Intel processors". I suspect that is marketing speak, which roughly translates to not that no one has asked for it, but rather requests haven't reached a threshold where the requests are viewed as significant enough. > Sure, AMD can offer more cores, >but with only AVX2, you'd need twice as many cores as Intel processors, >all other things being equal. But of course all other things aren't equal. AVX512 is a mess. Look at the Wikipedia page(*) and note that AVX512 means different things depending on the processor implementing it. So what does the poor software developer target? Or that it can for heat reasons cause CPU frequency reductions, meaning real world performance may not match theoritical - thus easier to just go with GPU's. The result is that most of the world is quite happily (at least for now) ignoring AVX512 and going with GPU's as necessary - particularly given the convenient libraries that Nvidia offers. > I compared a server with dual AMD EPYC >7H12 processors (128) > quad Intel Xeon 8268 >processors (96 cores). > From what I've heard, the AMD processors run much hotter than the Intel >processors, too, so I imagine a FLOPS/Watt comparison would be even less >favorable to AMD. Spec sheets would indicate AMD runs hotter, but then again you benchmarked twice as many Intel processors. So, per spec sheets for you processors above: AMD - 280W - 2 processors means system 560W Intel - 205W - 4 processors means system 820W (and then you also need to factor in purchase price). >An argument can be made that for calculations that lend themselves to >vectorization should be done on GPUs, instead of the main processors but >the last time I checked, GPU jobs are still memory is limited, and >moving data in and out of GPU memory can still take time, so I can see >situations where for large amounts of data using CPUs would be preferred >over GPUs. AMD's latest chips support PCI 4 while Intel is still stuck on PCI 3, which may or may not mean a difference. But what despite all of the above and the other replies, it is AMD who has been winning the HPC contracts of late, not Intel. * - https://en.wikipedia.org/wiki/Advanced_Vector_Extensions ___ Beowulf mailing list, Beowulf@beowulf.org sponsored by Penguin Computing To
[Beowulf] GPUs Nvidia C2050 w/OpenMP 4.5 in cluster
Heterogeneous nodes in my small CentOS 7 cluster have x86-64 CPUs along with the old Nvidia GPU C2050 (Fermi). New Fortran program uses MPI + OpenMP software. Does the modern gfortran or Intel ifort compilers give support of work through OpenMP 4.5 with these GPUs? Mikhail Kuzminsky, Zelinsky Institute of Organic Chemistry Moscow ___ Beowulf mailing list, Beowulf@beowulf.org sponsored by Penguin Computing To change your subscription (digest mode or unsubscribe) visit https://beowulf.org/cgi-bin/mailman/listinfo/beowulf
Re: [Beowulf] Fortran is Awesome
I believe that the rationality of FORTRAN using is and now very much dependent on the application. In quantum chemistry, where I previously programmed, as also in computational chemistry in general, Fortran remains the main language. Yes, C is dangerous. You can break your code in ever so many ways if you code with less than discipline and knowledge and great care. This may mean that in some cases write Fortran program can be easier and therefore faster than in C. Hell, at my age I may never write serious C applications ever again, but if I write ANYTHING that requires a compiler, its going to be in C. I haven't been programming in quantum chemistry for a very long time. But recently I wrote a tiny program for the task of computational chemistry - and I did it in Fortran :-) Mikhail ___ Beowulf mailing list, Beowulf@beowulf.org sponsored by Penguin Computing To change your subscription (digest mode or unsubscribe) visit http://www.beowulf.org/mailman/listinfo/beowulf
Re: [Beowulf] Oh.. IBM eats Red Hat
There are probably several reasons for this capture - they can be from both IBM and Red Hat. And it is very difficult to discuss it now, when it is not clear how events will develop in the future. But it’s much better to join Red Hat to IBM than if Microsoft got involved :-)). I don't know what to make of systemd as a design decision. I'm an Old Guy, so by definition I grew up with init and the classic Unix OS structure -- I still have all of the books in my office, sadly at least semi-obsolete within the current kernels and linux layout. I worked a set of years on IBM mainframe w/MVS OS. I hope that IBM is not a bad choice for Red Hat. It is possible to say also about xCAT developed by IBM. Mikhail Kuzminsky ___ Beowulf mailing list, Beowulf@beowulf.org sponsored by Penguin Computing To change your subscription (digest mode or unsubscribe) visit http://www.beowulf.org/mailman/listinfo/beowulf
[Beowulf] About Torque maillist
Does anyone know, does it work now torque maillist (after switching to only commercial Adaptivecomputing software) - earlier torqueus...@clusterresources.com ? Mikhail Kuzminsky, Zelinsky Institute of Organic Chemistry Moscow ___ Beowulf mailing list, Beowulf@beowulf.org sponsored by Penguin Computing To change your subscription (digest mode or unsubscribe) visit http://www.beowulf.org/mailman/listinfo/beowulf
[Beowulf] batch systems connection
Sorry, may be my question is not exactly for our Beowulf maillist. I have 2 small OpenSuSE-based clusters using different batch systems, and want to connect them "grid-like", via CREAM (Computing Resource Execution And Management) service (I may add also one additional common server for both clusters). But there is no CREAM binary RPMs for OpenSuSE (Only for CentOS7/SL6 on UMD site //repository.egi.eu/2018/03/14/release-umd-4-6-1/). I did not find: where I can download source text of CREAM software ? Mikhail Kuzminsky Zelinsky Institute of Organic Chemistry Moscow ___ Beowulf mailing list, Beowulf@beowulf.org sponsored by Penguin Computing To change your subscription (digest mode or unsubscribe) visit http://www.beowulf.org/mailman/listinfo/beowulf
Re: [Beowulf] cursed (and perhaps blessed) Intel microcode
In message from Mark Hahn <h...@mcmaster.ca> (Fri, 23 Mar 2018 16:02:12 -0400 (EDT)): There *is* an updated microcode data file: https://downloadcenter.intel.com/product/873/Processors which seems to correspond to the document above How I believe, it's for correction of general defect practically for all CPUs. Then microcode update may give also decrease of performance. May be Intel have idea to improve this update and do not recommend use this version ? Mikhail Kuzminsky Zelinsky Institute of Organic Chemistry Moscow ___ Beowulf mailing list, Beowulf@beowulf.org sponsored by Penguin Computing To change your subscription (digest mode or unsubscribe) visit http://www.beowulf.org/mailman/listinfo/beowulf
Re: [Beowulf] Intel kills Knights Hill, Xeon Phi line "being revised"
Rumours flying that the Xeon Phi family is in jeopardy, but the article has an addendum to say: # [Update: Intel denies they are dropping the Xeon Phi line, # saying only that it has "been revised based on recent # customer and overall market needs."] This should cause some confusion. While Knights Hill was cancelled, Intel has quietly put information about Knights Mill online as the next Phi product line: https://www.anandtech.com/show/12172/intel-lists-knights-mill-xeon-phi-on-ark-up-to-72-cores-at-320w-with-qfma-and-vnni I partially disagree with "confusion". It's simple because KNM has minimal microarchitecture changes vs KNL, and does not focus on normal DP-precision. KNM focuses on SP etc, and is oriented to Deep Learning, AI etc. Mikhail Kuzminsky, Zelinsky Institute of Organic Chemistry, Moscow ___ Beowulf mailing list, Beowulf@beowulf.org sponsored by Penguin Computing To change your subscription (digest mode or unsubscribe) visit http://www.beowulf.org/mailman/listinfo/beowulf
Re: [Beowulf] Intel kills Knights Hill, Xeon Phi line "being revised"
Unfortunately I did not find the english version, but Andreas Essentially yes Xeon Phi is not continued, but a new design called Xeon-H is coming. Yes, and Xeon-H has close to KNL codename - Knights Cove. May be some important (for HPC) microarchitecture features will remain. But in any case stop of Xeon Phi give pluses for new NEC SX-Aurora. Mikhail Kuzminsky Zelinsky Institute of Organic Chemistry Moscow ___ Beowulf mailing list, Beowulf@beowulf.org sponsored by Penguin Computing To change your subscription (digest mode or unsubscribe) visit http://www.beowulf.org/mailman/listinfo/beowulf
Re: [Beowulf] slurm in heterogenous cluster
In message from Christopher Samuel(Mon, 18 Sep 2017 16:03:47 +1000): ... The best info is in the "Upgrading" section of the Slurm quickstart guide: https://slurm.schedmd.com/quickstart_admin.html ... So basically you could have (please double check this!): slurmdbd: 17.02.x slurmctld: 17.02.x slurmd: 17.02.x & 16.05.x & 15.08.x ... Thank you very much ! I hope than modern major slurm versions will be succesfully translated and builded also w/old Linux distributions (for example, w/2.6 kernel). Yours Mikhail ___ Beowulf mailing list, Beowulf@beowulf.org sponsored by Penguin Computing To change your subscription (digest mode or unsubscribe) visit http://www.beowulf.org/mailman/listinfo/beowulf
[Beowulf] slurm in heterogenous cluster
Is it possible to use diffenent slurm versions on different worker nodes of cluster (w/other slurmctld and slurmdbd versions on head node) ? If this is possible in principle (to use different slurmd versions on different worker nodes), what are the most important restrictions for this? Mikhail Kuzminsky, Zelinsky Institute of Organic Chemistry RAS, Moscow ___ Beowulf mailing list, Beowulf@beowulf.org sponsored by Penguin Computing To change your subscription (digest mode or unsubscribe) visit http://www.beowulf.org/mailman/listinfo/beowulf
Re: [Beowulf] Register article on Epyc
Returning to the first message: > Also regarding compute power, it would be interesting to see a comparison > of a single socket of these versus Xeon Phi rather than -v4 or -v5 Xeon. I partially disagree w/general discussion direction. AMD Epyc looks as excellent CPUs for datacenters. But if we say about Beowulf and HPC, we must start 1st of all not from SPECfp_rate, but simple from FLOPS per cycle for core, or somes like Linpack, dgemm or like other tests. OK, it's known that Zen core support AVX2 only via 128 bits base, and gives only 8 DP FLOPS per cycle (see http://www.linleygroup.com/mpr/article.php?id=11666 or https://www.hotchips.org/wp-content/uploads/hc_archives/hc28/HC28.23-Tuesday-Epub/HC28.23.90-High-Perform-Epub/HC28.23.930-X86-core-MikeClark-AMD-final_v2-28.pdf Broadwell core gives 16 FLOPS/cycle, and Skylake-SP 32 FLOPS/cycle w/AVX512. Therefore SPECfp_rate2006 may be good for Epyc 7601 because of 32 cores per CPU instead of 22 cores for Broadwell Xeon E5-2699A v4. Xeon Phy KNL cores also gives 32 DP FLOPS per cycle. By my opinion, it's necessary to wait results of normal HPC tests. Mikhail Kuzminsky ___ Beowulf mailing list, Beowulf@beowulf.org sponsored by Penguin Computing To change your subscription (digest mode or unsubscribe) visit http://www.beowulf.org/mailman/listinfo/beowulf
Re: [Beowulf] more automatic building
Many thanks for all answers ! It looks for me now that OpenHPC choice may will be best for me. 1 of my 2 existing clusters is based on RH, 2nd - on OpenSuSE. Basing of OpenHPC on repositories is plus for me. Plus support of mvapich2/OpenMPI/intelMPI (I don't know about basical MPICH). Etc. Are there, by your opinions, some clear OpenHPC minuses ? Mikhail ___ Beowulf mailing list, Beowulf@beowulf.org sponsored by Penguin Computing To change your subscription (digest mode or unsubscribe) visit http://www.beowulf.org/mailman/listinfo/beowulf
[Beowulf] more automatic building
I worked always w/very small HPC clusters and built them manually (each server). But what is reasonable to do for clusters containing some tens or hundred of nodes ? Of course w/modern Xeon (or Xeon Phi KNL) and IB EDR, during the next year for example. There are some automatic systems like OSCAR or even ROCKS. But it looks that ROCKS don't support modern interconnects, and there may be problems w/OSCAR versions for support of systemd-based distributives like CentOS 7. For next year - is it reasonable to wait new OSCAR version or something else ? Mikhail Kuzminsky, Zelinsky Institute of Organic Chemistry RAS, Moscow ___ Beowulf mailing list, Beowulf@beowulf.org sponsored by Penguin Computing To change your subscription (digest mode or unsubscribe) visit http://www.beowulf.org/mailman/listinfo/beowulf
Re: [Beowulf] Thoughts on IB EDR and Intel OmniPath
> from :Joe Landman <land...@scalableinformatics.com>: > > > Even with RoCE2, some of the testing we did >demonstrated very significant congestion related slowdowns that we >couldn't easily tune for (with PFC and other bits that RoCE needs). > >I've used iWARP in the dim and distant past, and it was much better than >plain old gigabit on the same systems (with Ammasso cards). -- BTW, this gives >the question about choice between RoCE vs iWARP. Does your "Even with RoCE2" >means that iWARP is more bad than RoCE ? Mikhail Kuzminsky, Zelinsky Institute of Organic Chemistry RAS, Moscow ___ Beowulf mailing list, Beowulf@beowulf.org sponsored by Penguin Computing To change your subscription (digest mode or unsubscribe) visit http://www.beowulf.org/mailman/listinfo/beowulf
[Beowulf] modern batch management systems
In our more old clusters we used PBS and SGE as batch systems to run quantum-chemical applications. Now there are also commercial versions PBS Pro and Oracle Grid Engine, and other commercial batch management programs. But we are based on free open-source batch management systems. Which free (and potentially free in a few years) batch systems you recommend ? Mikhail Kuzminsky Zelinsky Institute of Organic Chemistry RAS Moscow Mikhail Kuzminsky___ Beowulf mailing list, Beowulf@beowulf.org sponsored by Penguin Computing To change your subscription (digest mode or unsubscribe) visit http://www.beowulf.org/mailman/listinfo/beowulf
Re: [Beowulf] Haswell as supercomputer microprocessors
By my opinion, PowerPC A2 more exactly should be used as name for *core*, not for IBM BlueGene/Q *processor chip*. Power BQC name is used in TOP500, GREEN500, in a lot of Internet data, in IBM journal - see: Sugavanam K. et al. Design for low power and power management in IBM Blue Gene/Q //IBM Journal of Research and Development. – 2013. –v. 57. – №. 1/2. – p. 3: 1-3: 11. PowerPC A2 is the core, see //en.wikipedia.org/wiki/Blue_Gene //en.wikipedia.org/wiki/PowerPC A2 Mikhail ___ Beowulf mailing list, Beowulf@beowulf.org sponsored by Penguin Computing To change your subscription (digest mode or unsubscribe) visit http://www.beowulf.org/mailman/listinfo/beowulf
[Beowulf] Haswell as supercomputer microprocessors
New special supercomputer microprocessors (like IBM Power BQC and Fujitsu SPARC64 XIfx) have 2**N +2 cores (N=4 for 1st, N=5 for 2nd), where 2 last cores are redundant, not for computations, but only for other work w/Linux or even for replacing of failed computational core. Current Intel Haswell E5 v3 may also have 18 = 2**4 +2 cores. Is there some sense to try POWER BQC or SPARC64 XIfx ideas (not exactly), and use only 16 Haswell cores for parallel computations ? If the answer is yes, then how to use this way under Linux ? Mikhail Kuzminsky, Zelinsky Institute of Organic Chemistry RAS, Moscow ___ Beowulf mailing list, Beowulf@beowulf.org sponsored by Penguin Computing To change your subscription (digest mode or unsubscribe) visit http://www.beowulf.org/mailman/listinfo/beowulf
[Beowulf] sorry
I apologize again for erroneous setting of date field in mailer used some years ago. Mikhail Kuzminsky ___ Beowulf mailing list, Beowulf@beowulf.org sponsored by Penguin Computing To change your subscription (digest mode or unsubscribe) visit http://www.beowulf.org/mailman/listinfo/beowulf
Re: [Beowulf] Supermicro BIOS error (was Nvidia K20 + Supermicro mobo)
Previously I described here the situation w/GPU K20c on Supermicro X9SCA-F mobo, where NVIDIA driver v.319.32 (last version) can't be installed. NVIDIA wrote me, that it's Supermicro board (BIOS) error: BIOS don't allocate memory (via BAR registers) for driver. We found that this erroneoues situation is absent on Supermicro board X8-series, and on ASUS board - driver was installed successfully on OpenSUSE 12.3 (and 11.4 also); nvidia-smi utility works normally. Mikhail Kuzminsky Computer Assistance to Chemical Research Center Zelinsky Institute of Organic Chemistry Moscow ___ Beowulf mailing list, Beowulf@beowulf.org sponsored by Penguin Computing To change your subscription (digest mode or unsubscribe) visit http://www.beowulf.org/mailman/listinfo/beowulf
[Beowulf] PCI configuration space errors ? (was Nvidia K20 + Supermicro mobo)
Let me try to forgot (to distract from) GPUs. I don't know, who setup BARs for PCI-E devices: BIOS or Linux kernel (OpenSUSE 12.3 kernel 3.7.10-1.1 - in my case). Here (below) is presented part of /var/log/messages, but at the corresponding moment of kernel loading there is no Nvidia GPU driver loaded (PCI 01:00.0) ---from /var/log/messages-- 2013-07-21T02:28:58.348552+04:00 c6ws4 kernel: [0.432261] ACPI: ACPI bus type pnp unregistered 2013-07-21T02:28:58.348554+04:00 c6ws4 kernel: [0.438011] pci :00:01.0: BAR 15: can't assign mem pref (size 0x1800) 2013-07-21T02:28:58.348555+04:00 c6ws4 kernel: [0.438015] pci :00:01.0: BAR 14: assigned [mem 0xd100-0xd1ff] 2013-07-21T02:28:58.348555+04:00 c6ws4 kernel: [0.438018] pci :01:00.0: BAR 1: can't assign mem pref (size 0x1000) 2013-07-21T02:28:58.348556+04:00 c6ws4 kernel: [0.438020] pci :01:00.0: BAR 3: can't assign mem pref (size 0x200) 2013-07-21T02:28:58.348557+04:00 c6ws4 kernel: [0.438023] pci :01:00.0: BAR 0: assigned [mem 0xd100-0xd1ff] 2013-07-21T02:28:58.348558+04:00 c6ws4 kernel: [0.438026] pci :01:00.0: BAR 6: can't assign mem pref (size 0x8) 2013-07-21T02:28:58.348558+04:00 c6ws4 kernel: [0.438028] pci :00:01.0: PCI bridge to [bus 01] 2013-07-21T02:28:58.348559+04:00 c6ws4 kernel: [0.438031] pci :00:01.0: bridge window [mem 0xd100-0xd1ff] 2013-07-21T02:28:58.348561+04:00 c6ws4 kernel: [0.438035] pci :00:1c.0: PCI bridge to [bus 02] - Of course, there is much more than 2 PCI devices in the system (based on Supermicro X9SCA-F, last BIOS v.2.0b), but only for 2 of them exist such BAR error messages: for PCI Bridge (00:01.0, Xeon E3-1230 PCI-E port) and for Nvidia/PNY K20c at 01:00.0. Does this means some BIOS problems - or it's result of absence of loaded nvidia driver ? The BAR error messages above are presented independently of BIOS/PCI settings - a) 4G decoding enabled/disabled b) is PCI-E Gen.2 mode forced (instead of Gen.3) or no. Mikhail Kuzminsky Computer Assistance to Chemical Research Center Zelinsky Institute of Organic Chemistry Moscow ___ Beowulf mailing list, Beowulf@beowulf.org sponsored by Penguin Computing To change your subscription (digest mode or unsubscribe) visit http://www.beowulf.org/mailman/listinfo/beowulf
Re: [Beowulf] Nvidia K20 + Supermicro mobo
Adam DeConinck ajde...@ajdecon.org wrote : I've seen similar messages on CentOS when the Nouveau drivers are loaded and a Tesla K20 is installed. You should make sure that nouveau is blacklisted so the kernel won't load it. Note that it hasn't always been enough for me to have nouveau listed in /etc/modprobe.d/blacklist; sometimes I've had to actually put rdblacklist=nouveau on the kernel line. nouveau driver loading is suppressed via /etc/modprobe.d . lsmod don't show the presence of nouveau module; therefore I hope that rdblacklist as kernel parameter is not necessary. First group of kernel messages about BARs are presented BEFORE I start nvidia driver installations, and I think that my corresponding question doesn't depends from driver installation, and, in particular, from nouveau. Mikhail Disclaimer: I work at NVIDIA, but I haven't touched OpenSUSE in forever. Cheers, Adam On Tue, Jul 16, 2013 at 10:29 AM, Mikhail Kuzminsky mikk...@mail.ru wrote: I want to test NVIDIA GPU (PNY Tesla K20c) w/our own application for future using in our cluster. But I found problems w/NVIDIA driver (v.319.32) installation (OpenSUSE 12.3, kernel 3.7.10-1.1). 1st of all, before start of driver installation I've strange for me messages about BAR registers: ---from /var/log/messages-- 2013-07-04T01:43:43.666022+04:00 c6ws4 kernel: [ 0.421559] pci :00:01.0: BAR 15: can't assign mem pref (size 0x1800) 2013-07-04T01:43:43.666024+04:00 c6ws4 kernel: [ 0.421563] pci :00:01.0: BAR 14: assigned [mem 0xe100-0xe1ff] 2013-07-04T01:43:43.666025+04:00 c6ws4 kernel: [ 0.421566] pci :00:16.1: BAR 0: assigned [mem 0xe0001000-0xe000100f 64bit] 2013-07-04T01:43:43.666026+04:00 c6ws4 kernel: [ 0.421576] pci :01:00.0: BAR 1: can't assign mem pref (size 0x1000) 2013-07-04T01:43:43.666027+04:00 c6ws4 kernel: [ 0.421579] pci :01:00.0: BAR 3: can't assign mem pref (size 0x200) 2013-07-04T01:43:43.666027+04:00 c6ws4 kernel: [ 0.421581] pci :01:00.0: BAR 0: assigned [mem 0xe100-0xe1ff] 2013-07-04T01:43:43.666028+04:00 c6ws4 kernel: [ 0.421584] pci :01:00.0: BAR 6: can't assign mem pref (size 0x8) 2013-07-04T01:43:43.666029+04:00 c6ws4 kernel: [ 0.421586] pci :00:01.0: PCI bridge to [bus 01] --- May be it's hardware/BIOS (Supermicro X9SCA-F, last BIOS v.2.0b) error symptoms ? I tried both BIOS modes - above 4G Decoding enabled and disabled. It looks for me that NVIDIA driver uses BAR 1 (see below). Although it was also some unclear for me messages in nvidia-installer.log, installer shows that kernel interface of nvidia.ko was compiled, but then nvidia-installer.log contains --from nvidia-installer.log -- - Kernel module load error: No such device - Kernel messages: ...[ 25.286079] IPv6: ADDRCONF(NETDEV_CHANGE): eth1: link becomes ready [ 1379.760532] nvidia: module license 'NVIDIA' taints kernel. [ 1379.760536] Disabling lock debugging due to kernel taint [ 1379.765158] nvidia :01:00.0: enabling device (0140 - 0142) [ 1379.765165] NVRM: This PCI I/O region assigned to your NVIDIA device is invalid: [ 1379.765165] NVRM: BAR1 is 0M @ 0x0 (PCI::01:00.0) [ 1379.765166] NVRM: The system BIOS may have misconfigured your GPU. [ 1379.765169] nvidia: probe of :01:00.0 failed with error -1 [ 1379.765177] NVRM: The NVIDIA probe routine failed for 1 device(s). [ 1379.765178] NVRM: None of the NVIDIA graphics adapters were initialized! - I add also lspci -v extraction : 01:00.0 3D controller: NVIDIA Corporation GK107 [Tesla K20c] (rev a1) Subsystem: NVIDIA Corporation Device 0982 Flags: fast devsel, IRQ 11 Memory at e100 (32-bit, non-prefetchable) [disabled] [size=16M] Memory at unassigned (64-bit, prefetchable) [disabled] Memory at unassigned (64-bit, prefetchable) [disabled] Does this kernel messages above means that I have hardware/BIOS problems or it may be some NVIDIA driver problems ? Mikhail Kuzminsky Computer Assistance to Chemical Research Center Zelinsky Institute of Organic Chemistry Moscow ___ Beowulf mailing list, Beowulf@beowulf.org sponsored by Penguin Computing To change your subscription (digest mode or unsubscribe) visit http://www.beowulf.org/mailman/listinfo/beowulf ___ Beowulf mailing list, Beowulf@beowulf.org sponsored by Penguin Computing To change your subscription (digest mode or unsubscribe) visit http://www.beowulf.org/mailman/listinfo/beowulf Mikhail Kuzminsky
[Beowulf] Nvidia K20 + Supermicro mobo
I want to test NVIDIA GPU (PNY Tesla K20c) w/our own application for future using in our cluster. But I found problems w/NVIDIA driver (v.319.32) installation (OpenSUSE 12.3, kernel 3.7.10-1.1). 1st of all, before start of driver installation I've strange for me messages about BAR registers: ---from /var/log/messages-- 2013-07-04T01:43:43.666022+04:00 c6ws4 kernel: [ 0.421559] pci :00:01.0: BAR 15: can't assign mem pref (size 0x1800) 2013-07-04T01:43:43.666024+04:00 c6ws4 kernel: [ 0.421563] pci :00:01.0: BAR 14: assigned [mem 0xe100-0xe1ff] 2013-07-04T01:43:43.666025+04:00 c6ws4 kernel: [ 0.421566] pci :00:16.1: BAR 0: assigned [mem 0xe0001000-0xe000100f 64bit] 2013-07-04T01:43:43.666026+04:00 c6ws4 kernel: [ 0.421576] pci :01:00.0: BAR 1: can't assign mem pref (size 0x1000) 2013-07-04T01:43:43.666027+04:00 c6ws4 kernel: [ 0.421579] pci :01:00.0: BAR 3: can't assign mem pref (size 0x200) 2013-07-04T01:43:43.666027+04:00 c6ws4 kernel: [ 0.421581] pci :01:00.0: BAR 0: assigned [mem 0xe100-0xe1ff] 2013-07-04T01:43:43.666028+04:00 c6ws4 kernel: [ 0.421584] pci :01:00.0: BAR 6: can't assign mem pref (size 0x8) 2013-07-04T01:43:43.666029+04:00 c6ws4 kernel: [ 0.421586] pci :00:01.0: PCI bridge to [bus 01] --- May be it's hardware/BIOS (Supermicro X9SCA-F, last BIOS v.2.0b) error symptoms ? I tried both BIOS modes - above 4G Decoding enabled and disabled. It looks for me that NVIDIA driver uses BAR 1 (see below). Although it was also some unclear for me messages in nvidia-installer.log, installer shows that kernel interface of nvidia.ko was compiled, but then nvidia-installer.log contains --from nvidia-installer.log -- - Kernel module load error: No such device - Kernel messages: ...[ 25.286079] IPv6: ADDRCONF(NETDEV_CHANGE): eth1: link becomes ready [ 1379.760532] nvidia: module license 'NVIDIA' taints kernel. [ 1379.760536] Disabling lock debugging due to kernel taint [ 1379.765158] nvidia :01:00.0: enabling device (0140 - 0142) [ 1379.765165] NVRM: This PCI I/O region assigned to your NVIDIA device is invalid: [ 1379.765165] NVRM: BAR1 is 0M @ 0x0 (PCI::01:00.0) [ 1379.765166] NVRM: The system BIOS may have misconfigured your GPU. [ 1379.765169] nvidia: probe of :01:00.0 failed with error -1 [ 1379.765177] NVRM: The NVIDIA probe routine failed for 1 device(s). [ 1379.765178] NVRM: None of the NVIDIA graphics adapters were initialized! - I add also lspci -v extraction : 01:00.0 3D controller: NVIDIA Corporation GK107 [Tesla K20c] (rev a1) Subsystem: NVIDIA Corporation Device 0982 Flags: fast devsel, IRQ 11 Memory at e100 (32-bit, non-prefetchable) [disabled] [size=16M] Memory at unassigned (64-bit, prefetchable) [disabled] Memory at unassigned (64-bit, prefetchable) [disabled] Does this kernel messages above means that I have hardware/BIOS problems or it may be some NVIDIA driver problems ? Mikhail Kuzminsky Computer Assistance to Chemical Research Center Zelinsky Institute of Organic Chemistry Moscow ___ Beowulf mailing list, Beowulf@beowulf.org sponsored by Penguin Computing To change your subscription (digest mode or unsubscribe) visit http://www.beowulf.org/mailman/listinfo/beowulf
Re: [Beowulf] Strange resume statements generated for GRUB2
Skylar Thompson skylar.thomp...@gmail.com wrote: Hibernation isn't strictly suspension - it's writing all allocated, non-file-backed portions of memory to the paging/swap space. When the system comes out of hibernation, it boots normally and then looks for a hibernation image in the paging space. If it finds one, it loads that back into system memory rather than proceeding with a regular boot. This is in contract to system suspension, which depends on hardware support to place CPU, memory, and other system devices into a low power state, and wait for a signal to power things back up, bypassing the boot process. Taking into account small size of my swap partition (4GB only), less than my RAM size, (I wrote about this situation in my 1st message) the hibernation image may not fit into swap partition. Therefore coding of -part2 (for /) in resume statement is preferred (right for general case). I'm not a SuSE expert so I'm not sure what YaST is doing, but I imagine you have to make grub changes via YaST rather than editing the grub configs directly. Skylar Generally speaking, you are right. But I myself strongly prefer to know what occurs at linux level - to have the natural possibility (enough knowledge) to work w/OpenSUSE, Fedora etc. So I prefer to change GRUB2 configuration files :-) Mikhail On 06/09/2013 11:37 AM, Mikhail Kuzminsky wrote: I have swap in sda1 and / in sda2 partitions of HDD. At installation of OpenSUSE 12.3 (where YaST2 is used) on my cluster node I found erroneous, by my opinion, boot loader (GRUB2) settings. YaST2 proposed (at installation) to use ... resume=/dev/disk/by-id/ata-WDC-... -part1 splash=silent ... in configuration of GRUB2. This parameters are transmitted (at linux loading) by GRUB2 to linux kernel. GRUB2 itself, according my installation settings, was installed to MBR. I changed (at installation stage) -part1 to -part2, but after that YaST2 restored it back to - part1 value ! And after installation OpenSuSE boots successfully ! I found (in installed OpenSuSE) 2 GRUB2 configuration files w/erroneous -part1 setting. I found possible interpretation of this behaviour in /var/log/messages. I found in this file the strings: [Kernel] PM: Checking hibernation image partition /dev/disk/by-id/ata-WDC_...-part1 [Kernel] PM: Hibernation Image partition 8:1 present [Kernel] PM: Looking for hibernation image. {Kernel] PM: Image not found (code -22) [Kernel] PM: Hibernation Image partitions not present or could not be loaded What does it means ? The hibernation image is writing to swap partition ? But I beleive that hibernation is really suppressed in my Linux (cpufreq kernel modules are not loaded) , and my BIOS settings do not allow any changes of CPU frequency. BTW, my swap partition is small (4 GB, but RAM size is 8 GB). Which GRUB2/resume settings are really right and why they are right ? ___ Beowulf mailing list, Beowulf@beowulf.org sponsored by Penguin Computing To change your subscription (digest mode or unsubscribe) visit http://www.beowulf.org/mailman/listinfo/beowulf
[Beowulf] Prevention of cpu frequency changes in cluster nodes (Was : cpupower, acpid cpufreq)
I installed OpenSuSE 12.3/x86-64 now. I may now say about the reasons why I am afraid of loading of cpufreq modules. 1) I found in /var/log/messages pairs of strings about governor like [kernel] cpuidle: using governor ladder [kernel] cpuidle: using governor menu and strange for me [kernel] ENERGY_PERF_BIAS: Set to 'normal', was 'performance' [kernel] ENERGY_PERF_BIAS: View and update with x86_energy_perf_policy(8) 2) The presence on installed system of /sys/devices/system/cpu/cpufreq /sys/devices/system/cpu/cpu0/cpuidle directories. cpuidle directories contains state0, state1 etc directories w/non-empty files. 3) But to prevent cpu frequency changes I suppressed all like possibilities in BIOS. 4) And I don't have (as I wrote in my previous Beowulf message) /sys/devices/system/cpu/cpu0/cpufreq files. Just the presence of this file is used by my /etc/init.d/cpufreq script as test of needs to load cpufreq kernel modules. 5) lsmod says that there is no cpufreq modules loaded. Any comments ? Am I everywhere here right and should I ignore my afraids about kernel messages and presence of some /sys/devices/system/cpu/.. files ? Mikhail Kuzminsky Computer Assistance to Chemical Research Center RAS Zelinsky Institute of Organic Chemistry Moscow ___ Beowulf mailing list, Beowulf@beowulf.org sponsored by Penguin Computing To change your subscription (digest mode or unsubscribe) visit http://www.beowulf.org/mailman/listinfo/beowulf
[Beowulf] Strange resume statements generated for GRUB2
I have swap in sda1 and / in sda2 partitions of HDD. At installation of OpenSUSE 12.3 (where YaST2 is used) on my cluster node I found erroneous, by my opinion, boot loader (GRUB2) settings. YaST2 proposed (at installation) to use ... resume=/dev/disk/by-id/ata-WDC-... -part1 splash=silent ... in configuration of GRUB2. This parameters are transmitted (at linux loading) by GRUB2 to linux kernel. GRUB2 itself, according my installation settings, was installed to MBR. I changed (at installation stage) -part1 to -part2, but after that YaST2 restored it back to - part1 value ! And after installation OpenSuSE boots successfully ! I found (in installed OpenSuSE) 2 GRUB2 configuration files w/erroneous -part1 setting. I found possible interpretation of this behaviour in /var/log/messages. I found in this file the strings: [Kernel] PM: Checking hibernation image partition /dev/disk/by-id/ata-WDC_...-part1 [Kernel] PM: Hibernation Image partition 8:1 present [Kernel] PM: Looking for hibernation image. {Kernel] PM: Image not found (code -22) [Kernel] PM: Hibernation Image partitions not present or could not be loaded What does it means ? The hibernation image is writing to swap partition ? But I beleive that hibernation is really suppressed in my Linux (cpufreq kernel modules are not loaded) , and my BIOS settings do not allow any changes of CPU frequency. BTW, my swap partition is small (4 GB, but RAM size is 8 GB). Which GRUB2/resume settings are really right and why they are right ? Mikhail Kuzminsky Computer Assistance to Chemical Research Center RAS Zelinsky Institute of Organic Chemistry Moscow ___ Beowulf mailing list, Beowulf@beowulf.org sponsored by Penguin Computing To change your subscription (digest mode or unsubscribe) visit http://www.beowulf.org/mailman/listinfo/beowulf
[Beowulf] cpupower, acpid cpufreq
I plan to perform Linux installation (openSuSE 12.3/x86-64, kernel 3.7.10) for HPC-cluster. I want not to change CPU frequencies in cluster nodes, and not to use cpufreq kernel modules. Therefore I also don't want to use special power saving states of CPUs. I performed fast test installation, and, of course, /lib/modules/`uname-r`/kernel/drivers/cpufreq directory is presented, but no cpufreq kernel modules are loaded, and /sys/devices/system/cpu/cpu0 e.a. do not have cpufreq files. But generally - must I perform special steps to avoid cpufreq modules load ? BTW, do somebody use acpid and/or cpupower RPM packages in cluster nodes ? If yes, why they are interesting ? Mikhail Kuzminsky Computer Assistance to Chemical Research Center RAS, Zelinsky Institute of Organic Chemistry Moscow ___ Beowulf mailing list, Beowulf@beowulf.org sponsored by Penguin Computing To change your subscription (digest mode or unsubscribe) visit http://www.beowulf.org/mailman/listinfo/beowulf
[Beowulf] nVidia Kepler GK110 GPU is incompatible w/Intel x86 hardware in PCI-E 3.0 mode ?
I've cluster node (w/Linux, of course) based on Supermicro X9SCA system board and Xeon E3-1230v2 having LGA1155 socket. Now I want to buy GPU nVidia Kepler GK110 w/PCI-E 3.0 (CK20 Compute Board from PNY ?) and install it into my node. Intel Xeon E3-1230v2 and Supermicro X9SCA both support PCI-E 3.0. But I've heard that GK110 (available today on the market) can't works in PCI-E v.3.0 mode w/Intel Equipment (only in PCI-E 2.0 mode). But GK104 etc, good for SP only, works w/PCI-E 3.0 normally. Moreover, for successfull work of GK110 w/Intel Xeon Platform I need: a) to buy next version of Xeon processor which will have new socket (when it'll arrive to the market?) and new PCI-E 3.0 support plus new system board for this processor (i.e. I need to modernize node hardware) and b) to buy new GK110 version, which will have improved PCI-E 3.0 interface block (again, when it'll on the market?). The reason, as I heard, is different realization of PCI-E 3.0 standard by nVidia and Intel in the corresponding currently available hardware. This is result of poor detalizations of PCI-E 3.0 standard, which should be defined more exactly. Can somebody clarify this situation ? Mikhail Kuzminsky Computer Assistance to Chemical Research Center Zelinsky Institute of Organic Chemistry RAS Moscow ___ Beowulf mailing list, Beowulf@beowulf.org sponsored by Penguin Computing To change your subscription (digest mode or unsubscribe) visit http://www.beowulf.org/mailman/listinfo/beowulf
Re: [Beowulf] Quantum Chemistry scalability for large number of processors (cores)
Thu, 27 Sep 2012 11:11:24 +1000 от Christopher Samuel sam...@unimelb.edu.au: -BEGIN PGP SIGNED MESSAGE- On 27/09/12 03:52, Andrew Holway wrote: Let the benchmarks begin!!! Assuming the license agreement allows you to publish them.. :-) For example: Gaussian-09/03/... licenses disallow you to publish any data which may harm Gaussian, Inc. Therefore if you present speedup values which show good parallelization efficiency, and w/o any comparison w/other programs, all will be OK. There are also some other codes, which are free, even with GPL license. But I myself don't have high number of cores :-) Mikhail - -- Christopher SamuelSenior Systems Administrator VLSCI - Victorian Life Sciences Computation Initiative Email: sam...@unimelb.edu.au Phone: +61 (0)3 903 55545 http://www.vlsci.org.au/ http://twitter.com/vlsci ___ Beowulf mailing list, Beowulf@beowulf.org sponsored by Penguin Computing To change your subscription (digest mode or unsubscribe) visit http://www.beowulf.org/mailman/listinfo/beowulf
Re: [Beowulf] Q: IB message rate large core counts (per node) ?
BTW, is Cray SeaStar2+ better than IB - for nodes w/many cores ? And I didn't see latencies comparison for SeaStar vs IB. Mikhail ___ Beowulf mailing list, Beowulf@beowulf.org sponsored by Penguin Computing To change your subscription (digest mode or unsubscribe) visit http://www.beowulf.org/mailman/listinfo/beowulf
Re: [Beowulf] Fortran Array size question
In message from Prentice Bisbal prent...@ias.edu (Tue, 03 Nov 2009 12:09:07 -0500): This question is a bit off-topic, but since it involves Fortran minutia, I figured this would be the best place to ask. This code may eventually run on my cluster, so it's not completely off topic! Question: What is the maximum number of elements you can have in a double-precision array in Fortran? I have someone creating a 4-dimensional double-precision array. When they increase the dimenions of the array to ~200 million elements, they get this error: compilation aborted (code 1). I'm sure they're hitting a Fortran limit, but I need to prove it. I haven't been able to find anything using The Google. It is not Fortran restriction. It may be some compiler restriction. 64-bit ifort for EM64t allow you to use, for example, 400 millions elements. Mikhail Kuzminsky Computer Assistance to Chemical Research Center Zelinsky Institute of Organic Chemistry RAS Moscow -- Prentice ___ Beowulf mailing list, Beowulf@beowulf.org sponsored by Penguin Computing To change your subscription (digest mode or unsubscribe) visit http://www.beowulf.org/mailman/listinfo/beowulf -- Ýòî ñîîáùåíèå áûëî ïðîâåðåíî íà íàëè÷èå â íåì âèðóñîâ è èíîãî îïàñíîãî ñîäåðæèìîãî ïîñðåäñòâîì MailScanner, è ìû íàäååìñÿ ÷òî îíî íå ñîäåðæèò âðåäîíîñíîãî êîäà. ___ Beowulf mailing list, Beowulf@beowulf.org sponsored by Penguin Computing To change your subscription (digest mode or unsubscribe) visit http://www.beowulf.org/mailman/listinfo/beowulf
Re: [Beowulf] nearly future of Larrabee
In message from Bogdan Costescu bcoste...@gmail.com (Sun, 23 Aug 2009 03:17:08 +0200): 2009/8/21 Mikhail Kuzminsky k...@free.net: Q3. Does it means that Larrabee will give essential speedup also on relative short vectors ? I don't quite understand your question... For example, will DAXPY give essential speedup (percent of peak performance) for N=10 or 100 for example (for matrix and vector), and will DGEMM give high performance for meduim sizes of matrices - or we'll need large N values - for example, 1000, 1 etc ? What is about gather/scatter/etc for vector processing, the compilers for Cray T90/C90 ... Cray 1, NEC SX-6/5/4... performs, I beleive, all the necessary things. Mikhail Mikhail ___ Beowulf mailing list, Beowulf@beowulf.org sponsored by Penguin Computing To change your subscription (digest mode or unsubscribe) visit http://www.beowulf.org/mailman/listinfo/beowulf
[Beowulf] nearly future of Larrabee
AFAIK Larrabee-based product(s) will appear soon - at begin of 2010. Unfortunatley I didn't see enough appropriate technical information. What new is known from SIGGRAPH 2009 ? There was 2 ideas of Larrabee-based hardware a) Whole computers on Larrabee CPU(s) b) GPGPU card. Recently I didn't see any words about Larrabee-based servers - only about graphical cards. If Larrabee will work as CPU - then I beleive that linux kernel developers will work in this direction. But I didn't find anything about Larrabee in 2.6. So Q1. Is there the plans to build Larrabee-based motherboards (in particular in 2010) ? If Larrabee will be in the form of graphical card (the most probable case) - Q2. What will be the interface - one slot PCI-E v.2 x16 ? It's known now, that DP will be hardware supported and (AFAIK) that 512-bit operands (i.e. 8 DP words) will be supported in ISA. Q3. Does it means that Larrabee will give essential speedup also on relative short vectors ? And is there some preliminary articles w/estimation of Larrabee DP performance ? One of declared potential advantages of Larrabee is support by compilers. There is now PGI Fortran w/NVidia GPGPU extensions. PGI Accelerator-2010 will include support of CUDA on the base of OpenMP-like comments to compiler. So Q4. Is there some rumours about direct Larrabee support w/Intel ifort or PGI compilers in 2010 ? (By direct I mean automatic compiler vectorization of pure Fortran/C source, maximim w/additional commemts). Q5. How much may costs Larrabee-based hardware in 2010 ? I hope it'll be lower $1. Any more exact predictions ? Mikhail Kuzminsky Computer Assistance to Chemical Research Center Zelinsky Institute of Organic Chemistry RAS Moscow ___ Beowulf mailing list, Beowulf@beowulf.org sponsored by Penguin Computing To change your subscription (digest mode or unsubscribe) visit http://www.beowulf.org/mailman/listinfo/beowulf
Re: [Beowulf] moving of Linux HDD to other node: udev problem at boot
In message from Reuti re...@staff.uni-marburg.de (Wed, 19 Aug 2009 21:07:19 +0200): Maybe the disk id is different form the one recored in /etc/fstab. What about using plain /dev/sda1 or alike, or mounting by volume label? At the moment of problem /etc/fstab, as I understand, isn't used. And /dev/sda* files are not created by udev :-( Mikhail -- Reuti and then the proposal to try again. After finish of this script I don't see any HDDs in /dev. BIOS setting for this SATA device is enhanced. compatible mode gives the same result. What may be the source of the problem ? May be HDD driver used by initrd ? Mikhail Kuzminsky Computer Assistance to Chemical Research Center Zelinsky Institute of Organic Chemistry RAS Moscow PS. If I see (after finish of udev.sh script) the content of /sys - it's right in NUMA sense, i.e. /sys/devices/system/node contains normal node0 and node1. ___ Beowulf mailing list, Beowulf@beowulf.org sponsored by Penguin Computing To change your subscription (digest mode or unsubscribe) visit http://www.beowulf.org/mailman/listinfo/beowulf ___ Beowulf mailing list, Beowulf@beowulf.org sponsored by Penguin Computing To change your subscription (digest mode or unsubscribe) visit http://www.beowulf.org/mailman/listinfo/beowulf -- ьФП УППВЭЕОЙЕ ВЩМП РТПЧЕТЕОП ОБ ОБМЙЮЙЕ Ч ОЕН ЧЙТХУПЧ Й ЙОПЗП ПРБУОПЗП УПДЕТЦЙНПЗП РПУТЕДУФЧПН MailScanner, Й НЩ ОБДЕЕНУС ЮФП ПОП ОЕ УПДЕТЦЙФ ЧТЕДПОПУОПЗП ЛПДБ. ___ Beowulf mailing list, Beowulf@beowulf.org sponsored by Penguin Computing To change your subscription (digest mode or unsubscribe) visit http://www.beowulf.org/mailman/listinfo/beowulf
[Beowulf] Re: moving of Linux HDD to other node: udev problem at boot
In message from David Mathog mat...@caltech.edu (Thu, 20 Aug 2009 11:29:17 -0700): Mikhail Kuzminsky k...@free.net wrote: I moved Western Digital SATA HDD w/SuSE 10.3 installed (on dual Barcelona server) to dual Nehalem server (master HDD on Nehalem server) with Supermicro X8DTi mobo. Which means any number of drivers will have to change. The boot could only succeed if all of these new drivers are present in the distro AND the installation isn't hardwired to use information from the previous system. The first may be true, It was, of course, the main hope the second is almost certainly false. ... and the second can be resolved IMHO w/o difficult problems. On Mandriva, and probably Red Hat, and maybe Suse, even cloning between identical systems requires that that the file: /etc/udev/rules.d/61-net_config.rules be removed before reboot as it holds a copy of the MAC from the previous system, and no two machines (should) have the same MAC even if they are otherwise identical. SuSE have this problem, but at least 11.1 have special setting to avoid such udev behaviour. And updating of network settings isn't a problem. There are a lot of other files in the same directory which I believe hold similar machine specific information. Similarly, your /etc/modprobe.conf will almost certainly load modules which are not appropriate for the new system. Is there some modules which depends from processors ? The NIC drivers isn't a problem. If there is an /etc/sysconfig directory there may be files there that also hold machine specific information. The /etc/sensors.conf configuration will also certainly also be incorrect. Of course, lm_sensors and NICs settings have to be changed. But HDDs for example was the same (excluding size). Perhaps you can successfully boot the system in safe mode and then run whatever configuration tool Suse provides to reset all of these hardware specific files? The problem don't depends from kind of load (safemode or usual). David Mathog mat...@caltech.edu Manager, Sequence Analysis Facility, Biology Division, Caltech ___ Beowulf mailing list, Beowulf@beowulf.org sponsored by Penguin Computing To change your subscription (digest mode or unsubscribe) visit http://www.beowulf.org/mailman/listinfo/beowulf
Re: [Beowulf] moving of Linux HDD to other node: udev problem at boot
In message from Greg Lindahl lind...@pbm.com (Thu, 20 Aug 2009 11:23:25 -0700): On Thu, Aug 20, 2009 at 08:06:07PM +0200, Reuti wrote: AFAIK, initrd (as the kernel itself) is universal for EM64T/x86-64, The problem is not the type of CPU, but the chipset (i.e. the necessary kernel module) with which the HDD is accessed. There are 2 aspects to this: 1: /etc/modprobe.conf or equivalent 2: the initrd on a non-rescue disk is generally specialized to only include modules for devices in (1). Solution? Boot in a rescue disk, chroot to your system disk, modify /etc/modprobe.conf appropriately, run mkinitrd. Thanks, it's good idea ! The problem is (I think) just in 10.3 initrd image. Unfortunately it's in some inconsistence w/my source hope - move HDD ASAP (As Simple As Possible :-)) ). Mikhail ___ Beowulf mailing list, Beowulf@beowulf.org sponsored by Penguin Computing To change your subscription (digest mode or unsubscribe) visit http://www.beowulf.org/mailman/listinfo/beowulf
[Beowulf] moving of Linux HDD to other node: udev problem at boot
As it was discussed here, there are NUMA problems w/Nehalem on a set of Linux distributions/kernels. I was informed that may be old OpenSuSE 10.3 default kernel (2.6.22) works w/Nehalem OK in the sense of NUMA, i.e. gives right /sys/devices/system/node content. I moved Western Digital SATA HDD w/SuSE 10.3 installed (on dual Barcelona server) to dual Nehalem server (master HDD on Nehalem server) with Supermicro X8DTi mobo. But loading of SuSE 10.3 on Nehalem server was not successful. Grub loader (which menu.lst configuration uses by-id identification of disk partitions) works OK. But linux kernel booting didn't finish successfully: /boot/04-udev.sh script (which task is udev initialization) - I think, it's from initrd content - do not see root partition (1st partition on HDD) ! At the boot I see the messages SCSI subsystem initialized ACPI Exception (processor_core_0787): Processor device isn't present a set of messages about usb ... Trying manual resume from /dev/sda2 /* it's swap partition*/ resume device /dev/sda2 not found (ignoring) ... Waiting for device /dev/disk/by-id/scsi-SATA-WDC_WDname_of_disk-part1 ... /* echo from udev.sh */ and then the proposal to try again. After finish of this script I don't see any HDDs in /dev. BIOS setting for this SATA device is enhanced. compatible mode gives the same result. What may be the source of the problem ? May be HDD driver used by initrd ? Mikhail Kuzminsky Computer Assistance to Chemical Research Center Zelinsky Institute of Organic Chemistry RAS Moscow PS. If I see (after finish of udev.sh script) the content of /sys - it's right in NUMA sense, i.e. /sys/devices/system/node contains normal node0 and node1. ___ Beowulf mailing list, Beowulf@beowulf.org sponsored by Penguin Computing To change your subscription (digest mode or unsubscribe) visit http://www.beowulf.org/mailman/listinfo/beowulf
Re: [Beowulf] bizarre scaling behavior on a Nehalem
In message from Bill Broadley b...@cse.ucdavis.edu (Thu, 13 Aug 2009 17:09:24 -0700): Tom Elken wrote: To add some details to what Christian says, the HPC Challenge version of STREAM uses dynamic arrays and is hard to optimize. I don't know what's best with current compiler versions, but you could try some of these that were used in past HPCC submissions with your program, Bill: Thanks for the heads up, I've checked the specbench.org compiler options for hints on where to start with optimization flags, but I didn't know about the dynamic stream. Is the HPC challenge code open source? Yes, they are open. PathScale 2.2.1 on Opteron: Base OPT flags: -O3 -OPT:Ofast:fold_reassociate=0 STREAMFLAGS=-O3 -OPT:Ofast:fold_reassociate=0 -OPT:alias=restrict:align_unsafe=on -CG:movnti=1 Alas my pathscale license expired and I believe with sci-cortex's death (RIP) I can't renew it. Now I understand that I was sage :-) (we purchased perpetual acafemic license). #1042;#1058;W, do somebody know about Pathscale compilers future (if it will be) ? Mikhail I tried open64-4.2.2 with those flags and on a nehalem single socket: $ opencc -O4 -fopenmp stream.c -o stream-open64 -static $ opencc -O4 -fopenmp stream-malloc.c -o stream-open64-malloc -static $ ./stream-open64 Total memory required = 457.8 MB. Function Rate (MB/s) Avg time Min time Max time Copy: 22061.4958 0.0145 0.0145 0.0146 Scale: 8.4705 0.0144 0.0144 0.0145 Add:20659.2638 0.0233 0.0232 0.0233 Triad: 20511.0888 0.0235 0.0234 0.0235 Dynamic: $ ./stream-open64-malloc Function Rate (MB/s) Avg time Min time Max time Copy: 14436.5155 0.0222 0.0222 0.0222 Scale: 14667.4821 0.0218 0.0218 0.0219 Add:15739.7070 0.0305 0.0305 0.0305 Triad: 15770.7775 0.0305 0.0304 0.0305 Intel C/C++ Compiler 10.1 on Harpertown CPUs: Base OPT flags: -O2 -xT -ansi-alias -ip -i-static Intel recently used Intel C/C++ Compiler 11.0.081 on Nehalem CPUs: -O2 -xSSE4.2 -ansi-alias -ip and got good STREAM results in their HPCC submission on their ENdeavor cluster. $ icc -O2 -xSSE4.2 -ansi-alias -ip -openmp stream.c -o stream-icc $ icc -O2 -xSSE4.2 -ansi-alias -ip -openmp stream-malloc.c -o stream-icc-malloc $ ./stream-icc | grep : STREAM version $Revision: 5.9 $ Copy: 14767.0512 0.0022 0.0022 0.0022 Scale: 14304.3513 0.0022 0.0022 0.0023 Add:15503.3568 0.0031 0.0031 0.0031 Triad: 15613.9749 0.0031 0.0031 0.0031 $ ./stream-icc-malloc | grep : STREAM version $Revision: 5.9 $ Copy: 14604.7582 0.0022 0.0022 0.0022 Scale: 14480.2814 0.0022 0.0022 0.0022 Add:15414.3321 0.0031 0.0031 0.0031 Triad: 15738.4765 0.0031 0.0030 0.0031 So ICC does manage zero penalty, alas no faster than open64 with the penalty. I'll attempt to track down the HPCC stream source code to see if their dynamic arrays are any friendlier than mine (I just use malloc). In any case many thanks for the pointer. Oh, my dynamic tweak: $ diff stream.c stream-malloc.c 43a44 # include stdlib.h 97c98 static double a[N+OFFSET], --- /* static doublea[N+OFFSET], 99c100,102 c[N+OFFSET]; --- c[N+OFFSET]; */ double *a, *b, *c; 134a138,142 a=(double *)malloc(sizeof(double)*(N+OFFSET)); b=(double *)malloc(sizeof(double)*(N+OFFSET)); c=(double *)malloc(sizeof(double)*(N+OFFSET)); 283c291,293 --- free(a); free(b); free(c); ___ Beowulf mailing list, Beowulf@beowulf.org sponsored by Penguin Computing To change your subscription (digest mode or unsubscribe) visit http://www.beowulf.org/mailman/listinfo/beowulf -- üÔÏ ÓÏÏÂÝÅÎÉÅ ÂÙÌÏ ÐÒÏ×ÅÒÅÎÏ ÎÁ ÎÁÌÉÞÉÅ × ÎÅÍ ×ÉÒÕÓÏ× É ÉÎÏÇÏ ÏÐÁÓÎÏÇÏ ÓÏÄÅÒÖÉÍÏÇÏ ÐÏÓÒÅÄÓÔ×ÏÍ MailScanner, É ÍÙ ÎÁÄÅÅÍÓÑ ÞÔÏ ÏÎÏ ÎÅ ÓÏÄÅÒÖÉÔ ×ÒÅÄÏÎÏÓÎÏÇÏ ËÏÄÁ. ___ Beowulf mailing list, Beowulf@beowulf.org sponsored by Penguin Computing To change your subscription (digest mode or unsubscribe) visit http://www.beowulf.org/mailman/listinfo/beowulf
Re: [Beowulf] bizarre scaling behavior on a Nehalem
In message from Bill Broadley b...@cse.ucdavis.edu (Thu, 13 Aug 2009 17:09:24 -0700): Do I unerstand correctly that this results are for 4 cores 4 openmp threads ? And what is DDR3 RAM: DDR3/1066 ? Mikhail I tried open64-4.2.2 with those flags and on a nehalem single socket: $ opencc -O4 -fopenmp stream.c -o stream-open64 -static $ opencc -O4 -fopenmp stream-malloc.c -o stream-open64-malloc -static $ ./stream-open64 Total memory required = 457.8 MB. Function Rate (MB/s) Avg time Min time Max time Copy: 22061.4958 0.0145 0.0145 0.0146 Scale: 8.4705 0.0144 0.0144 0.0145 Add:20659.2638 0.0233 0.0232 0.0233 Triad: 20511.0888 0.0235 0.0234 0.0235 Dynamic: $ ./stream-open64-malloc Function Rate (MB/s) Avg time Min time Max time Copy: 14436.5155 0.0222 0.0222 0.0222 Scale: 14667.4821 0.0218 0.0218 0.0219 Add:15739.7070 0.0305 0.0305 0.0305 Triad: 15770.7775 0.0305 0.0304 0.0305 Intel C/C++ Compiler 10.1 on Harpertown CPUs: Base OPT flags: -O2 -xT -ansi-alias -ip -i-static Intel recently used Intel C/C++ Compiler 11.0.081 on Nehalem CPUs: -O2 -xSSE4.2 -ansi-alias -ip and got good STREAM results in their HPCC submission on their ENdeavor cluster. $ icc -O2 -xSSE4.2 -ansi-alias -ip -openmp stream.c -o stream-icc $ icc -O2 -xSSE4.2 -ansi-alias -ip -openmp stream-malloc.c -o stream-icc-malloc $ ./stream-icc | grep : STREAM version $Revision: 5.9 $ Copy: 14767.0512 0.0022 0.0022 0.0022 Scale: 14304.3513 0.0022 0.0022 0.0023 Add:15503.3568 0.0031 0.0031 0.0031 Triad: 15613.9749 0.0031 0.0031 0.0031 $ ./stream-icc-malloc | grep : STREAM version $Revision: 5.9 $ Copy: 14604.7582 0.0022 0.0022 0.0022 Scale: 14480.2814 0.0022 0.0022 0.0022 Add:15414.3321 0.0031 0.0031 0.0031 Triad: 15738.4765 0.0031 0.0030 0.0031 So ICC does manage zero penalty, alas no faster than open64 with the penalty. I'll attempt to track down the HPCC stream source code to see if their dynamic arrays are any friendlier than mine (I just use malloc). In any case many thanks for the pointer. Oh, my dynamic tweak: $ diff stream.c stream-malloc.c 43a44 # include stdlib.h 97c98 static double a[N+OFFSET], --- /* static doublea[N+OFFSET], 99c100,102 c[N+OFFSET]; --- c[N+OFFSET]; */ double *a, *b, *c; 134a138,142 a=(double *)malloc(sizeof(double)*(N+OFFSET)); b=(double *)malloc(sizeof(double)*(N+OFFSET)); c=(double *)malloc(sizeof(double)*(N+OFFSET)); 283c291,293 --- free(a); free(b); free(c); ___ Beowulf mailing list, Beowulf@beowulf.org sponsored by Penguin Computing To change your subscription (digest mode or unsubscribe) visit http://www.beowulf.org/mailman/listinfo/beowulf -- üÔÏ ÓÏÏÂÝÅÎÉÅ ÂÙÌÏ ÐÒÏ×ÅÒÅÎÏ ÎÁ ÎÁÌÉÞÉÅ × ÎÅÍ ×ÉÒÕÓÏ× É ÉÎÏÇÏ ÏÐÁÓÎÏÇÏ ÓÏÄÅÒÖÉÍÏÇÏ ÐÏÓÒÅÄÓÔ×ÏÍ MailScanner, É ÍÙ ÎÁÄÅÅÍÓÑ ÞÔÏ ÏÎÏ ÎÅ ÓÏÄÅÒÖÉÔ ×ÒÅÄÏÎÏÓÎÏÇÏ ËÏÄÁ. ___ Beowulf mailing list, Beowulf@beowulf.org sponsored by Penguin Computing To change your subscription (digest mode or unsubscribe) visit http://www.beowulf.org/mailman/listinfo/beowulf
Re: [Beowulf] bizarre scaling behavior on a Nehalem
In message from Tom Elken tom.el...@qlogic.com (Fri, 14 Aug 2009 13:57:53 -0700): On Behalf Of Bill Broadley I put DDR3-1333 in the machine, but the bios seems to want to run them at 1066, How many dimms per memory channel do you have? My understanding (which may be a few months old) is that if you have more than one dimm per memory channel, DDR3-1333 dimms will run at 1066 speed; i.e. on your 1-CPU system, if you have 6 dimms, you have 2 per memory channel. I'm not sure exactly what speed they are running at. Your results look excellent, so I wouldn't be surprised if they are running at 1333. I have 12-18 GB/s on 4 threads of stream/ifort w/DDR3-1066 on dual E5520 server. But it works under numa-bad kernel w/o control of numa-efficient allocation. Mikhail -Tom ___ Beowulf mailing list, Beowulf@beowulf.org sponsored by Penguin Computing To change your subscription (digest mode or unsubscribe) visit http://www.beowulf.org/mailman/listinfo/beowulf -- ьФП УППВЭЕОЙЕ ВЩМП РТПЧЕТЕОП ОБ ОБМЙЮЙЕ Ч ОЕН ЧЙТХУПЧ Й ЙОПЗП ПРБУОПЗП УПДЕТЦЙНПЗП РПУТЕДУФЧПН MailScanner, Й НЩ ОБДЕЕНУС ЮФП ПОП ОЕ УПДЕТЦЙФ ЧТЕДПОПУОПЗП ЛПДБ. ___ Beowulf mailing list, Beowulf@beowulf.org sponsored by Penguin Computing To change your subscription (digest mode or unsubscribe) visit http://www.beowulf.org/mailman/listinfo/beowulf
Re: [Beowulf] bizarre scaling behavior on a Nehalem
In message from Bill Broadley b...@cse.ucdavis.edu (Fri, 14 Aug 2009 16:13:21 -0700): Mikhail Kuzminsky wrote: Your results look excellent, so I wouldn't be surprised if they are running at 1333. I have 12-18 GB/s on 4 threads of stream/ifort w/DDR3-1066 on dual E5520 server. But it works under numa-bad kernel w/o control of numa-efficient allocation. Sounds pretty bad. Why 4 threads? You need 8 cores to keep all 6 memory busses busy. For comparison w/your tests: you have only 4 cores. On 8 threads I have 20-26 GB/s. Which compiler? ifort pointed above means intel fortran 11.0.38. Mikhail open64 does substantially better than gcc. -- üÔÏ ÓÏÏÂÝÅÎÉÅ ÂÙÌÏ ÐÒÏ×ÅÒÅÎÏ ÎÁ ÎÁÌÉÞÉÅ × ÎÅÍ ×ÉÒÕÓÏ× É ÉÎÏÇÏ ÏÐÁÓÎÏÇÏ ÓÏÄÅÒÖÉÍÏÇÏ ÐÏÓÒÅÄÓÔ×ÏÍ MailScanner, É ÍÙ ÎÁÄÅÅÍÓÑ ÞÔÏ ÏÎÏ ÎÅ ÓÏÄÅÒÖÉÔ ×ÒÅÄÏÎÏÓÎÏÇÏ ËÏÄÁ. ___ Beowulf mailing list, Beowulf@beowulf.org sponsored by Penguin Computing To change your subscription (digest mode or unsubscribe) visit http://www.beowulf.org/mailman/listinfo/beowulf
Re: [Beowulf] performance tweaks and optimum memory configs for a Nehalem
In message from Rahul Nabar rpna...@gmail.com (Sun, 9 Aug 2009 22:42:25 -0500): (a) I am seeing strange scaling behaviours with Nehlem cores. eg A specific DFT (Density Functional Theory) code we use is maxing out performance at 2, 4 cpus instead of 8. i.e. runs on 8 cores are actually slower than 2 and 4 cores (depending on setup) If this results are for HyperThreading ON, it may be not too strange because of virtual cores competition. But if this results are for switched off Hyperthreading - it's strange. I have usual good DFT scaling w/number of cores on G03 - about in 7 times for 8 cores. Mikhail Kuzminsky Computer Assistance to Chemical Research Center Zelinsky Institute of Organic Chemistry RAS Moscow ___ Beowulf mailing list, Beowulf@beowulf.org sponsored by Penguin Computing To change your subscription (digest mode or unsubscribe) visit http://www.beowulf.org/mailman/listinfo/beowulf
Re: [Beowulf] numactl SuSE11.1
It's interesting, that for this hardsoftware configuration disabling of NUMA in BIOS gives more high STREAM results in comparison w/NUMA enabled. I.e. for NUMA off: 8723/8232/10388/10317 MB/s for NUMA on: 5620/5217/6795/6767 MB/s (both for OMP_NUM_THREADS=1 and ifort 11.1 compiler). The situation for Opteron's is opposite: NUMA mode gives more high throughput. In message from Mikhail Kuzminsky k...@free.net (Mon, 10 Aug 2009 21:43:56 +0400): I'm sorry for my mistake: the problem is on Nehalem Xeon under SuSE -11.1, but w/kernel 2.6.27.7-9 (w/Supermicro X8DT mobo). For Opteron 2350 w/SuSE 10.3 (w/ more old 2.6.22.5-31 -I erroneously inserted this string in my previous message) numactl works OK (w/Tyan mobo). NUMA is enabled in BIOS. Of course, CONFIG_NUMA (and CONFIG_NUMA_EMU) are setted to y in both kernels. Unfortunately I (i.e. root) can't change files in /sys/devices/system/node (or rename directory node2 to node1) :-( - as it's possible w/some files in /proc filesystem. It's interesting, that extraction from dmesg show, that IT WAS NODE1, but then node2 is appear ! ACPI: SRAT BF79A4B0, 0150 (r1 041409 OEMSRAT 1 INTL1) ACPI: SSDT BF79FAC0, 249F (r1 DpgPmmCpuPm 12 INTL 20051117) ACPI: Local APIC address 0xfee0 SRAT: PXM 0 - APIC 0 - Node 0 SRAT: PXM 0 - APIC 2 - Node 0 SRAT: PXM 0 - APIC 4 - Node 0 SRAT: PXM 0 - APIC 6 - Node 0 SRAT: PXM 1 - APIC 16 - Node 1 SRAT: PXM 1 - APIC 18 - Node 1 SRAT: PXM 1 - APIC 20 - Node 1 SRAT: PXM 1 - APIC 22 - Node 1 SRAT: Node 0 PXM 0 0-a SRAT: Node 0 PXM 0 10-c000 SRAT: Node 0 PXM 0 1-1c000 SRAT: Node 2 PXM 257 1c000-34000 (here !!) NUMA: Allocated memnodemap from 1c000 - 22880 NUMA: Using 20 for the hash shift. Bootmem setup node 0 -0001c000 NODE_DATA [00022880 - 0003a87f] bootmap [0003b000 - 00072fff] pages 38 (8 early reservations) == bootmem [00 - 01c000] #0 [00 - 001000] BIOS data page == [00 - 001000] #1 [006000 - 008000] TRAMPOLINE == [006000 - 008000] #2 [20 - bf27b8]TEXT DATA BSS == [20 - bf27b8] #3 [0037a3b000 - 0037fef104] RAMDISK == [0037a3b000 - 0037fef104] #4 [09cc00 - 10]BIOS reserved == [09cc00 - 10] #5 [01 - 013000] PGTABLE == [01 - 013000] #6 [013000 - 01c000] PGTABLE == [013000 - 01c000] #7 [01c000 - 022880] MEMNODEMAP == [01c000 - 022880] Bootmem setup node 2 0001c000-00034000 NODE_DATA [0001c000 - 0001c0017fff] bootmap [0001c0018000 - 0001c0047fff] pages 30 (8 early reservations) == bootmem [01c000 - 034000] #0 [00 - 001000] BIOS data page #1 [006000 - 008000] TRAMPOLINE #2 [20 - bf27b8]TEXT DATA BSS #3 [0037a3b000 - 0037fef104] RAMDISK #4 [09cc00 - 10]BIOS reserved #5 [01 - 013000] PGTABLE #6 [013000 - 01c000] PGTABLE #7 [01c000 - 022880] MEMNODEMAP found SMP MP-table at [880ff780] 000ff780 [e200-e20006ff] PMD - [88002820-88002e1f] on node 0 [e2000700-e2000cff] PMD - [8801c020-8801c61f] on node 2 Mikhail Kuzminsky Computer Assistance to Chemical Research Center Zelinsky Institute of Organic Chemistry RAS Moscow ___ Beowulf mailing list, Beowulf@beowulf.org sponsored by Penguin Computing To change your subscription (digest mode or unsubscribe) visit http://www.beowulf.org/mailman/listinfo/beowulf
Re: [Beowulf] numactl SuSE11.1
I'm sorry for my mistake: the problem is on Nehalem Xeon under SuSE -11.1, but w/kernel 2.6.27.7-9 (w/Supermicro X8DT mobo). For Opteron 2350 w/SuSE 10.3 (w/ more old 2.6.22.5-31 -I erroneously inserted this string in my previous message) numactl works OK (w/Tyan mobo). NUMA is enabled in BIOS. Of course, CONFIG_NUMA (and CONFIG_NUMA_EMU) are setted to y in both kernels. Unfortunately I (i.e. root) can't change files in /sys/devices/system/node (or rename directory node2 to node1) :-( - as it's possible w/some files in /proc filesystem. It's interesting, that extraction from dmesg show, that IT WAS NODE1, but then node2 is appear ! ACPI: SRAT BF79A4B0, 0150 (r1 041409 OEMSRAT 1 INTL1) ACPI: SSDT BF79FAC0, 249F (r1 DpgPmmCpuPm 12 INTL 20051117) ACPI: Local APIC address 0xfee0 SRAT: PXM 0 - APIC 0 - Node 0 SRAT: PXM 0 - APIC 2 - Node 0 SRAT: PXM 0 - APIC 4 - Node 0 SRAT: PXM 0 - APIC 6 - Node 0 SRAT: PXM 1 - APIC 16 - Node 1 SRAT: PXM 1 - APIC 18 - Node 1 SRAT: PXM 1 - APIC 20 - Node 1 SRAT: PXM 1 - APIC 22 - Node 1 SRAT: Node 0 PXM 0 0-a SRAT: Node 0 PXM 0 10-c000 SRAT: Node 0 PXM 0 1-1c000 SRAT: Node 2 PXM 257 1c000-34000 (here !!) NUMA: Allocated memnodemap from 1c000 - 22880 NUMA: Using 20 for the hash shift. Bootmem setup node 0 -0001c000 NODE_DATA [00022880 - 0003a87f] bootmap [0003b000 - 00072fff] pages 38 (8 early reservations) == bootmem [00 - 01c000] #0 [00 - 001000] BIOS data page == [00 - 001000] #1 [006000 - 008000] TRAMPOLINE == [006000 - 008000] #2 [20 - bf27b8]TEXT DATA BSS == [20 - bf27b8] #3 [0037a3b000 - 0037fef104] RAMDISK == [0037a3b000 - 0037fef104] #4 [09cc00 - 10]BIOS reserved == [09cc00 - 10] #5 [01 - 013000] PGTABLE == [01 - 013000] #6 [013000 - 01c000] PGTABLE == [013000 - 01c000] #7 [01c000 - 022880] MEMNODEMAP == [01c000 - 022880] Bootmem setup node 2 0001c000-00034000 NODE_DATA [0001c000 - 0001c0017fff] bootmap [0001c0018000 - 0001c0047fff] pages 30 (8 early reservations) == bootmem [01c000 - 034000] #0 [00 - 001000] BIOS data page #1 [006000 - 008000] TRAMPOLINE #2 [20 - bf27b8]TEXT DATA BSS #3 [0037a3b000 - 0037fef104] RAMDISK #4 [09cc00 - 10]BIOS reserved #5 [01 - 013000] PGTABLE #6 [013000 - 01c000] PGTABLE #7 [01c000 - 022880] MEMNODEMAP found SMP MP-table at [880ff780] 000ff780 [e200-e20006ff] PMD - [88002820-88002e1f] on node 0 [e2000700-e2000cff] PMD - [8801c020-8801c61f] on node 2 Mikhail Kuzminsky Computer Assistance to Chemical Research Center Zelinsky Institute of Organic Chemistry RAS Moscow ___ Beowulf mailing list, Beowulf@beowulf.org sponsored by Penguin Computing To change your subscription (digest mode or unsubscribe) visit http://www.beowulf.org/mailman/listinfo/beowulf
[Beowulf] Tyan S7002 for Nehalem-based nodes
Is there some contra-indications for using of Tyan S7002 AG2NR w/Xeon 5520 for cluster nodes ? May be somebody have some experience w/S7002 ? Mikhail Kuzminsky Computer Assistance to Chemical Research Center Zelinsky Institute of Organic Chemistry RAS Moscow ___ Beowulf mailing list, Beowulf@beowulf.org sponsored by Penguin Computing To change your subscription (digest mode or unsubscribe) visit http://www.beowulf.org/mailman/listinfo/beowulf
Re: [Beowulf] FPU performance of Intel CPUs
In message from John Hearns hear...@googlemail.com (Mon, 6 Apr 2009 17:45:37 +0100): 2009/4/6 Jones de Andrade johanne...@gmail.com: That's a thing that rises a question for me... Will beowulfers start to accept manufacturers auto-overclock as a feature... ou will choose motherboards that allows you to disable this? ;) Concerning Nehalems, of course. I read up about this. You can always disable it using ACPI If you use good parallelized program w/high CPUs utilization, I beleive, you SHOULD disable turbo-boost mode :-) Mikhail Kuzminsky Computer Assistance to Chemical Research Center Zelinsky Institute of Organic Chemistry RAS Moscow ___ Beowulf mailing list, Beowulf@beowulf.org To change your subscription (digest mode or unsubscribe) visit http://www.beowulf.org/mailman/listinfo/beowulf -- This message has been scanned for viruses and dangerous content by MailScanner, and is believed to be clean. ___ Beowulf mailing list, Beowulf@beowulf.org sponsored by Penguin Computing To change your subscription (digest mode or unsubscribe) visit http://www.beowulf.org/mailman/listinfo/beowulf
Re: [Beowulf] X5500
In message from Kilian CAVALOTTI kilian.cavalotti.w...@gmail.com (Tue, 31 Mar 2009 10:27:55 +0200): ... Any other numbers, people? I beleive there is also a bit other important numbers - prices for Xeon 55XX and system boards ;-) I didn't see prices on pricegrabber, for example. Is there some price information available ? Mikhail Kuzminsky Computer Assistance to Chemical Research Center Zelinsky Institute of Organic Chemostry RAS Moscow ___ Beowulf mailing list, Beowulf@beowulf.org To change your subscription (digest mode or unsubscribe) visit http://www.beowulf.org/mailman/listinfo/beowulf
Re: [Beowulf] Lowered latency with multi-rail IB?
In message from Dow Hurst DPHURST dphu...@uncg.edu (Thu, 26 Mar 2009 23:32:23 -0400): We've got a couple of weeks max to finalize spec'ing a new cluster. Has anyone knowledge of lowering latency for NAMD by implementing a multi-rail IB solution using MVAPICH or Intel's MPI? My research tells me low latency is key to scaling our code of choice, NAMD, effectively. Has anyone cut down real effective latency to below 1.0us using multi-rail IB for molecular dynamics codes such Gromacs, Amber, CHARMM, or NAMD? What about lowered latency for parallel abnitio calculations involving NwChem, Jaguar, or Gaussian using multi-rail IB? In opposition to molecular dynamics programs (Gromacs/Amber/Charmm) where low latency is necessary, for some quantum chemical programs (Gaussian, Gamess-US) there is relative low interconnect dependency. I measured message lengthes for Gaussian-03 for a set of calculation methods, and this messages are middle-to-large in sizes. NWChem is the only quantum-chemical program I know, which require high interconnect performance. I don't know about Jaguar. Mikhail Kuzminsky Computer Assistance to Chemical Research Center Zelinsky Institute of Organic Chemistry RAS Moscow If so, what was the configuration of cards and software? Any caveats involved, except price? ;-) Multi-rail IB is not something I know much about so am trying to get up to speed on what is possible and what is not. I do understand that lowering latency using multi-rail has to come from the MPI layer knowing how to use the hardware properly and some MPI implementations have such options and others don't. I understand that MVAPICH has some capabilities to use multi-rail and that NAMD is run on top of MVAPICH on many IB based clusters. Any links or pointers to how I can quickly educate myself on the topic would be appreciated. Best wishes, Dow __ Dow P. Hurst, Research Scientist Department of Chemistry and Biochemistry University of North Carolina at Greensboro 435 New Science Bldg. Greensboro, NC 27402-6170 dphu...@uncg.edu dow.hu...@mindspring.com 336-334-5122 office 336-334-4766 lab 336-334-5402 fax -- This message has been scanned for viruses and dangerous content by MailScanner, and is believed to be clean. ___ Beowulf mailing list, Beowulf@beowulf.org To change your subscription (digest mode or unsubscribe) visit http://www.beowulf.org/mailman/listinfo/beowulf
[Beowulf] Sun X4600 STREAM results
Sorry, do somebody have X4600 M2 Stream results (or the corresponding URLs) for DDR2/667 - w/dependance from processor core numbers? Mikhail Kuzminsky Computer Assistance to Chemical Reserach Center Zelinsky Institute of Organic Chemistry RAS Moscow ___ Beowulf mailing list, Beowulf@beowulf.org To change your subscription (digest mode or unsubscribe) visit http://www.beowulf.org/mailman/listinfo/beowulf
Re: [Beowulf] Grid scheduler for Windows XP
In message from Sangamesh B forum@gmail.com (Thu, 5 Mar 2009 01:29:09 -0500): Hello everyone, Is there a Grid scheduler (only open source, like SGE) tool which can be installed/run on Windows XP Desktop systems (there is no Linux involvement strictly). The applications used under this grid are Native to Windows XP. GRAM component of Globus Toolkit (http://www.globus.org/) give you some possibilities of batch queue system, and there is SGE interfaces to Globus. Mikhail Kuzminsky Computer Assistance to Chemical Research Center Zelinsky Institute of Organic Chemistry RAS Moscow Thanks, Sangamesh ___ Beowulf mailing list, Beowulf@beowulf.org To change your subscription (digest mode or unsubscribe) visit http://www.beowulf.org/mailman/listinfo/beowulf -- This message has been scanned for viruses and dangerous content by MailScanner, and is believed to be clean. ___ Beowulf mailing list, Beowulf@beowulf.org To change your subscription (digest mode or unsubscribe) visit http://www.beowulf.org/mailman/listinfo/beowulf
Re: [Beowulf] RE:small distro for PXE boot, autostarts sshd?
In message from Greg Keller g...@keller.net (Fri, 27 Feb 2009 10:20:50 -0600): Have you ever considered Perceus (Caos has it baked in) from infiscale? ... http://www.infiscale.com/ It looks that there is only one way to understand a bit more detailed what Perceus does - to download it :-) What is known about OpenSuSE/SLES distros + Perceus ? (Or any other choice for SuSE distros) Mikhail Kuaminsky Computer Assistance to Chemical Research Center Zelinsky Institute of Organic Chemistry RAS Moscow ___ Beowulf mailing list, Beowulf@beowulf.org To change your subscription (digest mode or unsubscribe) visit http://www.beowulf.org/mailman/listinfo/beowulf
Re: [Beowulf] How many double precision floating point operations per clock cycle for AMD Barcelona?
In message from Prakashan Korambath p...@ats.ucla.edu (Tue, 10 Feb 2009 08:23:05 -0800): Could someone confirm the number of double precision floating point operations (FLOPS) for AMD Barcelona chips? The URL below seems to indicate 4 FLOPS per cycle. I just want to confirm it. Thanks. 4 FLOPS per core. Mikhail Kuzminskiy, Computer Assistance to Chemical Research Center, Zelinsky Institute of Organic Chemistry RAS, Moscow http://forums.amd.com/devblog/blogpost.cfm?catid=253threadid=87799 Prakashan ___ Beowulf mailing list, Beowulf@beowulf.org To change your subscription (digest mode or unsubscribe) visit http://www.beowulf.org/mailman/listinfo/beowulf -- This message has been scanned for viruses and dangerous content by MailScanner, and is believed to be clean. ___ Beowulf mailing list, Beowulf@beowulf.org To change your subscription (digest mode or unsubscribe) visit http://www.beowulf.org/mailman/listinfo/beowulf
Re: [Beowulf] Hadoop
In message from Gerry Creager gerry.crea...@tamu.edu (Mon, 29 Dec 2008 09:01:21 -0600): As for Fortran vs C, real scientists program in Fortran. Real Old Scientists program in Fortran-66. Carbon-dated scientists can still recall IBM FORTRAN-G and -H. :-) I didn't check, but may be I just have Fortran-G and H on my PC - as a part of free Turnkey MVS distribution working w/(free) Hercules emulator for IBM mainframes. Actually, a number of our mathematicians use C for their codes, but don't seem to be doing much more than theoretical codes. The guys who're wwriting/rewriting practical codes (weather models, computational chemistry, reservoir simulations in solid earth) seem to stick to Fortran here. Our group works in area of computational chemistry, and of course we write the programs on Fortran (95) :-) But I'm afraid that we'll start here the new cycle of religious language war :-) Mikhail Kuzminsky Computer Assistance to Chemical Research Center Zelinsky Institute of Organic Chemistry Moscow gerry Jeff Layton wrote: I hate to tangent (hijack?) this subject, but I'm curious about your class poll. Did the people who were interested in Matlab consider Octave? Thanks! Jeff *From:* Joe Landman land...@scalableinformatics.com *To:* Jeff Layton layto...@att.net *Cc:* Gerry Creager gerry.crea...@tamu.edu; Beowulf Mailing List beowulf@beowulf.org *Sent:* Saturday, December 27, 2008 11:11:20 AM *Subject:* Re: [Beowulf] Hadoop N.B. the recent MPI class we gave suggested that we need to re-tool it to focus more upon Fortran than C. There was no interest in Java from the class I polled. Some researchers want to use Matlab for their work, but most university computing facilities are loathe to spend the money to get site licenses for Matlab. Unfortunate, as Matlab is a very cool tool (been playing with it first in 1988 ...) its just not fast. The folks at Interactive Supercomputing might be able to help with this with their compiler. -- Gerry Creager -- gerry.crea...@tamu.edu Texas Mesonet -- AATLT, Texas AM University Cell: 979.229.5301 Office: 979.458.4020 FAX: 979.862.3983 Office: 1700 Research Parkway Ste 160, TAMU, College Station, TX 77843 ___ Beowulf mailing list, Beowulf@beowulf.org To change your subscription (digest mode or unsubscribe) visit http://www.beowulf.org/mailman/listinfo/beowulf -- This message has been scanned for viruses and dangerous content by MailScanner, and is believed to be clean. ___ Beowulf mailing list, Beowulf@beowulf.org To change your subscription (digest mode or unsubscribe) visit http://www.beowulf.org/mailman/listinfo/beowulf
Re: [Beowulf] Parallel software for chemists
In message from Dr Cool Santa drcoolsa...@gmail.com (Wed, 10 Dec 2008 19:21:43 +0530): Currently in the lab we use Schrodinger and we are looking into NWchem. We'd be interested in knowing about software that a chemist could use that makes use of a parallel supercomputer. And better if it is linux. To say shortly, practically all the modern software for molecular modelling calculations can run in parallel on Linux clusters. Mikhail Kuzminsky Computer Assistance to Chemical Research Center Zelinsky Institute of Organic Chemistry RAS Moscow -- This message has been scanned for viruses and dangerous content by MailScanner, and is believed to be clean. ___ Beowulf mailing list, Beowulf@beowulf.org To change your subscription (digest mode or unsubscribe) visit http://www.beowulf.org/mailman/listinfo/beowulf
[Beowulf] Сlos network vs fat tree
Sorry, is it correct to say that fat tree topology is equal to *NON-BLOCKING* Clos network w/addition of uplinks ? I.e. any non-blocking Clos network w/corresponding addition of uplinks gives fat tree ? I read somewhere that exact evidence of non-blocking was performed for Clos networks with = 3 levels. But most popular Infiniband fat trees has only 2 levels. (Yes, I know that non-blocking for Clos network isn't absolute :-)) Mikhail ___ Beowulf mailing list, Beowulf@beowulf.org To change your subscription (digest mode or unsubscribe) visit http://www.beowulf.org/mailman/listinfo/beowulf
Re: Re[2]: [Beowulf] Shanghai vs Barcelona, Shanghai vs Nehalem
In message from Jan Heichler [EMAIL PROTECTED] (Wed, 22 Oct 2008 20:27:40 +0200): Hallo Mikhail, Mittwoch, 22. Oktober 2008, meintest Du: MK In message from Ivan Oleynik [EMAIL PROTECTED] (Tue, 21 Oct 2008 MK 18:15:49 -0400): I have heard that AMD Shanghai will be available in Nov 2008. Does someone know the pricing and performance info and how is it compared with Barcelona? Are there some informal comparisons of Shanghai vs Nehalem? MK I beleive that Shanghai performance increase in comparison w/Barcelona MK will be practically defined only by possible higher Shanghai MK frequencies. You can expect to see better performance in SPEC_CPU for Shanghai vs. Barcelona when comparing identical clockspeeds. But of course the increased clockspeed ist a big argument for Shanghai (or the same clockspeed with less energy consumption). And Shanghai has some more features like faster memory and HT3 in some of the later revisions i hope... Yes, I think HT3 *must* be. It was declared for Barcelona, but really is supported now AFAIK only for desktop chips. Mikhail Jan ___ Beowulf mailing list, Beowulf@beowulf.org To change your subscription (digest mode or unsubscribe) visit http://www.beowulf.org/mailman/listinfo/beowulf
Re: [Beowulf] Shanghai vs Barcelona, Shanghai vs Nehalem
In message from Ivan Oleynik [EMAIL PROTECTED] (Tue, 21 Oct 2008 18:15:49 -0400): I have heard that AMD Shanghai will be available in Nov 2008. Does someone know the pricing and performance info and how is it compared with Barcelona? Are there some informal comparisons of Shanghai vs Nehalem? I beleive that Shanghai performance increase in comparison w/Barcelona will be practically defined only by possible higher Shanghai frequencies. Mikhail Thanks, Ivan ___ Beowulf mailing list, Beowulf@beowulf.org To change your subscription (digest mode or unsubscribe) visit http://www.beowulf.org/mailman/listinfo/beowulf
Re: [Beowulf] Shanghai vs Barcelona, Shanghai vs Nehalem
In message from Mark Hahn [EMAIL PROTECTED] (Wed, 22 Oct 2008 13:23:08 -0400 (EDT)): Are there some informal comparisons of Shanghai vs Nehalem? I beleive that Shanghai performance increase in comparison w/Barcelona will be practically defined only by possible higher Shanghai frequencies. is that based on anything hands-on? No, I'm not under NDA - because I don't have Shanghai chips in hands :-) Mikhail IMO, AMD needs to get a bit more serious about competing. if I7 ships with ~15 GB/s per socket and working multi-socket scalability, it's hard to imagine why anyone would bother to look at AMD. either: - there is some sort of significant flaw with I7 (runs like a dog in 64b mode or Hf turns into blue cheese after a year, etc). - AMD gets its act together (lower-latency L3, highly efficient ddr3/1333 interface, directory-based coherence). - AMD satisfies itself with bottom-feeding (which probably also means only low-end consumer stuff with little HPC interest). I've had good reason to be an AMD fan in recent-ish years, but if Intel is firing on all cylinders, AMD needs to be the rotary engine or have more cylinders, or something... ___ Beowulf mailing list, Beowulf@beowulf.org To change your subscription (digest mode or unsubscribe) visit http://www.beowulf.org/mailman/listinfo/beowulf
Re: [Beowulf] Nehalem Xeons
In message from Håkon Bugge [EMAIL PROTECTED] (Tue, 14 Oct 2008 07:50:32 +0200): They are _definitively_ worth waiting for, although I am not familiar with the release timing. But I have been running on a dual-socket with 8 cores and 16 SMTs. And I say they are worth waiting for. Q1'2009 - unfortunately, I don't know more exactly :-( Mikhail Kuzminsky Computer Assistance to Chemical Research Center, Zelinsky Institute of Organic Chemistry Moscow Håkon At 01:57 14.10.2008, Ivan Oleynik wrote: I am still in process of purchasing a new cluster and consider whether is worth waiting for new Intel Xeons. Does someone know when Intel will start selling Xeons based on Nehalem architecture? They announced desktop launch (Nov 16) but were quiet about server release. Thanks, Ivan ___ Beowulf mailing list, Beowulf@beowulf.org To change your subscription (digest mode or unsubscribe) visit http://www.beowulf.org/mailman/listinfo/beowulf -- Håkon Bugge Chief Technologist mob. +47 92 48 45 14 off. +47 21 37 93 19 fax. +47 22 23 36 66 [EMAIL PROTECTED] Skype: hakon_bugge Platform Computing, Inc. ___ Beowulf mailing list, Beowulf@beowulf.org To change your subscription (digest mode or unsubscribe) visit http://www.beowulf.org/mailman/listinfo/beowulf ___ Beowulf mailing list, Beowulf@beowulf.org To change your subscription (digest mode or unsubscribe) visit http://www.beowulf.org/mailman/listinfo/beowulf
[Beowulf] Re: Beowulf Digest, Vol 55, Issue 2
In message from Li, Bo [EMAIL PROTECTED] (Thu, 4 Sep 2008 14:34:00 +0800): Hello, Is it too expensive for the platform? The easy solution is: And X48 level motherboard with CF support, about $150 Q6600 Processor, about $170 Two 4870X2 $1,100 Do somebody know, are ACML routines parallelized for using of few GPGPUs ? Mikhail Two Seagate SATA Harddisk 500G for Raid1, about $140 4*2G DDR2 RAM, about $150 PSU 1000W, about $200 A big box, about $100 That's all, in total, $2,010. Regards, Li, Bo - Original Message - From: Maurice Hilarius To: beowulf@beowulf.org Cc: [EMAIL PROTECTED] ; [EMAIL PROTECTED] ; [EMAIL PROTECTED] Sent: Thursday, September 04, 2008 6:51 AM Subject: Re: Beowulf Digest, Vol 55, Issue 2 Li, bo wrote: .. From: Li, Bo [EMAIL PROTECTED] Subject: Re: [Beowulf] gpgpu Hello, It seemed that you had got a very good example for GPGPU. As I said before, it's not the time for GPGPU to do the DP calculation at the moment. If you can bear SP computation, you will find more about it. NVidia just sent me some special offer about their Tesla platforms, which said that the workstation equipped with two GTX280 level professional cards costs about $5000, not bad. But my intention is still to lower the core frequency of a gaming card, and use it for computation. Regards, Li, Bo Looking at AMD/ATI Firestream and 4850 pricing, it is not too bad: AMD FIRESTREAM 9250 STREAM PROCESSOR (P/N: 100-505563)$880 VISIONTEK RADEON HD4870X2 2GB PCI-E (P/N: 900250) $575 VISIONTEK RADEON HD 4870 512MB PCI-E (P/N: 900244) $355 The 4870 and X2 also run the AMD code. So, given a decent machine, with 4 cores and a pair of the 4870X2, one can achieve some pretty amazing GPU performance levels for a system well under $4,000. With dualX2s ( 4 GPU engines) around $4700 ( extra PSU capacity and cooling is needed for that level). I hear that AMD have a new Firestream coming, with the 48x0 family chips on it, but that will likely be a bit on the pricier side.. Anyway, the Firestream has GPUs with Double-Precision Floating Point. Something the nVidia offerings do not. Worth considering. http://ati.amd.com/technology/streamcomputing/product_firestream_9250.html SDK: http://ati.amd.com/technology/streamcomputing/sdkdwnld.html -- With our best regards, Maurice W. Hilarius Telephone: 01-780-456-9771 Hard Data Ltd.FAX: 01-780-456-9772 11060 - 166 Avenue email:[EMAIL PROTECTED] Edmonton, AB, Canada http://www.harddata.com/ T5X 1Y3 ___ Beowulf mailing list, Beowulf@beowulf.org To change your subscription (digest mode or unsubscribe) visit http://www.beowulf.org/mailman/listinfo/beowulf
Re: [Beowulf] gpgpu
In message from Li, Bo [EMAIL PROTECTED] (Thu, 28 Aug 2008 14:20:15 +0800): ... Currently, the DP performance of GPU is not good as we expected, or only 1/8 1/10 of SP Flops. It is also a problem. AMD data: Firestream 9170 SP performance is 5 GFLOPS/W vs 1 GFLOPS/W for DP. It's 5 times slower than SP. Firestream 9250 has 1 TFLOPS for SP, therefore 1/5 is about 200 GFLOPS DP. The price will be, I suppose, about $2000 - as for 9170. Let me look to modern dual socket quad-core beowulf node w/price about $4000+, for example. For Opteron 2350/2 Ghz chips (I use) peak DP performance is 64 GFLOPS (8 cores). For 3 Ghz Xeon chips - about 100 GFLOPS. Therefore GPGPU peak DP performance is 1.5-2 times higher than w/CPUs. Is it enough for essential calculation speedup - taking into account time for data transmission to/from GPU ? I would suggest hybrid computation platforms, with GPU, CPU, and processors like Clearspeed. It may be a good topic for programming model. Clearspeed, if there is no new hardware now, has not enough DP performance in comparison w/typical modern servers on quad-core CPUs. Yours Mikhail Regards, Li, Bo - Original Message - From: Vincent Diepeveen [EMAIL PROTECTED] To: Li, Bo [EMAIL PROTECTED] Cc: Mikhail Kuzminsky [EMAIL PROTECTED]; Beowulf beowulf@beowulf.org Sent: Thursday, August 28, 2008 12:22 AM Subject: Re: [Beowulf] gpgpu Hi Bo, Thanks for your message. What library do i call to find primes? Currently it's searching here after primes (PRP's) in the form of p = (2^n + 1) / 3 n is here about 1.5 million bits roughly as we speak. For SSE2 type processors there is the George Woltman assembler code (MiT) to do the squaring + implicit modulo; how do you plan to beat that type of real optimized number crunching at a GPU? You'll have to figure out a way to find an instruction level parallellism of at least 32, which also doesn't write to the same cacheline, i *guess* (no documentation to verify that in fact). So that's a range of 256 * 32 = 2^8 * 2^5 = 2^13 = 8192 bytes In fact the first problem to solve is to do some sort of squaring real quickly. If you figured that out at a PC, experience learns you're still losing a potential of factor 8, thanks to another zillion optimizations. You're not allowed to lose factor 8. that 52 gflop a gpu can deliver on paper @ 250 watt TDP (you bet it will consume that when you let it work so hard) means GPU delivers effectively less than 7 gflops double precision thanks to inefficient code. Additionally remember the P4. On paper in integers claim was when it released it would be able to execute 4 integers a cycle, reality is that it was a processor getting an IPC far under 1 for most integer codes. All kind of stuff sucked at it. The experience learns this is the same for todays GPU's, the scientists who have run codes on it so far and are really experienced CUDA programmers, figured out the speed it delivers is a very big bummer. Additionally 250 watt TDP for massive number crunching is too much. It's well over factor 2 power consumption of a quadcore. Now i can take a look soon in China myself what power prices are over there, but i can assure you they will rise soon. Now that's a lot less than a quadcore delivers with a tdp far under 100 watt. Now i explicitly mention the n's i'm searching here, as it should fit within caches. So the very secret bandwidth you can practical achieve (as we know nvidia lobotomized bandwidth in the GPU cards, only the Tesla type seems to be not lobotomized), i'm not even teasing you with that. This is true for any type of code. You're losing it to the details. Only custom tailored solutions will work, simply because they're factors faster. Thanks, Vincent On Aug 27, 2008, at 2:50 AM, Li, Bo wrote: Hello, IMHO, it is better to call the BLAS or similiar libarary rather than programing you own functions. And CUDA treats the GPU as a cluster, so .CU is not working as our normal codes. If you have got to many matrix or vector computation, it is better to use Brook+/ CAL, which can show great power of AMD gpu. Regards, Li, Bo - Original Message - From: Mikhail Kuzminsky [EMAIL PROTECTED] To: Vincent Diepeveen [EMAIL PROTECTED] Cc: Beowulf beowulf@beowulf.org Sent: Wednesday, August 27, 2008 2:35 AM Subject: Re: [Beowulf] gpgpu In message from Vincent Diepeveen [EMAIL PROTECTED] (Tue, 26 Aug 2008 00:30:30 +0200): Hi Mikhail, I'd say they're ok for black box 32 bits calculations that can do with a GB or 2 RAM, other than that they're just luxurious electric heating. I also want to have simple blackbox, but 64-bit (Tesla C1060 or Firestream 9170 or 9250). Unfortunately the life isn't restricted to BLAS/LAPACK/FFT :-) So I'll need to program something other. People say that the best choice is CUDA for Nvidia. When I look to sgemm source, it has about 1 thousand (or higher) strings
Re: [Beowulf] gpgpu
In message from Vincent Diepeveen [EMAIL PROTECTED] (Tue, 26 Aug 2008 00:30:30 +0200): Hi Mikhail, I'd say they're ok for black box 32 bits calculations that can do with a GB or 2 RAM, other than that they're just luxurious electric heating. I also want to have simple blackbox, but 64-bit (Tesla C1060 or Firestream 9170 or 9250). Unfortunately the life isn't restricted to BLAS/LAPACK/FFT :-) So I'll need to program something other. People say that the best choice is CUDA for Nvidia. When I look to sgemm source, it has about 1 thousand (or higher) strings in *.cu files. Thereofore I think that a bit more difficult alghorithm as some special matrix diagonalization will require a lot of programming work :-(. It's interesting, that when I read Firestream Brook+ kernel function source example - for addition of 2 vectors (Building a High Level Language Compiler For GPGPU, Bixia Zheng ([EMAIL PROTECTED]) Derek Gladding ([EMAIL PROTECTED]) Micah Villmow ([EMAIL PROTECTED]) June 8th, 2008) - it looks SIMPLE. May be there are a lot of details/source lines which were omitted from this example ? Vincent p.s. if you ask me, honestely, 250 watt or so for latest gpu is really too much. 250 W is TDP, the average value declared is about 160 W. I don't remember, which GPU - from AMD or Nvidia - has a lot of special functional units for sin/cos/exp/etc. If they are not used, may be the power will a bit more lower. What is about Firestream 9250, AMD says about 150 W (although I'm not absolutely sure that it's TDP) - it's as for some Intel Xeon quad-cores chips w/names beginning from X. Mikhail On Aug 23, 2008, at 10:31 PM, Mikhail Kuzminsky wrote: BTW, why GPGPUs are considered as vector systems ? Taking into account that GPGPUs contain many (equal) execution units, I think it might be not SIMD, but SPMD model. Or it depends from the software tools used (CUDA etc) ? Mikhail Kuzminsky Computer Assistance to Chemical Research Center Zelinsky Institute of Organic Chemistry Moscow ___ Beowulf mailing list, Beowulf@beowulf.org To change your subscription (digest mode or unsubscribe) visit http://www.beowulf.org/mailman/listinfo/beowulf ___ Beowulf mailing list, Beowulf@beowulf.org To change your subscription (digest mode or unsubscribe) visit http://www.beowulf.org/mailman/listinfo/beowulf
Re: [Beowulf] hang-up of HPC Challenge
In message from Greg Lindahl [EMAIL PROTECTED] (Tue, 19 Aug 2008 19:39:38 -0700): On Wed, Aug 20, 2008 at 03:45:43AM +0400, Mikhail Kuzminsky wrote: For some localization of possible problem reason, I ran pure HPL test instead of HPCC. HPL performs direct output to screen instead of writing to the file. Using MPICH w/np=8 I obtained normal HPL result for N=35000 - including 3 PASSED strings for ||Ax-b|| calculations. BUT ! Linux hang-ups immediately after output of this strings. Well, what did your configuration file tell HPL to do? Does it have another test, perhaps a bigger one, or is it supposed to exit? We aren't mind-readers. Pls sorry: I performed now 2 HPL run cases for the same N=1, (1st) - single HPL run, i.e. ONE N=1, ONE blocksize value, and ONE any other HPL.dat parameter. (2nd) - multiple HPL run w/same (one) N=1 and blocksize=100, but with a sets of PFACTS etc (see the output below). 1st run finished successfully, 2nd lead to Linux hang-up. Yours Mikhail single HPL run : HPLinpack 1.0a -- High-Performance Linpack benchmark -- January 20, 2004 Written by A. Petitet and R. Clint Whaley, Innovative Computing Labs., UTK An explanation of the input/output parameters follows: T/V: Wall time / encoded variant. N : The order of the coefficient matrix A. NB : The partitioning blocking factor. P : The number of process rows. Q : The number of process columns. Time : Time in seconds to solve the linear system. Gflops : Rate of execution for solving the linear system. The following parameter values will be used: N : 1 NB : 100 PMAP : Row-major process mapping P : 2 Q : 4 PFACT : Right NBMIN : 4 NDIV : 2 RFACT : Crout BCAST : 1ringM DEPTH : 1 SWAP : Mix (threshold = 64) L1 : transposed form U : transposed form EQUIL : yes ALIGN : 16 double precision words - The matrix A is randomly generated for each test. - The following scaled residual checks will be computed: 1) ||Ax-b||_oo / ( eps * ||A||_1 * N) 2) ||Ax-b||_oo / ( eps * ||A||_1 * ||x||_1 ) 3) ||Ax-b||_oo / ( eps * ||A||_oo * ||x||_oo ) - The relative machine precision (eps) is taken to be 1.110223e-16 - Computational tests pass if scaled residuals are less than 16.0 T/VNNB P Q Time Gflops WR11C2R4 1 100 2 4 23.32 2.859e+01 ||Ax-b||_oo / ( eps * ||A||_1 * N) =0.0767386 .. PASSED ||Ax-b||_oo / ( eps * ||A||_1 * ||x||_1 ) =0.0181586 .. PASSED ||Ax-b||_oo / ( eps * ||A||_oo * ||x||_oo ) =0.0040588 .. PASSED Finished 1 tests with the following results: 1 tests completed and passed residual checks, 0 tests completed and failed residual checks, 0 tests skipped because of illegal input values. End of Tests. [1]+ Donempirun -np 8 xhpl multiple HPL run: HPLinpack 1.0a -- High-Performance Linpack benchmark -- January 20, 2004 Written by A. Petitet and R. Clint Whaley, Innovative Computing Labs., UTK An explanation of the input/output parameters follows: T/V: Wall time / encoded variant. N : The order of the coefficient matrix A. NB : The partitioning blocking factor. P : The number of process rows. Q : The number of process columns. Time : Time in seconds to solve the linear system. Gflops : Rate of execution for solving the linear system. The following parameter values will be used: N : 1 NB : 100 PMAP : Row-major process mapping P : 2 Q : 4 PFACT :LeftCroutRight NBMIN : 24 NDIV : 2 RFACT :LeftCroutRight BCAST : 1ring DEPTH : 0 SWAP : Mix (threshold = 64) L1 : transposed form U : transposed form EQUIL : yes ALIGN : 16 double precision words - The matrix A is randomly generated for each test. - The following scaled residual checks will be computed: 1) ||Ax-b||_oo / ( eps * ||A||_1 * N) 2) ||Ax-b||_oo / ( eps
Re: [Beowulf] hang-up of HPC Challenge
In message from Chris Samuel [EMAIL PROTECTED] (Wed, 20 Aug 2008 11:12:52 +1000 (EST)): - Mikhail Kuzminsky [EMAIL PROTECTED] wrote: What else may be the reason of hangup ? Depends what you mean by hangup really.. Does the code crash, does it just stop idle, does it busy loop, does the node oops, does it lockup, etc ? I beleive that program crash is not hangup. When I wrote about Linux hangup, I means that Linux don't response to any interrupts - from keyboard, from ssh client requests etc. If you're not already running a mainline kernel (say 2.6.26.2) it might also be worth giving that a go too, we're happily doing it on our Barcelonas (though on CentOS not SuSE). I use 2.6.22.5-31 kernel from SuSE 10.3 distribution. Mikhail cheers, Chris -- Christopher Samuel - (03) 9925 4751 - Systems Manager The Victorian Partnership for Advanced Computing P.O. Box 201, Carlton South, VIC 3053, Australia VPAC is a not-for-profit Registered Research Agency ___ Beowulf mailing list, Beowulf@beowulf.org To change your subscription (digest mode or unsubscribe) visit http://www.beowulf.org/mailman/listinfo/beowulf ___ Beowulf mailing list, Beowulf@beowulf.org To change your subscription (digest mode or unsubscribe) visit http://www.beowulf.org/mailman/listinfo/beowulf
[Beowulf] new flash SSDs
FYI: Intel presented on IDF new SATA 2.5 SSDs (based on NAND flash) for servers. This SSDs (X25-E Extreme, 32 GB) support command queueing (32 operations), R/W throughput = 250/170 MB/s, 75 usec read latency. 35000 writes per second and 3300 reads per second - for 4 KB blocks. 64 GB SDD is awaiting in Q1'2009. I hope this will lead to decrease of SSD market price. Unfortunately I have no information about prices and about lifetime. But I'm not too enthusiastic about prices: even Samsung PATA 2.5/32 GB SDD costs about $300, IBM SATA are much more expensive. Mikhail Kuzminsky Computer Assistance to Chemical Research Center Zelinsky Institute of Organic Chemistry Moscow ___ Beowulf mailing list, Beowulf@beowulf.org To change your subscription (digest mode or unsubscribe) visit http://www.beowulf.org/mailman/listinfo/beowulf
Re: [Beowulf] hang-up of HPC Challenge
For some localization of possible problem reason, I ran pure HPL test instead of HPCC. HPL performs direct output to screen instead of writing to the file. Using MPICH w/np=8 I obtained normal HPL result for N=35000 - including 3 PASSED strings for ||Ax-b|| calculations. BUT ! Linux hang-ups immediately after output of this strings. Mikhail In message from Mikhail Kuzminsky [EMAIL PROTECTED] (Mon, 18 Aug 2008 22:20:16 +0400): I ran a set of HPC Challenge benchmarks on ONE dual socket quad-core Opteron2350 (Rev. B3) based server (8 logical CPUs). RAM size is 16 Gbytes. The tests performed were under SuSE 10.3/x86-64, for LAM MPI 7.1.4 and MPICH 1.2.7 from SuSE distribution, using Atlas 3.9. Unfortunately there is only one such cluster node, and I can't reproduce the run on another node :-( For N (matrix size) up to 1 all looks OK. But for more large N (15000/2/...) hpcc execution (mpirun -np 8 hpcc) leads to Linux hang-up. In the top output I see 8 hpcc examplars each eating about 100% of CPU, and reasonable amounts of virtual and RSS memory per hpcc process, and the absense of swap using. Usually there is no PTRANS results in hpccoutf.txt results file, but in a few cases (when I activelly looked to hpcc execution by means of ps/top issuing) I see reasonable PTRANS results but absense of HPLinpack results. One time I obtained PTRANS, HPL and DGEMM results for N=2, but hangup later - on STREAM tests. May be it's simple because of absense (at hangup) of final writing of output buffer to output file on HDD. One of possible reasons of hang-ups is memory hardware problem, but what is about possible software reasons of hangups ? The hpcc executable is 64-bit dynamically linked. /etc/security/limits.conf is empty. stacksize limit (for user issuing mpirun) is unlimited, main memory limit - about 14 GB, virtual memory limit - about 30 GB. Atlas was compiled for 32-bit integers, but it's enough for such N values. Even /proc/sys/kernel/shmmax is 2^63-1. What else may be the reason of hangup ? Mikhail Kuzminskiy Computer Assistance to Chemical Research Center Zelinsky Institute of Organic Chemistry Moscow ___ Beowulf mailing list, Beowulf@beowulf.org To change your subscription (digest mode or unsubscribe) visit http://www.beowulf.org/mailman/listinfo/beowulf ___ Beowulf mailing list, Beowulf@beowulf.org To change your subscription (digest mode or unsubscribe) visit http://www.beowulf.org/mailman/listinfo/beowulf
Re: [Beowulf] Building new cluster - estimate
In message from Gerry Creager [EMAIL PROTECTED] (Wed, 06 Aug 2008 09:59:59 -0500): Robert Kubrick wrote: Or use solid-state data disks? Does anybody here have experience with SSD disk in HPC? Not on OUR budget! ;-) It was the proposal for journal part only ;-) SSD/flash disks (for increasing of lifetime) attempt not to erase data really - if it's physically possible. But if I use practically whole HDD partition for scratch files (and therefore whole SSD) - IMHO it'll be impossible not to erase flash RAM. What will be w/SSD disk lifetime in that case ? Mikhail Kuzminsky Computer Assistance to Chemical Research Zelinsky Institute of Organic Chemistry Moscow ___ Beowulf mailing list, Beowulf@beowulf.org To change your subscription (digest mode or unsubscribe) visit http://www.beowulf.org/mailman/listinfo/beowulf
Re: [Beowulf] Building new cluster - estimate
In message from Joshua Baker-LePain [EMAIL PROTECTED] (Tue, 5 Aug 2008 14:10:33 -0400 (EDT)): On Tue, 5 Aug 2008 at 8:34pm, Mikhail Kuzminsky wrote xfs has a rich set of utilities, but AFAIK no defragmentation tools (I don't know what will be after xfsdump/xfsrestore). But which modern linux Not true -- see xfs_fsr(8). Thanks !! I didn't look to xfs details many years :-( - it's my mistake. Back in the IRIX days, it was recommended to run this regularly. I don't remember that xfs_fsr was included in IRIX 6.1-6.4 we used. Mikhail However, ISTR that the current recommendation is as needed, but it really shouldn't be needed. -- Joshua Baker-LePain QB3 Shared Cluster Sysadmin UCSF ___ Beowulf mailing list, Beowulf@beowulf.org To change your subscription (digest mode or unsubscribe) visit http://www.beowulf.org/mailman/listinfo/beowulf
Re: [Beowulf] Re: Building new cluster - estimate (Ivan Oleynik)
In message from Mark Hahn [EMAIL PROTECTED] (Fri, 1 Aug 2008 10:06:17 -0400 (EDT)): ... Plus , with a lot of those PDUs you can add thermal sensors and trigger power off on high temperature conditions. IPMI normally provides all the motherboard's sensors as well. it seems like those are far more relevant than the temp of the PDU... using lm_sensors is a poor substitute for IPMI. IMHO the only disadvantage of lm_sensors is the poroblem of building of right sensors.conf file. Mikhail Kuzminsky Computer Assistance to Chemical Research Center Zelinsky Institute of Organic Chemistry Moscow ___ Beowulf mailing list, Beowulf@beowulf.org To change your subscription (digest mode or unsubscribe) visit http://www.beowulf.org/mailman/listinfo/beowulf
[Beowulf] MPI: over OFED and over IBGD
Is there some MPI realization/versions which may be installed one some nodes - to work over Mellanox IBGD 1.8.0 (Gold Distribution) IB stack and on other nodes - for work w/OFED-1.2 ? Mikhail Kuzminsky Computer Assistance to Chemical Research Center Zelinsky Institute of Organic Chemistry Moscow ___ Beowulf mailing list, Beowulf@beowulf.org To change your subscription (digest mode or unsubscribe) visit http://www.beowulf.org/mailman/listinfo/beowulf
Re: [Beowulf] MPI: over OFED and over IBGD
In message from Gilad Shainer [EMAIL PROTECTED] (Thu, 3 Jul 2008 09:41:01 -0700): Mikhail Kuzminsky wrote: Is there some MPI realization/versions which may be installed one some nodes - to work over Mellanox IBGD 1.8.0 (Gold Distribution) IB stack and on other nodes - for work w/OFED-1.2 ? IBGD is out of date, and AFAIK none of the latest versions of the various MPI were tested against it. It's clear, but I didn't ask about *LATEST* MPI versions ;-) I would recommend to update the install to OFED from IBGD, and if you need some help let me know. Thanks you very much for your help ! If you must keep it Yes. There is some russian romance w/the words : You can't understand, you can't understand, you can't understand my sorrows :-)) , than MVAPICH 0.9.6 might work. Eh, I used 0.9.5 and 0.9.9 :-) Now will see mvapich archives. Thanks ! Mikhail Gilad. ___ Beowulf mailing list, Beowulf@beowulf.org To change your subscription (digest mode or unsubscribe) visit http://www.beowulf.org/mailman/listinfo/beowulf ___ Beowulf mailing list, Beowulf@beowulf.org To change your subscription (digest mode or unsubscribe) visit http://www.beowulf.org/mailman/listinfo/beowulf
Re: [Beowulf] Strange Opteron 2350 performance: Gaussian-03
In message from Bernd Schubert [EMAIL PROTECTED] (Sat, 28 Jun 2008 19:04:50 +0200): On Saturday 28 June 2008, Li, Bo wrote: Hello, Sorry, I don't have the same applications as you. Did you compile them with gcc? If gcc, then -o3 can do some optimization. -march=k8 is enough I think. As Mikhail wrote in his first mail, he uses binaries from Gaussian Inc. Can gfortran in the mean time compile gaussian? Even if it can, it might be a problem for publications, since the only officially supported compiler is pgf77. Mikhail, do you have the source at all? Due to the different cache model of the barcelona a recompilation might really help. No, I have no source :-( I absolutely agree w/you - DFT used is cache-friendly. Moreover, this big performance gap corresponds to DFT w/FMM (Fast Multipole Method). For usual DFT, Opteron 2350 cores are also more slow than Opteron 246, bun only on 33%. And you make sure the CPU running at the default frequency. Sometime Yeah, can you check the scaling governor isn't set to ondemand or conservative? Yes, I looked to frequency many times (as a crazy :-)). There is no powersaved daemon, and I looked only 2 Ghz in /proc/cpuinfo :-) Yours Mikhail Cheers, Bernd ___ Beowulf mailing list, Beowulf@beowulf.org To change your subscription (digest mode or unsubscribe) visit http://www.beowulf.org/mailman/listinfo/beowulf ___ Beowulf mailing list, Beowulf@beowulf.org To change your subscription (digest mode or unsubscribe) visit http://www.beowulf.org/mailman/listinfo/beowulf
[Beowulf] Strange Opteron 2350 performance: Gaussian-03
I'm runnung a set of quad-core Opteron 2350 benchmarks, in particular using Gaussian-03 (binary version from Gaussian, Inc, i.e. translated by more old - than current - pgf77 version, for Opteron target). I compare in particular *one core* of Opteron 2350 w/Opteron 246 having the same 2 Ghz frequency and the same amount of cache per core (512K L2 + 0.25*2 MB L3 for Opteron 2350 is just 1 MB L2 for Opteron 246). Opteron 246 has even more fast DDR2-667 RAM. The Gaussian-03 performance in some cases is close for both Opteron's (I remember that compilation didn't know about Barcelona !), but for very popular DFT method Opteron 2350 cores looks as slow: one job gives 33% more bad (than Opteron 246) performance. But on standard Gaussian-03 test397.com DFT/B3LYP test: *one* (1) Opteron 2350 core run 15667 sec. (both startstop and cpu) vs 8709 sec. on (one) Opteron 246 !! There is no powersaved daemon, so the frequnecy of Opteron 2350 is fixed to 2 Ghz. I reproduced this result twice on Opteron 2350, in particular one time using forced good numactl behaviour. I'm reproducing it on Opteron 246 again :-) but I have indirect confirmation of this timings (based on 2-cpus Opteron 246 parallel test). Yes, AFAIK DFT method is cache-friendly, and more slow L3 cache in Opteron 2350 may give more bad performance. But in 1.8 times ?? Any your comments are welcome. Mikhail Kuzminsky Computer Assistance to Chemical Research Center Zelinsky Institute of Organic Chemistry Moscow ___ Beowulf mailing list, Beowulf@beowulf.org To change your subscription (digest mode or unsubscribe) visit http://www.beowulf.org/mailman/listinfo/beowulf
Re: [Beowulf] Strange Opteron 2350 performance: Gaussian-03
In message from Li, Bo [EMAIL PROTECTED] (Sun, 29 Jun 2008 00:07:07 +0800): Hello, I am afraid there must be something wrong with your experiment. How did you get the performance? Was your DFT codes running in parallel? Any optimization involved? I was afraid the same, but the results are reproduced twice. As I wrote in my message: - there were ONE CORE (one CPU for Opteron 246) runs - the optimization was performed for OLD Opteron 246 (because Gaussian, Inc do not propose binaries optimized specially for Barcelona) DFT test397 (as any other DFT) is parallelized well, and on Opteron 246 it gives 1.9 times speedup on 2 CPUs. But I didn't run 2-cores parallelized job for Opteron 2350: I was stressed by results obtained for 1 core. In most of my test, K8L or K10 can beat old opteron at the same frequency with about 20% improvement. Sorry, do you have this on Gaussian-03 and for DFT in particular ? Did you compile it on K10 using target=barcelona (i.e. optimized for barcelona) ? Yours Mikhail Regards, Li, Bo - Original Message - From: Mikhail Kuzminsky [EMAIL PROTECTED] To: beowulf@beowulf.org Sent: Saturday, June 28, 2008 11:48 PM Subject: [Beowulf] Strange Opteron 2350 performance: Gaussian-03 I'm runnung a set of quad-core Opteron 2350 benchmarks, in particular using Gaussian-03 (binary version from Gaussian, Inc, i.e. translated by more old - than current - pgf77 version, for Opteron target). I compare in particular *one core* of Opteron 2350 w/Opteron 246 having the same 2 Ghz frequency and the same amount of cache per core (512K L2 + 0.25*2 MB L3 for Opteron 2350 is just 1 MB L2 for Opteron 246). Opteron 246 has even more fast DDR2-667 RAM. The Gaussian-03 performance in some cases is close for both Opteron's (I remember that compilation didn't know about Barcelona !), but for very popular DFT method Opteron 2350 cores looks as slow: one job gives 33% more bad (than Opteron 246) performance. But on standard Gaussian-03 test397.com DFT/B3LYP test: *one* (1) Opteron 2350 core run 15667 sec. (both startstop and cpu) vs 8709 sec. on (one) Opteron 246 !! There is no powersaved daemon, so the frequnecy of Opteron 2350 is fixed to 2 Ghz. I reproduced this result twice on Opteron 2350, in particular one time using forced good numactl behaviour. I'm reproducing it on Opteron 246 again :-) but I have indirect confirmation of this timings (based on 2-cpus Opteron 246 parallel test). Yes, AFAIK DFT method is cache-friendly, and more slow L3 cache in Opteron 2350 may give more bad performance. But in 1.8 times ?? Any your comments are welcome. Mikhail Kuzminsky Computer Assistance to Chemical Research Center Zelinsky Institute of Organic Chemistry Moscow ___ Beowulf mailing list, Beowulf@beowulf.org To change your subscription (digest mode or unsubscribe) visit http://www.beowulf.org/mailman/listinfo/beowulf ___ Beowulf mailing list, Beowulf@beowulf.org To change your subscription (digest mode or unsubscribe) visit http://www.beowulf.org/mailman/listinfo/beowulf
Re: [Beowulf] Strange Opteron 2350 performance: Gaussian-03
In message from Li, Bo [EMAIL PROTECTED] (Sun, 29 Jun 2008 00:37:12 +0800): The problem is present just for Gaussian-03 binary version we have. If I compile myself Linpack, for example, Opteron 2350 core is faster. Yes - of course it's Linux x86-64, SuSE 10.3 Powersave daemon is no run. Mikhail Hello, Sorry, I don't have the same applications as you. Did you compile them with gcc? If gcc, then -o3 can do some optimization. -march=k8 is enough I think. And you make sure the CPU running at the default frequency. Sometime Powernow is active as default. And BTW, what's your platform? Linux? Which release? X86_64? Regards, Li, Bo - Original Message - From: Mikhail Kuzminsky [EMAIL PROTECTED] To: Li, Bo [EMAIL PROTECTED] Cc: beowulf@beowulf.org Sent: Sunday, June 29, 2008 12:23 AM Subject: Re: [Beowulf] Strange Opteron 2350 performance: Gaussian-03 In message from Li, Bo [EMAIL PROTECTED] (Sun, 29 Jun 2008 00:07:07 +0800): Hello, I am afraid there must be something wrong with your experiment. How did you get the performance? Was your DFT codes running in parallel? Any optimization involved? I was afraid the same, but the results are reproduced twice. As I wrote in my message: - there were ONE CORE (one CPU for Opteron 246) runs - the optimization was performed for OLD Opteron 246 (because Gaussian, Inc do not propose binaries optimized specially for Barcelona) DFT test397 (as any other DFT) is parallelized well, and on Opteron 246 it gives 1.9 times speedup on 2 CPUs. But I didn't run 2-cores parallelized job for Opteron 2350: I was stressed by results obtained for 1 core. In most of my test, K8L or K10 can beat old opteron at the same frequency with about 20% improvement. Sorry, do you have this on Gaussian-03 and for DFT in particular ? Did you compile it on K10 using target=barcelona (i.e. optimized for barcelona) ? Yours Mikhail Regards, Li, Bo - Original Message - From: Mikhail Kuzminsky [EMAIL PROTECTED] To: beowulf@beowulf.org Sent: Saturday, June 28, 2008 11:48 PM Subject: [Beowulf] Strange Opteron 2350 performance: Gaussian-03 I'm runnung a set of quad-core Opteron 2350 benchmarks, in particular using Gaussian-03 (binary version from Gaussian, Inc, i.e. translated by more old - than current - pgf77 version, for Opteron target). I compare in particular *one core* of Opteron 2350 w/Opteron 246 having the same 2 Ghz frequency and the same amount of cache per core (512K L2 + 0.25*2 MB L3 for Opteron 2350 is just 1 MB L2 for Opteron 246). Opteron 246 has even more fast DDR2-667 RAM. The Gaussian-03 performance in some cases is close for both Opteron's (I remember that compilation didn't know about Barcelona !), but for very popular DFT method Opteron 2350 cores looks as slow: one job gives 33% more bad (than Opteron 246) performance. But on standard Gaussian-03 test397.com DFT/B3LYP test: *one* (1) Opteron 2350 core run 15667 sec. (both startstop and cpu) vs 8709 sec. on (one) Opteron 246 !! There is no powersaved daemon, so the frequnecy of Opteron 2350 is fixed to 2 Ghz. I reproduced this result twice on Opteron 2350, in particular one time using forced good numactl behaviour. I'm reproducing it on Opteron 246 again :-) but I have indirect confirmation of this timings (based on 2-cpus Opteron 246 parallel test). Yes, AFAIK DFT method is cache-friendly, and more slow L3 cache in Opteron 2350 may give more bad performance. But in 1.8 times ?? Any your comments are welcome. Mikhail Kuzminsky Computer Assistance to Chemical Research Center Zelinsky Institute of Organic Chemistry Moscow ___ Beowulf mailing list, Beowulf@beowulf.org To change your subscription (digest mode or unsubscribe) visit http://www.beowulf.org/mailman/listinfo/beowulf ___ Beowulf mailing list, Beowulf@beowulf.org To change your subscription (digest mode or unsubscribe) visit http://www.beowulf.org/mailman/listinfo/beowulf
Re: [Beowulf] Strange Opteron 2350 performance: Gaussian-03
In message from Joe Landman [EMAIL PROTECTED] (Sat, 28 Jun 2008 14:48:02 -0400): This is possible, depending upon the compiler used. Though I have to admit that I find it odd that it would be the case within the Opteron family and not between Opteron and Xeon. Intel compilers used to (haven't checked 10.1) switch between fast (SSE*) and slow (x87 FP) paths as a function of a processor version string. If this is an old Intel compiler built code, this is possible that the code paths may be different, though as noted, I would find that surprising if this were the case within the Opteron family. Well, I thought about (absense of) using of SSE in binary Gaussian 03 Rev.C02 version I used, but even if x87-codes were really generated by pgf77 - why this x87-based codes gives such high performance on Opteron 246 in comparison w/Opteron 2350 core ? On both CPUs I ran the same binary Gaussian codes ! Modern PGI compilers (suggested default for Gaussian-03 last I checked) have the ability to do this as well, though I don't know how they implement it (capability testing hopefully?) Out of curiousity, how does streams run on both systems? I ran stream on Opteron 242 and 244 few years ago. The scalability and the troughput itself was OK. Currently I ran stream on my Opteron 2350-based dual-socket server. In accordance w/more fast DDR2-667 I obtained more high throughput. I reproduced in particular 8-cores result presented in McCalpin's table (sent from AMD), and some data presented early on our Beowulf maillist. (BTW, there is one bad thing for stream on this server - the corresponding data are absent in McCalpin's table: the throughput is scaled good from 1 to 2 OpenMP threads, and gives good result for 8 threads, but the throughput for 4 threads is about the same as for 2 threads. The reason is, IMHO, that for 8 threads RAM is allocated by kernel in both nodes, but for 4 threads the RAM allocated is placed in one node, and 4 threads have bad competition for memory access). Taking into account that Gaussian-03 was bad on Opteron 2350 core - in sequential run, Opteron 2350 RAM gives it only pluses in comparison w/Opteron 246. I didn't run stream on Opteron 246, but it's clear for me. Also, it is possible, with a larger cache, that you might be running into some odd cache effects (tlb/page thrashing). But DFTs are usually small and thus sensitive to cache size. You might be able to instrument the run within a papi wrapper, and see if you observe a large number of cache/tlb flushes for some reason. On a related note: are you using a stepping before B3 of 2350? That could impact performance, if you have the patch in place or have the tlb/cache turned off in bios (some MB makers created a patch to do this). Gaussian-03 fails in link302 on Barcelona B2 because of this error. I use stepping B3. Yours Mikhail Joe -- Joseph Landman, Ph.D Founder and CEO Scalable Informatics LLC, email: [EMAIL PROTECTED] web : http://www.scalableinformatics.com http://jackrabbit.scalableinformatics.com phone: +1 734 786 8423 fax : +1 866 888 3112 cell : +1 734 612 4615 ___ Beowulf mailing list, Beowulf@beowulf.org To change your subscription (digest mode or unsubscribe) visit http://www.beowulf.org/mailman/listinfo/beowulf
Re: [Beowulf] Again about NUMA (numactl and taskset)
In message from Håkon Bugge [EMAIL PROTECTED] (Thu, 26 Jun 2008 11:16:17 +0200): Numastat statistics before Gaussian-03 run (OpenMP, 8 threads, 8 cores, requires 512 Mbytes shared memory plus something more, may be fitted in memory of any node - I have 8 GB per node, 6- GB free in node0 and 7+ GB free in node1) node0: numa_hit 14594588 numa_miss 0 numa_foreign 0 interleave_hit 14587 local_node 14470168 other_node 124420 node1: numa_hit 11743071 numa_miss 0 numa_foreign 0 interleave_hit 14584 local_node 11727424 other_node 15647 --- Statistics after run: node0: numa_hit 15466972 numa_miss 0 numa_foreign 0 interleave_hit 14587 local_node 15342552 other_node 124420 node1: numa_hit 12960452 numa_miss 0 numa_foreign 0 interleave_hit 14584 local_node 12944805 other_node 15647 --- Unfortunately I don't know, what exactly means this lines !! :-( (BTW, do somebody know ?!) But intuitive it looks (taking into account the increase of numa_hit and local_node values), that the allocation of RAM was performed from BOTH nodes (and more RAM was allocated from node1 memory - node1 had initially more free RAM). It is in opposition w/my expectations of continuous RAM allocation from the RAM of one node ! Mikhail Kuzminsky, Computer Assistance to Chemical Research Zelinsky Institute of Organic Chemistry Moscow At 18:34 25.06.2008, Mikhail Kuzminsky wrote: Let me assume now the following situation. I have OpenMP-parallelized application which have the number of processes equal to number of CPU cores per server. And let me assume that this application uses not too more virtual memory, so all the real memory used may be placed in RAM of *one* node. It's not the abstract question - a lot of Gaussian-03 jobs we have fit to this situation, and all the 8 cores for dual socket quad core Opteron server will be well loaded. Is it right that all the application memory (w/o using of numactl) will be allocated (by Linux kernel) in *one* node ? Guess the answer is, it depends. The memory will be allocated on the node where the thread first touching it is running. But you could use numastat to investigate the issue. Håkon ___ Beowulf mailing list, Beowulf@beowulf.org To change your subscription (digest mode or unsubscribe) visit http://www.beowulf.org/mailman/listinfo/beowulf ___ Beowulf mailing list, Beowulf@beowulf.org To change your subscription (digest mode or unsubscribe) visit http://www.beowulf.org/mailman/listinfo/beowulf
Re: [Beowulf] Again about NUMA (numactl and taskset)
Let me assume now the following situation. I have OpenMP-parallelized application which have the number of processes equal to number of CPU cores per server. And let me assume that this application uses not too more virtual memory, so all the real memory used may be placed in RAM of *one* node. It's not the abstract question - a lot of Gaussian-03 jobs we have fit to this situation, and all the 8 cores for dual socket quad core Opteron server will be well loaded. Is it right that all the application memory (w/o using of numactl) will be allocated (by Linux kernel) in *one* node ? Then only one memory controller will be used. OK, then if I have the same server but w/2 times more small memory size (it's enough for run of this Gaussian-03 job !) and DIMMs are populating both nodes, then the performance of this server will be higher ! - because both memory controllers (and therefore more memory channels) will work simultaneously. Is it right - that more cheap server will have higher performance for like cases ?? Mikhail Kuzminsky Computer Assistance to Chemical Research Center Zelinsky Institute of Organic Chemistry Moscow ___ Beowulf mailing list, Beowulf@beowulf.org To change your subscription (digest mode or unsubscribe) visit http://www.beowulf.org/mailman/listinfo/beowulf
[Beowulf] Timers and TSC behaviour on SMP/x86
As I remember, TSCs in SMP/x86 are synchronized by Linux kernels at the boot process. But the only message (about TSC) I see after Linux boot in dmesg (or /var/log/messages) in SuSE 10.3 w/2.6.22 default kernel on quad-core dual socket Opteron serever is Marking TSC unstable due to TSCs unsynchronized Does it means that RDTSC-based timer (I use it for microbenchmarks) will give wrong results ? :-( Some additional information, according Software Optimization Guide for AMD Familly 10h Processors (quad-core) from Apr 4th, 2008: Early each AMD core had own TSC. Now quad-core processors have one common clock source in NorthBridge (BTW, is it in this case integrated into CPU chip - i.e. includes integrated memory controller and support of HT links ? - M.K.) - for all the TSCs of CPUs (cores ? - M.K.). The synchronization accuracy should be few tens cycles. Mikhail Kuzminsky Computer Assistance to Chemical Research Center Zelinsky Institute of Organic Chemistry Moscow ___ Beowulf mailing list, Beowulf@beowulf.org To change your subscription (digest mode or unsubscribe) visit http://www.beowulf.org/mailman/listinfo/beowulf
[Beowulf] Again about NUMA (numactl and taskset)
I'm testing my 1st dual-socket quad-core Opteron 2350-based server. Let me assume that the RAM used by kernel and system processes is zero, there is no physical RAM fragmentation, and the affinity of processes to CPU cores is maintained. I assume also that both the nodes are populated w/equal number of the same DIMMs. If I run thread- parallelized (for example, w/OpenMP) application w/8 threads (8 = number of server CPU cores), the ideal case for all the (equal) threads is: the shared memory used by each of 2 CPUs (by each of 2 processes quads) should be divided equally between 2 nodes, and the local memory used by each process should be mapped analogically. Theoretically like ideal case may be realized if my application (8 threads) uses practically all the RAM and uses only shared memory (I assume here also that all the RAM addresses have the same load, and the size of program codes is zero :-) ). The questions are 1) Is there some way to distribute analogously the local memory of threads (I assume that it have the same size for each thread) using reasonable NUMA allocation ? 2) Is it right that using of numactl for applications may gives improvements of performance for the following case: the number of application processes is equal to the number of cores of one CPU *AND* the necessary (for application) RAM amount may be placed on one node DIMMs (I assume that RAM is allocated continously). What will be w/performance (at numactl using) for the case if RAM size required is higher than RAM available per one node, and therefore the program will not use the possibility of (load balanced) simultaneous using of memory controllers on both CPUs ? (I also assume also that RAM is allocated continously). 3) Is there some reason to use things like mpirun -np N /usr/bin/numactl numactl_parameters my_application ? 4) If I use malloc() and don't use numactl, how to understand - from which node Linux will begin the real memory allocation ? (I remember that I assume that all the RAM is free) And how to understand where are placed the DIMMs which will corresponds to higher RAM addresses or lower RAM addresses ? 5) In which cases is it reasonable to switch on Node memory interleaving (in BIOS) for the application which uses more memory than is presented on the node ? And BTW: if I use taskset -c CPU1,CPU2, ... program_file and the program_file creates some new processes, will all this processes run only on the same CPUs defined in taskset command ? Mikhail Kuzminsky Computer Assistance to Chemical Research Center, Zelinsky Institute of Organic Chemistry Moscow ___ Beowulf mailing list, Beowulf@beowulf.org To change your subscription (digest mode or unsubscribe) visit http://www.beowulf.org/mailman/listinfo/beowulf
Re: [Beowulf] Again about NUMA (numactl and taskset)
In message from Vincent Diepeveen [EMAIL PROTECTED] (Mon, 23 Jun 2008 18:41:21 +0200): I would add to this: how sure are we that a process (or thread) that allocated and initialized and writes to memory at a single specific memory node, also keeps getting scheduled at a core on that memory node? It seems to me that sometimes (like every second or so) threads jump from 1 memory node to another. I could be wrong, but i certainly have that impression with the linux kernels. Dear Vincent, do I understand you correctly that simple using of taskset is not enough to prevent process migration to other core/node ?? Mikhail That said, it has improved a lot, now all we need is a better compiler for linux. GCC is for my chessprogram generating an executable that gets 22% slower positions per second than visual c++ 2005 is. Thanks, Vincent On Jun 23, 2008, at 4:01 PM, Mikhail Kuzminsky wrote: I'm testing my 1st dual-socket quad-core Opteron 2350-based server. Let me assume that the RAM used by kernel and system processes is zero, there is no physical RAM fragmentation, and the affinity of processes to CPU cores is maintained. I assume also that both the nodes are populated w/equal number of the same DIMMs. If I run thread- parallelized (for example, w/OpenMP) application w/ 8 threads (8 = number of server CPU cores), the ideal case for all the (equal) threads is: the shared memory used by each of 2 CPUs (by each of 2 processes quads) should be divided equally between 2 nodes, and the local memory used by each process should be mapped analogically. Theoretically like ideal case may be realized if my application (8 threads) uses practically all the RAM and uses only shared memory (I assume here also that all the RAM addresses have the same load, and the size of program codes is zero :-) ). The questions are 1) Is there some way to distribute analogously the local memory of threads (I assume that it have the same size for each thread) using reasonable NUMA allocation ? 2) Is it right that using of numactl for applications may gives improvements of performance for the following case: the number of application processes is equal to the number of cores of one CPU *AND* the necessary (for application) RAM amount may be placed on one node DIMMs (I assume that RAM is allocated continously). What will be w/performance (at numactl using) for the case if RAM size required is higher than RAM available per one node, and therefore the program will not use the possibility of (load balanced) simultaneous using of memory controllers on both CPUs ? (I also assume also that RAM is allocated continously). 3) Is there some reason to use things like mpirun -np N /usr/bin/numactl numactl_parameters my_application ? 4) If I use malloc() and don't use numactl, how to understand - from which node Linux will begin the real memory allocation ? (I remember that I assume that all the RAM is free) And how to understand where are placed the DIMMs which will corresponds to higher RAM addresses or lower RAM addresses ? 5) In which cases is it reasonable to switch on Node memory interleaving (in BIOS) for the application which uses more memory than is presented on the node ? And BTW: if I use taskset -c CPU1,CPU2, ... program_file and the program_file creates some new processes, will all this processes run only on the same CPUs defined in taskset command ? Mikhail Kuzminsky Computer Assistance to Chemical Research Center, Zelinsky Institute of Organic Chemistry Moscow ___ Beowulf mailing list, Beowulf@beowulf.org To change your subscription (digest mode or unsubscribe) visit http://www.beowulf.org/mailman/listinfo/beowulf ___ Beowulf mailing list, Beowulf@beowulf.org To change your subscription (digest mode or unsubscribe) visit http://www.beowulf.org/mailman/listinfo/beowulf ___ Beowulf mailing list, Beowulf@beowulf.org To change your subscription (digest mode or unsubscribe) visit http://www.beowulf.org/mailman/listinfo/beowulf
Re: [Beowulf] SuperMicro and lm_sensors
In message from Bernard Li [EMAIL PROTECTED] (Thu, 19 Jun 2008 11:28:08 -0700): Hi David: On Thu, Jun 19, 2008 at 6:50 AM, Lombard, David N [EMAIL PROTECTED] wrote: Did you look for /proc/acpi/thermal_zone/*/temperature The glob is for your BIOS-defined ID. If it does exist, that's the value that drives /proc/acpi/thermal_zone/*/trip_points See also /proc/acpi/thermal_zone/*/polling_frequency I have always wondered about /proc/acpi/thermal_zone. I noticed that on some servers, the files exist, but on others, that directory is empty. I guess this is dependent on whether the BIOS exposes the information to the kernel? Or are there modules that I need to install to get it working? AFAIK it depends from BIOS. On my Tyan S2932 w/last BIOS version this directory is empty. Mikhail Thanks, Bernard ___ Beowulf mailing list, Beowulf@beowulf.org To change your subscription (digest mode or unsubscribe) visit http://www.beowulf.org/mailman/listinfo/beowulf ___ Beowulf mailing list, Beowulf@beowulf.org To change your subscription (digest mode or unsubscribe) visit http://www.beowulf.org/mailman/listinfo/beowulf
[Beowulf] Tyan S2932 and lm_sensors
Sorry, do somebody have correct sensors.conf file for Tyan S2932 motherboard ? There is no lm_sensors configuration file for this mobos on Tyan site :-( Yours Mikhail Kuzminsky Computer Assistance to Chemical Research Center Zelinsky Institute of Organic Chemistry Moscow ___ Beowulf mailing list, Beowulf@beowulf.org To change your subscription (digest mode or unsubscribe) visit http://www.beowulf.org/mailman/listinfo/beowulf
Re: [Beowulf] Tyan S2932 and lm_sensors
In message from Seth Bardash [EMAIL PROTECTED] (Wed, 18 Jun 2008 10:32:17 -0600): ftp://ftp.tyan.com/softwave/lms/2932.sensors.conf Seth Bardash Integrated Solutions and Systems 1510 Old North Gate Road Colorado Springs, CO 80921 719-495-5866 719-495-5870 Fax 719-337-4779 Cell http://www.integratedsolutions.org Failure can not cope with knowledge and perseverance! -Original Message- From: [EMAIL PROTECTED] [mailto:[EMAIL PROTECTED] On Behalf Of Mikhail Kuzminsky Sent: Wednesday, June 18, 2008 9:26 AM To: beowulf@beowulf.org Subject: [Beowulf] Tyan S2932 and lm_sensors Sorry, do somebody have correct sensors.conf file for Tyan S2932 motherboard ? There is no lm_sensors configuration file for this mobos on Tyan site :-( Yours Mikhail Kuzminsky Computer Assistance to Chemical Research Center Zelinsky Institute of Organic Chemistry Moscow ___ Beowulf mailing list, Beowulf@beowulf.org To change your subscription (digest mode or unsubscribe) visit http://www.beowulf.org/mailman/listinfo/beowulf No virus found in this incoming message. Checked by AVG. Version: 8.0.100 / Virus Database: 270.4.0/1507 - Release Date: 6/18/2008 7:09 AM ___ Beowulf mailing list, Beowulf@beowulf.org To change your subscription (digest mode or unsubscribe) visit http://www.beowulf.org/mailman/listinfo/beowulf
Re: [Beowulf] Tyan S2932 and lm_sensors
In message from Seth Bardash [EMAIL PROTECTED] (Wed, 18 Jun 2008 10:32:17 -0600): ftp://ftp.tyan.com/softwave/lms/2932.sensors.conf Seth Bardash Thank you very much !! It's strange, but I didn't find this file on Tyan archive! Mikhail Integrated Solutions and Systems 1510 Old North Gate Road Colorado Springs, CO 80921 719-495-5866 719-495-5870 Fax 719-337-4779 Cell http://www.integratedsolutions.org Failure can not cope with knowledge and perseverance! -Original Message- From: [EMAIL PROTECTED] [mailto:[EMAIL PROTECTED] On Behalf Of Mikhail Kuzminsky Sent: Wednesday, June 18, 2008 9:26 AM To: beowulf@beowulf.org Subject: [Beowulf] Tyan S2932 and lm_sensors Sorry, do somebody have correct sensors.conf file for Tyan S2932 motherboard ? There is no lm_sensors configuration file for this mobos on Tyan site :-( Yours Mikhail Kuzminsky Computer Assistance to Chemical Research Center Zelinsky Institute of Organic Chemistry Moscow ___ Beowulf mailing list, Beowulf@beowulf.org To change your subscription (digest mode or unsubscribe) visit http://www.beowulf.org/mailman/listinfo/beowulf No virus found in this incoming message. Checked by AVG. Version: 8.0.100 / Virus Database: 270.4.0/1507 - Release Date: 6/18/2008 7:09 AM ___ Beowulf mailing list, Beowulf@beowulf.org To change your subscription (digest mode or unsubscribe) visit http://www.beowulf.org/mailman/listinfo/beowulf
[Beowulf] Powersave on Beowulf nodes
What is about using of powersaved (and dbus and HAL daemons) on Beowulf nodes ? Currently I installed SuSE 10.3 where all the corresponding daemons are running (by default) at the runlevel=3. I simple added issuing of powersave -f at the end of booting. /proc/acpi/thermal_zone/ is empty, and powersave can't give me temperature and FANs information. I don't see now any serious advantages of powersaved daemon using in performance mode (using performance scheme). We have many jobs in SGE at every time moment, and underload situation (where it's reasonable to decrease CPUs frequency) is not the our danger :-) So I'm thinking about simple stopping of all the corresponding daemons. Mikhail Kuzminsky Computer Assistance to Chemical Research Center Zelinsky Institute of Organic Chemistry Moscow ___ Beowulf mailing list, Beowulf@beowulf.org To change your subscription (digest mode or unsubscribe) visit http://www.beowulf.org/mailman/listinfo/beowulf
Re: [Beowulf] size of swap partition
In message from Mark Hahn [EMAIL PROTECTED] (Tue, 10 Jun 2008 00:58:12 -0400 (EDT)): ... for instance, you can always avoid OOM with the vm.overcommit_memory=2 sysctl (you'll need to tune vm.overcommit_ratio and the amount of swap to get the desired limits.) in this mode, the kernel tracks how much VM it actually needs (worst-case, reflected in Committed_AS in /proc/meminfo) and compares that to a commit limit that reflects ram and swap. if you don't use overcommit_memory=2, you are basically borrowing VM space in hopes of not needing it. that can still be reasonable, considering how often processes have a lot of shared VM, and how many processes allocate but never touch lots of pages. but you have to ask yourself: would I like a system that was actually _using_ 16 GB of swap? if you have 16x disks, perhaps, but 16G will suck if you only have 1 disk. at least for overcommit_memory != 2, I don't see the point of configuring a lot of swap, since the only time you'd use it is if you were thrashing. sort of a quality of life argument. But what are the reccomendations of modern praxis ? it depends a lot on the size variance of your jobs, as well as their real/virtual ratio. the kernel only enforces RLIMIT_AS (vsz in ps),assuming a 2.6 kernel - I forget whether 2.4 did RLIMIT_RSS or not. if you use overcommit_memory=2, your desired max VM size determines the amount of swap. otherwise, go with something modest - memory size or so. but given that the smallest reasonable single disk these days is probably about 320GB, it's hard to justify being _too_ tight. :-) The disks we use in nodes is SATA WD/10K RPM w/70 GB :-)) We didn't set overcommit_memory=2, but really use strongly restricted scheduling police for SGE batch jobs using only few applications. We have only batch jobs (no interactive), moreover - practically only *long batch jobs*. As a result we have summary VM (requested per node) equal (or lower) than RAM. There is practically zero swap activity. The only exclusion are (seldom executed) small test jobs, non-parallelized, mainly for check of input data. They use small RAM amount. So it looks for me that I may set even lower than 1.5*RAM swap size (I think RAM+4G = 20G will be enough). In message from Walid [EMAIL PROTECTED] (Tue, 10 Jun 2008 19:27:43 +0300): Hi, For an 8GB dual socket quad core node, choosing in the kick start file --recommended instead of specifying size RHEL5 allocates 1GB of memory. our developers say that they should not swap as this will cause an overhead, and they try to avoid it as much as possible OpenSuSE 10.3 recommends swap size=2 GB only, but I don't know, performs SuSE inst software some estimation of server RAM or no. Yours Mikhail ___ Beowulf mailing list, Beowulf@beowulf.org To change your subscription (digest mode or unsubscribe) visit http://www.beowulf.org/mailman/listinfo/beowulf
[Beowulf] size of swap partition
A lot of time ago it was formulated simple rule for swap partition size (equal to main memory size). Currently we all have relative large RAM on the nodes (typically, I beleive, it is 2 or more GB per core; we have 16 GB per dual-socket quad-core Opteron node). What is typical modern swap size today? I understand that it depends from applications ;-) We, in particular, practically don't have jobs which run out-of-RAM. For single core dual-socket Opteron nodes w/4GB RAM per node and molecular modelling workload we used 4 GB swap partition. But what are the reccomendations of modern praxis ? Mikhail Kuzminksy Computer Assistance to Chemical Research Center Zelinsky Inst. of Organic Chemistry Moscow ___ Beowulf mailing list, Beowulf@beowulf.org To change your subscription (digest mode or unsubscribe) visit http://www.beowulf.org/mailman/listinfo/beowulf
[Beowulf] Barcelona hardware error: how to detect
How is possible to detect, that particular AMD Barcelona CPU has - or doesn't have - known hardware error problem ? To be more exact, Rev. B2 of Opteron 2350 - is it for CPU stepping w/error or w/o error ? Mikhail Kuzminsky Computer Assistance to Chemical Research Center Zelinsky Inst. of Organic Chemistry Moscow ___ Beowulf mailing list, Beowulf@beowulf.org To change your subscription (digest mode or unsubscribe) visit http://www.beowulf.org/mailman/listinfo/beowulf
Re: [Beowulf] Barcelona hardware error: how to detect
In message from Mark Hahn [EMAIL PROTECTED] (Thu, 5 Jun 2008 11:57:28 -0400 (EDT)): To be more exact, Rev. B2 of Opteron 2350 - is it for CPU stepping w/error or w/o error ? AMD, like Intel, does a reasonable job of disclosing such info: http://www.amd.com/us-en/assets/content_type/white_papers_and_tech_docs/41322.PDF the well-known problem is erattum 298, I think, and fixed in B3. Yes, this AMD errata document says that in B3 revision the error will be fixed. I heard that new CPUs w/o TLB+L3 error are shipped now, but are this CPUs really B3 or may be have some more new release ? Mikhail ___ Beowulf mailing list, Beowulf@beowulf.org To change your subscription (digest mode or unsubscribe) visit http://www.beowulf.org/mailman/listinfo/beowulf
Re: [Beowulf] Barcelona hardware error: how to detect
In message from Mark Hahn [EMAIL PROTECTED] (Thu, 5 Jun 2008 13:30:57 -0400 (EDT)): http://www.amd.com/us-en/assets/content_type/white_papers_and_tech_docs/41322.PDF the well-known problem is erattum 298, I think, and fixed in B3. Yes, this AMD errata document says that in B3 revision the error will be fixed. I believe the absence of 'x' in the B3 column of the table on p 15 means that it _is_ fixed in B3. I received just now some preliminary data about Gaussian-03 run problems w/B2 and about absence of this problems w/B3. Yours Mikhail ___ Beowulf mailing list, Beowulf@beowulf.org To change your subscription (digest mode or unsubscribe) visit http://www.beowulf.org/mailman/listinfo/beowulf
Re: [Beowulf] Barcelona hardware error: how to detect
In message from Mark Hahn [EMAIL PROTECTED] (Thu, 5 Jun 2008 13:55:01 -0400 (EDT)): I believe the absence of 'x' in the B3 column of the table on p 15 means that it _is_ fixed in B3. I received just now some preliminary data about Gaussian-03 run problems w/B2 and about absence of this problems w/B3. I'm mystified by this: B2 was broken, so using it without the bios workaround is just a mistake or masochism. the workaround _did_ apparently have performance implications, but that's why B3 exists... do you mean you know of G03 problems on B2 systems which are operating _with_ the workaround? I don't know exactly, but I think the crash was under absence of workaround, because I was not informed that there was some kernel patches or BIOS changes. This was interesting for me also, because I have no information how this hardware problem may be affected in the real life. Mikhail ___ Beowulf mailing list, Beowulf@beowulf.org To change your subscription (digest mode or unsubscribe) visit http://www.beowulf.org/mailman/listinfo/beowulf
Re: [Beowulf] Barcelona hardware error: how to detect
In message from Jason Clinton [EMAIL PROTECTED] (Thu, 5 Jun 2008 13:16:33 -0500): On Thu, Jun 5, 2008 at 1:09 PM, Mikhail Kuzminsky [EMAIL PROTECTED] wrote: In message from Mark Hahn [EMAIL PROTECTED] (Thu, 5 Jun 2008 13:55:01 -0400 (EDT)): I'm mystified by this: B2 was broken, so using it without the bios workaround is just a mistake or masochism. the workaround _did_ apparently have performance implications, but that's why B3 exists... do you mean you know of G03 problems on B2 systems which are operating _with_ the workaround? I don't know exactly, but I think the crash was under absence of workaround, because I was not informed that there was some kernel patches or BIOS changes. This was interesting for me also, because I have no information how this hardware problem may be affected in the real life. Mikhail The B2 BIOS work-around is to disable the L3 cache which gives you a 10-20% performance hit with no reduction in power consumption. The kernel patch is very extensive and, last I heard, under NDA. AMD has said publicly that the patch gives you a 1-2% performance hit. This URL is old, but may give some information: https://www.x86-64.org/pipermail/discuss/2007-December/010260.html Mikhail ___ Beowulf mailing list, Beowulf@beowulf.org To change your subscription (digest mode or unsubscribe) visit http://www.beowulf.org/mailman/listinfo/beowulf
Re: [Beowulf] Nvidia, cuda, tesla and... where's my double floating point?
In message from Ricardo Reis [EMAIL PROTECTED] (Fri, 2 May 2008 14:05:25 +0100 (WEST)): Does anyone knows if/when there will be double floating point on those little toys from nvidia? Next generation Tesla, but I don't know when. Or use AMD FireStream 9170 instead :-) Mikhail Kuzminsky Computer Assistance to Chemical Research Center Zelisnky Inst. of Organic Chemistry Moscow greets, Ricardo Reis 'Non Serviam' PhD student @ Lasef Computational Fluid Dynamics, High Performance Computing, Turbulence http://www.lasef.ist.utl.pt Cultural Instigator @ Rádio Zero http://www.radiozero.pt http://www.flickr.com/photos/rreis/ ___ Beowulf mailing list, Beowulf@beowulf.org To change your subscription (digest mode or unsubscribe) visit http://www.beowulf.org/mailman/listinfo/beowulf
Re: [Beowulf] IB DDR: mvapich2 vs mvapich performance
In message from Eric Thibodeau [EMAIL PROTECTED] (Wed, 23 Apr 2008 16:48:04 -0400): Mikhail Kuzminsky wrote: In message from Greg Lindahl [EMAIL PROTECTED] (Wed, 23 Apr 2008 00:36:44 -0700): On Wed, Apr 23, 2008 at 07:04:51AM +0400, Mikhail Kuzminsky wrote: Is this throughput difference the result of MPI-2 vs MPI implementation or should I beleive that this difference (about 4% for my mvapich vs mvapich2 at SC'07 ) is not significant - in the sense that it is simple because of some measurement errors (inaccuracies)? I dunno, does it help your real applications? Significantly - of course, not :-) But our application is really bounded by throughput ! Throughput or latency? Throughput ! osu_bw was interesting just for bandwidth ;-) Mikhail Yours Mikhail -- greg Eric ___ Beowulf mailing list, Beowulf@beowulf.org To change your subscription (digest mode or unsubscribe) visit http://www.beowulf.org/mailman/listinfo/beowulf